Tidy Data Validator

Upload a CSV, check it against Hadley Wickham's tidy data principles, and get a detailed report with fix suggestions. 100% client-side — your data never leaves your browser.

Tidy Data RulesCSV QualityFix Suggestions

Try it out

Load example tidy data data to see the full workflow

Data Input

Drop a CSV file here, or

Supports CSV, TSV, and semicolon-delimited files

  • Checking CSV files before importing into R, Python, or statistical software
  • Validating datasets for reproducible research pipelines
  • Teaching tidy data principles with immediate feedback
  • Auditing shared datasets for common structural problems
  • Pre-screening data before database ingestion or ETL

Don't use for

  • Validating data types against a strict schema (use JSON Schema or Great Expectations)
  • Checking domain-specific business rules beyond structural quality
  • Validating binary or non-tabular file formats

What Is Tidy Data?

Tidy data is a standard way of organizing tabular data, formalized by Hadley Wickham in his 2014 *Journal of Statistical Software* paper. The three principles are:

1. Each variable forms a column — every column represents one and only one measured quantity or attribute. 2. Each observation forms a row — every row represents one complete unit of observation. 3. Each value occupies a single cell — no cell contains multiple values.

These principles sound simple, but violations are extremely common in real-world data. Wide-format spreadsheets, merged cells, multi-value fields, and inconsistent types all break tidy structure and cause downstream analysis errors.

Common Tidy Data Violations

Values in column headers: Columns like "revenue_2023", "revenue_2024" encode the year variable in the header instead of having a "year" column. Fix: pivot to long format.
Multiple values per cell: A "symptoms" column containing "fever, cough, fatigue" crams three observations into one cell. Fix: split into separate rows or use indicator columns.
Mixed types: A numeric column with occasional "N/A" or "pending" strings forces the entire column to text. Fix: use proper missing value representations (empty cell, NA).
Inconsistent casing: "Treatment" and "treatment" in the same column creates phantom groups. Fix: standardize case.
Variables in rows: When each row contains a different measurement (height in row 1, weight in row 2), the data needs transposing.

Why Tidy Data Matters for Research

Reproducibility: Tidy data works predictably with standard tools (dplyr, pandas, ggplot2). Non-tidy data requires custom reshaping code that is error-prone and hard to reproduce.
Composability: Tidy datasets can be merged, joined, and combined because the structure is consistent. Two tidy tables with a shared key column can be joined in one line of code.
Tool compatibility: Most statistical software (R, Python, Stata, SPSS) expects tidy input. Wide-format data requires manual reshaping before every analysis step.
Error reduction: Structural problems in data cause silent errors — wrong group counts, dropped observations, type coercion. Catching these early saves hours of debugging.

How to Fix Untidy Data

Wide to long (pivot): In R: `pivot_longer()`. In Python: `pd.melt()`. Converts columns like score_2024, score_2025 into rows with "year" and "score" columns.
Split multi-value cells: In R: `separate_rows()`. In Python: `df['col'].str.split(',').explode()`. Creates one row per value.
Standardize types: Replace string "N/A", "n/a", "missing" with proper NA. In R: `na_if()`. In Python: `df.replace()`.
Fix column names: In R: `janitor::clean_names()`. In Python: `df.columns = df.columns.str.lower().str.replace(' ', '_')`.
Remove duplicates: In R: `distinct()`. In Python: `df.drop_duplicates()`. Always check if duplicates are real data or errors first.

Validation Methodology

Rule 1 — Variables as columns: Detects numeric suffixes and common temporal/categorical patterns in column names (e.g., _2023, _2024, _q1, _q2) that suggest values encoded in headers.
Rule 2 — Observations as rows: Identifies rows sharing the same ID with different values, suggesting merged or nested observations.
Rule 3 — Values as cells: Scans for comma-separated lists, slash-separated alternatives (A/B), and semicolon-delimited values within individual cells.
Rule 4 — Consistent types: Checks whether each column contains a consistent data type (all numeric, all text, all dates) or a problematic mix.
Rule 5 — Missing values: Counts and locates empty cells, "N/A", "null", "NA", "NaN", and other missing value representations.
Rule 6 — Column naming: Checks for spaces, special characters, leading/trailing whitespace, and inconsistent casing conventions.
Rule 7 — Duplicate rows: Detects exact row duplicates via string comparison.
Rule 8 — Header issues: Checks for empty column names and duplicate header names.

All checks run entirely in your browser. No data is transmitted to any server.

Frequently Asked Questions