Tidy Data Validator

Upload a CSV, check it against Hadley Wickham's tidy data principles, and get a detailed report with fix suggestions. 100% client-side — your data never leaves your browser.

Tidy Data RulesCSV QualityFix Suggestions
Tool details, related tools, and citation

Try it out

Load example tidy data data to see the full workflow

Data Input

Drop a CSV file here, or

Supports CSV, TSV, and semicolon-delimited files

  • Checking CSV files before importing into R, Python, or statistical software
  • Validating datasets for reproducible research pipelines
  • Teaching tidy data principles with immediate feedback
  • Auditing shared datasets for common structural problems
  • Pre-screening data before database ingestion or ETL

Don't use for

  • Validating data types against a strict schema (use JSON Schema or Great Expectations)
  • Checking domain-specific business rules beyond structural quality
  • Validating binary or non-tabular file formats

What Is Tidy Data?

Tidy data is a standard way of organizing tabular data, formalized by Hadley Wickham in his 2014 *Journal of Statistical Software* paper. The three principles are:

1. Each variable forms a column — every column represents one and only one measured quantity or attribute. 2. Each observation forms a row — every row represents one complete unit of observation. 3. Each value occupies a single cell — no cell contains multiple values.

These principles sound simple, but violations are extremely common in real-world data. Wide-format spreadsheets, merged cells, multi-value fields, and inconsistent types all break tidy structure and cause downstream analysis errors.

Common Tidy Data Violations

Values in column headers: Columns like "revenue_2023", "revenue_2024" encode the year variable in the header instead of having a "year" column. Fix: pivot to long format.
Multiple values per cell: A "symptoms" column containing "fever, cough, fatigue" crams three observations into one cell. Fix: split into separate rows or use indicator columns.
Mixed types: A numeric column with occasional "N/A" or "pending" strings forces the entire column to text. Fix: use proper missing value representations (empty cell, NA).
Inconsistent casing: "Treatment" and "treatment" in the same column creates phantom groups. Fix: standardize case.
Variables in rows: When each row contains a different measurement (height in row 1, weight in row 2), the data needs transposing.

Why Tidy Data Matters for Research

Reproducibility: Tidy data works predictably with standard tools (dplyr, pandas, ggplot2). Non-tidy data requires custom reshaping code that is error-prone and hard to reproduce.
Composability: Tidy datasets can be merged, joined, and combined because the structure is consistent. Two tidy tables with a shared key column can be joined in one line of code.
Tool compatibility: Most statistical software (R, Python, Stata, SPSS) expects tidy input. Wide-format data requires manual reshaping before every analysis step.
Error reduction: Structural problems in data cause silent errors — wrong group counts, dropped observations, type coercion. Catching these early saves hours of debugging.

How to Fix Untidy Data

Wide to long (pivot): In R: `pivot_longer()`. In Python: `pd.melt()`. Converts columns like score_2024, score_2025 into rows with "year" and "score" columns.
Split multi-value cells: In R: `separate_rows()`. In Python: `df['col'].str.split(',').explode()`. Creates one row per value.
Standardize types: Replace string "N/A", "n/a", "missing" with proper NA. In R: `na_if()`. In Python: `df.replace()`.
Fix column names: In R: `janitor::clean_names()`. In Python: `df.columns = df.columns.str.lower().str.replace(' ', '_')`.
Remove duplicates: In R: `distinct()`. In Python: `df.drop_duplicates()`. Always check if duplicates are real data or errors first.

Validation Methodology

Rule 1 — Variables as columns: Detects numeric suffixes and common temporal/categorical patterns in column names (e.g., _2023, _2024, _q1, _q2) that suggest values encoded in headers.
Rule 2 — Observations as rows: Identifies rows sharing the same ID with different values, suggesting merged or nested observations.
Rule 3 — Values as cells: Scans for comma-separated lists, slash-separated alternatives (A/B), and semicolon-delimited values within individual cells.
Rule 4 — Consistent types: Checks whether each column contains a consistent data type (all numeric, all text, all dates) or a problematic mix.
Rule 5 — Missing values: Counts and locates empty cells, "N/A", "null", "NA", "NaN", and other missing value representations.
Rule 6 — Column naming: Checks for spaces, special characters, leading/trailing whitespace, and inconsistent casing conventions.
Rule 7 — Duplicate rows: Detects exact row duplicates via string comparison.
Rule 8 — Header issues: Checks for empty column names and duplicate header names.

All checks run entirely in your browser. No data is transmitted to any server.

Frequently Asked Questions