ToolsConductScience tool
Tidy Data RulesFree in-browser calculator

Tidy Data Validator.

Upload a CSV, check it against Hadley Wickham's tidy data principles, and get a detailed report with fix suggestions. 100% client-side — your data never leaves your browser.

PrivateData stays in your browser
LiveNo sign-up required
Validated2026-04-05
CitableMethods and citation included

Calculator

Results update in place

Try it out

Load example tidy data data to see the full workflow

Data Input

Drop a CSV file here, or

Supports CSV, TSV, and semicolon-delimited files

When to use

  • Checking CSV files before importing into R, Python, or statistical software
  • Validating datasets for reproducible research pipelines
  • Teaching tidy data principles with immediate feedback
  • Auditing shared datasets for common structural problems
  • Pre-screening data before database ingestion or ETL

Do not use for

  • Validating data types against a strict schema (use JSON Schema or Great Expectations)
  • Checking domain-specific business rules beyond structural quality
  • Validating binary or non-tabular file formats

Wide format is not always wrong

Wide-format data (e.g., columns for each year) is common and sometimes preferred for display. However, for analysis, tidy (long) format is almost always easier to work with. The validator flags wide patterns as warnings, not errors, so you can decide based on your use case.

Comma-separated values inside cells are a frequent source of bugs

A cell containing "A, B, C" looks like one value but represents three. This violates the "each value is a cell" principle and causes silent errors in group-by operations, joins, and aggregations. Split these into separate rows.

Mixed types silently break numeric operations

A column with 99% numbers and one "N/A" string will be read as text by most tools, silently converting all values to strings. Use proper NA/null representations that your software recognizes.

Duplicate rows may be valid

Exact duplicate rows are flagged because they often indicate copy-paste errors. However, if your data genuinely has repeated observations (e.g., multiple measurements per subject), duplicates may be correct. Check the context.

Column naming conventions matter for code

Spaces and special characters in column names force quoting in R (backticks) and Python (bracket notation). snake_case or camelCase names are easier to work with programmatically.

1

Method

Validates CSV data against Hadley Wickham's three tidy data principles (2014): (1) each variable forms a column, (2) each observation forms a row, (3) each type of observational unit forms a table. Additionally checks for mixed types, missing values, column naming quality, duplicate rows, and header issues. All parsing and validation runs entirely in the browser.

2

Validated

Last validated 2026-04-05. Calculations are designed for planning and documentation support; verify procurement decisions against manufacturer specifications or institutional SOPs.

3

How to cite

How to Cite

ConductScience Tidy Data Validator (1.0.0). ConductScience, Inc. 2026. Available at: https://conductscience.com/tools/tidy-data-validator

Wickham H. Tidy Data. Journal of Statistical Software. 2014;59(10):1-23. doi:10.18637/jss.v059.i10

What Is Tidy Data?

Tidy data is a standard way of organizing tabular data, formalized by Hadley Wickham in his 2014 *Journal of Statistical Software* paper. The three principles are:

1. Each variable forms a column — every column represents one and only one measured quantity or attribute. 2. Each observation forms a row — every row represents one complete unit of observation. 3. Each value occupies a single cell — no cell contains multiple values.

These principles sound simple, but violations are extremely common in real-world data. Wide-format spreadsheets, merged cells, multi-value fields, and inconsistent types all break tidy structure and cause downstream analysis errors.

Common Tidy Data Violations

Values in column headers: Columns like "revenue_2023", "revenue_2024" encode the year variable in the header instead of having a "year" column. Fix: pivot to long format.
Multiple values per cell: A "symptoms" column containing "fever, cough, fatigue" crams three observations into one cell. Fix: split into separate rows or use indicator columns.
Mixed types: A numeric column with occasional "N/A" or "pending" strings forces the entire column to text. Fix: use proper missing value representations (empty cell, NA).
Inconsistent casing: "Treatment" and "treatment" in the same column creates phantom groups. Fix: standardize case.
Variables in rows: When each row contains a different measurement (height in row 1, weight in row 2), the data needs transposing.

Why Tidy Data Matters for Research

Reproducibility: Tidy data works predictably with standard tools (dplyr, pandas, ggplot2). Non-tidy data requires custom reshaping code that is error-prone and hard to reproduce.
Composability: Tidy datasets can be merged, joined, and combined because the structure is consistent. Two tidy tables with a shared key column can be joined in one line of code.
Tool compatibility: Most statistical software (R, Python, Stata, SPSS) expects tidy input. Wide-format data requires manual reshaping before every analysis step.
Error reduction: Structural problems in data cause silent errors — wrong group counts, dropped observations, type coercion. Catching these early saves hours of debugging.

How to Fix Untidy Data

Wide to long (pivot): In R: `pivot_longer()`. In Python: `pd.melt()`. Converts columns like score_2024, score_2025 into rows with "year" and "score" columns.
Split multi-value cells: In R: `separate_rows()`. In Python: `df['col'].str.split(',').explode()`. Creates one row per value.
Standardize types: Replace string "N/A", "n/a", "missing" with proper NA. In R: `na_if()`. In Python: `df.replace()`.
Fix column names: In R: `janitor::clean_names()`. In Python: `df.columns = df.columns.str.lower().str.replace(' ', '_')`.
Remove duplicates: In R: `distinct()`. In Python: `df.drop_duplicates()`. Always check if duplicates are real data or errors first.

Validation Methodology

Rule 1 — Variables as columns: Detects numeric suffixes and common temporal/categorical patterns in column names (e.g., _2023, _2024, _q1, _q2) that suggest values encoded in headers.
Rule 2 — Observations as rows: Identifies rows sharing the same ID with different values, suggesting merged or nested observations.
Rule 3 — Values as cells: Scans for comma-separated lists, slash-separated alternatives (A/B), and semicolon-delimited values within individual cells.
Rule 4 — Consistent types: Checks whether each column contains a consistent data type (all numeric, all text, all dates) or a problematic mix.
Rule 5 — Missing values: Counts and locates empty cells, "N/A", "null", "NA", "NaN", and other missing value representations.
Rule 6 — Column naming: Checks for spaces, special characters, leading/trailing whitespace, and inconsistent casing conventions.
Rule 7 — Duplicate rows: Detects exact row duplicates via string comparison.
Rule 8 — Header issues: Checks for empty column names and duplicate header names.

All checks run entirely in your browser. No data is transmitted to any server.

Frequently asked

325
Free tools
1,200+
Institutions
100%
Client-side
0
Uploads required