Metadata Anonymization Checker

Scan uploaded files for personally identifiable information (PII). Detect emails, SSNs, phone numbers, dates of birth, IP addresses, MRNs, and GPS coordinates with severity-rated findings and HIPAA Safe Harbor anonymization guidance.

PII DetectionHIPAA Safe HarborClient-Side

Try it out

Load example Metadata Anonymization data to see the full workflow

Drop your file here to scan for PII

Accepts CSV, TSV, JSON, XML, and TXT files. All processing happens locally in your browser.

CSVTSVJSONXMLTXT
  • Screen research datasets before depositing in public repositories (e.g., Dryad, Figshare, ICPSR)
  • Check clinical data exports for residual PHI before analysis on non-secure systems
  • Audit anonymization pipelines by verifying outputs are free of PII patterns
  • Review data files before sharing with collaborators or including as journal supplements
  • Quick pre-check before submitting data to IRB or data safety monitoring boards

Don't use for

  • As the sole method for HIPAA compliance -- this is a screening aid, not a certification tool
  • For binary files (XLSX, DOCX, PDF) -- convert to text-based formats first
  • To detect indirect identifiers or re-identification risk from quasi-identifier combinations
  • As a replacement for expert review by a privacy officer or qualified statistician

Data Anonymization Fundamentals

Research data sharing accelerates scientific progress but introduces privacy risks when datasets contain personally identifiable information (PII). Effective anonymization must balance two goals:

  • Privacy protection -- ensuring no individual can be re-identified from the released data
  • Data utility -- preserving enough information for meaningful analysis

The most common approach in biomedical research is HIPAA de-identification using the Safe Harbor method, which requires removal of 18 specific identifier categories. However, removal alone may be insufficient when combinations of seemingly anonymous variables (quasi-identifiers like age + ZIP + gender) can be linked to external datasets.

Key anonymization techniques include:

  • Suppression -- removing the identifier entirely (e.g., deleting a name column)
  • Generalization -- reducing precision (e.g., exact age to age range, ZIP code to state)
  • Pseudonymization -- replacing identifiers with consistent codes (e.g., patient ID)
  • Perturbation -- adding noise to continuous values while preserving distributions
  • k-Anonymity -- ensuring every record is indistinguishable from at least k-1 others

The 18 HIPAA Safe Harbor Identifiers

The HIPAA Safe Harbor method (45 CFR 164.514(b)(2)) requires removal of these 18 identifier categories:

1. Names 2. Geographic data smaller than state (street address, city, ZIP code) 3. Dates (except year) directly related to an individual 4. Phone numbers 5. Fax numbers 6. Email addresses 7. Social Security numbers 8. Medical record numbers 9. Health plan beneficiary numbers 10. Account numbers 11. Certificate/license numbers 12. Vehicle identifiers and serial numbers 13. Device identifiers and serial numbers 14. Web URLs 15. IP addresses 16. Biometric identifiers 17. Full-face photographs and comparable images 18. Any other unique identifying number, characteristic, or code

After removing these identifiers, the covered entity must have no actual knowledge that the remaining information could be used alone or in combination to identify an individual.

Frequently Asked Questions