ToolsConductScience tool
PII DetectionFree in-browser calculator

Metadata Anonymization Checker.

Scan uploaded files for personally identifiable information (PII). Detect emails, SSNs, phone numbers, dates of birth, IP addresses, MRNs, and GPS coordinates with severity-rated findings and HIPAA Safe Harbor anonymization guidance.

PrivateData stays in your browser
LiveNo sign-up required
Validated2026-04-05
CitableMethods and citation included

Calculator

Results update in place

Try it out

Load example Metadata Anonymization data to see the full workflow

Drop your file here to scan for PII

Accepts CSV, TSV, JSON, XML, and TXT files. All processing happens locally in your browser.

CSVTSVJSONXMLTXT

When to use

  • Screen research datasets before depositing in public repositories (e.g., Dryad, Figshare, ICPSR)
  • Check clinical data exports for residual PHI before analysis on non-secure systems
  • Audit anonymization pipelines by verifying outputs are free of PII patterns
  • Review data files before sharing with collaborators or including as journal supplements
  • Quick pre-check before submitting data to IRB or data safety monitoring boards

Do not use for

  • As the sole method for HIPAA compliance -- this is a screening aid, not a certification tool
  • For binary files (XLSX, DOCX, PDF) -- convert to text-based formats first
  • To detect indirect identifiers or re-identification risk from quasi-identifier combinations
  • As a replacement for expert review by a privacy officer or qualified statistician

PII can hide in free-text fields

Clinical notes, comments, and free-text columns often contain names, dates, and other identifiers that structured de-identification misses. Always scan entire files, not just structured columns.

Dates of birth vs. study dates

This tool flags date patterns that could be dates of birth. Study dates and procedure dates may also need anonymization under HIPAA if they relate directly to an individual. Review flagged dates in context.

Column headers reveal data structure

Even if PII values are removed, column headers like "patient_name" or "ssn" signal that the dataset was derived from identified sources. Consider renaming columns to neutral labels (e.g., "subject_id").

Pseudonymization is not anonymization

Replacing names with codes (pseudonymization) is reversible if the key exists. Under GDPR, pseudonymized data is still personal data. True anonymization means no key exists to reverse the mapping.

1

Method

The scanner reads uploaded files as plain text and applies regular expressions for each PII category. Email, SSN, phone, IP address, date, MRN, and GPS patterns are matched line-by-line with severity classification (high for direct identifiers, medium for indirect). Column headers in the first row are checked against known PII field name patterns. Results include redacted samples (first/last 2 characters preserved), line numbers, and HIPAA Safe Harbor-aligned anonymization suggestions. All processing is client-side; no data leaves the browser.

2

Validated

Last validated 2026-04-05. Calculations are designed for planning and documentation support; verify procurement decisions against manufacturer specifications or institutional SOPs.

3

How to cite

How to Cite

ConductScience Metadata Anonymization Checker (v1.0). ConductScience, Inc. 2026. Available at: https://conductscience.com/tools/metadata-anonymization-checker

U.S. Department of Health and Human Services. Guidance Regarding Methods for De-identification of Protected Health Information in Accordance with the HIPAA Privacy Rule. 2012.

El Emam K, Arbuckle L. Anonymizing Health Data: Case Studies and Methods to Get You Started. O'Reilly Media; 2013.

Data Anonymization Fundamentals

Research data sharing accelerates scientific progress but introduces privacy risks when datasets contain personally identifiable information (PII). Effective anonymization must balance two goals:

  • Privacy protection -- ensuring no individual can be re-identified from the released data
  • Data utility -- preserving enough information for meaningful analysis

The most common approach in biomedical research is HIPAA de-identification using the Safe Harbor method, which requires removal of 18 specific identifier categories. However, removal alone may be insufficient when combinations of seemingly anonymous variables (quasi-identifiers like age + ZIP + gender) can be linked to external datasets.

Key anonymization techniques include:

  • Suppression -- removing the identifier entirely (e.g., deleting a name column)
  • Generalization -- reducing precision (e.g., exact age to age range, ZIP code to state)
  • Pseudonymization -- replacing identifiers with consistent codes (e.g., patient ID)
  • Perturbation -- adding noise to continuous values while preserving distributions
  • k-Anonymity -- ensuring every record is indistinguishable from at least k-1 others

The 18 HIPAA Safe Harbor Identifiers

The HIPAA Safe Harbor method (45 CFR 164.514(b)(2)) requires removal of these 18 identifier categories:

1. Names 2. Geographic data smaller than state (street address, city, ZIP code) 3. Dates (except year) directly related to an individual 4. Phone numbers 5. Fax numbers 6. Email addresses 7. Social Security numbers 8. Medical record numbers 9. Health plan beneficiary numbers 10. Account numbers 11. Certificate/license numbers 12. Vehicle identifiers and serial numbers 13. Device identifiers and serial numbers 14. Web URLs 15. IP addresses 16. Biometric identifiers 17. Full-face photographs and comparable images 18. Any other unique identifying number, characteristic, or code

After removing these identifiers, the covered entity must have no actual knowledge that the remaining information could be used alone or in combination to identify an individual.

Frequently asked

325
Free tools
1,200+
Institutions
100%
Client-side
0
Uploads required