Guide
The Data Cleaning Checklist: From Messy CSV to Analysis-Ready
The biggest source of bad analysis isn’t bad math — it’s dirty input. Use this checklist before any aggregation, modeling, or chart hits a stakeholder’s screen.
The 10-Step Checklist
- Deduplicate. Use our Remove Duplicates Tool on key columns first.
- Trim whitespace. Leading/trailing spaces silently break joins. Run text through our Text Cleaner Tool.
- Standardize casing. “john@x.com” and “John@x.com” are the same email but two rows.
- Fix encoding. Watch for mojibake from Excel’s UTF-8 BOM and Latin-1 round-trips.
- Coerce types. Dates as strings, numbers as text, booleans as 0/1 — pick one representation.
- Standardize date formats. ISO 8601 (YYYY-MM-DD) is the only safe choice for cross-tool data.
- Handle missing values. Decide: drop, impute, or flag — and be explicit per column.
- Validate ranges. Negative ages, future order dates, prices of $0.
- Cross-field consistency. Order date can’t be after delivery date.
- Sample-check the output. Eyeball 20 random rows before declaring it clean.
Excel-Specific Gotchas
- Gene names like SEPT2 silently convert to dates.
- Leading zeros in IDs/zip codes get stripped.
- Long numbers (16+ digits) lose precision.
Convert via our Excel to CSV Converter with text-import options when in doubt.
Document What You Did
Keep a cleaning log alongside your dataset: rules applied, rows dropped, fields imputed. Future-you will need it.
Quick Sanity Summary
After cleaning, run our Dataset Summary Generator for instant means, medians, missing counts and outliers across every column.
FAQs
Is dropping rows with missing data okay? Sometimes. If >5% of rows are affected, prefer imputation or flagging.
How do I clean data at scale? Use Pandas, dbt or DuckDB pipelines with version control instead of one-off spreadsheets.