DATA CLEANSING
SKY YIN
Photo credit: http://outofmygord.com/2015/04/08/the-messy-part-of-marketing/
DATA QUALITY ISSUES
MISSING DATA
▸ Null, empty string, 0, NA, N/A
▸ Find root cause
▸ Randomly missing or regular missing
▸ Fix missing data
▸ Skip
▸ Fill
DUPLICATED DATA
▸ Detect dups
▸ Unique count
▸ Root cause: bug or process or valid reason?
▸ Dup caused by typo, inconsistent format, spelling, and abbreviations
▸ Be careful on things look like dups but actually different
▸ People with same names
OUTLIERS
▸ Outlier detection
▸ Histogram is your friend
▸ Dealing with outliers
▸ Bug or exception
▸ Corrupted data
▸ Intentional wrong input: age, gender, post code
SUBTLE PROBLEMS
▸ Order in records
▸ Always sort. Don’t assume order
▸ Hidden link across records
▸ Duplicated session end bug
▸ Need rule-based detection
▸ Don’t know what you don’t know
BEYOND ISSUES
▸ Transforming
▸ Encoding
▸ Local time <—> UTC time
▸ Tidy data/normalization
▸ Storage optimization: Parquet, ORC
▸ Flexibility optimization: JSON
TOOLS
TEXT
EXPLORATORY CLEANSING
▸ R: dataframe, data.table, dplyr
▸ Python: pandas, ipython notebook
▸ Open Refine
▸ Trifacta
TEXT
PRODUCTION CLEANSING
▸ ETL
▸ Hadoop-based: Pig, Scalding
▸ Spark (can also be used for exploratory cleansing)
▸ ETL mangement
▸ AWS data pipeline
▸ Airbnb airflow
TEXT
USE MACHINE LEARNING TO CLEANSING DATA
▸ Clustering
▸ Use similarity to find dups
▸ Use similarity to find difference
PRACTICES
TEXT
GENERAL PRACTICES
▸ Data pipeline to automate the process
▸ Sushi principle: prefer raw data
▸ Prefer immutable than mutable
▸ Reproducible: scripts vs tools
TEXT
MINOR DETAILS
▸ Approximate unique: hyperloglog
▸ Avoid incremental update on counts
▸ Save change if space permitting (S3)
▸ Upsert instead of insert: only effective for the first run
TEXT
OPEN QUESTIONS
▸ Data versioning
▸ Data continuous validation
▸ Automated cleansing

Data cleansing

  • 1.
    DATA CLEANSING SKY YIN Photocredit: http://outofmygord.com/2015/04/08/the-messy-part-of-marketing/
  • 2.
  • 3.
    MISSING DATA ▸ Null,empty string, 0, NA, N/A ▸ Find root cause ▸ Randomly missing or regular missing ▸ Fix missing data ▸ Skip ▸ Fill
  • 4.
    DUPLICATED DATA ▸ Detectdups ▸ Unique count ▸ Root cause: bug or process or valid reason? ▸ Dup caused by typo, inconsistent format, spelling, and abbreviations ▸ Be careful on things look like dups but actually different ▸ People with same names
  • 5.
    OUTLIERS ▸ Outlier detection ▸Histogram is your friend ▸ Dealing with outliers ▸ Bug or exception ▸ Corrupted data ▸ Intentional wrong input: age, gender, post code
  • 6.
    SUBTLE PROBLEMS ▸ Orderin records ▸ Always sort. Don’t assume order ▸ Hidden link across records ▸ Duplicated session end bug ▸ Need rule-based detection ▸ Don’t know what you don’t know
  • 7.
    BEYOND ISSUES ▸ Transforming ▸Encoding ▸ Local time <—> UTC time ▸ Tidy data/normalization ▸ Storage optimization: Parquet, ORC ▸ Flexibility optimization: JSON
  • 8.
  • 9.
    TEXT EXPLORATORY CLEANSING ▸ R:dataframe, data.table, dplyr ▸ Python: pandas, ipython notebook ▸ Open Refine ▸ Trifacta
  • 10.
    TEXT PRODUCTION CLEANSING ▸ ETL ▸Hadoop-based: Pig, Scalding ▸ Spark (can also be used for exploratory cleansing) ▸ ETL mangement ▸ AWS data pipeline ▸ Airbnb airflow
  • 11.
    TEXT USE MACHINE LEARNINGTO CLEANSING DATA ▸ Clustering ▸ Use similarity to find dups ▸ Use similarity to find difference
  • 12.
  • 13.
    TEXT GENERAL PRACTICES ▸ Datapipeline to automate the process ▸ Sushi principle: prefer raw data ▸ Prefer immutable than mutable ▸ Reproducible: scripts vs tools
  • 14.
    TEXT MINOR DETAILS ▸ Approximateunique: hyperloglog ▸ Avoid incremental update on counts ▸ Save change if space permitting (S3) ▸ Upsert instead of insert: only effective for the first run
  • 15.
    TEXT OPEN QUESTIONS ▸ Dataversioning ▸ Data continuous validation ▸ Automated cleansing