SlideShare a Scribd company logo
DATA CLEANSING
SKY YIN
Photo credit: http://outofmygord.com/2015/04/08/the-messy-part-of-marketing/
DATA QUALITY ISSUES
MISSING DATA
▸ Null, empty string, 0, NA, N/A
▸ Find root cause
▸ Randomly missing or regular missing
▸ Fix missing data
▸ Skip
▸ Fill
DUPLICATED DATA
▸ Detect dups
▸ Unique count
▸ Root cause: bug or process or valid reason?
▸ Dup caused by typo, inconsistent format, spelling, and abbreviations
▸ Be careful on things look like dups but actually different
▸ People with same names
OUTLIERS
▸ Outlier detection
▸ Histogram is your friend
▸ Dealing with outliers
▸ Bug or exception
▸ Corrupted data
▸ Intentional wrong input: age, gender, post code
SUBTLE PROBLEMS
▸ Order in records
▸ Always sort. Don’t assume order
▸ Hidden link across records
▸ Duplicated session end bug
▸ Need rule-based detection
▸ Don’t know what you don’t know
BEYOND ISSUES
▸ Transforming
▸ Encoding
▸ Local time <—> UTC time
▸ Tidy data/normalization
▸ Storage optimization: Parquet, ORC
▸ Flexibility optimization: JSON
TOOLS
TEXT
EXPLORATORY CLEANSING
▸ R: dataframe, data.table, dplyr
▸ Python: pandas, ipython notebook
▸ Open Refine
▸ Trifacta
TEXT
PRODUCTION CLEANSING
▸ ETL
▸ Hadoop-based: Pig, Scalding
▸ Spark (can also be used for exploratory cleansing)
▸ ETL mangement
▸ AWS data pipeline
▸ Airbnb airflow
TEXT
USE MACHINE LEARNING TO CLEANSING DATA
▸ Clustering
▸ Use similarity to find dups
▸ Use similarity to find difference
PRACTICES
TEXT
GENERAL PRACTICES
▸ Data pipeline to automate the process
▸ Sushi principle: prefer raw data
▸ Prefer immutable than mutable
▸ Reproducible: scripts vs tools
TEXT
MINOR DETAILS
▸ Approximate unique: hyperloglog
▸ Avoid incremental update on counts
▸ Save change if space permitting (S3)
▸ Upsert instead of insert: only effective for the first run
TEXT
OPEN QUESTIONS
▸ Data versioning
▸ Data continuous validation
▸ Automated cleansing

More Related Content

Viewers also liked

Presentation on Data Cleansing
Presentation on Data CleansingPresentation on Data Cleansing
Presentation on Data Cleansing
ng8
 
Brief Introduction to the 12 Steps of Evaluation Data Cleaning
Brief Introduction to the 12 Steps of Evaluation Data CleaningBrief Introduction to the 12 Steps of Evaluation Data Cleaning
Brief Introduction to the 12 Steps of Evaluation Data Cleaning
Jennifer Morrow
 
Salesforce Spring '17 Release Admin Webinar
Salesforce Spring '17 Release Admin WebinarSalesforce Spring '17 Release Admin Webinar
Salesforce Spring '17 Release Admin Webinar
Salesforce Admins
 
Salesforce Admin Webinar: Processes Drive Solutions
Salesforce Admin Webinar: Processes Drive SolutionsSalesforce Admin Webinar: Processes Drive Solutions
Salesforce Admin Webinar: Processes Drive Solutions
Salesforce Admins
 
Salesforce Spring 17 Release Overview
Salesforce Spring 17 Release OverviewSalesforce Spring 17 Release Overview
Salesforce Spring 17 Release Overview
Roy Gilad
 
Data cleansing
Data cleansingData cleansing
Data cleansing
kunaljain1701
 
Data Cleaning Techniques
Data Cleaning TechniquesData Cleaning Techniques
Data Cleaning Techniques
Amir Masoud Sefidian
 
Best practice strategies to clean up and maintain your database with Hether G...
Best practice strategies to clean up and maintain your database with Hether G...Best practice strategies to clean up and maintain your database with Hether G...
Best practice strategies to clean up and maintain your database with Hether G...
Blackbaud Pacific
 

Viewers also liked (8)

Presentation on Data Cleansing
Presentation on Data CleansingPresentation on Data Cleansing
Presentation on Data Cleansing
 
Brief Introduction to the 12 Steps of Evaluation Data Cleaning
Brief Introduction to the 12 Steps of Evaluation Data CleaningBrief Introduction to the 12 Steps of Evaluation Data Cleaning
Brief Introduction to the 12 Steps of Evaluation Data Cleaning
 
Salesforce Spring '17 Release Admin Webinar
Salesforce Spring '17 Release Admin WebinarSalesforce Spring '17 Release Admin Webinar
Salesforce Spring '17 Release Admin Webinar
 
Salesforce Admin Webinar: Processes Drive Solutions
Salesforce Admin Webinar: Processes Drive SolutionsSalesforce Admin Webinar: Processes Drive Solutions
Salesforce Admin Webinar: Processes Drive Solutions
 
Salesforce Spring 17 Release Overview
Salesforce Spring 17 Release OverviewSalesforce Spring 17 Release Overview
Salesforce Spring 17 Release Overview
 
Data cleansing
Data cleansingData cleansing
Data cleansing
 
Data Cleaning Techniques
Data Cleaning TechniquesData Cleaning Techniques
Data Cleaning Techniques
 
Best practice strategies to clean up and maintain your database with Hether G...
Best practice strategies to clean up and maintain your database with Hether G...Best practice strategies to clean up and maintain your database with Hether G...
Best practice strategies to clean up and maintain your database with Hether G...
 

Recently uploaded

一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
bopyb
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
nuttdpt
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
Walaa Eldin Moustafa
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
Timothy Spann
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
javier ramirez
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
rwarrenll
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Kiwi Creative
 
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
74nqk8xf
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
Lars Albertsson
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
Social Samosa
 
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdfUdemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Fernanda Palhano
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
jitskeb
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
vikram sood
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
Social Samosa
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
apvysm8
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
nyfuhyz
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
g4dpvqap0
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
nuttdpt
 

Recently uploaded (20)

一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
 
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
 
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdfUdemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
 

Data cleansing

  • 1. DATA CLEANSING SKY YIN Photo credit: http://outofmygord.com/2015/04/08/the-messy-part-of-marketing/
  • 3. MISSING DATA ▸ Null, empty string, 0, NA, N/A ▸ Find root cause ▸ Randomly missing or regular missing ▸ Fix missing data ▸ Skip ▸ Fill
  • 4. DUPLICATED DATA ▸ Detect dups ▸ Unique count ▸ Root cause: bug or process or valid reason? ▸ Dup caused by typo, inconsistent format, spelling, and abbreviations ▸ Be careful on things look like dups but actually different ▸ People with same names
  • 5. OUTLIERS ▸ Outlier detection ▸ Histogram is your friend ▸ Dealing with outliers ▸ Bug or exception ▸ Corrupted data ▸ Intentional wrong input: age, gender, post code
  • 6. SUBTLE PROBLEMS ▸ Order in records ▸ Always sort. Don’t assume order ▸ Hidden link across records ▸ Duplicated session end bug ▸ Need rule-based detection ▸ Don’t know what you don’t know
  • 7. BEYOND ISSUES ▸ Transforming ▸ Encoding ▸ Local time <—> UTC time ▸ Tidy data/normalization ▸ Storage optimization: Parquet, ORC ▸ Flexibility optimization: JSON
  • 9. TEXT EXPLORATORY CLEANSING ▸ R: dataframe, data.table, dplyr ▸ Python: pandas, ipython notebook ▸ Open Refine ▸ Trifacta
  • 10. TEXT PRODUCTION CLEANSING ▸ ETL ▸ Hadoop-based: Pig, Scalding ▸ Spark (can also be used for exploratory cleansing) ▸ ETL mangement ▸ AWS data pipeline ▸ Airbnb airflow
  • 11. TEXT USE MACHINE LEARNING TO CLEANSING DATA ▸ Clustering ▸ Use similarity to find dups ▸ Use similarity to find difference
  • 13. TEXT GENERAL PRACTICES ▸ Data pipeline to automate the process ▸ Sushi principle: prefer raw data ▸ Prefer immutable than mutable ▸ Reproducible: scripts vs tools
  • 14. TEXT MINOR DETAILS ▸ Approximate unique: hyperloglog ▸ Avoid incremental update on counts ▸ Save change if space permitting (S3) ▸ Upsert instead of insert: only effective for the first run
  • 15. TEXT OPEN QUESTIONS ▸ Data versioning ▸ Data continuous validation ▸ Automated cleansing