1st edition
November 4-5, 2018
Machine Learning School in Doha
1
QCRI
Data Profiling, Data Cleaning,
Error Detection
Getting to ML Ready Data
Saravanan Thirumuruganathan
Scientist, QCRI
2
· @bigmlcom · @QatarComputing · #MLSD18 ·QCRI
Outline
• ML Ready data
• Data Preparation Workflow
• Error Detection and Repair
• Demo of Google Facets
3
· @bigmlcom · @QatarComputing · #MLSD18 ·QCRI
ML Ready Data
4
· @bigmlcom · @QatarComputing · #MLSD18 ·QCRI
ML Ready Data
• is one instance
• contains all information about this instance
Each row
5
· @bigmlcom · @QatarComputing · #MLSD18 ·QCRI
ML Ready Data
• is a field that describes a property of the instance.
Each column
6
· @bigmlcom · @QatarComputing · #MLSD18 ·QCRI
ML Ready Data
Features Instances Values
7
· @bigmlcom · @QatarComputing · #MLSD18 ·QCRI
ML Ready Data : Don’t Dos
• Column headers are values, not variable names.
• Multiple variables are stored in one column.
• Variables are stored in both rows and columns.
8
· @bigmlcom · @QatarComputing · #MLSD18 ·QCRI
Column headers are values
Source: Tidy Data by Hadley Wickham
9
· @bigmlcom · @QatarComputing · #MLSD18 ·QCRI
Column headers are values
Source: Tidy Data by Hadley Wickham
10
· @bigmlcom · @QatarComputing · #MLSD18 ·QCRI
Column headers are values
Source: Tidy Data by Hadley Wickham
Same data as previous slide. Columns are renamed to income and value to freq
11
· @bigmlcom · @QatarComputing · #MLSD18 ·QCRI
• Column headers are values, not variable names.
• Multiple variables are stored in one column.
• Variables are stored in both rows and columns.
12
ML Ready Data : Don’t Dos
· @bigmlcom · @QatarComputing · #MLSD18 ·QCRI
Multiple variables in 1 column
Source: Tidy Data by Hadley Wickham
13
· @bigmlcom · @QatarComputing · #MLSD18 ·QCRI
Multiple variables in 1 column
Source: Tidy Data by Hadley Wickham
14
· @bigmlcom · @QatarComputing · #MLSD18 ·QCRI
Multiple variables in 1 column
Source: Tidy Data by Hadley Wickham
15
· @bigmlcom · @QatarComputing · #MLSD18 ·QCRI
Multiple variables in 1 column
Source: Tidy Data by Hadley Wickham
16
· @bigmlcom · @QatarComputing · #MLSD18 ·QCRI
• Column headers are values, not variable names.
• Multiple variables are stored in one column.
• Variables are stored in both rows and columns.
17
ML Ready Data : Don’t Dos
· @bigmlcom · @QatarComputing · #MLSD18 ·QCRI
Multiple variables in 1 column
Source: Tidy Data by Hadley Wickham
18
· @bigmlcom · @QatarComputing · #MLSD18 ·QCRI
Multiple variables in 1 column
Source: Tidy Data by Hadley Wickham
19
· @bigmlcom · @QatarComputing · #MLSD18 ·QCRI
Multiple variables in 1 column
Source: Tidy Data by Hadley Wickham
20
· @bigmlcom · @QatarComputing · #MLSD18 ·QCRI
Data Preparation
21
· @bigmlcom · @QatarComputing · #MLSD18 ·QCRI
Data Preparation
Data Preparation
where is my “ML Ready” data?
22
· @bigmlcom · @QatarComputing · #MLSD18 ·QCRI
Data Preparation
23
· @bigmlcom · @QatarComputing · #MLSD18 ·QCRI
Source: http://visit.crowdflower.com/rs/416-ZBE-142/images/CrowdFlower_DataScienceReport_2016.pdf
Bad News
24
· @bigmlcom · @QatarComputing · #MLSD18 ·QCRI
Day of a Data Scientist
– Mark Schreiber (Merck) reports that his data scientists
spend 98% of their time, i.e. 39 hours/week, in grunt work
and only 1 hour/week doing the job for which they were
hired
– For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to
Insights (The New York Times https://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor-work.html)
– Reports vary between 60% and 80% of grunt work but
nobody reports less
25
· @bigmlcom · @QatarComputing · #MLSD18 ·QCRI
Good News
Source: https://thodrek.github.io/di-ml/sigmod2018/slides/diml_sigmod2018.pdf
26
· @bigmlcom · @QatarComputing · #MLSD18 ·QCRI
Good News
Data Preparation
Data Civilizer
(QCRI)
27
· @bigmlcom · @QatarComputing · #MLSD18 ·QCRI
Data Preparation Workflow
28
· @bigmlcom · @QatarComputing · #MLSD18 ·QCRI
Data Preparation Workflow
• Data Discovery
• Data Integration
• Data Profiling
• Error Detection
• Error Repair
29
· @bigmlcom · @QatarComputing · #MLSD18 ·QCRI
Data Discovery
Source: https://tmrresearchblog.com/data-discovery-tools-provide-one-time-query-business-processes/
30
· @bigmlcom · @QatarComputing · #MLSD18 ·QCRI
Data Discovery
• Merck has 4000 Oracle databases + countless other repositories
• MIT Data Warehouse has 2400 tables
• GE has 75 procurement systems and supplier databases
• Your organisation has many databases and many repositories!
31
· @bigmlcom · @QatarComputing · #MLSD18 ·QCRI
Data Discovery
• the company stock variations in recent years
• the mentions of the company in social media channels
• the number of drugs about to be approved by the FDA
• the current productivity of the research department
Goal: predicting the change in stock price of a drug company
Source: Aurum: A Data Discovery System
Need: merge tables about drugs, stock performance, social media
mentions etc
32
· @bigmlcom · @QatarComputing · #MLSD18 ·QCRI
Data Integration
Schema Alignment and Transformation
Source: Beaver: Towards a Declarative Schema Mapping
33
· @bigmlcom · @QatarComputing · #MLSD18 ·QCRI
Data Integration
Source: Guoliang Li: Human-in-the-loop Data Integration
34
· @bigmlcom · @QatarComputing · #MLSD18 ·QCRI
Data Integration
Protein Data Bank
(>400 tables, >3000 columns)
Source: Beaver: Towards a Declarative Schema Mapping
35
· @bigmlcom · @QatarComputing · #MLSD18 ·QCRI
Data Profiling
• Single and multi-column statistics
• Common patterns
• Data types and analysis
• Column dependencies
• Data dependencies
• Unique column combinations
36
· @bigmlcom · @QatarComputing · #MLSD18 ·QCRI
Data Profiling Tools in Industry
Source: Data Profiling Tutorial SIGMOD 2017 Abedjan et al
37
· @bigmlcom · @QatarComputing · #MLSD18 ·QCRI
Data Profiling Tools in Research
Source: Data Profiling Tutorial SIGMOD 2017 Abedjan et al
38
· @bigmlcom · @QatarComputing · #MLSD18 ·QCRI
Error Detection and Repair
39
· @bigmlcom · @QatarComputing · #MLSD18 ·QCRI
Dirty Data
Source: Dilbert : Bad Data
• Garbage In Garbage Out!
40
· @bigmlcom · @QatarComputing · #MLSD18 ·QCRI
Dirty Data
• GE : Normalization of 75 procurement systems could save $100M
a year!
• Dirty data costs the global banking industry over $400 billion
• $600 billion a year for US businesses
•12% lost revenue for most companies due to inaccurate,
inconsistent and incomplete data
Source: (a) Data Integration: The Current Status and the Way Forward (b) https://www.marklogic.com/blog/the-staggering-impact-of-dirty-
data/ (c ) The Data Warehousing Institute (TDWI) (d) Experian Quality Data
41
· @bigmlcom · @QatarComputing · #MLSD18 ·QCRI
Error Detection
• Duplicates / Entity Resolution
• Pattern violations
• Integrity Constraints / Business rule violations
• Outliers
42
· @bigmlcom · @QatarComputing · #MLSD18 ·QCRI
Entity Resolution
• Find records referring to the same real-world entity
• GE : 75 procurement systems, 2M records
• Your company: Missed opportunity for holistic picture of a client
43
· @bigmlcom · @QatarComputing · #MLSD18 ·QCRI
Pattern Violations
44
· @bigmlcom · @QatarComputing · #MLSD18 ·QCRI
Integrity Constraints
Source: Qualitative Data Cleaning : Xu Chu and Ihab Ilyas
45
· @bigmlcom · @QatarComputing · #MLSD18 ·QCRI
Integrity Constraints
Source: Qualitative Data Cleaning : Xu Chu and Ihab Ilyas
46
· @bigmlcom · @QatarComputing · #MLSD18 ·QCRI
Integrity Constraints
Source: Qualitative Data Cleaning : Xu Chu and Ihab Ilyas
47
· @bigmlcom · @QatarComputing · #MLSD18 ·QCRI
Data Repair
Source: Qualitative Data Cleaning : Xu Chu and Ihab Ilyas
48
· @bigmlcom · @QatarComputing · #MLSD18 ·QCRI
Nadeef
Source: NADEEF: A Commodity Data Cleaning System
49
· @bigmlcom · @QatarComputing · #MLSD18 ·QCRI
Data Cleaning
• Talk to us if you have any question on Data cleaning!
50
· @bigmlcom · @QatarComputing · #MLSD18 ·QCRI
Data Profiling
Demo of Google Facets : https://pair-code.github.io/facets/
51
52

MLSD18. Data Cleaning

  • 1.
    1st edition November 4-5,2018 Machine Learning School in Doha 1
  • 2.
    QCRI Data Profiling, DataCleaning, Error Detection Getting to ML Ready Data Saravanan Thirumuruganathan Scientist, QCRI 2
  • 3.
    · @bigmlcom ·@QatarComputing · #MLSD18 ·QCRI Outline • ML Ready data • Data Preparation Workflow • Error Detection and Repair • Demo of Google Facets 3
  • 4.
    · @bigmlcom ·@QatarComputing · #MLSD18 ·QCRI ML Ready Data 4
  • 5.
    · @bigmlcom ·@QatarComputing · #MLSD18 ·QCRI ML Ready Data • is one instance • contains all information about this instance Each row 5
  • 6.
    · @bigmlcom ·@QatarComputing · #MLSD18 ·QCRI ML Ready Data • is a field that describes a property of the instance. Each column 6
  • 7.
    · @bigmlcom ·@QatarComputing · #MLSD18 ·QCRI ML Ready Data Features Instances Values 7
  • 8.
    · @bigmlcom ·@QatarComputing · #MLSD18 ·QCRI ML Ready Data : Don’t Dos • Column headers are values, not variable names. • Multiple variables are stored in one column. • Variables are stored in both rows and columns. 8
  • 9.
    · @bigmlcom ·@QatarComputing · #MLSD18 ·QCRI Column headers are values Source: Tidy Data by Hadley Wickham 9
  • 10.
    · @bigmlcom ·@QatarComputing · #MLSD18 ·QCRI Column headers are values Source: Tidy Data by Hadley Wickham 10
  • 11.
    · @bigmlcom ·@QatarComputing · #MLSD18 ·QCRI Column headers are values Source: Tidy Data by Hadley Wickham Same data as previous slide. Columns are renamed to income and value to freq 11
  • 12.
    · @bigmlcom ·@QatarComputing · #MLSD18 ·QCRI • Column headers are values, not variable names. • Multiple variables are stored in one column. • Variables are stored in both rows and columns. 12 ML Ready Data : Don’t Dos
  • 13.
    · @bigmlcom ·@QatarComputing · #MLSD18 ·QCRI Multiple variables in 1 column Source: Tidy Data by Hadley Wickham 13
  • 14.
    · @bigmlcom ·@QatarComputing · #MLSD18 ·QCRI Multiple variables in 1 column Source: Tidy Data by Hadley Wickham 14
  • 15.
    · @bigmlcom ·@QatarComputing · #MLSD18 ·QCRI Multiple variables in 1 column Source: Tidy Data by Hadley Wickham 15
  • 16.
    · @bigmlcom ·@QatarComputing · #MLSD18 ·QCRI Multiple variables in 1 column Source: Tidy Data by Hadley Wickham 16
  • 17.
    · @bigmlcom ·@QatarComputing · #MLSD18 ·QCRI • Column headers are values, not variable names. • Multiple variables are stored in one column. • Variables are stored in both rows and columns. 17 ML Ready Data : Don’t Dos
  • 18.
    · @bigmlcom ·@QatarComputing · #MLSD18 ·QCRI Multiple variables in 1 column Source: Tidy Data by Hadley Wickham 18
  • 19.
    · @bigmlcom ·@QatarComputing · #MLSD18 ·QCRI Multiple variables in 1 column Source: Tidy Data by Hadley Wickham 19
  • 20.
    · @bigmlcom ·@QatarComputing · #MLSD18 ·QCRI Multiple variables in 1 column Source: Tidy Data by Hadley Wickham 20
  • 21.
    · @bigmlcom ·@QatarComputing · #MLSD18 ·QCRI Data Preparation 21
  • 22.
    · @bigmlcom ·@QatarComputing · #MLSD18 ·QCRI Data Preparation Data Preparation where is my “ML Ready” data? 22
  • 23.
    · @bigmlcom ·@QatarComputing · #MLSD18 ·QCRI Data Preparation 23
  • 24.
    · @bigmlcom ·@QatarComputing · #MLSD18 ·QCRI Source: http://visit.crowdflower.com/rs/416-ZBE-142/images/CrowdFlower_DataScienceReport_2016.pdf Bad News 24
  • 25.
    · @bigmlcom ·@QatarComputing · #MLSD18 ·QCRI Day of a Data Scientist – Mark Schreiber (Merck) reports that his data scientists spend 98% of their time, i.e. 39 hours/week, in grunt work and only 1 hour/week doing the job for which they were hired – For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights (The New York Times https://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor-work.html) – Reports vary between 60% and 80% of grunt work but nobody reports less 25
  • 26.
    · @bigmlcom ·@QatarComputing · #MLSD18 ·QCRI Good News Source: https://thodrek.github.io/di-ml/sigmod2018/slides/diml_sigmod2018.pdf 26
  • 27.
    · @bigmlcom ·@QatarComputing · #MLSD18 ·QCRI Good News Data Preparation Data Civilizer (QCRI) 27
  • 28.
    · @bigmlcom ·@QatarComputing · #MLSD18 ·QCRI Data Preparation Workflow 28
  • 29.
    · @bigmlcom ·@QatarComputing · #MLSD18 ·QCRI Data Preparation Workflow • Data Discovery • Data Integration • Data Profiling • Error Detection • Error Repair 29
  • 30.
    · @bigmlcom ·@QatarComputing · #MLSD18 ·QCRI Data Discovery Source: https://tmrresearchblog.com/data-discovery-tools-provide-one-time-query-business-processes/ 30
  • 31.
    · @bigmlcom ·@QatarComputing · #MLSD18 ·QCRI Data Discovery • Merck has 4000 Oracle databases + countless other repositories • MIT Data Warehouse has 2400 tables • GE has 75 procurement systems and supplier databases • Your organisation has many databases and many repositories! 31
  • 32.
    · @bigmlcom ·@QatarComputing · #MLSD18 ·QCRI Data Discovery • the company stock variations in recent years • the mentions of the company in social media channels • the number of drugs about to be approved by the FDA • the current productivity of the research department Goal: predicting the change in stock price of a drug company Source: Aurum: A Data Discovery System Need: merge tables about drugs, stock performance, social media mentions etc 32
  • 33.
    · @bigmlcom ·@QatarComputing · #MLSD18 ·QCRI Data Integration Schema Alignment and Transformation Source: Beaver: Towards a Declarative Schema Mapping 33
  • 34.
    · @bigmlcom ·@QatarComputing · #MLSD18 ·QCRI Data Integration Source: Guoliang Li: Human-in-the-loop Data Integration 34
  • 35.
    · @bigmlcom ·@QatarComputing · #MLSD18 ·QCRI Data Integration Protein Data Bank (>400 tables, >3000 columns) Source: Beaver: Towards a Declarative Schema Mapping 35
  • 36.
    · @bigmlcom ·@QatarComputing · #MLSD18 ·QCRI Data Profiling • Single and multi-column statistics • Common patterns • Data types and analysis • Column dependencies • Data dependencies • Unique column combinations 36
  • 37.
    · @bigmlcom ·@QatarComputing · #MLSD18 ·QCRI Data Profiling Tools in Industry Source: Data Profiling Tutorial SIGMOD 2017 Abedjan et al 37
  • 38.
    · @bigmlcom ·@QatarComputing · #MLSD18 ·QCRI Data Profiling Tools in Research Source: Data Profiling Tutorial SIGMOD 2017 Abedjan et al 38
  • 39.
    · @bigmlcom ·@QatarComputing · #MLSD18 ·QCRI Error Detection and Repair 39
  • 40.
    · @bigmlcom ·@QatarComputing · #MLSD18 ·QCRI Dirty Data Source: Dilbert : Bad Data • Garbage In Garbage Out! 40
  • 41.
    · @bigmlcom ·@QatarComputing · #MLSD18 ·QCRI Dirty Data • GE : Normalization of 75 procurement systems could save $100M a year! • Dirty data costs the global banking industry over $400 billion • $600 billion a year for US businesses •12% lost revenue for most companies due to inaccurate, inconsistent and incomplete data Source: (a) Data Integration: The Current Status and the Way Forward (b) https://www.marklogic.com/blog/the-staggering-impact-of-dirty- data/ (c ) The Data Warehousing Institute (TDWI) (d) Experian Quality Data 41
  • 42.
    · @bigmlcom ·@QatarComputing · #MLSD18 ·QCRI Error Detection • Duplicates / Entity Resolution • Pattern violations • Integrity Constraints / Business rule violations • Outliers 42
  • 43.
    · @bigmlcom ·@QatarComputing · #MLSD18 ·QCRI Entity Resolution • Find records referring to the same real-world entity • GE : 75 procurement systems, 2M records • Your company: Missed opportunity for holistic picture of a client 43
  • 44.
    · @bigmlcom ·@QatarComputing · #MLSD18 ·QCRI Pattern Violations 44
  • 45.
    · @bigmlcom ·@QatarComputing · #MLSD18 ·QCRI Integrity Constraints Source: Qualitative Data Cleaning : Xu Chu and Ihab Ilyas 45
  • 46.
    · @bigmlcom ·@QatarComputing · #MLSD18 ·QCRI Integrity Constraints Source: Qualitative Data Cleaning : Xu Chu and Ihab Ilyas 46
  • 47.
    · @bigmlcom ·@QatarComputing · #MLSD18 ·QCRI Integrity Constraints Source: Qualitative Data Cleaning : Xu Chu and Ihab Ilyas 47
  • 48.
    · @bigmlcom ·@QatarComputing · #MLSD18 ·QCRI Data Repair Source: Qualitative Data Cleaning : Xu Chu and Ihab Ilyas 48
  • 49.
    · @bigmlcom ·@QatarComputing · #MLSD18 ·QCRI Nadeef Source: NADEEF: A Commodity Data Cleaning System 49
  • 50.
    · @bigmlcom ·@QatarComputing · #MLSD18 ·QCRI Data Cleaning • Talk to us if you have any question on Data cleaning! 50
  • 51.
    · @bigmlcom ·@QatarComputing · #MLSD18 ·QCRI Data Profiling Demo of Google Facets : https://pair-code.github.io/facets/ 51
  • 52.