1/18/18
1
DATA CLEANING & DATA
QUALITY
Admin
¨ Sign-up for presentations
¤ Volunteers for next week
¨ Projects
¤ Finalize project topics by end of the month
2
Poor Data Quality
Poor data quality is a serious problem for organizations
¤ Incorrect decision making
q “multiple versions of the truth”, lack of user confidence in data
¤ Operational inefficiencies and mistakes
q Major Canadian bankfaxed sensitive customerdata to US junk yard
(incorrect extra digit in fax number)
¤ Waste of time and money [Gartner Research Report]
q Organizations are each losing an average of $8.2M annually
q More than 25% of critical data in Fortune 1000 companies will
continue to be inaccurate andincomplete
3
Examples
Class EDUC MR OCC REL SAL
private Bachelor Married Exec Husband >50k
private Bachelor Married Prof Husband >50k
self-emp Masters Married Exec Wife >50k
? HS Divorced ? Not-fam <50k
self-emp Masters Married Admin Wife <50k
nev-work Primary Single ? child >50k
? Bachelor Divorced ? Not-fam >50k
Sample fromUS Census
City # Beds # Baths Monthly Rent
Toronto, ON 1 1 $1000
Toronto, ON 1 1 $900
Toronto 1 1 $380
Toronto, Ohio, USA?
About: IBM
An Entity of Type : Public company
LocationCity: United States
Armonk, New York
NumberOfEmployees: 426751
DateOfDeath: 1956
4
4
1/18/18
2
Causes of Inconsistencies
¨ Human Error
¨ Weak enforcement of constraints (or lack of constraints)
¨ Results of data integration or information extraction
¨ Violation of constraints due to errors or schema/data evolution
C1: [Salary < $100k] à [rate = 5%]
C2: [Salary > $100k] à [rate = 3%]
ClientID Name NetWorth Salary OtherProducts Amt Rate
225 Jane Walters $300k $85k $15k $150k 5%
420 Tracy Smith $800k $150k $90k $350k 3%
333 Jen Wright $100k $65k $0 $200k 5%
128 Ron Maples $235k $70k $45k $200k 4%
5
Objective
¨ Integrity constraints are the primary means for preserving data
integrity
¨ Use constraints as the benchmark for defining conditions the data
should satisfy
¨ Constraints may be stale, or better constraints may be available
to meet the application needs
Leverage the power of constraints to improve data quality.
6
F
|
I
and
[Y],
t
[Y]
t
then
[X]
t
[X]
t
if
I,
t
,
t
2
1
2
1
2
1
=
=
=
F1: [Income] à [Privileges]
New Semantic Constraints
ClientID Income Privileges Country Amt Status
225 Low None USA $150k Approved
420 High Full USA $350k Approved
333 Middle Limited USA $200k Approved
128 Middle Limited USA $200k Approved
193 Middle Limited UK $0 Pending
550 Low None UK $0 Pending
243 High Full UK $200k Incomplete
376 High Full UK $175k Approved
Functional Dependency (FD) F: X à Y
F2: [Amt] à [Status]
Conditional Functional Dependencies (CFD)
[Maher97,Bohannon+
07]
[Country = ‘USA’, Amt] à [Status]
7
[Status = ‘Pending’] à [Amt = $0]
7
Data and Constraint Repair
ClientID Income OtherProd Rate Country City State
225 High A 2.1% USA San Francisco MD
420 High A 2.1% USA San Francisco CA
333 High B 3.0% USA San Francisco CA
576 High B 3.0% USA San Francisco CA
128 Low C 4.5% UK Reading Berkshire
193 Low C 4.5% UK London London
550 Low B 3.5% UK London London
F1: [Income ] à [Rate]
∧
F2: [Country, City] à [State]
1) Repair the data or the constraints?
2) How to find the repairs?
8
1/18/18
3
Outline for Today
¨ Review and discuss this week’s readings
¨ Advice for giving presentations
¨ Project topics
9
This Week’s Papers
1. E. Rahm, H. Do. Data Cleaning: Problems and Current
Approaches. IEEE Data Eng. Bull. 23 (4), pp. 3-13.
2. W. Fan, F. Geerts. Foundations of Data Quality Management.
Synthesis Lectures on Data Management.
10
Data Quality and Data Integration
S1
S2 S4
S3
Josephine Smith Jo Smith
J. Smith
Joe Smith
11
Single vs. Multi-Source
What examples can you provide for each type of
problem?
12
1/18/18
4
Data Cleaning Steps
1. Data Analysis
¤ Identify the inconsistencies
2. Defining workflow and rules
¤ What is the data cleaning process? What are the data
quality rules? Can we programmatically define the
transformation process?
3. Verification
¤ Validate the rules and data cleaning process were
correct. How do we validate? Suggestions?
4. Execution
13
Data Analysis: Example
14
Example 1
Number of cars sold with this colour
0
100
200
300
400
500
600
black blue bule orange red green white yellow
Credit: David Corne
15
Ex 2: What problems do you observe?
Histogram of Online Supermarket customers by age field
0
100
200
300
400
500
600
<0 0-10 11-20 21-30 31-40 41-50 51-60 61-70 71-80 81-90 >90
16
1/18/18
5
Tools
¨ Many tools available to help…
¨ Data analysis tools (identify potential errors)
¤ Commercial tools
¨ Domain specific tools
¤ Addresses, customer relationship management, energy
usage data, etc.
¨ Duplicate Elimination
¨ ETL
¤ Good at providing aggregated data summaries
17
Let’s Discuss
1. What do you think is the main contribution of this
paper? What did you take away?
2. Strengths?
3. Weaknesses?
4. Improvements?
18
¨ W. Fan, F. Geerts. Foundations of Data Quality
Management. Synthesis Lectures on Data
Management.
19
Data Quality
¨ DBMS focus on efficient processing of large
quantities of data, but not quality.
¤ If we provide poor quality data, we will get poor
quality results.
¨ Real data is often dirty, duplicated and inconsistent
DB
20
1/18/18
6
Examples
¨ US: Pentagon asked 200+ dead officers to re-enlist
¨ UK: there are 81M national insurance numbers but
only 60M people are eligible
¨ Australia: 500K dead people retain active medical
cards
¨ Typical data error rate in industry: 1%- 5%
Source: F
. Geerts, W. Fan, X. Jia
21
Lots of Statistics
¨ Poor data costs companies $600B annually
¨ Incorrectly priced data costs retailers in the US
$2.5B annually
¨ 30% - 80% of time spent on data integration
projects is for data cleaning
¨ The market for data quality tools is growing at
17% annually (vs. 7% for other IT segments)
22
Central Issues in DQ
¨ Data Consistency: whether the data contains errors or
conflicts that arise as violations of rules.
¤ Example: age = 82 and age = 28 for the same patient
¤ Different types of rules may be used to judge consistency
¨ Accuracy: how close a value (representing an entity) is
to the true value.
¤ Example: age <= 200 vs. age = 45
23
Data Quality Issues (cont’d)
¨ Completeness: whether a query can be answered given
the information available.
¤ Example: age = null (missing value) in a patient record, or
missing patient record (missing tuple)
¤ What is the degree of incompleteness?
¨ Currency: whether the data is too outdated to answer a
query.
¤ Example: Census results in 2013, based on old address and
demographic information from 2010
¤ We want timely data! Can you think of an example where
timeliness doesn’t matter?
24
1/18/18
7
Data Quality Issues (cont’d)
¨ Deduplication: whether two records refer to the
same entity
¤ Example: If Jane Smith and Joan Smith have the same
email address, are they referring to the same person?
25
Improving DQ with Constraints
¨ Need constraints to help improve DQ, how:
¨ Types of errors:
¤ Syntactic : values do not conform to type, length, domain
range [easier to identify]
¤ Semantic: data values deviate from the true values of the
entity they represent [harder]
¨ Constraints model the intended application semantics
¤ Many different types of constraints exist
¤ Identify patterns, similarity, containment of tuples,
timeliness of data
26
Constraint Considerations
¨ Discovering data quality rules
¨ Identifying errors
¨ Repairing the data (or the constraints)
27
Interactions BetweenDQ Issues
¨ These issues clearly do not occur in isolation!
¨ Let’s consider a case study…
28
1/18/18
8
Let’s Discuss
1. What do you think is the main contribution of this
paper? What did you take away?
2. Strengths?
3. Weaknesses?
4. Improvements?
Do you think any one of the properties is
more important than the others?
29
Case Study
¨ The International Seismological Centre (ISC)
(located in the UK)
¨ It’s a charity funded by various governments.
¨ Their role is to be the repository for recording all
earthquake events on the planet.
Credit: David Corne
30
Gathering Data
ISC
Data gathering centres
31
Data Delivery
¨ Raw seismograph data from local collection points to
‘data gathering’ centres.
¨ Send to ISC
¨ Some data gathering centres provide raw data, some
provide interpreted data
¤ Some will send partial data if they believe it’s erroneous
32
1/18/18
9
Data Integration
¨ Integrate the raw data from the incoming sources
¨ ISC job is actually to determine:
¤ where and when the Earth tremors were
¤ Hundreds per month around the world
¤ Reverse engineer from seismograph readings
33
Analysis
¨ Where can errors arise in data …
Accuracy
Consistency Currency
Completeness Duplicates
What other factors are important to consider?
34
PresentationTips
35
Giving Presentations
Things to avoid:
¨ You don’t want people to lose focus on you (the presenter)
¤ Do not make the slides the focus of the audience
¨ Lose the attention of the audience altogether
¨ Don’t disrespect the audience – they can read and don’t
need a voice over of words
36
1/18/18
10
Advice
1) Limit the number of slides. Guideline: about 2-3
minutes per slide
2) Limit the words on the page
3) Use imagery: pictures help the audience
remember and you will need to describe what the
picture is about
37
Advice (cont’d)
4) A presentation should aim to focus on three major
points. Any more and the audience may forget.
o Make them clear, make them relevant and most all…make
them.
5) Make it personal: if you inject a personaltouch it will
connect with the audience,and you are likely to be
more engaged and animated.
6) Use stories:effective way to get a message across.
Make sure the story is relevant to the key message you
are conveying and make it short.
38
Project Topics
¨ You are encouraged to come up with your own ideas,
and discuss them with me.
¨ Send me a ½ page proposal idea
¨ Finalize a topic in the next few weeks
¨ I will propose a few projects if needed.
39

Remediating Data Quality by using Data cleaning

  • 1.
    1/18/18 1 DATA CLEANING &DATA QUALITY Admin ¨ Sign-up for presentations ¤ Volunteers for next week ¨ Projects ¤ Finalize project topics by end of the month 2 Poor Data Quality Poor data quality is a serious problem for organizations ¤ Incorrect decision making q “multiple versions of the truth”, lack of user confidence in data ¤ Operational inefficiencies and mistakes q Major Canadian bankfaxed sensitive customerdata to US junk yard (incorrect extra digit in fax number) ¤ Waste of time and money [Gartner Research Report] q Organizations are each losing an average of $8.2M annually q More than 25% of critical data in Fortune 1000 companies will continue to be inaccurate andincomplete 3 Examples Class EDUC MR OCC REL SAL private Bachelor Married Exec Husband >50k private Bachelor Married Prof Husband >50k self-emp Masters Married Exec Wife >50k ? HS Divorced ? Not-fam <50k self-emp Masters Married Admin Wife <50k nev-work Primary Single ? child >50k ? Bachelor Divorced ? Not-fam >50k Sample fromUS Census City # Beds # Baths Monthly Rent Toronto, ON 1 1 $1000 Toronto, ON 1 1 $900 Toronto 1 1 $380 Toronto, Ohio, USA? About: IBM An Entity of Type : Public company LocationCity: United States Armonk, New York NumberOfEmployees: 426751 DateOfDeath: 1956 4 4
  • 2.
    1/18/18 2 Causes of Inconsistencies ¨Human Error ¨ Weak enforcement of constraints (or lack of constraints) ¨ Results of data integration or information extraction ¨ Violation of constraints due to errors or schema/data evolution C1: [Salary < $100k] à [rate = 5%] C2: [Salary > $100k] à [rate = 3%] ClientID Name NetWorth Salary OtherProducts Amt Rate 225 Jane Walters $300k $85k $15k $150k 5% 420 Tracy Smith $800k $150k $90k $350k 3% 333 Jen Wright $100k $65k $0 $200k 5% 128 Ron Maples $235k $70k $45k $200k 4% 5 Objective ¨ Integrity constraints are the primary means for preserving data integrity ¨ Use constraints as the benchmark for defining conditions the data should satisfy ¨ Constraints may be stale, or better constraints may be available to meet the application needs Leverage the power of constraints to improve data quality. 6 F | I and [Y], t [Y] t then [X] t [X] t if I, t , t 2 1 2 1 2 1 = = = F1: [Income] à [Privileges] New Semantic Constraints ClientID Income Privileges Country Amt Status 225 Low None USA $150k Approved 420 High Full USA $350k Approved 333 Middle Limited USA $200k Approved 128 Middle Limited USA $200k Approved 193 Middle Limited UK $0 Pending 550 Low None UK $0 Pending 243 High Full UK $200k Incomplete 376 High Full UK $175k Approved Functional Dependency (FD) F: X à Y F2: [Amt] à [Status] Conditional Functional Dependencies (CFD) [Maher97,Bohannon+ 07] [Country = ‘USA’, Amt] à [Status] 7 [Status = ‘Pending’] à [Amt = $0] 7 Data and Constraint Repair ClientID Income OtherProd Rate Country City State 225 High A 2.1% USA San Francisco MD 420 High A 2.1% USA San Francisco CA 333 High B 3.0% USA San Francisco CA 576 High B 3.0% USA San Francisco CA 128 Low C 4.5% UK Reading Berkshire 193 Low C 4.5% UK London London 550 Low B 3.5% UK London London F1: [Income ] à [Rate] ∧ F2: [Country, City] à [State] 1) Repair the data or the constraints? 2) How to find the repairs? 8
  • 3.
    1/18/18 3 Outline for Today ¨Review and discuss this week’s readings ¨ Advice for giving presentations ¨ Project topics 9 This Week’s Papers 1. E. Rahm, H. Do. Data Cleaning: Problems and Current Approaches. IEEE Data Eng. Bull. 23 (4), pp. 3-13. 2. W. Fan, F. Geerts. Foundations of Data Quality Management. Synthesis Lectures on Data Management. 10 Data Quality and Data Integration S1 S2 S4 S3 Josephine Smith Jo Smith J. Smith Joe Smith 11 Single vs. Multi-Source What examples can you provide for each type of problem? 12
  • 4.
    1/18/18 4 Data Cleaning Steps 1.Data Analysis ¤ Identify the inconsistencies 2. Defining workflow and rules ¤ What is the data cleaning process? What are the data quality rules? Can we programmatically define the transformation process? 3. Verification ¤ Validate the rules and data cleaning process were correct. How do we validate? Suggestions? 4. Execution 13 Data Analysis: Example 14 Example 1 Number of cars sold with this colour 0 100 200 300 400 500 600 black blue bule orange red green white yellow Credit: David Corne 15 Ex 2: What problems do you observe? Histogram of Online Supermarket customers by age field 0 100 200 300 400 500 600 <0 0-10 11-20 21-30 31-40 41-50 51-60 61-70 71-80 81-90 >90 16
  • 5.
    1/18/18 5 Tools ¨ Many toolsavailable to help… ¨ Data analysis tools (identify potential errors) ¤ Commercial tools ¨ Domain specific tools ¤ Addresses, customer relationship management, energy usage data, etc. ¨ Duplicate Elimination ¨ ETL ¤ Good at providing aggregated data summaries 17 Let’s Discuss 1. What do you think is the main contribution of this paper? What did you take away? 2. Strengths? 3. Weaknesses? 4. Improvements? 18 ¨ W. Fan, F. Geerts. Foundations of Data Quality Management. Synthesis Lectures on Data Management. 19 Data Quality ¨ DBMS focus on efficient processing of large quantities of data, but not quality. ¤ If we provide poor quality data, we will get poor quality results. ¨ Real data is often dirty, duplicated and inconsistent DB 20
  • 6.
    1/18/18 6 Examples ¨ US: Pentagonasked 200+ dead officers to re-enlist ¨ UK: there are 81M national insurance numbers but only 60M people are eligible ¨ Australia: 500K dead people retain active medical cards ¨ Typical data error rate in industry: 1%- 5% Source: F . Geerts, W. Fan, X. Jia 21 Lots of Statistics ¨ Poor data costs companies $600B annually ¨ Incorrectly priced data costs retailers in the US $2.5B annually ¨ 30% - 80% of time spent on data integration projects is for data cleaning ¨ The market for data quality tools is growing at 17% annually (vs. 7% for other IT segments) 22 Central Issues in DQ ¨ Data Consistency: whether the data contains errors or conflicts that arise as violations of rules. ¤ Example: age = 82 and age = 28 for the same patient ¤ Different types of rules may be used to judge consistency ¨ Accuracy: how close a value (representing an entity) is to the true value. ¤ Example: age <= 200 vs. age = 45 23 Data Quality Issues (cont’d) ¨ Completeness: whether a query can be answered given the information available. ¤ Example: age = null (missing value) in a patient record, or missing patient record (missing tuple) ¤ What is the degree of incompleteness? ¨ Currency: whether the data is too outdated to answer a query. ¤ Example: Census results in 2013, based on old address and demographic information from 2010 ¤ We want timely data! Can you think of an example where timeliness doesn’t matter? 24
  • 7.
    1/18/18 7 Data Quality Issues(cont’d) ¨ Deduplication: whether two records refer to the same entity ¤ Example: If Jane Smith and Joan Smith have the same email address, are they referring to the same person? 25 Improving DQ with Constraints ¨ Need constraints to help improve DQ, how: ¨ Types of errors: ¤ Syntactic : values do not conform to type, length, domain range [easier to identify] ¤ Semantic: data values deviate from the true values of the entity they represent [harder] ¨ Constraints model the intended application semantics ¤ Many different types of constraints exist ¤ Identify patterns, similarity, containment of tuples, timeliness of data 26 Constraint Considerations ¨ Discovering data quality rules ¨ Identifying errors ¨ Repairing the data (or the constraints) 27 Interactions BetweenDQ Issues ¨ These issues clearly do not occur in isolation! ¨ Let’s consider a case study… 28
  • 8.
    1/18/18 8 Let’s Discuss 1. Whatdo you think is the main contribution of this paper? What did you take away? 2. Strengths? 3. Weaknesses? 4. Improvements? Do you think any one of the properties is more important than the others? 29 Case Study ¨ The International Seismological Centre (ISC) (located in the UK) ¨ It’s a charity funded by various governments. ¨ Their role is to be the repository for recording all earthquake events on the planet. Credit: David Corne 30 Gathering Data ISC Data gathering centres 31 Data Delivery ¨ Raw seismograph data from local collection points to ‘data gathering’ centres. ¨ Send to ISC ¨ Some data gathering centres provide raw data, some provide interpreted data ¤ Some will send partial data if they believe it’s erroneous 32
  • 9.
    1/18/18 9 Data Integration ¨ Integratethe raw data from the incoming sources ¨ ISC job is actually to determine: ¤ where and when the Earth tremors were ¤ Hundreds per month around the world ¤ Reverse engineer from seismograph readings 33 Analysis ¨ Where can errors arise in data … Accuracy Consistency Currency Completeness Duplicates What other factors are important to consider? 34 PresentationTips 35 Giving Presentations Things to avoid: ¨ You don’t want people to lose focus on you (the presenter) ¤ Do not make the slides the focus of the audience ¨ Lose the attention of the audience altogether ¨ Don’t disrespect the audience – they can read and don’t need a voice over of words 36
  • 10.
    1/18/18 10 Advice 1) Limit thenumber of slides. Guideline: about 2-3 minutes per slide 2) Limit the words on the page 3) Use imagery: pictures help the audience remember and you will need to describe what the picture is about 37 Advice (cont’d) 4) A presentation should aim to focus on three major points. Any more and the audience may forget. o Make them clear, make them relevant and most all…make them. 5) Make it personal: if you inject a personaltouch it will connect with the audience,and you are likely to be more engaged and animated. 6) Use stories:effective way to get a message across. Make sure the story is relevant to the key message you are conveying and make it short. 38 Project Topics ¨ You are encouraged to come up with your own ideas, and discuss them with me. ¨ Send me a ½ page proposal idea ¨ Finalize a topic in the next few weeks ¨ I will propose a few projects if needed. 39