SlideShare a Scribd company logo
From Data Quality to Big Data Quality:
A Data Integration Scenario
Carlo Batini*, Anisa Rula+
*University of Milano-Bicocca
+Department of Information Engineering,
University of Brescia
Data Quality and Data Integration in a 3-dimensional
Data Space
2
- Relational data
- Traditional information systems
Data Types
Big Data
lifecycle
Data Quality Clusters
1
2
3
Big Data Lifecycle
Source selection
Discovery
Retrieval
Browsing
Internet of Things Design
Extraction
Sampling
Modeling
Filtering
Creation
Management
Interpretation
Modeling
Semantic enrichement
Profiling
Metadata management
Persistence
Representation transformation
Matching
Entity resolution (or record linkage)
Integration
Semantic Alignement
Fusion
Abstraction
Refinement
Restructuring
Quality assessment
Elementary data
Statistics
Quality improvement
Elementary data
Statistics
Learning
Supervised
Unsupervised
Reinforcement
Interpretation
Validation
Hypothesis testing
Statistical analysis
Sensitivity analysis
Model validation
Value driven analysis
Correlation
Mining
Discovery
Learning
Supervised
Unsupervised
Reinforcement
Prediction (predictive modeling)
Forecasting
Visualization
Optimization
Acquisition
Analysis
1
Data Quality and Data Integration among the 3Vs
• Volume
• Velocity
• Variety
Data Quality Dimension Clusters*
1. Accuracy, correctness, validity and precision focus on the adherence of data to a given reality of interest. Including time
related issues
2. Completeness, pertinence and relevance refer to the capability of representing all and only the relevant aspects of the
reality of interest
3. Redundancy, minimality, compactness and conciseness refer to the capability of representing the reality of interest
with the minimal use of informative resources
4. Readability, comprehensibility, clarity and simplicity refer to ease of understanding of data by users.
5. Accessibility and availability are related to the ability of the user to access data from his or her culture, physical
status/functions, and technologies available
6. Consistency, cohesion and coherence refer to the capability of data to comply without contradictions to all properties
of the reality of interest, as specified in terms of integrity constraints, data edits, business rules and other formalisms
7. Trust, including believability, reliability and reputation, catching how much data derive from an authoritative source
Relational data
Traditional information systems
DQ Clusters &
cluster dimensions
2
* From Data Quality to Big Data Quality. J. Database Manag. 2015. C. Batini, A. Rula, M. Scannapieco, G. Viscusi
Role of Structural Characteristics – Linked Data
How do data quality dimensions and metrics evolve in the transformation
process from tabular to linked data?
Data type Structural characteristics
Linked Data
• Dereferenceable Resource
• SPARQL Endpoint
• RDF dump
• Interlinking
• Licensing
Relational data
Types of Big Data
3
Enables us managing the huge variability of methods and techniques
needed to manage data quality in linked data
Dimensions and their Evolution
Completeness
Structural characteristics
Dereferenceable
Resource
SPARQL
Endpoint
RDF dumps Interlinking Licensing
Quality
dimension
Accessibility
Resource
accessibility
Dataset
accessibility
Browsing
accessibility
Integration
accessibility
Reuse
accessibility
LD completenesses
dimensions
Accuracy
LD accuracy
dimensions
Toward a Cost Effectiveness Model
Management
Interpretation
Modeling
Semantic enrichement
Profiling
Metadata management
Persistence
Representation transformation
Matching
Entity resolution (or record linkage)
Integration
Semantic Alignement
Fusion
Abstraction
Refinement
Restructuring
Quality assessment
Elementary data
Statistics
Quality improvement
Elementary data
Statistics
…
From relational to RDF
Consider the four steps together
Two Possible Workflows
P…. G….
alppg ttygb
gnnn null
P… G….
alpbg ttygt
gnggr eryuui
Record
linkage
Transfor-
mation &
Alignment
P…. G….
alppg ttygb
gnnn null
kkkyyt vvvy
Quality management
P…. G….
Alppg ttygb
Gnnn null
P… G….
alpbg ttygt
gnggr eryuui
Transfor-
mation
Quality management
Align-
ment
Strategy of Workflow 1
10
• Exploit syntactic distance between concepts
• Differentiate distance:
–for alphanumeric strings use edit distance
–for lists of items use Jaccard distance
P…. G….
alppg ttygb
gnnn null
P… G….
alpbg ttygt
gnggr eryuui
Record
linkage
Transfor-
mation &
Alignment
P…. G….
alppg ttygb
gnnn null
kkkyyt vvvy
Quality management
Running Example – Relational Data
11
Id Denomination Istat Code Type
126759 Torgian 54053 Historic Place
125007 Saint Peter Abbey 246 Building
Id Label Location Type
1 Torgiano Torgiano Populated Place
2 Saint Peter Sicily Historic Region
3 Basilica of Saint Peter null Religious Building
Tourist attraction
Place
Running Example – Attributes Selection
12
Id Denomination Istat Code Type
126759 Torgian 54053 Historic Place
125007 Saint Peter Abbey 246 Building
Id Label Location Type
1 Torgiano Torgiano Populated Place
2 Saint Peter Sicily Historic Region
3 Basilica of Saint Peter null Religious Building
Tourist attraction
Place
First pair – Edit Distance
13
Id Denomination Istat Code Type
126759 Torgian 54053 Historic Place
125007 Saint Peter Abbey 246 Building
Id Label Location Type
1 Torgiano Torgiano Populated Place
2 Saint Peter Sicily Historic Region
3 Basilica of Saint Peter null Religious Building
Tourist attraction
Place
Edit-d = 1
YES!
Second Pair – Jaccard Distance
14
Id Denomination Istat Code Type
126759 Torgian 54053 Historic Place
125007 Saint Peter Abbey 246 Building
Id Label Location Type
1 Torgiano Torgiano Populated Place
2 Saint Peter Sicily Historic Region
3 Basilica of Saint Peter null Religious Building
Tourist attraction
Place
Jaccard =
1 - 2/3 =
1/3
YES!
Third Pair – Jaccard Distance
15
Id Denomination Istat Code Type
126759 Torgian 54053 Historic Place
125007 Saint Peter Abbey 246 Building
Id Label Location Type
1 Torgiano Torgiano Populated Place
2 Saint Peter Sicily Historic Region
3 Basilica of Saint Peter null Religious Building
Tourist attraction
Place
Jaccard =
1 - 2/5 =
3/5
NO!
Workflow 1 – Outcome
16
Id Denomination Istat Code Type
126759 Torgian 54053 Historic Place
125007 Saint Peter Abbey 246 Building
Id Label Location Type
1 Torgiano Torgiano Populated Place
2 Saint Peter Sicily Historic Region
3 Basilica of Saint Peter null Religious Building
Tourist attraction
Place
Final outcome:
1 pair true positive
1 pair false positive
1 pair false negative
Completeness
assessment:
1 unsolvable null
Strategy of Workflow 2
17
• Exploit semantic distance between concepts
• Improve completeness through DBpedia
P…. G….
alppg ttygb
gnnn null
P… G….
alpbg ttygt
gnggr eryuui
Transformation/
Semantic
Enrichment
Alignment/
Matching
http://dbpedia.org/resource/St._Peter
“Saint Peter”
label
location
http://dbpedia.org/resource/Sicily
37
lat
14
long
http://example/resource/ 125970
246 “Saint Peter Abbey”
istatCode label
Match/
Non-Match
Id Denomination Istat Code Type
126759 Torgian 54053 Historic
Place
125007 Saint Peter Abbey 246 Building
Id Label Location Type
1 Torgiano Torgiano Populated Place
2 Saint Peter Sicily Historic Region
3 Basilica of Saint
Peter
null Religious
Building
Running Example – Model Transformation & Alignment 1
18
http://example/resource/126759
“Torgian”
54053
type
istatCode
denomination
HistoricPlace
PopulatedPlace
http://dbpedia.org/resource/Torgiano
“Torgiano”
type
label
Tourist attraction
Place
PopulatedPlace
http://dbpedia.org/resource/Torgiano
“Torgiano”
type
label
http://example/resource/126759
Id Denomination Istat Code Type
126759 Torgian 54053 Historic
Place
125007 Saint Peter Abbey 246 Building
Id Label Location Type
1 Torgiano Torgiano Populated Place
2 Saint Peter Sicily Historic Region
3 Basilica of Saint
Peter
null Religious
Building
“Torgian”
54053
type
istatCode
label
HistoricPlace
Running Example – Model Transformation & Alignment 1
Alignement
Tourist attraction
Place
PopulatedPlace
http://dbpedia.org/resource/Torgiano
“Torgiano”
type
label
http://example/resource/126759
Id Denomination Istat Code Type
126759 Torgian 54053 Historic
Place
125007 Saint Peter Abbey 246 Building
Id Label Location Type
1 Torgiano Torgiano Populated Place
2 Saint Peter Sicily Historic Region
3 Basilica of Saint
Peter
null Religious
Building
“Torgian”
54053
type
istatCode
label
HistoricPlace
Running Example – Model Transformation & Alignment 1
20
Yes!
Tourist attraction
Place
Id Label Location Type
1 Torgiano Torgiano Populated Place
2 Saint Peter Sicily Historic Region
3 Basilica of Saint
Peter
null Religious
Building
Running Example – Model Transformation & Alignment 2
21
Id Denomination Istat Code Type
126759 Torgian 54053 Historic
Place
125007 Saint Peter Abbey 246 Building
http://example/resource/ 125970
246 “Saint Peter Abbey”
istatCode
label
http://dbpedia.org/resource/St._Peter_Basilica
“Basilica of Saint Peter”
label
Tourist attraction
Place
http://dbpedia.org/resource/St._Peter
“Saint Peter”
label
location
http://dbpedia.org/resource/Sicily
37
lat
14
long
http://dbpedia.org/resource/St._Peter
http://example/resource/ 125970
246 “Saint Peter Abbey”
istatCode
label
“Saint Peter”
label
ArchitecturalStructure HistoricPlace
Monument
PopulatedPlace
Building
ReligiousBuilding
Place
…
Region
AdministrativeRegion HistoricRegion
HistoricBuilding
DBpedia Ontology
type
Running Example – Typing
22
location
http://dbpedia.org/resource/Sicily
http://dbpedia.org/resource/St._Peter_Basilica
“Basilica of Saint Peter”
label
http://dbpedia.org/resource/Vatican_City
location
41
lat
12
long
37
lat
14
long
type
type
ArchitecturalStructure HistoricPlace
Monument
PopulatedPlace
Building
ReligiousBuilding
Place
…
Region
AdministrativeRegion HistoricRegion
HistoricBuilding
DBpedia Ontology
http://example/resource/ 125970
246 “Saint Peter Abbey”
istatCode
label
type
Running Example – typing
23
http://dbpedia.org/resource/St._Peter_Basilica
“Basilica of Saint Peter”
label
http://dbpedia.org/resource/Vatican_City
location
41
lat
12
long
http://dbpedia.org/resource/St._Peter
“Saint Peter”
label location
http://dbpedia.org/resource/Sicily
37
lat
14
long
type
type
Distance = 5
No!
ArchitecturalStructure HistoricPlace
Monument
PopulatedPlace
Building
ReligiousBuilding
Place
…
Region
AdministrativeRegion HistoricRegion
HistoricBuilding
type
DBpedia Ontology
Distance = 1
http://example/resource/ 125970
http://dbpedia.org/resource/St._Peter
246 “Saint Peter Abbey”
istatCode
label
“Saint Peter”
label
type
Workflow 2 – Outcome
24
location
http://dbpedia.org/resource/Sicily
http://dbpedia.org/resource/St._Peter_Basilica
“Basilica of Saint Peter”
label
http://dbpedia.org/resource/Vatican_City
location
41
lat
12
long
37
lat
14
long
type
Yes!
Final outcome:
2 pairs true positive
1 pair true negative
0 pairs false positive
0 pair false negative
Solving Missing Data
25
http://dbpedia.org/resource/St._Peter_Basilica
“Basilica of Saint Peter”
label
http://dbpedia.org/resource/Vatican_City
location
41
lat
12
long
Id Label Location Type
1 Torgiano Torgiano Populated Place
2 Saint Peter Sicily Historic Region
3 Basilica of Saint
Peter
null Religious Building
Place
Id Label Location Type
1 Torgiano Torgiano Populated Place
2 Saint Peter Sicily Historic Region
3 Basilica of Saint
Peter
Vatican City Religious Building
Conclusions and Futur Works
26
• Integration followed by transformation if no domain knowledge is needed
• Transformation followed by integration is better when:
– Aligning attributes by using existing vocabularies
– Distance between concepts is defined
– Enriching the dataset with the missing values
• Data integration of may depend on how good is the transformation
outcome
• Define it as an optimization problem
- Define functions for measuring the cost and the gain
- The cost and gain are different for different integration models
Thank you! Questions?
27

More Related Content

Recently uploaded

Thesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.pptThesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.ppt
EverAndrsGuerraGuerr
 
MASS MEDIA STUDIES-835-CLASS XI Resource Material.pdf
MASS MEDIA STUDIES-835-CLASS XI Resource Material.pdfMASS MEDIA STUDIES-835-CLASS XI Resource Material.pdf
MASS MEDIA STUDIES-835-CLASS XI Resource Material.pdf
goswamiyash170123
 
special B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdfspecial B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdf
Special education needs
 
South African Journal of Science: Writing with integrity workshop (2024)
South African Journal of Science: Writing with integrity workshop (2024)South African Journal of Science: Writing with integrity workshop (2024)
South African Journal of Science: Writing with integrity workshop (2024)
Academy of Science of South Africa
 
JEE1_This_section_contains_FOUR_ questions
JEE1_This_section_contains_FOUR_ questionsJEE1_This_section_contains_FOUR_ questions
JEE1_This_section_contains_FOUR_ questions
ShivajiThube2
 
The approach at University of Liverpool.pptx
The approach at University of Liverpool.pptxThe approach at University of Liverpool.pptx
The approach at University of Liverpool.pptx
Jisc
 
"Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe..."Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe...
SACHIN R KONDAGURI
 
Pride Month Slides 2024 David Douglas School District
Pride Month Slides 2024 David Douglas School DistrictPride Month Slides 2024 David Douglas School District
Pride Month Slides 2024 David Douglas School District
David Douglas School District
 
How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...
Jisc
 
CACJapan - GROUP Presentation 1- Wk 4.pdf
CACJapan - GROUP Presentation 1- Wk 4.pdfCACJapan - GROUP Presentation 1- Wk 4.pdf
CACJapan - GROUP Presentation 1- Wk 4.pdf
camakaiclarkmusic
 
Executive Directors Chat Leveraging AI for Diversity, Equity, and Inclusion
Executive Directors Chat  Leveraging AI for Diversity, Equity, and InclusionExecutive Directors Chat  Leveraging AI for Diversity, Equity, and Inclusion
Executive Directors Chat Leveraging AI for Diversity, Equity, and Inclusion
TechSoup
 
Digital Artifact 1 - 10VCD Environments Unit
Digital Artifact 1 - 10VCD Environments UnitDigital Artifact 1 - 10VCD Environments Unit
Digital Artifact 1 - 10VCD Environments Unit
chanes7
 
Francesca Gottschalk - How can education support child empowerment.pptx
Francesca Gottschalk - How can education support child empowerment.pptxFrancesca Gottschalk - How can education support child empowerment.pptx
Francesca Gottschalk - How can education support child empowerment.pptx
EduSkills OECD
 
Azure Interview Questions and Answers PDF By ScholarHat
Azure Interview Questions and Answers PDF By ScholarHatAzure Interview Questions and Answers PDF By ScholarHat
Azure Interview Questions and Answers PDF By ScholarHat
Scholarhat
 
How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17
Celine George
 
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Dr. Vinod Kumar Kanvaria
 
Acetabularia Information For Class 9 .docx
Acetabularia Information For Class 9  .docxAcetabularia Information For Class 9  .docx
Acetabularia Information For Class 9 .docx
vaibhavrinwa19
 
Lapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdfLapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdf
Jean Carlos Nunes Paixão
 
Normal Labour/ Stages of Labour/ Mechanism of Labour
Normal Labour/ Stages of Labour/ Mechanism of LabourNormal Labour/ Stages of Labour/ Mechanism of Labour
Normal Labour/ Stages of Labour/ Mechanism of Labour
Wasim Ak
 
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
MysoreMuleSoftMeetup
 

Recently uploaded (20)

Thesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.pptThesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.ppt
 
MASS MEDIA STUDIES-835-CLASS XI Resource Material.pdf
MASS MEDIA STUDIES-835-CLASS XI Resource Material.pdfMASS MEDIA STUDIES-835-CLASS XI Resource Material.pdf
MASS MEDIA STUDIES-835-CLASS XI Resource Material.pdf
 
special B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdfspecial B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdf
 
South African Journal of Science: Writing with integrity workshop (2024)
South African Journal of Science: Writing with integrity workshop (2024)South African Journal of Science: Writing with integrity workshop (2024)
South African Journal of Science: Writing with integrity workshop (2024)
 
JEE1_This_section_contains_FOUR_ questions
JEE1_This_section_contains_FOUR_ questionsJEE1_This_section_contains_FOUR_ questions
JEE1_This_section_contains_FOUR_ questions
 
The approach at University of Liverpool.pptx
The approach at University of Liverpool.pptxThe approach at University of Liverpool.pptx
The approach at University of Liverpool.pptx
 
"Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe..."Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe...
 
Pride Month Slides 2024 David Douglas School District
Pride Month Slides 2024 David Douglas School DistrictPride Month Slides 2024 David Douglas School District
Pride Month Slides 2024 David Douglas School District
 
How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...
 
CACJapan - GROUP Presentation 1- Wk 4.pdf
CACJapan - GROUP Presentation 1- Wk 4.pdfCACJapan - GROUP Presentation 1- Wk 4.pdf
CACJapan - GROUP Presentation 1- Wk 4.pdf
 
Executive Directors Chat Leveraging AI for Diversity, Equity, and Inclusion
Executive Directors Chat  Leveraging AI for Diversity, Equity, and InclusionExecutive Directors Chat  Leveraging AI for Diversity, Equity, and Inclusion
Executive Directors Chat Leveraging AI for Diversity, Equity, and Inclusion
 
Digital Artifact 1 - 10VCD Environments Unit
Digital Artifact 1 - 10VCD Environments UnitDigital Artifact 1 - 10VCD Environments Unit
Digital Artifact 1 - 10VCD Environments Unit
 
Francesca Gottschalk - How can education support child empowerment.pptx
Francesca Gottschalk - How can education support child empowerment.pptxFrancesca Gottschalk - How can education support child empowerment.pptx
Francesca Gottschalk - How can education support child empowerment.pptx
 
Azure Interview Questions and Answers PDF By ScholarHat
Azure Interview Questions and Answers PDF By ScholarHatAzure Interview Questions and Answers PDF By ScholarHat
Azure Interview Questions and Answers PDF By ScholarHat
 
How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17
 
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
 
Acetabularia Information For Class 9 .docx
Acetabularia Information For Class 9  .docxAcetabularia Information For Class 9  .docx
Acetabularia Information For Class 9 .docx
 
Lapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdfLapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdf
 
Normal Labour/ Stages of Labour/ Mechanism of Labour
Normal Labour/ Stages of Labour/ Mechanism of LabourNormal Labour/ Stages of Labour/ Mechanism of Labour
Normal Labour/ Stages of Labour/ Mechanism of Labour
 
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
 

Featured

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot
Marius Sescu
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
Expeed Software
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
Pixeldarts
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
ThinkNow
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
marketingartwork
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
Skeleton Technologies
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
Kurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
SpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Lily Ray
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
Rajiv Jayarajah, MAppComm, ACC
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
Christy Abraham Joy
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
Vit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
MindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
RachelPearson36
 

Featured (20)

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 

From Data Quality to Big Data Quality: A Data Integration Scenario

  • 1. From Data Quality to Big Data Quality: A Data Integration Scenario Carlo Batini*, Anisa Rula+ *University of Milano-Bicocca +Department of Information Engineering, University of Brescia
  • 2. Data Quality and Data Integration in a 3-dimensional Data Space 2 - Relational data - Traditional information systems Data Types Big Data lifecycle Data Quality Clusters 1 2 3
  • 3. Big Data Lifecycle Source selection Discovery Retrieval Browsing Internet of Things Design Extraction Sampling Modeling Filtering Creation Management Interpretation Modeling Semantic enrichement Profiling Metadata management Persistence Representation transformation Matching Entity resolution (or record linkage) Integration Semantic Alignement Fusion Abstraction Refinement Restructuring Quality assessment Elementary data Statistics Quality improvement Elementary data Statistics Learning Supervised Unsupervised Reinforcement Interpretation Validation Hypothesis testing Statistical analysis Sensitivity analysis Model validation Value driven analysis Correlation Mining Discovery Learning Supervised Unsupervised Reinforcement Prediction (predictive modeling) Forecasting Visualization Optimization Acquisition Analysis 1
  • 4. Data Quality and Data Integration among the 3Vs • Volume • Velocity • Variety
  • 5. Data Quality Dimension Clusters* 1. Accuracy, correctness, validity and precision focus on the adherence of data to a given reality of interest. Including time related issues 2. Completeness, pertinence and relevance refer to the capability of representing all and only the relevant aspects of the reality of interest 3. Redundancy, minimality, compactness and conciseness refer to the capability of representing the reality of interest with the minimal use of informative resources 4. Readability, comprehensibility, clarity and simplicity refer to ease of understanding of data by users. 5. Accessibility and availability are related to the ability of the user to access data from his or her culture, physical status/functions, and technologies available 6. Consistency, cohesion and coherence refer to the capability of data to comply without contradictions to all properties of the reality of interest, as specified in terms of integrity constraints, data edits, business rules and other formalisms 7. Trust, including believability, reliability and reputation, catching how much data derive from an authoritative source Relational data Traditional information systems DQ Clusters & cluster dimensions 2 * From Data Quality to Big Data Quality. J. Database Manag. 2015. C. Batini, A. Rula, M. Scannapieco, G. Viscusi
  • 6. Role of Structural Characteristics – Linked Data How do data quality dimensions and metrics evolve in the transformation process from tabular to linked data? Data type Structural characteristics Linked Data • Dereferenceable Resource • SPARQL Endpoint • RDF dump • Interlinking • Licensing Relational data Types of Big Data 3 Enables us managing the huge variability of methods and techniques needed to manage data quality in linked data
  • 7. Dimensions and their Evolution Completeness Structural characteristics Dereferenceable Resource SPARQL Endpoint RDF dumps Interlinking Licensing Quality dimension Accessibility Resource accessibility Dataset accessibility Browsing accessibility Integration accessibility Reuse accessibility LD completenesses dimensions Accuracy LD accuracy dimensions
  • 8. Toward a Cost Effectiveness Model Management Interpretation Modeling Semantic enrichement Profiling Metadata management Persistence Representation transformation Matching Entity resolution (or record linkage) Integration Semantic Alignement Fusion Abstraction Refinement Restructuring Quality assessment Elementary data Statistics Quality improvement Elementary data Statistics … From relational to RDF Consider the four steps together
  • 9. Two Possible Workflows P…. G…. alppg ttygb gnnn null P… G…. alpbg ttygt gnggr eryuui Record linkage Transfor- mation & Alignment P…. G…. alppg ttygb gnnn null kkkyyt vvvy Quality management P…. G…. Alppg ttygb Gnnn null P… G…. alpbg ttygt gnggr eryuui Transfor- mation Quality management Align- ment
  • 10. Strategy of Workflow 1 10 • Exploit syntactic distance between concepts • Differentiate distance: –for alphanumeric strings use edit distance –for lists of items use Jaccard distance P…. G…. alppg ttygb gnnn null P… G…. alpbg ttygt gnggr eryuui Record linkage Transfor- mation & Alignment P…. G…. alppg ttygb gnnn null kkkyyt vvvy Quality management
  • 11. Running Example – Relational Data 11 Id Denomination Istat Code Type 126759 Torgian 54053 Historic Place 125007 Saint Peter Abbey 246 Building Id Label Location Type 1 Torgiano Torgiano Populated Place 2 Saint Peter Sicily Historic Region 3 Basilica of Saint Peter null Religious Building Tourist attraction Place
  • 12. Running Example – Attributes Selection 12 Id Denomination Istat Code Type 126759 Torgian 54053 Historic Place 125007 Saint Peter Abbey 246 Building Id Label Location Type 1 Torgiano Torgiano Populated Place 2 Saint Peter Sicily Historic Region 3 Basilica of Saint Peter null Religious Building Tourist attraction Place
  • 13. First pair – Edit Distance 13 Id Denomination Istat Code Type 126759 Torgian 54053 Historic Place 125007 Saint Peter Abbey 246 Building Id Label Location Type 1 Torgiano Torgiano Populated Place 2 Saint Peter Sicily Historic Region 3 Basilica of Saint Peter null Religious Building Tourist attraction Place Edit-d = 1 YES!
  • 14. Second Pair – Jaccard Distance 14 Id Denomination Istat Code Type 126759 Torgian 54053 Historic Place 125007 Saint Peter Abbey 246 Building Id Label Location Type 1 Torgiano Torgiano Populated Place 2 Saint Peter Sicily Historic Region 3 Basilica of Saint Peter null Religious Building Tourist attraction Place Jaccard = 1 - 2/3 = 1/3 YES!
  • 15. Third Pair – Jaccard Distance 15 Id Denomination Istat Code Type 126759 Torgian 54053 Historic Place 125007 Saint Peter Abbey 246 Building Id Label Location Type 1 Torgiano Torgiano Populated Place 2 Saint Peter Sicily Historic Region 3 Basilica of Saint Peter null Religious Building Tourist attraction Place Jaccard = 1 - 2/5 = 3/5 NO!
  • 16. Workflow 1 – Outcome 16 Id Denomination Istat Code Type 126759 Torgian 54053 Historic Place 125007 Saint Peter Abbey 246 Building Id Label Location Type 1 Torgiano Torgiano Populated Place 2 Saint Peter Sicily Historic Region 3 Basilica of Saint Peter null Religious Building Tourist attraction Place Final outcome: 1 pair true positive 1 pair false positive 1 pair false negative Completeness assessment: 1 unsolvable null
  • 17. Strategy of Workflow 2 17 • Exploit semantic distance between concepts • Improve completeness through DBpedia P…. G…. alppg ttygb gnnn null P… G…. alpbg ttygt gnggr eryuui Transformation/ Semantic Enrichment Alignment/ Matching http://dbpedia.org/resource/St._Peter “Saint Peter” label location http://dbpedia.org/resource/Sicily 37 lat 14 long http://example/resource/ 125970 246 “Saint Peter Abbey” istatCode label Match/ Non-Match
  • 18. Id Denomination Istat Code Type 126759 Torgian 54053 Historic Place 125007 Saint Peter Abbey 246 Building Id Label Location Type 1 Torgiano Torgiano Populated Place 2 Saint Peter Sicily Historic Region 3 Basilica of Saint Peter null Religious Building Running Example – Model Transformation & Alignment 1 18 http://example/resource/126759 “Torgian” 54053 type istatCode denomination HistoricPlace PopulatedPlace http://dbpedia.org/resource/Torgiano “Torgiano” type label Tourist attraction Place
  • 19. PopulatedPlace http://dbpedia.org/resource/Torgiano “Torgiano” type label http://example/resource/126759 Id Denomination Istat Code Type 126759 Torgian 54053 Historic Place 125007 Saint Peter Abbey 246 Building Id Label Location Type 1 Torgiano Torgiano Populated Place 2 Saint Peter Sicily Historic Region 3 Basilica of Saint Peter null Religious Building “Torgian” 54053 type istatCode label HistoricPlace Running Example – Model Transformation & Alignment 1 Alignement Tourist attraction Place
  • 20. PopulatedPlace http://dbpedia.org/resource/Torgiano “Torgiano” type label http://example/resource/126759 Id Denomination Istat Code Type 126759 Torgian 54053 Historic Place 125007 Saint Peter Abbey 246 Building Id Label Location Type 1 Torgiano Torgiano Populated Place 2 Saint Peter Sicily Historic Region 3 Basilica of Saint Peter null Religious Building “Torgian” 54053 type istatCode label HistoricPlace Running Example – Model Transformation & Alignment 1 20 Yes! Tourist attraction Place
  • 21. Id Label Location Type 1 Torgiano Torgiano Populated Place 2 Saint Peter Sicily Historic Region 3 Basilica of Saint Peter null Religious Building Running Example – Model Transformation & Alignment 2 21 Id Denomination Istat Code Type 126759 Torgian 54053 Historic Place 125007 Saint Peter Abbey 246 Building http://example/resource/ 125970 246 “Saint Peter Abbey” istatCode label http://dbpedia.org/resource/St._Peter_Basilica “Basilica of Saint Peter” label Tourist attraction Place http://dbpedia.org/resource/St._Peter “Saint Peter” label location http://dbpedia.org/resource/Sicily 37 lat 14 long
  • 22. http://dbpedia.org/resource/St._Peter http://example/resource/ 125970 246 “Saint Peter Abbey” istatCode label “Saint Peter” label ArchitecturalStructure HistoricPlace Monument PopulatedPlace Building ReligiousBuilding Place … Region AdministrativeRegion HistoricRegion HistoricBuilding DBpedia Ontology type Running Example – Typing 22 location http://dbpedia.org/resource/Sicily http://dbpedia.org/resource/St._Peter_Basilica “Basilica of Saint Peter” label http://dbpedia.org/resource/Vatican_City location 41 lat 12 long 37 lat 14 long type type
  • 23. ArchitecturalStructure HistoricPlace Monument PopulatedPlace Building ReligiousBuilding Place … Region AdministrativeRegion HistoricRegion HistoricBuilding DBpedia Ontology http://example/resource/ 125970 246 “Saint Peter Abbey” istatCode label type Running Example – typing 23 http://dbpedia.org/resource/St._Peter_Basilica “Basilica of Saint Peter” label http://dbpedia.org/resource/Vatican_City location 41 lat 12 long http://dbpedia.org/resource/St._Peter “Saint Peter” label location http://dbpedia.org/resource/Sicily 37 lat 14 long type type Distance = 5 No!
  • 24. ArchitecturalStructure HistoricPlace Monument PopulatedPlace Building ReligiousBuilding Place … Region AdministrativeRegion HistoricRegion HistoricBuilding type DBpedia Ontology Distance = 1 http://example/resource/ 125970 http://dbpedia.org/resource/St._Peter 246 “Saint Peter Abbey” istatCode label “Saint Peter” label type Workflow 2 – Outcome 24 location http://dbpedia.org/resource/Sicily http://dbpedia.org/resource/St._Peter_Basilica “Basilica of Saint Peter” label http://dbpedia.org/resource/Vatican_City location 41 lat 12 long 37 lat 14 long type Yes! Final outcome: 2 pairs true positive 1 pair true negative 0 pairs false positive 0 pair false negative
  • 25. Solving Missing Data 25 http://dbpedia.org/resource/St._Peter_Basilica “Basilica of Saint Peter” label http://dbpedia.org/resource/Vatican_City location 41 lat 12 long Id Label Location Type 1 Torgiano Torgiano Populated Place 2 Saint Peter Sicily Historic Region 3 Basilica of Saint Peter null Religious Building Place Id Label Location Type 1 Torgiano Torgiano Populated Place 2 Saint Peter Sicily Historic Region 3 Basilica of Saint Peter Vatican City Religious Building
  • 26. Conclusions and Futur Works 26 • Integration followed by transformation if no domain knowledge is needed • Transformation followed by integration is better when: – Aligning attributes by using existing vocabularies – Distance between concepts is defined – Enriching the dataset with the missing values • Data integration of may depend on how good is the transformation outcome • Define it as an optimization problem - Define functions for measuring the cost and the gain - The cost and gain are different for different integration models