SlideShare a Scribd company logo
Sieve
                    Linked Data
                 Quality Assessment
                    and Fusion


                                               Pablo N. Mendes
                                              Hannes Mühleisen
                                                 Christian Bizer

                                         With contributions from:
Andreas Schultz, Andrea Matteini, Christian Becker, Robert Isele
“sieve”



“A sieve, or sifter, separates wanted elements
from unwanted material using a woven screen
such as a mesh or net.”
                       Source: http://en.wikipedia.org/wiki/Sieve
What is Linked Data?

•   Raw data (RDF)
•   Accessible on the Web
•   Data can link to other data sources

             Thing               Thing               Thing               Thing               Thing

             Thing               Thing               Thing               Thing               Thing


                     data link           data link           data link           data link


              A                   B                  C                   D                    E




•   Benefits: Ease of access and re-use; enables discovery
Linking Open Data Cloud




http://lod-cloud.net
Linked Data Challenges
•   Data providers have different intentions, experience/knowledge
    •   data may be inaccurate, outdated, spam etc.

•   Data sources that overlap in content may use…
    •   ... different RDF schemata
    •   ... different identifiers for the same real-world entity
    •   …conflicting values for properties

•   Integrating public datasets with internal databases poses the
    same problems
An Architecture for Linked Data Applications
LDIF – Linked Data Integration Framework
    1     Collect data: Managed download and update

    2     Translate data into a single target vocabulary

    3     Resolve identifier aliases into local target URIs

    4     Assess quality, filter bad results, resolve conflicts

    5     Output


•       Open source (Apache License, Version 2.0)
•       Collaboration between Freie Universität Berlin and mes|semantics
LDIF Pipeline

1   Collect data         Supported data sources:

2   Translate data        •   RDF dumps (various formats)
                          •   SPARQL Endpoints
3   Resolve identities
                          •   Crawling Linked Data
4   Filter and fuse

5   Output
LDIF Pipeline

1   Collect data
                              Data sources use a wide range of different RDF
                                              vocabularies
2   Translate data
                                  dbpedia-owl: City

3   Resolve identities                                   R2R         local:City
                                  schema:Place

4   Filter and fuse               fb:location.citytown

5   Output               •   Mappings expressed in RDF (Turtle)
                         •   Simple mappings using OWL / RDFs statements
                             (x rdfs:subClassOf y)
                         •   Complex mappings with SPARQL expressivity
                         •   Transformation functions
LDIF Pipeline

1   Collect data         Data sources use different identifiers for the same entity


2   Translate data             Berlin, Germany
                               Berlin, CT
                               Berlin, MD
3   Resolve identities         Berlin, NJ
                               Berlin, MA
4   Filter and fuse                                                     Berlin
                         Berlin                       Silk                =
5   Output                                                              Berlin,
                                                                       Germany

                         •   Profiles expressed in XML
                         •   Supports various comparators and
                             transformations
LDIF Pipeline

1   Collect data             Sources provide different values for the same property

                                            Total Area
2   Translate data
                                         891.85 km2
                                         891.82 km2
3   Resolve identities                   891.82 km2
                                         891.85 km2
4   Filter and fuse
                                                                          Total Area
5   Output
                             Quality                     Sieve          891.85 km2


                         •   Profiles expressed in XML
                         •   Supports various scoring and fusion functions
LDIF Pipeline

1   Collect data
                         •   Output options:N-Quads
2   Translate data       •   N-Triples

3   Resolve identities   •   SPARQL Update Stream

4   Filter and fuse
                         •   Provenance tracking using Named
5   Output                   Graphs
An Architecture for Linked Data Applications




          Data Quality and
           Fusion Module
Data Fusion


“fusing multiple records representing the same
real-world object into a single, consistent, and
clean representation”
(Bleiholder & Naumann, 2008)
Conflict resolution strategies

•   Independent of quality assessment metrics
    •   Pick most frequent (democratic voting)
    •   Average, max, min, concatenation
    •   Within interval
•   Based on task-specific quality assessment
    •   Keep highest scored
    •   Keep all that pass a threshold
    •   Trust some sources over others
    •   Weighted voting
Data Fusion

•   Input:
    •   (Potentially) conflicting data
    •   Quality metadata describing input
•   Execution:
    •   Use existing or custom FusionFunctions
•   Output:
    •   Clean data, according to user’s definition of clean
Configuration: Data Fusion
Sieve: Quality Assessment
•   Quality as “fitness for use”:
    •   Subjective:
        •   good for me might not be enough for you
    •   Task dependent:
        •   temperature: planning a weekend vs biology experiment
    •   Multidimensional:
        •   even correct data may be outdated or not available

    •   Requires task-specific quality assessment.
Data Quality - Conceptual Framework
                            Dimension
                            Accuracy
                            Consistency
                            Objectivity
                            Timeliness
                            Validity
                            Believability
                            Completeness
                            Understandability
                            Relevancy
                            Reputation
                            Verifiability
                            Amount of Data
                            Interpretability
                            Rep. Conciseness
                            Rep. Consistency
                            Availability
                            Response Time
                            Security
Configuration: Quality Assessment

•   Quality Assessment Metrics composed by:
    •   ScoringFunction (generically applicable to given data types)
    •   Quality Indicator as input (adaptable to use case)

                                       [0;1]
•   Output:




        Describes input within a quality dimension,
         according to a user’s definition of quality
Configuration: Quality Assessment
More about Sieve

•   Software: Open Source, Apache V2
•   Scoring Functions and Fusion Functions can be extended
    •   Scala/Java interface, methods score/fuse and fromXML


•   Quality scores can be stored and shared with other
    applications
•   Website: http://sieve.wbsg.de
    •   Documentation, examples, downloads, support
Use Case
    Multiple data sources
(Complementary)
(Heterogeneous)




                                  Conflicting values
                                  Quality indicators
                                  (Multidimensional)
                                  (Task-dependent)
 (Conflict
Resolution
Strategies)
                                        Voilá!
      User config
Evaluating Quality of Data Integration

•   Completeness
    •   How many cities did we find?
    •   How many of the properties did we fill with values?
•   Conciseness
    •   How much redundancy is there in the object identifiers?
    •   How much redundancy is there in the property values?
•   Consistency
    •   How many conflicting values are there?
Results




Generated data that is more complete, concise
 and consistent than in the original sources
Linked Data application Architecture




My view on this data space can also be
         shared, and reused.

       We can “pay as we go”
THANK YOU!
•   Twitter: @pablomendes
•   E-mail: pablo.mendes@fu-berlin.de

•   Website: http://sieve.wbsg.de
•   Google Group: http://bit.ly/ldifgroup


    Supported in part by:
    Vulcan Inc. as part of its Project Halo
    EU FP7 projects:
    -LOD2 - Creating Knowledge out of Interlinked Data
    -PlanetData - A European Network of Excellence on Large-Scale Data Management

More Related Content

Similar to Sieve - Data Quality and Fusion - LWDM2012

First Steps in Semantic Data Modelling and Search & Analytics in the Cloud
First Steps in Semantic Data Modelling and Search & Analytics in the CloudFirst Steps in Semantic Data Modelling and Search & Analytics in the Cloud
First Steps in Semantic Data Modelling and Search & Analytics in the Cloud
Ontotext
 
RDF-Gen: Generating RDF from streaming and archival data
RDF-Gen: Generating RDF from streaming and archival dataRDF-Gen: Generating RDF from streaming and archival data
RDF-Gen: Generating RDF from streaming and archival data
Giorgos Santipantakis
 
BlueBrain Nexus Technical Introduction
BlueBrain Nexus Technical IntroductionBlueBrain Nexus Technical Introduction
BlueBrain Nexus Technical Introduction
Bogdan Roman
 
Cs 510iri lecture8_relevanceevaluation-revised
Cs 510iri lecture8_relevanceevaluation-revisedCs 510iri lecture8_relevanceevaluation-revised
Cs 510iri lecture8_relevanceevaluation-revised
Abubakar Waqar
 
Introduction to Neo4j for the Emirates & Bahrain
Introduction to Neo4j for the Emirates & BahrainIntroduction to Neo4j for the Emirates & Bahrain
Introduction to Neo4j for the Emirates & Bahrain
Neo4j
 
Neo4j GraphTalk Oslo - Introduction to Graphs
Neo4j GraphTalk Oslo - Introduction to GraphsNeo4j GraphTalk Oslo - Introduction to Graphs
Neo4j GraphTalk Oslo - Introduction to Graphs
Neo4j
 
Linked Data for the Masses: The approach and the Software
Linked Data for the Masses: The approach and the SoftwareLinked Data for the Masses: The approach and the Software
Linked Data for the Masses: The approach and the Software
IMC Technologies
 
Link Sets And Why They Are Important (EDF2012)
Link Sets And Why They Are Important (EDF2012)Link Sets And Why They Are Important (EDF2012)
Link Sets And Why They Are Important (EDF2012)
Anja Jentzsch
 
Materials Data Facility: Streamlined and automated data sharing, discovery, ...
Materials Data Facility: Streamlined and automated data sharing,  discovery, ...Materials Data Facility: Streamlined and automated data sharing,  discovery, ...
Materials Data Facility: Streamlined and automated data sharing, discovery, ...
Ian Foster
 
Linked data 20171106
Linked data 20171106Linked data 20171106
Linked data 20171106
Synaptica, LLC
 
Graph Databases
Graph DatabasesGraph Databases
Graph Databases
Girish Khanzode
 
Semantic Web use cases in outcomes research
Semantic Web use cases in outcomes researchSemantic Web use cases in outcomes research
Semantic Web use cases in outcomes research
Chimezie Ogbuji
 
The Information Workbench - Linked Data and Semantic Wikis in the Enterprise
The Information Workbench - Linked Data and Semantic Wikis in the EnterpriseThe Information Workbench - Linked Data and Semantic Wikis in the Enterprise
The Information Workbench - Linked Data and Semantic Wikis in the Enterprise
Peter Haase
 
Bridging the Completeness of Big Data on Databricks
Bridging the Completeness of Big Data on DatabricksBridging the Completeness of Big Data on Databricks
Bridging the Completeness of Big Data on Databricks
Databricks
 
Building genomic data cyberinfrastructure with the online database software T...
Building genomic data cyberinfrastructure with the online database software T...Building genomic data cyberinfrastructure with the online database software T...
Building genomic data cyberinfrastructure with the online database software T...
mestato
 
Transforming AI with Graphs: Real World Examples using Spark and Neo4j
Transforming AI with Graphs: Real World Examples using Spark and Neo4jTransforming AI with Graphs: Real World Examples using Spark and Neo4j
Transforming AI with Graphs: Real World Examples using Spark and Neo4j
Databricks
 
Transforming AI with Graphs: Real World Examples using Spark and Neo4j
Transforming AI with Graphs: Real World Examples using Spark and Neo4jTransforming AI with Graphs: Real World Examples using Spark and Neo4j
Transforming AI with Graphs: Real World Examples using Spark and Neo4j
Fred Madrid
 
Labmatrix Slides 2011 05
Labmatrix Slides 2011 05Labmatrix Slides 2011 05
Labmatrix Slides 2011 05
bhughes26
 
Towards an extensible measurement of metadata quality (DATeCH 2017)
Towards an extensible measurement of metadata quality (DATeCH 2017)Towards an extensible measurement of metadata quality (DATeCH 2017)
Towards an extensible measurement of metadata quality (DATeCH 2017)
Péter Király
 
Southwickc lampert lodlam_training
Southwickc lampert lodlam_trainingSouthwickc lampert lodlam_training
Southwickc lampert lodlam_training
ssouthwick
 

Similar to Sieve - Data Quality and Fusion - LWDM2012 (20)

First Steps in Semantic Data Modelling and Search & Analytics in the Cloud
First Steps in Semantic Data Modelling and Search & Analytics in the CloudFirst Steps in Semantic Data Modelling and Search & Analytics in the Cloud
First Steps in Semantic Data Modelling and Search & Analytics in the Cloud
 
RDF-Gen: Generating RDF from streaming and archival data
RDF-Gen: Generating RDF from streaming and archival dataRDF-Gen: Generating RDF from streaming and archival data
RDF-Gen: Generating RDF from streaming and archival data
 
BlueBrain Nexus Technical Introduction
BlueBrain Nexus Technical IntroductionBlueBrain Nexus Technical Introduction
BlueBrain Nexus Technical Introduction
 
Cs 510iri lecture8_relevanceevaluation-revised
Cs 510iri lecture8_relevanceevaluation-revisedCs 510iri lecture8_relevanceevaluation-revised
Cs 510iri lecture8_relevanceevaluation-revised
 
Introduction to Neo4j for the Emirates & Bahrain
Introduction to Neo4j for the Emirates & BahrainIntroduction to Neo4j for the Emirates & Bahrain
Introduction to Neo4j for the Emirates & Bahrain
 
Neo4j GraphTalk Oslo - Introduction to Graphs
Neo4j GraphTalk Oslo - Introduction to GraphsNeo4j GraphTalk Oslo - Introduction to Graphs
Neo4j GraphTalk Oslo - Introduction to Graphs
 
Linked Data for the Masses: The approach and the Software
Linked Data for the Masses: The approach and the SoftwareLinked Data for the Masses: The approach and the Software
Linked Data for the Masses: The approach and the Software
 
Link Sets And Why They Are Important (EDF2012)
Link Sets And Why They Are Important (EDF2012)Link Sets And Why They Are Important (EDF2012)
Link Sets And Why They Are Important (EDF2012)
 
Materials Data Facility: Streamlined and automated data sharing, discovery, ...
Materials Data Facility: Streamlined and automated data sharing,  discovery, ...Materials Data Facility: Streamlined and automated data sharing,  discovery, ...
Materials Data Facility: Streamlined and automated data sharing, discovery, ...
 
Linked data 20171106
Linked data 20171106Linked data 20171106
Linked data 20171106
 
Graph Databases
Graph DatabasesGraph Databases
Graph Databases
 
Semantic Web use cases in outcomes research
Semantic Web use cases in outcomes researchSemantic Web use cases in outcomes research
Semantic Web use cases in outcomes research
 
The Information Workbench - Linked Data and Semantic Wikis in the Enterprise
The Information Workbench - Linked Data and Semantic Wikis in the EnterpriseThe Information Workbench - Linked Data and Semantic Wikis in the Enterprise
The Information Workbench - Linked Data and Semantic Wikis in the Enterprise
 
Bridging the Completeness of Big Data on Databricks
Bridging the Completeness of Big Data on DatabricksBridging the Completeness of Big Data on Databricks
Bridging the Completeness of Big Data on Databricks
 
Building genomic data cyberinfrastructure with the online database software T...
Building genomic data cyberinfrastructure with the online database software T...Building genomic data cyberinfrastructure with the online database software T...
Building genomic data cyberinfrastructure with the online database software T...
 
Transforming AI with Graphs: Real World Examples using Spark and Neo4j
Transforming AI with Graphs: Real World Examples using Spark and Neo4jTransforming AI with Graphs: Real World Examples using Spark and Neo4j
Transforming AI with Graphs: Real World Examples using Spark and Neo4j
 
Transforming AI with Graphs: Real World Examples using Spark and Neo4j
Transforming AI with Graphs: Real World Examples using Spark and Neo4jTransforming AI with Graphs: Real World Examples using Spark and Neo4j
Transforming AI with Graphs: Real World Examples using Spark and Neo4j
 
Labmatrix Slides 2011 05
Labmatrix Slides 2011 05Labmatrix Slides 2011 05
Labmatrix Slides 2011 05
 
Towards an extensible measurement of metadata quality (DATeCH 2017)
Towards an extensible measurement of metadata quality (DATeCH 2017)Towards an extensible measurement of metadata quality (DATeCH 2017)
Towards an extensible measurement of metadata quality (DATeCH 2017)
 
Southwickc lampert lodlam_training
Southwickc lampert lodlam_trainingSouthwickc lampert lodlam_training
Southwickc lampert lodlam_training
 

More from Pablo Mendes

Entity Aware Click Graph
Entity Aware Click GraphEntity Aware Click Graph
Entity Aware Click Graph
Pablo Mendes
 
WWW2012 Tutorial Visualizing SPARQL Queries
WWW2012 Tutorial Visualizing SPARQL QueriesWWW2012 Tutorial Visualizing SPARQL Queries
WWW2012 Tutorial Visualizing SPARQL Queries
Pablo Mendes
 
A Virtuous Cycle of Semantic Enhancement with DBpedia Spotlight - SemTech Ber...
A Virtuous Cycle of Semantic Enhancement with DBpedia Spotlight - SemTech Ber...A Virtuous Cycle of Semantic Enhancement with DBpedia Spotlight - SemTech Ber...
A Virtuous Cycle of Semantic Enhancement with DBpedia Spotlight - SemTech Ber...
Pablo Mendes
 
Ligado nos Políticos at ESWC'2011 Workshop
Ligado nos Políticos at ESWC'2011 WorkshopLigado nos Políticos at ESWC'2011 Workshop
Ligado nos Políticos at ESWC'2011 Workshop
Pablo Mendes
 
SMWCon Fall 2011 Lightning Talk
SMWCon Fall 2011 Lightning TalkSMWCon Fall 2011 Lightning Talk
SMWCon Fall 2011 Lightning Talk
Pablo Mendes
 
DBpedia Spotlight at I-SEMANTICS 2011
DBpedia Spotlight at I-SEMANTICS 2011DBpedia Spotlight at I-SEMANTICS 2011
DBpedia Spotlight at I-SEMANTICS 2011
Pablo Mendes
 
Dados Ligados (Linked Data) CONSEGI 2011
Dados Ligados (Linked Data) CONSEGI 2011Dados Ligados (Linked Data) CONSEGI 2011
Dados Ligados (Linked Data) CONSEGI 2011
Pablo Mendes
 
Cuebee Architecture
Cuebee ArchitectureCuebee Architecture
Cuebee Architecture
Pablo Mendes
 
Twarql Architecture - Streaming Annotated Tweets
Twarql Architecture - Streaming Annotated TweetsTwarql Architecture - Streaming Annotated Tweets
Twarql Architecture - Streaming Annotated Tweets
Pablo Mendes
 
Dynamic Associative Relationships on the Linked Open Data Web
Dynamic Associative Relationships on the Linked Open Data WebDynamic Associative Relationships on the Linked Open Data Web
Dynamic Associative Relationships on the Linked Open Data Web
Pablo Mendes
 

More from Pablo Mendes (10)

Entity Aware Click Graph
Entity Aware Click GraphEntity Aware Click Graph
Entity Aware Click Graph
 
WWW2012 Tutorial Visualizing SPARQL Queries
WWW2012 Tutorial Visualizing SPARQL QueriesWWW2012 Tutorial Visualizing SPARQL Queries
WWW2012 Tutorial Visualizing SPARQL Queries
 
A Virtuous Cycle of Semantic Enhancement with DBpedia Spotlight - SemTech Ber...
A Virtuous Cycle of Semantic Enhancement with DBpedia Spotlight - SemTech Ber...A Virtuous Cycle of Semantic Enhancement with DBpedia Spotlight - SemTech Ber...
A Virtuous Cycle of Semantic Enhancement with DBpedia Spotlight - SemTech Ber...
 
Ligado nos Políticos at ESWC'2011 Workshop
Ligado nos Políticos at ESWC'2011 WorkshopLigado nos Políticos at ESWC'2011 Workshop
Ligado nos Políticos at ESWC'2011 Workshop
 
SMWCon Fall 2011 Lightning Talk
SMWCon Fall 2011 Lightning TalkSMWCon Fall 2011 Lightning Talk
SMWCon Fall 2011 Lightning Talk
 
DBpedia Spotlight at I-SEMANTICS 2011
DBpedia Spotlight at I-SEMANTICS 2011DBpedia Spotlight at I-SEMANTICS 2011
DBpedia Spotlight at I-SEMANTICS 2011
 
Dados Ligados (Linked Data) CONSEGI 2011
Dados Ligados (Linked Data) CONSEGI 2011Dados Ligados (Linked Data) CONSEGI 2011
Dados Ligados (Linked Data) CONSEGI 2011
 
Cuebee Architecture
Cuebee ArchitectureCuebee Architecture
Cuebee Architecture
 
Twarql Architecture - Streaming Annotated Tweets
Twarql Architecture - Streaming Annotated TweetsTwarql Architecture - Streaming Annotated Tweets
Twarql Architecture - Streaming Annotated Tweets
 
Dynamic Associative Relationships on the Linked Open Data Web
Dynamic Associative Relationships on the Linked Open Data WebDynamic Associative Relationships on the Linked Open Data Web
Dynamic Associative Relationships on the Linked Open Data Web
 

Recently uploaded

How Social Media Hackers Help You to See Your Wife's Message.pdf
How Social Media Hackers Help You to See Your Wife's Message.pdfHow Social Media Hackers Help You to See Your Wife's Message.pdf
How Social Media Hackers Help You to See Your Wife's Message.pdf
HackersList
 
EuroPython 2024 - Streamlining Testing in a Large Python Codebase
EuroPython 2024 - Streamlining Testing in a Large Python CodebaseEuroPython 2024 - Streamlining Testing in a Large Python Codebase
EuroPython 2024 - Streamlining Testing in a Large Python Codebase
Jimmy Lai
 
The Role of IoT in Australian Mobile App Development - PDF Guide
The Role of IoT in Australian Mobile App Development - PDF GuideThe Role of IoT in Australian Mobile App Development - PDF Guide
The Role of IoT in Australian Mobile App Development - PDF Guide
Shiv Technolabs
 
Best Practices for Effectively Running dbt in Airflow.pdf
Best Practices for Effectively Running dbt in Airflow.pdfBest Practices for Effectively Running dbt in Airflow.pdf
Best Practices for Effectively Running dbt in Airflow.pdf
Tatiana Al-Chueyr
 
Using LLM Agents with Llama 3, LangGraph and Milvus
Using LLM Agents with Llama 3, LangGraph and MilvusUsing LLM Agents with Llama 3, LangGraph and Milvus
Using LLM Agents with Llama 3, LangGraph and Milvus
Zilliz
 
High Profile Girls call Service Pune 000XX00000 Provide Best And Top Girl Ser...
High Profile Girls call Service Pune 000XX00000 Provide Best And Top Girl Ser...High Profile Girls call Service Pune 000XX00000 Provide Best And Top Girl Ser...
High Profile Girls call Service Pune 000XX00000 Provide Best And Top Girl Ser...
bhumivarma35300
 
Litestack talk at Brighton 2024 (Unleashing the power of SQLite for Ruby apps)
Litestack talk at Brighton 2024 (Unleashing the power of SQLite for Ruby apps)Litestack talk at Brighton 2024 (Unleashing the power of SQLite for Ruby apps)
Litestack talk at Brighton 2024 (Unleashing the power of SQLite for Ruby apps)
Muhammad Ali
 
Acumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdf
Acumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdfAcumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdf
Acumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdf
BrainSell Technologies
 
Evolution of iPaaS - simplify IT workloads to provide a unified view of data...
Evolution of iPaaS - simplify IT workloads to provide a unified view of  data...Evolution of iPaaS - simplify IT workloads to provide a unified view of  data...
Evolution of iPaaS - simplify IT workloads to provide a unified view of data...
Torry Harris
 
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptxRPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
SynapseIndia
 
July Patch Tuesday
July Patch TuesdayJuly Patch Tuesday
July Patch Tuesday
Ivanti
 
Amul milk launches in US: Key details of its new products ...
Amul milk launches in US: Key details of its new products ...Amul milk launches in US: Key details of its new products ...
Amul milk launches in US: Key details of its new products ...
chetankumar9855
 
Girls call Kolkata 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery
Girls call Kolkata 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery Girls call Kolkata 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery
Girls call Kolkata 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery
sunilverma7884
 
Use Cases & Benefits of RPA in Manufacturing in 2024.pptx
Use Cases & Benefits of RPA in Manufacturing in 2024.pptxUse Cases & Benefits of RPA in Manufacturing in 2024.pptx
Use Cases & Benefits of RPA in Manufacturing in 2024.pptx
SynapseIndia
 
Implementations of Fused Deposition Modeling in real world
Implementations of Fused Deposition Modeling  in real worldImplementations of Fused Deposition Modeling  in real world
Implementations of Fused Deposition Modeling in real world
Emerging Tech
 
(CISOPlatform Summit & SACON 2024) Keynote _ Power Digital Identities With AI...
(CISOPlatform Summit & SACON 2024) Keynote _ Power Digital Identities With AI...(CISOPlatform Summit & SACON 2024) Keynote _ Power Digital Identities With AI...
(CISOPlatform Summit & SACON 2024) Keynote _ Power Digital Identities With AI...
Priyanka Aash
 
How to build a generative AI solution A step-by-step guide (2).pdf
How to build a generative AI solution A step-by-step guide (2).pdfHow to build a generative AI solution A step-by-step guide (2).pdf
How to build a generative AI solution A step-by-step guide (2).pdf
ChristopherTHyatt
 
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
Kief Morris
 
How RPA Help in the Transportation and Logistics Industry.pptx
How RPA Help in the Transportation and Logistics Industry.pptxHow RPA Help in the Transportation and Logistics Industry.pptx
How RPA Help in the Transportation and Logistics Industry.pptx
SynapseIndia
 
Recent Advancements in the NIST-JARVIS Infrastructure
Recent Advancements in the NIST-JARVIS InfrastructureRecent Advancements in the NIST-JARVIS Infrastructure
Recent Advancements in the NIST-JARVIS Infrastructure
KAMAL CHOUDHARY
 

Recently uploaded (20)

How Social Media Hackers Help You to See Your Wife's Message.pdf
How Social Media Hackers Help You to See Your Wife's Message.pdfHow Social Media Hackers Help You to See Your Wife's Message.pdf
How Social Media Hackers Help You to See Your Wife's Message.pdf
 
EuroPython 2024 - Streamlining Testing in a Large Python Codebase
EuroPython 2024 - Streamlining Testing in a Large Python CodebaseEuroPython 2024 - Streamlining Testing in a Large Python Codebase
EuroPython 2024 - Streamlining Testing in a Large Python Codebase
 
The Role of IoT in Australian Mobile App Development - PDF Guide
The Role of IoT in Australian Mobile App Development - PDF GuideThe Role of IoT in Australian Mobile App Development - PDF Guide
The Role of IoT in Australian Mobile App Development - PDF Guide
 
Best Practices for Effectively Running dbt in Airflow.pdf
Best Practices for Effectively Running dbt in Airflow.pdfBest Practices for Effectively Running dbt in Airflow.pdf
Best Practices for Effectively Running dbt in Airflow.pdf
 
Using LLM Agents with Llama 3, LangGraph and Milvus
Using LLM Agents with Llama 3, LangGraph and MilvusUsing LLM Agents with Llama 3, LangGraph and Milvus
Using LLM Agents with Llama 3, LangGraph and Milvus
 
High Profile Girls call Service Pune 000XX00000 Provide Best And Top Girl Ser...
High Profile Girls call Service Pune 000XX00000 Provide Best And Top Girl Ser...High Profile Girls call Service Pune 000XX00000 Provide Best And Top Girl Ser...
High Profile Girls call Service Pune 000XX00000 Provide Best And Top Girl Ser...
 
Litestack talk at Brighton 2024 (Unleashing the power of SQLite for Ruby apps)
Litestack talk at Brighton 2024 (Unleashing the power of SQLite for Ruby apps)Litestack talk at Brighton 2024 (Unleashing the power of SQLite for Ruby apps)
Litestack talk at Brighton 2024 (Unleashing the power of SQLite for Ruby apps)
 
Acumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdf
Acumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdfAcumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdf
Acumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdf
 
Evolution of iPaaS - simplify IT workloads to provide a unified view of data...
Evolution of iPaaS - simplify IT workloads to provide a unified view of  data...Evolution of iPaaS - simplify IT workloads to provide a unified view of  data...
Evolution of iPaaS - simplify IT workloads to provide a unified view of data...
 
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptxRPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
 
July Patch Tuesday
July Patch TuesdayJuly Patch Tuesday
July Patch Tuesday
 
Amul milk launches in US: Key details of its new products ...
Amul milk launches in US: Key details of its new products ...Amul milk launches in US: Key details of its new products ...
Amul milk launches in US: Key details of its new products ...
 
Girls call Kolkata 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery
Girls call Kolkata 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery Girls call Kolkata 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery
Girls call Kolkata 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery
 
Use Cases & Benefits of RPA in Manufacturing in 2024.pptx
Use Cases & Benefits of RPA in Manufacturing in 2024.pptxUse Cases & Benefits of RPA in Manufacturing in 2024.pptx
Use Cases & Benefits of RPA in Manufacturing in 2024.pptx
 
Implementations of Fused Deposition Modeling in real world
Implementations of Fused Deposition Modeling  in real worldImplementations of Fused Deposition Modeling  in real world
Implementations of Fused Deposition Modeling in real world
 
(CISOPlatform Summit & SACON 2024) Keynote _ Power Digital Identities With AI...
(CISOPlatform Summit & SACON 2024) Keynote _ Power Digital Identities With AI...(CISOPlatform Summit & SACON 2024) Keynote _ Power Digital Identities With AI...
(CISOPlatform Summit & SACON 2024) Keynote _ Power Digital Identities With AI...
 
How to build a generative AI solution A step-by-step guide (2).pdf
How to build a generative AI solution A step-by-step guide (2).pdfHow to build a generative AI solution A step-by-step guide (2).pdf
How to build a generative AI solution A step-by-step guide (2).pdf
 
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
 
How RPA Help in the Transportation and Logistics Industry.pptx
How RPA Help in the Transportation and Logistics Industry.pptxHow RPA Help in the Transportation and Logistics Industry.pptx
How RPA Help in the Transportation and Logistics Industry.pptx
 
Recent Advancements in the NIST-JARVIS Infrastructure
Recent Advancements in the NIST-JARVIS InfrastructureRecent Advancements in the NIST-JARVIS Infrastructure
Recent Advancements in the NIST-JARVIS Infrastructure
 

Sieve - Data Quality and Fusion - LWDM2012

  • 1. Sieve Linked Data Quality Assessment and Fusion Pablo N. Mendes Hannes Mühleisen Christian Bizer With contributions from: Andreas Schultz, Andrea Matteini, Christian Becker, Robert Isele
  • 2. “sieve” “A sieve, or sifter, separates wanted elements from unwanted material using a woven screen such as a mesh or net.” Source: http://en.wikipedia.org/wiki/Sieve
  • 3. What is Linked Data? • Raw data (RDF) • Accessible on the Web • Data can link to other data sources Thing Thing Thing Thing Thing Thing Thing Thing Thing Thing data link data link data link data link A B C D E • Benefits: Ease of access and re-use; enables discovery
  • 4. Linking Open Data Cloud http://lod-cloud.net
  • 5. Linked Data Challenges • Data providers have different intentions, experience/knowledge • data may be inaccurate, outdated, spam etc. • Data sources that overlap in content may use… • ... different RDF schemata • ... different identifiers for the same real-world entity • …conflicting values for properties • Integrating public datasets with internal databases poses the same problems
  • 6. An Architecture for Linked Data Applications
  • 7. LDIF – Linked Data Integration Framework 1 Collect data: Managed download and update 2 Translate data into a single target vocabulary 3 Resolve identifier aliases into local target URIs 4 Assess quality, filter bad results, resolve conflicts 5 Output • Open source (Apache License, Version 2.0) • Collaboration between Freie Universität Berlin and mes|semantics
  • 8. LDIF Pipeline 1 Collect data Supported data sources: 2 Translate data • RDF dumps (various formats) • SPARQL Endpoints 3 Resolve identities • Crawling Linked Data 4 Filter and fuse 5 Output
  • 9. LDIF Pipeline 1 Collect data Data sources use a wide range of different RDF vocabularies 2 Translate data dbpedia-owl: City 3 Resolve identities R2R local:City schema:Place 4 Filter and fuse fb:location.citytown 5 Output • Mappings expressed in RDF (Turtle) • Simple mappings using OWL / RDFs statements (x rdfs:subClassOf y) • Complex mappings with SPARQL expressivity • Transformation functions
  • 10. LDIF Pipeline 1 Collect data Data sources use different identifiers for the same entity 2 Translate data Berlin, Germany Berlin, CT Berlin, MD 3 Resolve identities Berlin, NJ Berlin, MA 4 Filter and fuse Berlin Berlin Silk = 5 Output Berlin, Germany • Profiles expressed in XML • Supports various comparators and transformations
  • 11. LDIF Pipeline 1 Collect data Sources provide different values for the same property Total Area 2 Translate data 891.85 km2 891.82 km2 3 Resolve identities 891.82 km2 891.85 km2 4 Filter and fuse Total Area 5 Output Quality Sieve 891.85 km2 • Profiles expressed in XML • Supports various scoring and fusion functions
  • 12. LDIF Pipeline 1 Collect data • Output options:N-Quads 2 Translate data • N-Triples 3 Resolve identities • SPARQL Update Stream 4 Filter and fuse • Provenance tracking using Named 5 Output Graphs
  • 13. An Architecture for Linked Data Applications Data Quality and Fusion Module
  • 14. Data Fusion “fusing multiple records representing the same real-world object into a single, consistent, and clean representation” (Bleiholder & Naumann, 2008)
  • 15. Conflict resolution strategies • Independent of quality assessment metrics • Pick most frequent (democratic voting) • Average, max, min, concatenation • Within interval • Based on task-specific quality assessment • Keep highest scored • Keep all that pass a threshold • Trust some sources over others • Weighted voting
  • 16. Data Fusion • Input: • (Potentially) conflicting data • Quality metadata describing input • Execution: • Use existing or custom FusionFunctions • Output: • Clean data, according to user’s definition of clean
  • 18. Sieve: Quality Assessment • Quality as “fitness for use”: • Subjective: • good for me might not be enough for you • Task dependent: • temperature: planning a weekend vs biology experiment • Multidimensional: • even correct data may be outdated or not available • Requires task-specific quality assessment.
  • 19. Data Quality - Conceptual Framework Dimension Accuracy Consistency Objectivity Timeliness Validity Believability Completeness Understandability Relevancy Reputation Verifiability Amount of Data Interpretability Rep. Conciseness Rep. Consistency Availability Response Time Security
  • 20. Configuration: Quality Assessment • Quality Assessment Metrics composed by: • ScoringFunction (generically applicable to given data types) • Quality Indicator as input (adaptable to use case) [0;1] • Output: Describes input within a quality dimension, according to a user’s definition of quality
  • 22. More about Sieve • Software: Open Source, Apache V2 • Scoring Functions and Fusion Functions can be extended • Scala/Java interface, methods score/fuse and fromXML • Quality scores can be stored and shared with other applications • Website: http://sieve.wbsg.de • Documentation, examples, downloads, support
  • 23. Use Case Multiple data sources (Complementary) (Heterogeneous) Conflicting values Quality indicators (Multidimensional) (Task-dependent) (Conflict Resolution Strategies) Voilá! User config
  • 24. Evaluating Quality of Data Integration • Completeness • How many cities did we find? • How many of the properties did we fill with values? • Conciseness • How much redundancy is there in the object identifiers? • How much redundancy is there in the property values? • Consistency • How many conflicting values are there?
  • 25. Results Generated data that is more complete, concise and consistent than in the original sources
  • 26. Linked Data application Architecture My view on this data space can also be shared, and reused. We can “pay as we go”
  • 27. THANK YOU! • Twitter: @pablomendes • E-mail: pablo.mendes@fu-berlin.de • Website: http://sieve.wbsg.de • Google Group: http://bit.ly/ldifgroup Supported in part by: Vulcan Inc. as part of its Project Halo EU FP7 projects: -LOD2 - Creating Knowledge out of Interlinked Data -PlanetData - A European Network of Excellence on Large-Scale Data Management