SlideShare a Scribd company logo
Crowdsourcing Linked Data Quality Assessment
Maribel Acosta, Amrapali Zaveri, Elena Simperl, Dimitris Kontokostas, Sören Auer
and Jens Lehmann
@ISWC2013

KIT – University of the State of Baden-Wuerttemberg and
National Research Center of the Helmholtz Association

www.kit.edu
Motivation
Varying quality of Linked Data sources
Some quality issues require certain interpretation
that can be easily performed by humans
dbpedia:Dave_Dobbyn dbprop:dateOfBirth “3”.

Solution: Include human verification in the
process of LD quality assessment
Direct application: Detecting pattern in errors
may allow to identify (and correct) the extraction
mechanisms
3

28.10.2013

Acosta et al. – Crowdsourcing Linked Data Quality Assessment

Institut für Angewandte Informatik und Formale
Beschreibungsverfahren (AIFB)
Research questions
RQ1: Is it possible to detect quality issues in LD data sets
via crowdsourcing mechanisms?

RQ2: What type of crowd is most suitable for each type of
quality issue?

RQ3: Which types of errors are made by lay users and
experts when assessing RDF triples?
4

28.10.2013

Acosta et al. – Crowdsourcing Linked Data Quality Assessment

Institut für Angewandte Informatik und Formale
Beschreibungsverfahren (AIFB)
Related work
DBpedia
Assessing LD
mappings

ZenCrowd
Entity resolution

(Automatic)

Crowdsourcing
& Linked Data
CrowdMAP
Ontology allignment

Web of data
quality
assessment

Quality
characteristics of
LD data sources
(Semi-automatic)

WIQA, Sieve,
(Manual)

GWAP for LD
Our work
5

28.10.2013

Acosta et al. – Crowdsourcing Linked Data Quality Assessment

Institut für Angewandte Informatik und Formale
Beschreibungsverfahren (AIFB)
OUR APPROACH

6

28.10.2013

Acosta et al. – Crowdsourcing Linked Data Quality Assessment

Institut für Angewandte Informatik und Formale
Beschreibungsverfahren (AIFB)
Methodology

2
1
Correct
{s p o .}

Dataset

{s p o .}
3

Incorrect +
Quality issue

Steps to implement the methodology
1

2

Selecting the appropriate crowdsourcing approaches

3
7

Selecting LD quality issues to crowdsource

Designing and generating the interfaces to present the data
to the crowd

28.10.2013

Acosta et al. – Crowdsourcing Linked Data Quality Assessment

Institut für Angewandte Informatik und Formale
Beschreibungsverfahren (AIFB)
1

Selecting LD quality issues
to crowdsource

Three categories of quality problems occur
in DBpedia [Zaveri2013] and can be crowdsourced:
Incorrect object
 Example: dbpedia:Dave_Dobbyn dbprop:dateOfBirth “3”.

Incorrect data type or language tags
 Example: dbpedia:Torishima_Izu_Islands foaf:name “

”@en.

Incorrect link to “external Web pages”
 Example: dbpedia:John-Two-Hawks dbpedia-owl:wikiPageExternalLink
<http://cedarlakedvd.com/>

8

28.10.2013

Acosta et al. – Crowdsourcing Linked Data Quality Assessment

Institut für Angewandte Informatik und Formale
Beschreibungsverfahren (AIFB)
2

Selecting appropriate
crowdsourcing approaches (1)

Find

Verify

Contest

Microtasks

LD Experts
Difficult task
Final prize

Workers
Easy task
Micropayments

TripleCheckMate
[Kontoskostas2013]

MTurk
http://mturk.com

Adapted from [Bernstein2010]
9

28.10.2013

Acosta et al. – Crowdsourcing Linked Data Quality Assessment

Institut für Angewandte Informatik und Formale
Beschreibungsverfahren (AIFB)
3

Presenting the data to the crowd

Microtask interfaces: MTurk tasks
Incorrect object

• Selection of foaf:name or
rdfs:label to extract humanreadable descriptions
• Values extracted automatically
from Wikipedia infoboxes
• Link to the Wikipedia article via
foaf:isPrimaryTopicOf

Incorrect data type or language tag

Incorrect outlink

• Preview of external pages by
implementing HTML iframe

10

28.10.2013

Acosta et al. – Crowdsourcing Linked Data Quality Assessment

Institut für Angewandte Informatik und Formale
Beschreibungsverfahren (AIFB)
EXPERIMENTAL STUDY

11

28.10.2013

Acosta et al. – Crowdsourcing Linked Data Quality Assessment

Institut für Angewandte Informatik und Formale
Beschreibungsverfahren (AIFB)
Experimental design
• Crowdsourcing approaches:
• Find stage: Contest with LD experts
• Verify stage: Microtasks (5 assignments)

• Creation of a gold standard:
• Two of the authors of this paper (MA, AZ) generated the gold
standard for all the triples obtained from the contest
• Each author independently evaluated the triples
• Conflicts were resolved via mutual agreement

• Metric: precision

12

28.10.2013

Acosta et al. – Crowdsourcing Linked Data Quality Assessment

Institut für Angewandte Informatik und Formale
Beschreibungsverfahren (AIFB)
Overall results
LD Experts
Number of distinct
participants
Total time

Total triples evaluated
Total cost

13

28.10.2013

Microtask workers

50

80

3 weeks (predefined)

4 days

1,512

1,073

~ US$ 400 (predefined)

~ US$ 43

Maribel Acosta - Identifying DBpedia Quality Issues via Crowdsourcing

Institut für Angewandte Informatik und Formale
Beschreibungsverfahren (AIFB)
Precision results: Incorrect object task
• MTurk workers can be used to reduce the error rates of LD experts for
the Find stage
Triples compared

LD Experts

MTurk
(majority voting: n=5)

509

0.7151

0.8977

• 117 DBpedia triples had predicates related to dates with
incorrect/incomplete values:
”2005 Six Nations Championship” Date 12 .
• 52 DBpedia triples had erroneous values from the source:
”English (programming language)” Influenced by ? .
•

•

14

Experts classified all these triples as incorrect

Workers compared values against Wikipedia and successfully classified this
triples as “correct”

28.10.2013

Acosta et al. – Crowdsourcing Linked Data Quality Assessment

Institut für Angewandte Informatik und Formale
Beschreibungsverfahren (AIFB)
Precision results: Incorrect data type task
Triples compared

LD Experts

MTurk
(majority voting: n=5)

341

0.8270

0.4752

Number of triples

140

Experts TP

120

Experts FP
100

Crowd TP

80

Crowd FP

60
40
20
0
Date

English Millimetre

Nanometre
Number

Number
with
decimals

Data types
15

28.10.2013

Acosta et al. – Crowdsourcing Linked Data Quality Assessment

Second

Volt

Year

Not
specified /
URI

Institut für Angewandte Informatik und Formale
Beschreibungsverfahren (AIFB)
Precision results: Incorrect link task
Triples compared

Baseline

LD Experts

MTurk
(n=5 majority voting)

223

0.2598

0.1525

0.9412

• We analyzed the 189 misclassifications by the experts:
11%

39%

Freebase links
50%

Wikipedia images
External links

• The 6% misclassifications by the workers correspond to
pages with a language different from English.
16

28.10.2013

Acosta et al. – Crowdsourcing Linked Data Quality Assessment

Institut für Angewandte Informatik und Formale
Beschreibungsverfahren (AIFB)
Final discussion
RQ1: Is it possible to detect quality issues in LD data sets via
crowdsourcing mechanisms?

Both forms of crowdsourcing can be applied to detect certain
LD quality issues
RQ2: What type of crowd is most suitable for each type of quality issue?

The effort of LD experts must be applied on those tasks
demanding specific-domain skills. MTurk crowd was
exceptionally good at performing data comparisons
RQ3: Which types of errors are made by lay users and experts?

Lay users do not have the skills to solve domain-specific
tasks, while experts performance is very low on tasks that
demand an extra effort (e.g., checking an external page)
17

28.10.2013

Acosta et al. – Crowdsourcing Linked Data Quality Assessment

Institut für Angewandte Informatik und Formale
Beschreibungsverfahren (AIFB)
CONCLUSIONS & FUTURE WORK

18

28.10.2013

Acosta et al. – Crowdsourcing Linked Data Quality Assessment

Institut für Angewandte Informatik und Formale
Beschreibungsverfahren (AIFB)
Conclusions & Future Work
A crowdsourcing methodology for LD quality assessment:
Find stage: LD experts
Verify stage: MTurk workers

Crowdsourcing approaches are feasible in detecting the
studied quality issues
Application: Detecting pattern in errors to fix the extraction
mechanisms

Future Work
Conducting new experiments (other quality issues and domains)
Integration of the crowd into curation processes and tools
19

28.10.2013

Acosta et al. – Crowdsourcing Linked Data Quality Assessment

Institut für Angewandte Informatik und Formale
Beschreibungsverfahren (AIFB)
References & Acknowledgements
[Bernstein2010]

M. S. Bernstein, G. Little, R. C. Miller, B. Hartmann, M. S. Ackerman, D. R.
Karger, D. Crowell, and K. Panovich. Soylent: a word processor with a crowd
inside. In Proceedings of the 23nd annual ACM symposium on User interface
software and technology, UIST ’10, pages 313–322, New
York, NY, USA, 2010. ACM.

[Kontoskostas2013]

D Kontokostas, A Zaveri, S Auer, J Lehmann. TripleCheckMate: A Tool for
Crowdsourcing the Quality Assessment of Linked Data . Knowledge
Engineering and the Semantic Web, 2013

[Zaveri2013]

A. Zaveri, A. Rula, A. Maurino, R. Pietrobon, J. Lehmann, and S. Auer.
Quality as- sessment methodologies for linked open data. Under
review, http://www.semantic-web-journal.net/content/quality-assessmentmethodologies-linked-open-data.

20

28.10.2013

Acosta et al. – Crowdsourcing Linked Data Quality Assessment

Institut für Angewandte Informatik und Formale
Beschreibungsverfahren (AIFB)
Approach

MTurk tasks
Incorrect object
Verify

Find

Contest

Microtasks

LD Experts
Difficult task
Final prize

Workers
Easy task
Micropayments

TripleCheckMate

Incorrect data type

MTurk

Incorrect outlink

Results: Precision
Object
values

Data types

Interlinks

Linked Data
experts

0.7151

0.8270

0.1525

MTurk

0.8977

0.4752

0.9412

(majority voting)

21

28.10.2013

Acosta et al. – Crowdsourcing Linked Data Quality Assessment

QUESTIONS?
Institut für Angewandte Informatik und Formale
Beschreibungsverfahren (AIFB)

More Related Content

What's hot

The Challenge of Deeper Knowledge Graphs for Science
The Challenge of Deeper Knowledge Graphs for ScienceThe Challenge of Deeper Knowledge Graphs for Science
The Challenge of Deeper Knowledge Graphs for Science
Paul Groth
 
Extracting, Aligning, and Linking Data to Build Knowledge Graphs
Extracting, Aligning, and Linking Data to Build Knowledge GraphsExtracting, Aligning, and Linking Data to Build Knowledge Graphs
Extracting, Aligning, and Linking Data to Build Knowledge Graphs
Craig Knoblock
 
The Web of Data: do we actually understand what we built?
The Web of Data: do we actually understand what we built?The Web of Data: do we actually understand what we built?
The Web of Data: do we actually understand what we built?
Frank van Harmelen
 
Knowledge Graph Maintenance
Knowledge Graph MaintenanceKnowledge Graph Maintenance
Knowledge Graph Maintenance
Paul Groth
 
Data Discovery and Visualization
Data Discovery and VisualizationData Discovery and Visualization
Data Discovery and Visualization
Dr. Neil Brittliff
 
Data Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data InteractionData Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data Interaction
University of Washington
 
The State of Linked Government Data
The State of Linked Government DataThe State of Linked Government Data
The State of Linked Government Data
Richard Cyganiak
 
More ways of symbol grounding for knowledge graphs?
More ways of symbol grounding for knowledge graphs?More ways of symbol grounding for knowledge graphs?
More ways of symbol grounding for knowledge graphs?
Paul Groth
 
The Implications of Plagiarism and Contract Cheating for the Assessment of Da...
The Implications of Plagiarism and Contract Cheating for the Assessment of Da...The Implications of Plagiarism and Contract Cheating for the Assessment of Da...
The Implications of Plagiarism and Contract Cheating for the Assessment of Da...Thomas Lancaster
 
Democratizing Data Science in the Cloud
Democratizing Data Science in the CloudDemocratizing Data Science in the Cloud
Democratizing Data Science in the Cloud
University of Washington
 
End-to-End Learning for Answering Structured Queries Directly over Text
End-to-End Learning for  Answering Structured Queries Directly over Text End-to-End Learning for  Answering Structured Queries Directly over Text
End-to-End Learning for Answering Structured Queries Directly over Text
Paul Groth
 
Knowledge Graph Futures
Knowledge Graph FuturesKnowledge Graph Futures
Knowledge Graph Futures
Paul Groth
 
Content + Signals: The value of the entire data estate for machine learning
Content + Signals: The value of the entire data estate for machine learningContent + Signals: The value of the entire data estate for machine learning
Content + Signals: The value of the entire data estate for machine learning
Paul Groth
 
Intelligent Software Engineering: Synergy between AI and Software Engineering
Intelligent Software Engineering: Synergy between AI and Software EngineeringIntelligent Software Engineering: Synergy between AI and Software Engineering
Intelligent Software Engineering: Synergy between AI and Software Engineering
Tao Xie
 
The need for a transparent data supply chain
The need for a transparent data supply chainThe need for a transparent data supply chain
The need for a transparent data supply chain
Paul Groth
 
Deep neural networks for matching online social networking profiles
Deep neural networks for matching online social networking profilesDeep neural networks for matching online social networking profiles
Deep neural networks for matching online social networking profiles
Traian Rebedea
 
ISEC'18 Keynote: Intelligent Software Engineering: Synergy between AI and Sof...
ISEC'18 Keynote: Intelligent Software Engineering: Synergy between AI and Sof...ISEC'18 Keynote: Intelligent Software Engineering: Synergy between AI and Sof...
ISEC'18 Keynote: Intelligent Software Engineering: Synergy between AI and Sof...
Tao Xie
 
ISEC'18 Tutorial: Research Methodology on Pursuing Impact-Driven Research
ISEC'18 Tutorial: Research Methodology on Pursuing Impact-Driven ResearchISEC'18 Tutorial: Research Methodology on Pursuing Impact-Driven Research
ISEC'18 Tutorial: Research Methodology on Pursuing Impact-Driven Research
Tao Xie
 
Science Data, Responsibly
Science Data, ResponsiblyScience Data, Responsibly
Science Data, Responsibly
University of Washington
 
Data, Responsibly: The Next Decade of Data Science
Data, Responsibly: The Next Decade of Data ScienceData, Responsibly: The Next Decade of Data Science
Data, Responsibly: The Next Decade of Data Science
University of Washington
 

What's hot (20)

The Challenge of Deeper Knowledge Graphs for Science
The Challenge of Deeper Knowledge Graphs for ScienceThe Challenge of Deeper Knowledge Graphs for Science
The Challenge of Deeper Knowledge Graphs for Science
 
Extracting, Aligning, and Linking Data to Build Knowledge Graphs
Extracting, Aligning, and Linking Data to Build Knowledge GraphsExtracting, Aligning, and Linking Data to Build Knowledge Graphs
Extracting, Aligning, and Linking Data to Build Knowledge Graphs
 
The Web of Data: do we actually understand what we built?
The Web of Data: do we actually understand what we built?The Web of Data: do we actually understand what we built?
The Web of Data: do we actually understand what we built?
 
Knowledge Graph Maintenance
Knowledge Graph MaintenanceKnowledge Graph Maintenance
Knowledge Graph Maintenance
 
Data Discovery and Visualization
Data Discovery and VisualizationData Discovery and Visualization
Data Discovery and Visualization
 
Data Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data InteractionData Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data Interaction
 
The State of Linked Government Data
The State of Linked Government DataThe State of Linked Government Data
The State of Linked Government Data
 
More ways of symbol grounding for knowledge graphs?
More ways of symbol grounding for knowledge graphs?More ways of symbol grounding for knowledge graphs?
More ways of symbol grounding for knowledge graphs?
 
The Implications of Plagiarism and Contract Cheating for the Assessment of Da...
The Implications of Plagiarism and Contract Cheating for the Assessment of Da...The Implications of Plagiarism and Contract Cheating for the Assessment of Da...
The Implications of Plagiarism and Contract Cheating for the Assessment of Da...
 
Democratizing Data Science in the Cloud
Democratizing Data Science in the CloudDemocratizing Data Science in the Cloud
Democratizing Data Science in the Cloud
 
End-to-End Learning for Answering Structured Queries Directly over Text
End-to-End Learning for  Answering Structured Queries Directly over Text End-to-End Learning for  Answering Structured Queries Directly over Text
End-to-End Learning for Answering Structured Queries Directly over Text
 
Knowledge Graph Futures
Knowledge Graph FuturesKnowledge Graph Futures
Knowledge Graph Futures
 
Content + Signals: The value of the entire data estate for machine learning
Content + Signals: The value of the entire data estate for machine learningContent + Signals: The value of the entire data estate for machine learning
Content + Signals: The value of the entire data estate for machine learning
 
Intelligent Software Engineering: Synergy between AI and Software Engineering
Intelligent Software Engineering: Synergy between AI and Software EngineeringIntelligent Software Engineering: Synergy between AI and Software Engineering
Intelligent Software Engineering: Synergy between AI and Software Engineering
 
The need for a transparent data supply chain
The need for a transparent data supply chainThe need for a transparent data supply chain
The need for a transparent data supply chain
 
Deep neural networks for matching online social networking profiles
Deep neural networks for matching online social networking profilesDeep neural networks for matching online social networking profiles
Deep neural networks for matching online social networking profiles
 
ISEC'18 Keynote: Intelligent Software Engineering: Synergy between AI and Sof...
ISEC'18 Keynote: Intelligent Software Engineering: Synergy between AI and Sof...ISEC'18 Keynote: Intelligent Software Engineering: Synergy between AI and Sof...
ISEC'18 Keynote: Intelligent Software Engineering: Synergy between AI and Sof...
 
ISEC'18 Tutorial: Research Methodology on Pursuing Impact-Driven Research
ISEC'18 Tutorial: Research Methodology on Pursuing Impact-Driven ResearchISEC'18 Tutorial: Research Methodology on Pursuing Impact-Driven Research
ISEC'18 Tutorial: Research Methodology on Pursuing Impact-Driven Research
 
Science Data, Responsibly
Science Data, ResponsiblyScience Data, Responsibly
Science Data, Responsibly
 
Data, Responsibly: The Next Decade of Data Science
Data, Responsibly: The Next Decade of Data ScienceData, Responsibly: The Next Decade of Data Science
Data, Responsibly: The Next Decade of Data Science
 

Viewers also liked

HARE: A Hybrid SPARQL Engine to Enhance Query Answers via Crowdsourcing
HARE: A Hybrid SPARQL Engine to Enhance Query Answers via CrowdsourcingHARE: A Hybrid SPARQL Engine to Enhance Query Answers via Crowdsourcing
HARE: A Hybrid SPARQL Engine to Enhance Query Answers via Crowdsourcing
Maribel Acosta Deibe
 
Conference Live: Accessible and Sociable Conference Semantic Data
Conference Live: Accessible and Sociable Conference Semantic DataConference Live: Accessible and Sociable Conference Semantic Data
Conference Live: Accessible and Sociable Conference Semantic Data
Anna Lisa Gentile
 
CrowdSem 2013 Workshop @ISWC2013
CrowdSem 2013 Workshop @ISWC2013CrowdSem 2013 Workshop @ISWC2013
CrowdSem 2013 Workshop @ISWC2013
Lora Aroyo
 
Semantic Data Management in Graph Databases: ESWC 2014 Tutorial
Semantic Data Management in Graph Databases: ESWC 2014 TutorialSemantic Data Management in Graph Databases: ESWC 2014 Tutorial
Semantic Data Management in Graph Databases: ESWC 2014 Tutorial
Maribel Acosta Deibe
 
Microtask Crowdsourcing Applications for Linked Data
Microtask Crowdsourcing Applications for Linked DataMicrotask Crowdsourcing Applications for Linked Data
Microtask Crowdsourcing Applications for Linked DataEUCLID project
 
Crowdsourcing Linked Data Quality Assessment
Crowdsourcing Linked Data Quality AssessmentCrowdsourcing Linked Data Quality Assessment
Crowdsourcing Linked Data Quality Assessment
Amrapali Zaveri, PhD
 
Linked Data Quality Assessment: A Survey
Linked Data Quality Assessment: A SurveyLinked Data Quality Assessment: A Survey
Linked Data Quality Assessment: A Survey
Amrapali Zaveri, PhD
 

Viewers also liked (7)

HARE: A Hybrid SPARQL Engine to Enhance Query Answers via Crowdsourcing
HARE: A Hybrid SPARQL Engine to Enhance Query Answers via CrowdsourcingHARE: A Hybrid SPARQL Engine to Enhance Query Answers via Crowdsourcing
HARE: A Hybrid SPARQL Engine to Enhance Query Answers via Crowdsourcing
 
Conference Live: Accessible and Sociable Conference Semantic Data
Conference Live: Accessible and Sociable Conference Semantic DataConference Live: Accessible and Sociable Conference Semantic Data
Conference Live: Accessible and Sociable Conference Semantic Data
 
CrowdSem 2013 Workshop @ISWC2013
CrowdSem 2013 Workshop @ISWC2013CrowdSem 2013 Workshop @ISWC2013
CrowdSem 2013 Workshop @ISWC2013
 
Semantic Data Management in Graph Databases: ESWC 2014 Tutorial
Semantic Data Management in Graph Databases: ESWC 2014 TutorialSemantic Data Management in Graph Databases: ESWC 2014 Tutorial
Semantic Data Management in Graph Databases: ESWC 2014 Tutorial
 
Microtask Crowdsourcing Applications for Linked Data
Microtask Crowdsourcing Applications for Linked DataMicrotask Crowdsourcing Applications for Linked Data
Microtask Crowdsourcing Applications for Linked Data
 
Crowdsourcing Linked Data Quality Assessment
Crowdsourcing Linked Data Quality AssessmentCrowdsourcing Linked Data Quality Assessment
Crowdsourcing Linked Data Quality Assessment
 
Linked Data Quality Assessment: A Survey
Linked Data Quality Assessment: A SurveyLinked Data Quality Assessment: A Survey
Linked Data Quality Assessment: A Survey
 

Similar to Crowdsourcing Linked Data Quality Assessment

Linked Data Quality Assessment – daQ and Luzzu
Linked Data Quality Assessment – daQ and LuzzuLinked Data Quality Assessment – daQ and Luzzu
Linked Data Quality Assessment – daQ and Luzzu
jerdeb
 
PhD Consortium ADBIS presetation.
PhD Consortium ADBIS presetation.PhD Consortium ADBIS presetation.
PhD Consortium ADBIS presetation.
Giuseppe Ricci
 
EarthCube Monthly Community Webinar- Nov. 22, 2013
EarthCube Monthly Community Webinar- Nov. 22, 2013EarthCube Monthly Community Webinar- Nov. 22, 2013
EarthCube Monthly Community Webinar- Nov. 22, 2013
EarthCube
 
How to clean data less through Linked (Open Data) approach?
How to clean data less through Linked (Open Data) approach?How to clean data less through Linked (Open Data) approach?
How to clean data less through Linked (Open Data) approach?
andrea huang
 
RDF Data Quality Assessment - connecting the pieces
RDF Data Quality Assessment - connecting the piecesRDF Data Quality Assessment - connecting the pieces
RDF Data Quality Assessment - connecting the pieces
Connected Data World
 
Data dissemination and materials informatics at LBNL
Data dissemination and materials informatics at LBNLData dissemination and materials informatics at LBNL
Data dissemination and materials informatics at LBNL
Anubhav Jain
 
Data Science and Analytics Brown Bag
Data Science and Analytics Brown BagData Science and Analytics Brown Bag
Data Science and Analytics Brown Bag
DataTactics
 
Data Tactics Data Science Brown Bag (April 2014)
Data Tactics Data Science Brown Bag (April 2014)Data Tactics Data Science Brown Bag (April 2014)
Data Tactics Data Science Brown Bag (April 2014)
Rich Heimann
 
Keynote speech - Carole Goble - Jisc Digital Festival 2015
Keynote speech - Carole Goble - Jisc Digital Festival 2015Keynote speech - Carole Goble - Jisc Digital Festival 2015
Keynote speech - Carole Goble - Jisc Digital Festival 2015
Jisc
 
RARE and FAIR Science: Reproducibility and Research Objects
RARE and FAIR Science: Reproducibility and Research ObjectsRARE and FAIR Science: Reproducibility and Research Objects
RARE and FAIR Science: Reproducibility and Research Objects
Carole Goble
 
Early Analysis and Debuggin of Linked Open Data Cubes
Early Analysis and Debuggin of Linked Open Data CubesEarly Analysis and Debuggin of Linked Open Data Cubes
Early Analysis and Debuggin of Linked Open Data Cubes
Enrico Daga
 
Assigning semantic labels to data sources
Assigning semantic labels to data sourcesAssigning semantic labels to data sources
Assigning semantic labels to data sources
Craig Knoblock
 
Crowdsourcing the Quality of Knowledge Graphs: A DBpedia Study
Crowdsourcing the Quality of Knowledge Graphs:A DBpedia StudyCrowdsourcing the Quality of Knowledge Graphs:A DBpedia Study
Crowdsourcing the Quality of Knowledge Graphs: A DBpedia Study
Maribel Acosta Deibe
 
Survey on MapReduce in Big Data Clustering using Machine Learning Algorithms
Survey on MapReduce in Big Data Clustering using Machine Learning AlgorithmsSurvey on MapReduce in Big Data Clustering using Machine Learning Algorithms
Survey on MapReduce in Big Data Clustering using Machine Learning Algorithms
IRJET Journal
 
The LDBC Social Network Benchmark Interactive Workload - SIGMOD 2015
The LDBC Social Network Benchmark Interactive Workload - SIGMOD 2015The LDBC Social Network Benchmark Interactive Workload - SIGMOD 2015
The LDBC Social Network Benchmark Interactive Workload - SIGMOD 2015Ioan Toma
 
MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...
MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...
MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...
Yongyao Jiang
 
kantorNSF-NIJ-ISI-03-06-04.ppt
kantorNSF-NIJ-ISI-03-06-04.pptkantorNSF-NIJ-ISI-03-06-04.ppt
kantorNSF-NIJ-ISI-03-06-04.pptbutest
 
Benchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detectionBenchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detection
Symeon Papadopoulos
 
Camp 4-data workshop presentation
Camp 4-data workshop presentationCamp 4-data workshop presentation
Camp 4-data workshop presentation
Paolo Missier
 
Benchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detectionBenchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detection
Sotiris Beis
 

Similar to Crowdsourcing Linked Data Quality Assessment (20)

Linked Data Quality Assessment – daQ and Luzzu
Linked Data Quality Assessment – daQ and LuzzuLinked Data Quality Assessment – daQ and Luzzu
Linked Data Quality Assessment – daQ and Luzzu
 
PhD Consortium ADBIS presetation.
PhD Consortium ADBIS presetation.PhD Consortium ADBIS presetation.
PhD Consortium ADBIS presetation.
 
EarthCube Monthly Community Webinar- Nov. 22, 2013
EarthCube Monthly Community Webinar- Nov. 22, 2013EarthCube Monthly Community Webinar- Nov. 22, 2013
EarthCube Monthly Community Webinar- Nov. 22, 2013
 
How to clean data less through Linked (Open Data) approach?
How to clean data less through Linked (Open Data) approach?How to clean data less through Linked (Open Data) approach?
How to clean data less through Linked (Open Data) approach?
 
RDF Data Quality Assessment - connecting the pieces
RDF Data Quality Assessment - connecting the piecesRDF Data Quality Assessment - connecting the pieces
RDF Data Quality Assessment - connecting the pieces
 
Data dissemination and materials informatics at LBNL
Data dissemination and materials informatics at LBNLData dissemination and materials informatics at LBNL
Data dissemination and materials informatics at LBNL
 
Data Science and Analytics Brown Bag
Data Science and Analytics Brown BagData Science and Analytics Brown Bag
Data Science and Analytics Brown Bag
 
Data Tactics Data Science Brown Bag (April 2014)
Data Tactics Data Science Brown Bag (April 2014)Data Tactics Data Science Brown Bag (April 2014)
Data Tactics Data Science Brown Bag (April 2014)
 
Keynote speech - Carole Goble - Jisc Digital Festival 2015
Keynote speech - Carole Goble - Jisc Digital Festival 2015Keynote speech - Carole Goble - Jisc Digital Festival 2015
Keynote speech - Carole Goble - Jisc Digital Festival 2015
 
RARE and FAIR Science: Reproducibility and Research Objects
RARE and FAIR Science: Reproducibility and Research ObjectsRARE and FAIR Science: Reproducibility and Research Objects
RARE and FAIR Science: Reproducibility and Research Objects
 
Early Analysis and Debuggin of Linked Open Data Cubes
Early Analysis and Debuggin of Linked Open Data CubesEarly Analysis and Debuggin of Linked Open Data Cubes
Early Analysis and Debuggin of Linked Open Data Cubes
 
Assigning semantic labels to data sources
Assigning semantic labels to data sourcesAssigning semantic labels to data sources
Assigning semantic labels to data sources
 
Crowdsourcing the Quality of Knowledge Graphs: A DBpedia Study
Crowdsourcing the Quality of Knowledge Graphs:A DBpedia StudyCrowdsourcing the Quality of Knowledge Graphs:A DBpedia Study
Crowdsourcing the Quality of Knowledge Graphs: A DBpedia Study
 
Survey on MapReduce in Big Data Clustering using Machine Learning Algorithms
Survey on MapReduce in Big Data Clustering using Machine Learning AlgorithmsSurvey on MapReduce in Big Data Clustering using Machine Learning Algorithms
Survey on MapReduce in Big Data Clustering using Machine Learning Algorithms
 
The LDBC Social Network Benchmark Interactive Workload - SIGMOD 2015
The LDBC Social Network Benchmark Interactive Workload - SIGMOD 2015The LDBC Social Network Benchmark Interactive Workload - SIGMOD 2015
The LDBC Social Network Benchmark Interactive Workload - SIGMOD 2015
 
MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...
MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...
MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...
 
kantorNSF-NIJ-ISI-03-06-04.ppt
kantorNSF-NIJ-ISI-03-06-04.pptkantorNSF-NIJ-ISI-03-06-04.ppt
kantorNSF-NIJ-ISI-03-06-04.ppt
 
Benchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detectionBenchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detection
 
Camp 4-data workshop presentation
Camp 4-data workshop presentationCamp 4-data workshop presentation
Camp 4-data workshop presentation
 
Benchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detectionBenchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detection
 

Recently uploaded

Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
g2nightmarescribd
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 

Recently uploaded (20)

Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 

Crowdsourcing Linked Data Quality Assessment

  • 1. Crowdsourcing Linked Data Quality Assessment Maribel Acosta, Amrapali Zaveri, Elena Simperl, Dimitris Kontokostas, Sören Auer and Jens Lehmann @ISWC2013 KIT – University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association www.kit.edu
  • 2. Motivation Varying quality of Linked Data sources Some quality issues require certain interpretation that can be easily performed by humans dbpedia:Dave_Dobbyn dbprop:dateOfBirth “3”. Solution: Include human verification in the process of LD quality assessment Direct application: Detecting pattern in errors may allow to identify (and correct) the extraction mechanisms 3 28.10.2013 Acosta et al. – Crowdsourcing Linked Data Quality Assessment Institut für Angewandte Informatik und Formale Beschreibungsverfahren (AIFB)
  • 3. Research questions RQ1: Is it possible to detect quality issues in LD data sets via crowdsourcing mechanisms? RQ2: What type of crowd is most suitable for each type of quality issue? RQ3: Which types of errors are made by lay users and experts when assessing RDF triples? 4 28.10.2013 Acosta et al. – Crowdsourcing Linked Data Quality Assessment Institut für Angewandte Informatik und Formale Beschreibungsverfahren (AIFB)
  • 4. Related work DBpedia Assessing LD mappings ZenCrowd Entity resolution (Automatic) Crowdsourcing & Linked Data CrowdMAP Ontology allignment Web of data quality assessment Quality characteristics of LD data sources (Semi-automatic) WIQA, Sieve, (Manual) GWAP for LD Our work 5 28.10.2013 Acosta et al. – Crowdsourcing Linked Data Quality Assessment Institut für Angewandte Informatik und Formale Beschreibungsverfahren (AIFB)
  • 5. OUR APPROACH 6 28.10.2013 Acosta et al. – Crowdsourcing Linked Data Quality Assessment Institut für Angewandte Informatik und Formale Beschreibungsverfahren (AIFB)
  • 6. Methodology 2 1 Correct {s p o .} Dataset {s p o .} 3 Incorrect + Quality issue Steps to implement the methodology 1 2 Selecting the appropriate crowdsourcing approaches 3 7 Selecting LD quality issues to crowdsource Designing and generating the interfaces to present the data to the crowd 28.10.2013 Acosta et al. – Crowdsourcing Linked Data Quality Assessment Institut für Angewandte Informatik und Formale Beschreibungsverfahren (AIFB)
  • 7. 1 Selecting LD quality issues to crowdsource Three categories of quality problems occur in DBpedia [Zaveri2013] and can be crowdsourced: Incorrect object  Example: dbpedia:Dave_Dobbyn dbprop:dateOfBirth “3”. Incorrect data type or language tags  Example: dbpedia:Torishima_Izu_Islands foaf:name “ ”@en. Incorrect link to “external Web pages”  Example: dbpedia:John-Two-Hawks dbpedia-owl:wikiPageExternalLink <http://cedarlakedvd.com/> 8 28.10.2013 Acosta et al. – Crowdsourcing Linked Data Quality Assessment Institut für Angewandte Informatik und Formale Beschreibungsverfahren (AIFB)
  • 8. 2 Selecting appropriate crowdsourcing approaches (1) Find Verify Contest Microtasks LD Experts Difficult task Final prize Workers Easy task Micropayments TripleCheckMate [Kontoskostas2013] MTurk http://mturk.com Adapted from [Bernstein2010] 9 28.10.2013 Acosta et al. – Crowdsourcing Linked Data Quality Assessment Institut für Angewandte Informatik und Formale Beschreibungsverfahren (AIFB)
  • 9. 3 Presenting the data to the crowd Microtask interfaces: MTurk tasks Incorrect object • Selection of foaf:name or rdfs:label to extract humanreadable descriptions • Values extracted automatically from Wikipedia infoboxes • Link to the Wikipedia article via foaf:isPrimaryTopicOf Incorrect data type or language tag Incorrect outlink • Preview of external pages by implementing HTML iframe 10 28.10.2013 Acosta et al. – Crowdsourcing Linked Data Quality Assessment Institut für Angewandte Informatik und Formale Beschreibungsverfahren (AIFB)
  • 10. EXPERIMENTAL STUDY 11 28.10.2013 Acosta et al. – Crowdsourcing Linked Data Quality Assessment Institut für Angewandte Informatik und Formale Beschreibungsverfahren (AIFB)
  • 11. Experimental design • Crowdsourcing approaches: • Find stage: Contest with LD experts • Verify stage: Microtasks (5 assignments) • Creation of a gold standard: • Two of the authors of this paper (MA, AZ) generated the gold standard for all the triples obtained from the contest • Each author independently evaluated the triples • Conflicts were resolved via mutual agreement • Metric: precision 12 28.10.2013 Acosta et al. – Crowdsourcing Linked Data Quality Assessment Institut für Angewandte Informatik und Formale Beschreibungsverfahren (AIFB)
  • 12. Overall results LD Experts Number of distinct participants Total time Total triples evaluated Total cost 13 28.10.2013 Microtask workers 50 80 3 weeks (predefined) 4 days 1,512 1,073 ~ US$ 400 (predefined) ~ US$ 43 Maribel Acosta - Identifying DBpedia Quality Issues via Crowdsourcing Institut für Angewandte Informatik und Formale Beschreibungsverfahren (AIFB)
  • 13. Precision results: Incorrect object task • MTurk workers can be used to reduce the error rates of LD experts for the Find stage Triples compared LD Experts MTurk (majority voting: n=5) 509 0.7151 0.8977 • 117 DBpedia triples had predicates related to dates with incorrect/incomplete values: ”2005 Six Nations Championship” Date 12 . • 52 DBpedia triples had erroneous values from the source: ”English (programming language)” Influenced by ? . • • 14 Experts classified all these triples as incorrect Workers compared values against Wikipedia and successfully classified this triples as “correct” 28.10.2013 Acosta et al. – Crowdsourcing Linked Data Quality Assessment Institut für Angewandte Informatik und Formale Beschreibungsverfahren (AIFB)
  • 14. Precision results: Incorrect data type task Triples compared LD Experts MTurk (majority voting: n=5) 341 0.8270 0.4752 Number of triples 140 Experts TP 120 Experts FP 100 Crowd TP 80 Crowd FP 60 40 20 0 Date English Millimetre Nanometre Number Number with decimals Data types 15 28.10.2013 Acosta et al. – Crowdsourcing Linked Data Quality Assessment Second Volt Year Not specified / URI Institut für Angewandte Informatik und Formale Beschreibungsverfahren (AIFB)
  • 15. Precision results: Incorrect link task Triples compared Baseline LD Experts MTurk (n=5 majority voting) 223 0.2598 0.1525 0.9412 • We analyzed the 189 misclassifications by the experts: 11% 39% Freebase links 50% Wikipedia images External links • The 6% misclassifications by the workers correspond to pages with a language different from English. 16 28.10.2013 Acosta et al. – Crowdsourcing Linked Data Quality Assessment Institut für Angewandte Informatik und Formale Beschreibungsverfahren (AIFB)
  • 16. Final discussion RQ1: Is it possible to detect quality issues in LD data sets via crowdsourcing mechanisms? Both forms of crowdsourcing can be applied to detect certain LD quality issues RQ2: What type of crowd is most suitable for each type of quality issue? The effort of LD experts must be applied on those tasks demanding specific-domain skills. MTurk crowd was exceptionally good at performing data comparisons RQ3: Which types of errors are made by lay users and experts? Lay users do not have the skills to solve domain-specific tasks, while experts performance is very low on tasks that demand an extra effort (e.g., checking an external page) 17 28.10.2013 Acosta et al. – Crowdsourcing Linked Data Quality Assessment Institut für Angewandte Informatik und Formale Beschreibungsverfahren (AIFB)
  • 17. CONCLUSIONS & FUTURE WORK 18 28.10.2013 Acosta et al. – Crowdsourcing Linked Data Quality Assessment Institut für Angewandte Informatik und Formale Beschreibungsverfahren (AIFB)
  • 18. Conclusions & Future Work A crowdsourcing methodology for LD quality assessment: Find stage: LD experts Verify stage: MTurk workers Crowdsourcing approaches are feasible in detecting the studied quality issues Application: Detecting pattern in errors to fix the extraction mechanisms Future Work Conducting new experiments (other quality issues and domains) Integration of the crowd into curation processes and tools 19 28.10.2013 Acosta et al. – Crowdsourcing Linked Data Quality Assessment Institut für Angewandte Informatik und Formale Beschreibungsverfahren (AIFB)
  • 19. References & Acknowledgements [Bernstein2010] M. S. Bernstein, G. Little, R. C. Miller, B. Hartmann, M. S. Ackerman, D. R. Karger, D. Crowell, and K. Panovich. Soylent: a word processor with a crowd inside. In Proceedings of the 23nd annual ACM symposium on User interface software and technology, UIST ’10, pages 313–322, New York, NY, USA, 2010. ACM. [Kontoskostas2013] D Kontokostas, A Zaveri, S Auer, J Lehmann. TripleCheckMate: A Tool for Crowdsourcing the Quality Assessment of Linked Data . Knowledge Engineering and the Semantic Web, 2013 [Zaveri2013] A. Zaveri, A. Rula, A. Maurino, R. Pietrobon, J. Lehmann, and S. Auer. Quality as- sessment methodologies for linked open data. Under review, http://www.semantic-web-journal.net/content/quality-assessmentmethodologies-linked-open-data. 20 28.10.2013 Acosta et al. – Crowdsourcing Linked Data Quality Assessment Institut für Angewandte Informatik und Formale Beschreibungsverfahren (AIFB)
  • 20. Approach MTurk tasks Incorrect object Verify Find Contest Microtasks LD Experts Difficult task Final prize Workers Easy task Micropayments TripleCheckMate Incorrect data type MTurk Incorrect outlink Results: Precision Object values Data types Interlinks Linked Data experts 0.7151 0.8270 0.1525 MTurk 0.8977 0.4752 0.9412 (majority voting) 21 28.10.2013 Acosta et al. – Crowdsourcing Linked Data Quality Assessment QUESTIONS? Institut für Angewandte Informatik und Formale Beschreibungsverfahren (AIFB)

Editor's Notes

  1. As we know, the Linking Open Data cloud is a great source of data. However, the varying quality of Linked Data sets often imposes serious problems to developers aiming to consume and integrate LD in their applications.Keeping aside the factual flaws of the original sources, several quality issues are introduced during the RDFication process. Solution: Include human verification in the process of LD quality assessment in order to detect the quality issues that cannot be easily detected by other meansDirect application: Detecting patterns in errors may allow to identify (and correct) the extraction mechanisms in order
  2. TP = a triple that is identified as “incorrect” by the crowd, and the triple is indeed incorrectFP = a triple identified as “incorrect” by the crowd, but was actually correct in the data set