SlideShare a Scribd company logo
User-driven Quality
Evaluation of DBpedia
Amrapali Zaveri, Dimitris Kontokostas,
Mohamed A. Sherif, Lorenz Bühmann,
Mohamed Morsey, Sören Auer, Jens Lehmann
Outline
❏Data Quality
❏Data Quality Assessment Methodology
❏Evaluating Quality of Dbpedia
❏ Manual
❏ Semi-automatic
❏Results
❏Conclusion & Future Work
Data Quality
● Data Quality (DQ) is defined as:
○ fitness for a certain use case*
● On the Data Web - varying quality of information
covering various domains
● High quality datasets
○ curated over decades - e.g. life science domain
○ crowdsourcing process - extracted from unstructured
and semi-structured information, e.g. DBpedia
* J. Juran. The Quality Control Handbook. McGraw-Hill, New York, 1974.
Data Quality Assessment
Methodology
4 Step Methodology:
❏ Step 1: Resource selection
❏ Per Class
❏ Completely random
❏ Manual
❏ Step 2: Evaluation mode
selection
❏ Manual
❏ Semi-automatic
❏ Automatic
❏ Step 3: Resource evaluation
❏ Step 4: DQ improvement
❏ Direct
❏ Indirect
Evaluating Quality of Dbpedia
– Manual
❏Phase 1: Creation of quality problem
taxonomy
❏Phase 2: User-driven quality assessment
Evaluating Quality of Dbpedia
– Manual
❏Phase 1: Creation of quality problem
taxonomy
❏Phase 2: User-driven quality assessment
Quality Problem Taxonomy
Dimension Category Sub-category D F Dbpedia
Specific
Accuracy Triple
Incorrectly
extracted
Object value is incompletely extracted - E -
Object value in incorrectly extracted - E -
Special template not properly recognized √ E √
Datatype
problems
Datatype incorrectly extracted √ E -
Implicit
relation-
ships
between
attributes
One fact is encoded in several attributes - M √
Several facts are encoded in one attribute - E -
Attribute value computed from another
attribute value
- E
+
M
√
D = Detectable means problem detection can be automized.
F = Fixable means the issue is solvable by amending either the extraction framework (E), the mappings
wiki (M) or Wikipedia (W).
Quality Problem Taxonomy
Dimension Category Sub-category D F Dbpedia
Specific
Relevancy Irrelevant inform-
ation extracted
Extraction of attributes containing
layout information
√ E √
Redundant attribute values √ - -
Image related information √ E √
Other irrelevant information √ E -
Represen-
tational
Consistency
Representation of
number values
Inconsistency in representation of
number values
√ W -
Interlinking External links External websites √ W -
Interlinks with other
datasets
Links to Wikimedia √ E -
Links to Freebase √ E -
Links to Geospecies √ E -
Links generated via Flickr wrapper √ E -
Evaluating Quality of Dbpedia
– Manual
❏Phase 1: Creation of quality problem
taxonomy
❏Phase 2: User-driven quality assessment
User-driven quality assessment
Type Contest-based
Participants LD experts
Task Detect and classify LD quality issues
Time 1 month
Reward 300 EU prize
Tool TripleChekMate
Crowdsourcing
 HITs (Human Intelligent Tasks),
 Submit to a crowdsourcing platform (e.g. Amazon Mechanical Turk)
 Financial Reward for each HIT
DQ Assessment Tool -
TripleCheckMate
http://nl.dbpedia.org:8080/TripleCheckMate-Demo/
Evaluating Results -
Manual Methodology
Total no. of users 58
Total no. of distinct resources evaluated 521
Total no. of resources evaluated 792
Total no. of distinct resources without problems 86
Total no. of distinct resources with problems 435
Total no. of distinct incorrect triples 2928
Total no. of distinct incorrect triples in the dbprop namespace 1745
Total no. of inter-evaluations 268
No. of resources with evaluators having different opinions 89
Resource-based inter-rater agreement (Cohen’s kappa) 0.34
Triple-based inter-rater agreement (Cohen’s kappa) 0.38
Evaluating Results -
Manual Methodology
No. of triples evaluated for correctness 700
No. of triples evaluated to be correct 567
No. of triples evaluated incorrectly 133
% of triples correctly evaluated 81
Average no. of problems per resource 5.69
Average no. of problems per resource in the dbprop namespae 3.45
Average no. of triples per resource 47.19
% of triples affected 11.93
% of triples affected in the dbprop namespace 7.11
Evaluating Quality of Dbpedia
– Semi-automatic
❏ Step 1: Automatic creation of an extended
schema
❏ DL-Learner*
❏ for all properties in DBpedia, axioms expressing the (inverse)
functional, irreflexive and asymmetric characteristic were
generated
❏ minimum confidence value of 0.95
❏ Step 2: Manual evaluation of the generated
axioms
❏ 100 random axioms per type
❏ Restricted evaluation of those axioms where at least one
violation is found
❏ Taking target context into account
*J. Lehmann. DL-Learner: learning concepts in description logics. Journal of Machine Learning
Research (JMLR), 10:2639{2642, 2009.
Evaluation Results
- Semi-automatic
❏ Irreflexivity:
❏ dbpedia:2012_Coppa_Italia_Final dbpedia-owl:followingEvent
dbpedia:2012_Coppa_Italia_Final
❏ Asymmetry:
❏ dbpedia-owl:starring with domain Work and range Actor
❏ Functionality:
❏ 2 different values 2600.0 and 1630.0 for the density of the moon Himalia.
❏ Inverse Functionality:
❏ Domain: dbpedia-owl:FormulaOneRacer
Range:dbpedia-owl:GrandPrix
Violation:
dbpedia:Fernando_Alonso dbpedia-owl:firstWin
dbpedia:2003_Hungarian_Grand_Prix .
dbpedia:WikiProject_Formula_one dbpedia-owl:firstWin
dbpedia:2003_Hungarian_Grand_Prix .
Evaluation Results -
Semi-automatic methodology
Characteristic #Properties Correct #Violation
Total Violated Min Max Avg. Total
Irreflexivity 142 24 24 1 133 9.8 236
Asymmetry 500 144 81 1 628 16.7 1358
Functionality 739 671 76 1 91581 2624.7 199480
Inverse
Functionality
52 49 13 8 18236 1685.2 21908
Conclusion & Future Work
● Empirical quality analysis for more than 500
resources of a large linked dataset extracted
from crowdsourced content
● Future work:
○ Fix problems detected (Improvement step)
○ Assess other LOD sources
○ Adopt an agile methodology to improve quality of LOD
○ Revisit quality analysis (in regular intervals)
Thank You
Questions?
http://aksw.org/AmrapaliZaveri
zaveri@informatik.uni-leipzig.de
Twitter: @amrapaliz

More Related Content

Similar to User-driven Quality Evaluation of DBpedia

Cikm 2013 - Beyond Data From User Information to Business Value
Cikm 2013 - Beyond Data From User Information to Business ValueCikm 2013 - Beyond Data From User Information to Business Value
Cikm 2013 - Beyond Data From User Information to Business ValueXavier Amatriain
 
Section1 compound data class
Section1 compound data classSection1 compound data class
Section1 compound data class
Dương Tùng
 
Big & Personal: the data and the models behind Netflix recommendations by Xa...
 Big & Personal: the data and the models behind Netflix recommendations by Xa... Big & Personal: the data and the models behind Netflix recommendations by Xa...
Big & Personal: the data and the models behind Netflix recommendations by Xa...
BigMine
 
Data Quality
Data QualityData Quality
Data Quality
jerdeb
 
Software Quality
Software QualitySoftware Quality
Software Quality
sjavaad
 
Building High Available and Scalable Machine Learning Applications
Building High Available and Scalable Machine Learning ApplicationsBuilding High Available and Scalable Machine Learning Applications
Building High Available and Scalable Machine Learning Applications
Yalçın Yenigün
 
acmsigtalkshare-121023190142-phpapp01.pptx
acmsigtalkshare-121023190142-phpapp01.pptxacmsigtalkshare-121023190142-phpapp01.pptx
acmsigtalkshare-121023190142-phpapp01.pptx
dongchangim30
 
Capturing the Ineffable: Collecting, Analysing, and Automating Web Document ...
Capturing the Ineffable: Collecting, Analysing, and Automating Web Document ...Capturing the Ineffable: Collecting, Analysing, and Automating Web Document ...
Capturing the Ineffable: Collecting, Analysing, and Automating Web Document ...
Davide Ceolin
 
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
Lucidworks
 
Lucene/Solr Revolution 2015: Where Search Meets Machine Learning
Lucene/Solr Revolution 2015: Where Search Meets Machine LearningLucene/Solr Revolution 2015: Where Search Meets Machine Learning
Lucene/Solr Revolution 2015: Where Search Meets Machine Learning
Joaquin Delgado PhD.
 
Lucene/Solr Revolution 2015: Where Search Meets Machine Learning
Lucene/Solr Revolution 2015: Where Search Meets Machine LearningLucene/Solr Revolution 2015: Where Search Meets Machine Learning
Lucene/Solr Revolution 2015: Where Search Meets Machine Learning
S. Diana Hu
 
Recommendation engine Using Genetic Algorithm
Recommendation engine Using Genetic AlgorithmRecommendation engine Using Genetic Algorithm
Recommendation engine Using Genetic Algorithm
Vaibhav Varshney
 
Rokach-GomaxSlides (1).pptx
Rokach-GomaxSlides (1).pptxRokach-GomaxSlides (1).pptx
Rokach-GomaxSlides (1).pptx
Jadna Almeida
 
Rokach-GomaxSlides.pptx
Rokach-GomaxSlides.pptxRokach-GomaxSlides.pptx
Rokach-GomaxSlides.pptx
Jadna Almeida
 
Search quality in practice
Search quality in practiceSearch quality in practice
Search quality in practice
Alexander Sibiryakov
 
Download
DownloadDownload
Downloadbutest
 
Download
DownloadDownload
Downloadbutest
 
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...
Xavier Amatriain
 
MAchine learning
MAchine learningMAchine learning
MAchine learning
JayrajSingh9
 

Similar to User-driven Quality Evaluation of DBpedia (20)

Cikm 2013 - Beyond Data From User Information to Business Value
Cikm 2013 - Beyond Data From User Information to Business ValueCikm 2013 - Beyond Data From User Information to Business Value
Cikm 2013 - Beyond Data From User Information to Business Value
 
Section1 compound data class
Section1 compound data classSection1 compound data class
Section1 compound data class
 
Big & Personal: the data and the models behind Netflix recommendations by Xa...
 Big & Personal: the data and the models behind Netflix recommendations by Xa... Big & Personal: the data and the models behind Netflix recommendations by Xa...
Big & Personal: the data and the models behind Netflix recommendations by Xa...
 
Data Quality
Data QualityData Quality
Data Quality
 
Software Quality
Software QualitySoftware Quality
Software Quality
 
Building High Available and Scalable Machine Learning Applications
Building High Available and Scalable Machine Learning ApplicationsBuilding High Available and Scalable Machine Learning Applications
Building High Available and Scalable Machine Learning Applications
 
acmsigtalkshare-121023190142-phpapp01.pptx
acmsigtalkshare-121023190142-phpapp01.pptxacmsigtalkshare-121023190142-phpapp01.pptx
acmsigtalkshare-121023190142-phpapp01.pptx
 
Computer Engineer Master Project
Computer Engineer Master ProjectComputer Engineer Master Project
Computer Engineer Master Project
 
Capturing the Ineffable: Collecting, Analysing, and Automating Web Document ...
Capturing the Ineffable: Collecting, Analysing, and Automating Web Document ...Capturing the Ineffable: Collecting, Analysing, and Automating Web Document ...
Capturing the Ineffable: Collecting, Analysing, and Automating Web Document ...
 
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
 
Lucene/Solr Revolution 2015: Where Search Meets Machine Learning
Lucene/Solr Revolution 2015: Where Search Meets Machine LearningLucene/Solr Revolution 2015: Where Search Meets Machine Learning
Lucene/Solr Revolution 2015: Where Search Meets Machine Learning
 
Lucene/Solr Revolution 2015: Where Search Meets Machine Learning
Lucene/Solr Revolution 2015: Where Search Meets Machine LearningLucene/Solr Revolution 2015: Where Search Meets Machine Learning
Lucene/Solr Revolution 2015: Where Search Meets Machine Learning
 
Recommendation engine Using Genetic Algorithm
Recommendation engine Using Genetic AlgorithmRecommendation engine Using Genetic Algorithm
Recommendation engine Using Genetic Algorithm
 
Rokach-GomaxSlides (1).pptx
Rokach-GomaxSlides (1).pptxRokach-GomaxSlides (1).pptx
Rokach-GomaxSlides (1).pptx
 
Rokach-GomaxSlides.pptx
Rokach-GomaxSlides.pptxRokach-GomaxSlides.pptx
Rokach-GomaxSlides.pptx
 
Search quality in practice
Search quality in practiceSearch quality in practice
Search quality in practice
 
Download
DownloadDownload
Download
 
Download
DownloadDownload
Download
 
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...
 
MAchine learning
MAchine learningMAchine learning
MAchine learning
 

More from Amrapali Zaveri, PhD

Data Quality and the FAIR principles
Data Quality and the FAIR principlesData Quality and the FAIR principles
Data Quality and the FAIR principles
Amrapali Zaveri, PhD
 
Workshop on Data Quality Management in Wikidata
Workshop on Data Quality Management in WikidataWorkshop on Data Quality Management in Wikidata
Workshop on Data Quality Management in Wikidata
Amrapali Zaveri, PhD
 
ESOF Panel 2018
ESOF Panel 2018ESOF Panel 2018
ESOF Panel 2018
Amrapali Zaveri, PhD
 
CrowdED: Guideline for optimal Crowdsourcing Experimental Design
CrowdED: Guideline for optimal Crowdsourcing Experimental DesignCrowdED: Guideline for optimal Crowdsourcing Experimental Design
CrowdED: Guideline for optimal Crowdsourcing Experimental Design
Amrapali Zaveri, PhD
 
MetaCrowd: Crowdsourcing Gene Expression Metadata Quality Assessment
MetaCrowd: Crowdsourcing Gene Expression Metadata Quality AssessmentMetaCrowd: Crowdsourcing Gene Expression Metadata Quality Assessment
MetaCrowd: Crowdsourcing Gene Expression Metadata Quality Assessment
Amrapali Zaveri, PhD
 
smartAPI: Towards a more intelligent network of Web APIs
smartAPI: Towards a more intelligent network of Web APIssmartAPI: Towards a more intelligent network of Web APIs
smartAPI: Towards a more intelligent network of Web APIs
Amrapali Zaveri, PhD
 
Introduction to Bio SPARQL
Introduction to Bio SPARQL Introduction to Bio SPARQL
Introduction to Bio SPARQL
Amrapali Zaveri, PhD
 
Linked Data Quality Assessment: A Survey
Linked Data Quality Assessment: A SurveyLinked Data Quality Assessment: A Survey
Linked Data Quality Assessment: A Survey
Amrapali Zaveri, PhD
 
Amrapali Zaveri Defense
Amrapali Zaveri DefenseAmrapali Zaveri Defense
Amrapali Zaveri Defense
Amrapali Zaveri, PhD
 
LDQ 2014 DQ Methodology
LDQ 2014 DQ MethodologyLDQ 2014 DQ Methodology
LDQ 2014 DQ Methodology
Amrapali Zaveri, PhD
 
LOD-SEM
LOD-SEMLOD-SEM
Towards Biomedical Data Integration for Analyzing the Evolution of Cognition
Towards Biomedical Data Integration for Analyzing the Evolution of CognitionTowards Biomedical Data Integration for Analyzing the Evolution of Cognition
Towards Biomedical Data Integration for Analyzing the Evolution of Cognition
Amrapali Zaveri, PhD
 
ReDD-Observatory
ReDD-ObservatoryReDD-Observatory
ReDD-Observatory
Amrapali Zaveri, PhD
 

More from Amrapali Zaveri, PhD (13)

Data Quality and the FAIR principles
Data Quality and the FAIR principlesData Quality and the FAIR principles
Data Quality and the FAIR principles
 
Workshop on Data Quality Management in Wikidata
Workshop on Data Quality Management in WikidataWorkshop on Data Quality Management in Wikidata
Workshop on Data Quality Management in Wikidata
 
ESOF Panel 2018
ESOF Panel 2018ESOF Panel 2018
ESOF Panel 2018
 
CrowdED: Guideline for optimal Crowdsourcing Experimental Design
CrowdED: Guideline for optimal Crowdsourcing Experimental DesignCrowdED: Guideline for optimal Crowdsourcing Experimental Design
CrowdED: Guideline for optimal Crowdsourcing Experimental Design
 
MetaCrowd: Crowdsourcing Gene Expression Metadata Quality Assessment
MetaCrowd: Crowdsourcing Gene Expression Metadata Quality AssessmentMetaCrowd: Crowdsourcing Gene Expression Metadata Quality Assessment
MetaCrowd: Crowdsourcing Gene Expression Metadata Quality Assessment
 
smartAPI: Towards a more intelligent network of Web APIs
smartAPI: Towards a more intelligent network of Web APIssmartAPI: Towards a more intelligent network of Web APIs
smartAPI: Towards a more intelligent network of Web APIs
 
Introduction to Bio SPARQL
Introduction to Bio SPARQL Introduction to Bio SPARQL
Introduction to Bio SPARQL
 
Linked Data Quality Assessment: A Survey
Linked Data Quality Assessment: A SurveyLinked Data Quality Assessment: A Survey
Linked Data Quality Assessment: A Survey
 
Amrapali Zaveri Defense
Amrapali Zaveri DefenseAmrapali Zaveri Defense
Amrapali Zaveri Defense
 
LDQ 2014 DQ Methodology
LDQ 2014 DQ MethodologyLDQ 2014 DQ Methodology
LDQ 2014 DQ Methodology
 
LOD-SEM
LOD-SEMLOD-SEM
LOD-SEM
 
Towards Biomedical Data Integration for Analyzing the Evolution of Cognition
Towards Biomedical Data Integration for Analyzing the Evolution of CognitionTowards Biomedical Data Integration for Analyzing the Evolution of Cognition
Towards Biomedical Data Integration for Analyzing the Evolution of Cognition
 
ReDD-Observatory
ReDD-ObservatoryReDD-Observatory
ReDD-Observatory
 

Recently uploaded

A Strategic Approach: GenAI in Education
A Strategic Approach: GenAI in EducationA Strategic Approach: GenAI in Education
A Strategic Approach: GenAI in Education
Peter Windle
 
Unit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdfUnit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdf
Thiyagu K
 
1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx
JosvitaDsouza2
 
"Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe..."Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe...
SACHIN R KONDAGURI
 
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
Nguyen Thanh Tu Collection
 
Model Attribute Check Company Auto Property
Model Attribute  Check Company Auto PropertyModel Attribute  Check Company Auto Property
Model Attribute Check Company Auto Property
Celine George
 
How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...
Jisc
 
Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.
Ashokrao Mane college of Pharmacy Peth-Vadgaon
 
The approach at University of Liverpool.pptx
The approach at University of Liverpool.pptxThe approach at University of Liverpool.pptx
The approach at University of Liverpool.pptx
Jisc
 
The geography of Taylor Swift - some ideas
The geography of Taylor Swift - some ideasThe geography of Taylor Swift - some ideas
The geography of Taylor Swift - some ideas
GeoBlogs
 
Introduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp NetworkIntroduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp Network
TechSoup
 
The Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official PublicationThe Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official Publication
Delapenabediema
 
Digital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and ResearchDigital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and Research
Vikramjit Singh
 
The basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptxThe basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptx
heathfieldcps1
 
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCECLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
BhavyaRajput3
 
Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345
beazzy04
 
Honest Reviews of Tim Han LMA Course Program.pptx
Honest Reviews of Tim Han LMA Course Program.pptxHonest Reviews of Tim Han LMA Course Program.pptx
Honest Reviews of Tim Han LMA Course Program.pptx
timhan337
 
Instructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptxInstructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptx
Jheel Barad
 
Palestine last event orientationfvgnh .pptx
Palestine last event orientationfvgnh .pptxPalestine last event orientationfvgnh .pptx
Palestine last event orientationfvgnh .pptx
RaedMohamed3
 
Acetabularia Information For Class 9 .docx
Acetabularia Information For Class 9  .docxAcetabularia Information For Class 9  .docx
Acetabularia Information For Class 9 .docx
vaibhavrinwa19
 

Recently uploaded (20)

A Strategic Approach: GenAI in Education
A Strategic Approach: GenAI in EducationA Strategic Approach: GenAI in Education
A Strategic Approach: GenAI in Education
 
Unit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdfUnit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdf
 
1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx
 
"Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe..."Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe...
 
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
 
Model Attribute Check Company Auto Property
Model Attribute  Check Company Auto PropertyModel Attribute  Check Company Auto Property
Model Attribute Check Company Auto Property
 
How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...
 
Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.
 
The approach at University of Liverpool.pptx
The approach at University of Liverpool.pptxThe approach at University of Liverpool.pptx
The approach at University of Liverpool.pptx
 
The geography of Taylor Swift - some ideas
The geography of Taylor Swift - some ideasThe geography of Taylor Swift - some ideas
The geography of Taylor Swift - some ideas
 
Introduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp NetworkIntroduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp Network
 
The Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official PublicationThe Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official Publication
 
Digital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and ResearchDigital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and Research
 
The basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptxThe basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptx
 
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCECLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
 
Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345
 
Honest Reviews of Tim Han LMA Course Program.pptx
Honest Reviews of Tim Han LMA Course Program.pptxHonest Reviews of Tim Han LMA Course Program.pptx
Honest Reviews of Tim Han LMA Course Program.pptx
 
Instructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptxInstructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptx
 
Palestine last event orientationfvgnh .pptx
Palestine last event orientationfvgnh .pptxPalestine last event orientationfvgnh .pptx
Palestine last event orientationfvgnh .pptx
 
Acetabularia Information For Class 9 .docx
Acetabularia Information For Class 9  .docxAcetabularia Information For Class 9  .docx
Acetabularia Information For Class 9 .docx
 

User-driven Quality Evaluation of DBpedia

  • 1. User-driven Quality Evaluation of DBpedia Amrapali Zaveri, Dimitris Kontokostas, Mohamed A. Sherif, Lorenz Bühmann, Mohamed Morsey, Sören Auer, Jens Lehmann
  • 2. Outline ❏Data Quality ❏Data Quality Assessment Methodology ❏Evaluating Quality of Dbpedia ❏ Manual ❏ Semi-automatic ❏Results ❏Conclusion & Future Work
  • 3. Data Quality ● Data Quality (DQ) is defined as: ○ fitness for a certain use case* ● On the Data Web - varying quality of information covering various domains ● High quality datasets ○ curated over decades - e.g. life science domain ○ crowdsourcing process - extracted from unstructured and semi-structured information, e.g. DBpedia * J. Juran. The Quality Control Handbook. McGraw-Hill, New York, 1974.
  • 4. Data Quality Assessment Methodology 4 Step Methodology: ❏ Step 1: Resource selection ❏ Per Class ❏ Completely random ❏ Manual ❏ Step 2: Evaluation mode selection ❏ Manual ❏ Semi-automatic ❏ Automatic ❏ Step 3: Resource evaluation ❏ Step 4: DQ improvement ❏ Direct ❏ Indirect
  • 5. Evaluating Quality of Dbpedia – Manual ❏Phase 1: Creation of quality problem taxonomy ❏Phase 2: User-driven quality assessment
  • 6. Evaluating Quality of Dbpedia – Manual ❏Phase 1: Creation of quality problem taxonomy ❏Phase 2: User-driven quality assessment
  • 7. Quality Problem Taxonomy Dimension Category Sub-category D F Dbpedia Specific Accuracy Triple Incorrectly extracted Object value is incompletely extracted - E - Object value in incorrectly extracted - E - Special template not properly recognized √ E √ Datatype problems Datatype incorrectly extracted √ E - Implicit relation- ships between attributes One fact is encoded in several attributes - M √ Several facts are encoded in one attribute - E - Attribute value computed from another attribute value - E + M √ D = Detectable means problem detection can be automized. F = Fixable means the issue is solvable by amending either the extraction framework (E), the mappings wiki (M) or Wikipedia (W).
  • 8. Quality Problem Taxonomy Dimension Category Sub-category D F Dbpedia Specific Relevancy Irrelevant inform- ation extracted Extraction of attributes containing layout information √ E √ Redundant attribute values √ - - Image related information √ E √ Other irrelevant information √ E - Represen- tational Consistency Representation of number values Inconsistency in representation of number values √ W - Interlinking External links External websites √ W - Interlinks with other datasets Links to Wikimedia √ E - Links to Freebase √ E - Links to Geospecies √ E - Links generated via Flickr wrapper √ E -
  • 9. Evaluating Quality of Dbpedia – Manual ❏Phase 1: Creation of quality problem taxonomy ❏Phase 2: User-driven quality assessment
  • 10. User-driven quality assessment Type Contest-based Participants LD experts Task Detect and classify LD quality issues Time 1 month Reward 300 EU prize Tool TripleChekMate Crowdsourcing  HITs (Human Intelligent Tasks),  Submit to a crowdsourcing platform (e.g. Amazon Mechanical Turk)  Financial Reward for each HIT
  • 11. DQ Assessment Tool - TripleCheckMate http://nl.dbpedia.org:8080/TripleCheckMate-Demo/
  • 12. Evaluating Results - Manual Methodology Total no. of users 58 Total no. of distinct resources evaluated 521 Total no. of resources evaluated 792 Total no. of distinct resources without problems 86 Total no. of distinct resources with problems 435 Total no. of distinct incorrect triples 2928 Total no. of distinct incorrect triples in the dbprop namespace 1745 Total no. of inter-evaluations 268 No. of resources with evaluators having different opinions 89 Resource-based inter-rater agreement (Cohen’s kappa) 0.34 Triple-based inter-rater agreement (Cohen’s kappa) 0.38
  • 13. Evaluating Results - Manual Methodology No. of triples evaluated for correctness 700 No. of triples evaluated to be correct 567 No. of triples evaluated incorrectly 133 % of triples correctly evaluated 81 Average no. of problems per resource 5.69 Average no. of problems per resource in the dbprop namespae 3.45 Average no. of triples per resource 47.19 % of triples affected 11.93 % of triples affected in the dbprop namespace 7.11
  • 14. Evaluating Quality of Dbpedia – Semi-automatic ❏ Step 1: Automatic creation of an extended schema ❏ DL-Learner* ❏ for all properties in DBpedia, axioms expressing the (inverse) functional, irreflexive and asymmetric characteristic were generated ❏ minimum confidence value of 0.95 ❏ Step 2: Manual evaluation of the generated axioms ❏ 100 random axioms per type ❏ Restricted evaluation of those axioms where at least one violation is found ❏ Taking target context into account *J. Lehmann. DL-Learner: learning concepts in description logics. Journal of Machine Learning Research (JMLR), 10:2639{2642, 2009.
  • 15. Evaluation Results - Semi-automatic ❏ Irreflexivity: ❏ dbpedia:2012_Coppa_Italia_Final dbpedia-owl:followingEvent dbpedia:2012_Coppa_Italia_Final ❏ Asymmetry: ❏ dbpedia-owl:starring with domain Work and range Actor ❏ Functionality: ❏ 2 different values 2600.0 and 1630.0 for the density of the moon Himalia. ❏ Inverse Functionality: ❏ Domain: dbpedia-owl:FormulaOneRacer Range:dbpedia-owl:GrandPrix Violation: dbpedia:Fernando_Alonso dbpedia-owl:firstWin dbpedia:2003_Hungarian_Grand_Prix . dbpedia:WikiProject_Formula_one dbpedia-owl:firstWin dbpedia:2003_Hungarian_Grand_Prix .
  • 16. Evaluation Results - Semi-automatic methodology Characteristic #Properties Correct #Violation Total Violated Min Max Avg. Total Irreflexivity 142 24 24 1 133 9.8 236 Asymmetry 500 144 81 1 628 16.7 1358 Functionality 739 671 76 1 91581 2624.7 199480 Inverse Functionality 52 49 13 8 18236 1685.2 21908
  • 17. Conclusion & Future Work ● Empirical quality analysis for more than 500 resources of a large linked dataset extracted from crowdsourced content ● Future work: ○ Fix problems detected (Improvement step) ○ Assess other LOD sources ○ Adopt an agile methodology to improve quality of LOD ○ Revisit quality analysis (in regular intervals)