DBpedia exposes data from Wikipedia as machine-readable Linked Data. The DBpedia data extraction process generates RDF data in two ways; (a) using the mappings that map the data from Wikipedia infoboxes to the DBpedia ontology and other vocabularies, and (b) using infobox-properties, i.e., properties that are not defined in the DBpedia
ontology but are auto-generated using the infobox attribute-value pairs. The work presented in this paper inspects the quality issues of the properties used in the Spanish DBpedia dataset according to conciseness, consistency, syntactic validity, and semantic accuracy quality dimensions.
Model Call Girl in Jamuna Vihar Delhi reach out to us at 🔝9953056974🔝
An analysis of the quality issues of the properties available in the Spanish DBpedia
1. An Analysis of the Quality Issues
of the Properties Available
in the Spanish DBpedia
Nandana Mihindukulasooriya, Mariano Rico,
Raúl García-Castro, and Asunción Gómez-Pérez
Ontology Engineering Group (OEG)
Departamento de Inteligencia Artificial
Escuela Técnica Superior de Ingenieros Informáticos
Universidad Politécnica de Madrid
Acknowledgments:
4V (TIN2013-46238-C4-2-R) and LIDER (EU FP7 610782) projects
http://loupe.linkeddata.es
2. Collaborative editing in Wikipedia
2Ontology Engineering Group, Universidad Politécnica de Madrid
3. Spontaneous data model creation
3Ontology Engineering Group, Universidad Politécnica de Madrid
4. Can spontaneous data models
support us in data quality
assessment?
But, first, what is the quality of such
spontaneous data models?
4Ontology Engineering Group, Universidad Politécnica de Madrid
5. DBpedia – Exposing Wikipedia’s content as Linked Data
5Ontology Engineering Group, Universidad Politécnica de Madrid
RDF Triple
store
Rendering
6. esDBpedia – the Spanish DBpedia chapter
6Ontology Engineering Group, Universidad Politécnica de Madrid
http://es.dbpedia.org/
7. Can esDBpedia’s
spontaneous data model
support us in data quality
assessment?
But, first, what is the quality of the
properties of such spontaneous
data model?
7Ontology Engineering Group, Universidad Politécnica de Madrid
8. Quality Dimensions for Datasets
A. Conciseness. A dataset does not contain
redundant concepts with different identifiers
B. Consistency. A dataset does not contain
conflicting or contradictory data
C. Syntactic Validity. Values belong to the
legal value range for the represented domain
and do not violate the syntactic rules
D. Semantic Accuracy. Values correctly
represent real world facts
8Ontology Engineering Group, Universidad Politécnica de Madrid
9. Extraction and inspection of property statistics
9Ontology Engineering Group, Universidad Politécnica de Madrid
http://loupe.linkeddata.es/
10. Information extracted about properties
10Ontology Engineering Group, Universidad Politécnica de Madrid
Property statistics template Example Data
General information URI http://es.dbpedia.org/property/edad
Local name edad
Namespace http://es.dbpedia.org/property/
Number of triples 4623
Subject Analysis IRI subject count 4623 (100 %)
Extracted domain classes
(i.e., ?subject a ?class)
dbpedia-owl:Agent 2611 (56,48 %)
schema:Person 1515 (32.77 %)
…
Object analysis URI object count 186 (4.02 %)
Extracted range classes
(i.e., ?object a ?class)
skos:Concept 17 (9.14 %)
schema:Place 2 (1.08 %)
…
Literal object count 4437 (95.98 %)
Numerical object count 2491 (53.88 %)
Integer object count 2382 (51.52 %)
Average of numerics 3.53
Max numeric sample 8.79E11, 1.5E8, 1.5E7, 8.2E6, 8121540
Min numeric sample -5, 0, 1, 1.08, 1.2
12. A. Conciseness
• Many redundant properties in esDBpedia
• 97.93% are auto-generated
• Causes
• Capitalization (857): partidosEnPrimera,partidosenprimera
• Synonyms: causaDeMuerte, causaDeFallecimiento
• Prepositions: causaDeFallecimiento, causaFallecimiento
• Spelling (7,495): apeliido, apelldio, apellid
• Singular/plural: apellido, apellidos
• Gender: administrador, administradora
• Accent usage (1,252): administracion, administración
• Parsing (107): altitudMin/máx, residencia/trabajo, idioma/s
12Ontology Engineering Group, Universidad Politécnica de Madrid
13. B. Consistency
• OWL properties with IRI and literal values
• 3,380 properties
• Use of strings and URL interchangeably
• esdbpedia:lugarDeEntierro
• "Madrid"@es
• http://es.dbpedia.org/resource/Madrid
• Diverse and incorrect domain and range types
• esdbpedia:edad has range of type dbo:Place
• esdbpedia:lugarmuerte has range of type dbo:Person
• esdbpedia:pais has range of type dbo:Actor
13Ontology Engineering Group, Universidad Politécnica de Madrid
14. C. Syntactic Validity
• IRIs represented as strings
• Many properties with IRI values and very few string values
• Common cause:
• IRIs encoded as strings -> “http://...”@es
• Numerical values represented as strings
• 3,675 properties with more than 99% integer objects and a
very few string literals
• Common cause:
• Numerics encoded as strings -> “2^^xsd:integer”
14Ontology Engineering Group, Universidad Politécnica de Madrid
15. D. Semantic Accuracy
• Outliers
• Numerical values allow an automatic analysis
• Properties such as diameter or edad with negative values
• Harder to detect automatically
• Our plan is to use data fusion approaches to compare values
from multiple sources
15Ontology Engineering Group, Universidad Politécnica de Madrid
16. Conclusions and future work
• DBpedia’s spontaneous data model can support
quality assessment
• Errors in DBpedia are introduced in many stages
• Crowd-sourced data
• Mappings
• Extraction framework
• Some errors can be eliminated with pre-processing
and cleaning
• Quality assessment currently semi-automatic
• Currently working towards its automation
• We plan to investigate if the quality issues are the
same in other DBpedia instances
16Ontology Engineering Group, Universidad Politécnica de Madrid