1. 4V: Volumen, Velocidad, Variedad y Validez
en la gestión innovadora de datos
(TIN2013-46238)
Progress Report – WP3
Zaragoza, 15 de Junio 2016
Ontology Engineering Group (OEG)
Escuela Técnica Superior de Ingenieros Informáticos
Universidad Politécnica de Madrid
Campus de Montegancedo,
Boadilla del Monte, 28660, Spain
2. Outline
• Loupe
• On-going work
• Quality Assessment and Repair
• Conciseness
• Consistency
• Collaborations
• A two-fold quality assurance approach for dynamic KBs: The
3cixty use case
2Nandana Mihindukulasooriya, OEG
3. Loupe - An Online Tool for Inspecting Datasets in
the Linked Data Cloud
Demo @ ISWC2015
3Nandana Mihindukulasooriya, OEG
4. Loupe - Overview
4Nandana Mihindukulasooriya, OEG
Explore the vocabularies used and the abstract triple patterns in 2+
billion triples including all Dbpedia datasets, Wikidata, Linked Brainz,
Bio2RDF.
Loupe helps to understand data, uncover patterns, formulate queries, and detect
quality issues
Loupe - An Online Tool for Inspecting Datasets in the Linked Data Cloud
Demo @ ISWC2015.
5. Loupe – Google Analytics
5Ontology Engineering Group, Universidad Politécnica de Madrid
6. Loupe – Google Analytics (II)
• Users from 84 countries
• Spain(23.76%), US (16.69%), Germany (10.64%), UK
(9.14%), Italy (4.51%)
6Ontology Engineering Group
8. Loupe – Use Case Analysis
• Dataset Descriptions
• Dataset statistics
• Dataset profiling
• Dataset exploration
• Class/property browsing
• Triple pattern browsing
• Dataset discovery and
recommendation
• keywords, vocabularies
• SPARQL queries
• RDF shapes
8Ontology Engineering Group
• Quality assessment
• Consistency
• Misused vocabularies
• Guided SPARQL query
generation
• auto-complete based on
abstract triple patterns
• Vocabulary reuse and
recommendation
• Recommendation of
vocabularies based on
popularity
• Ontology development
feedback
• Common properties
9. Loupe – LOD Laundromat integration
9Nandana Mihindukulasooriya, OEG
• Current status of Loupe
• 2 billion triples from 32 datasets
• LOD Laundromat
• 32 billion triples from 650K documents
• cleaned for syntax errors and duplicates
• coverage of smaller documents
• Collaboration with VU University Amsterdam
• Steps
• Fully automatic dataset download, SPARQL endpoint
creation, indexing, clean up
• UI changes to handle large number of datasets
• Vocabulary usage datasets
10. Loupe Ontology – Vocabulary Usage Statistics of LOD
• Analysis of existing metrics
• VoID
• DCAT
• RDFStats
• LODStats
• VoID-Ext
• Analysis of use case requirements
• Statistics
• Profiling
• Discovery
• Recommendation
10Nandana Mihindukulasooriya, OEG
12. An Analysis of the Quality Issues of the Properties
Available in the Spanish Dbpedia
CAEPIA 2015, Albacete
12Nandana Mihindukulasooriya, OEG
13. Analyzed Quality Dimensions
13Nandana Mihindukulasooriya, OEG
An Analysis of the Quality Issues of the Properties Available in the Spanish Dbpedia
CAEPIA2015.
A. Conciseness. A dataset does not contain
redundant concepts with different identifiers.
B. Consistency. A dataset does not contain
conflicting or contradictory data.
C. Syntactic Validity. Values belong to the
legal value range for the represented domain
and do not violate the syntactic rules.
D. Semantic Accuracy. Values correctly
represent real world facts
14. Conciseness
• Many redundant properties in esDBpedia
• 97.93% are auto-generated
• Causes
• Capitalization (857): partidosEnPrimera,partidosenprimera
• Synonyms: causaDeMuerte, causaDeFallecimiento
• Prepositions: causaDeFallecimiento, causaFallecimiento
• Spelling (7,495): apeliido, apelldio, apellid
• Singular/plural: apellido, apellidos
• Gender: administrador, administradora
• Accent usage (1,252): administracion, administración
• Parsing (107): altitudMin/máx, residencia/trabajo, idioma/s
14Ontology Engineering Group, Universidad Politécnica de Madrid
15. Consistency
• Diverse and incorrect domain and range types
• esdbpedia:edad has range of type dbo:Place
• esdbpedia:lugarmuerte has range of type dbo:Person
• esdbpedia:pais has range of type dbo:Actor
• OWL properties with IRI and literal values
• 3,380 properties
• Use of strings and URL interchangeably
• esdbpedia:lugarDeEntierro
• "Madrid"@es
• http://es.dbpedia.org/resource/Madrid
15Ontology Engineering Group, Universidad Politécnica de Madrid
17. How to query for the birth place of a person in DBpedia?
17Nandana Mihindukulasooriya, OEG
DBpedia
(lang)
Syntactically Similar Semantically
Similar
English birthplace, birthplace, placeofbirth, birthplace,
birthdplace, birthPalce, birthplace, PlaceOfBirth,
laceOfBirth, oplaceOfBirth, birthplace, birthplace,
birthPalce, birthPlae, birthPace, birthPlaxe,
birtPlace, birthPlcace, bithPlace, brithPlace,
nbirthPlace, birthplace, birghPlace, birthdplace,
biRthPlace, birth, placebirth, placeOfBirth,
placOfBirth, birthPlaceOf, birthPlae
cityofbirth,
cityofbirthPlace,
cityOfBirth,
birthLocation
Spanish birthPlace, placeOfBirth, birthPlace, birthplace
lugarDeNacimiento, lugarNacimiento,
lugarNacimiento, lugarnacimiento,
lugardenacimiento, lugarNacimento, lugarNaciento
ciudaddenacimiento,
ciudadDenacimiento,
paisdenacimiento,
paisNacimiento
German geburtsort, birthplace, birthPlace, placeOfBirth
placeofbirth
geburtsland,
countryofbirth
18. Conciseness
• Less-concise datasets
• Multiple identifiers with same semantics
• Issues
• Harder to understand data and vocabularies used
• Harder to write queries
• Harder to reuse
• Causes
• Less concise mappings
• Diverse distributed mappings created by multiple teams
• No policies or guidance of consistent vocabulary usage
• No tools for recommending class / properties
• Crowd-sourced ontologies
• No or minimum labels / descriptions
18Nandana Mihindukulasooriya, OEG
19. RDF generation process
19Nandana Mihindukulasooriya, OEG
Bulk RDF Transformation
(e.g., LOD Refine, DBpedia extraction
framework, Ad-hoc programs)
structured data
unstructured
Query Rewriting
RDF Mappings
(e.g., R2RML,
Mappings Wiki, D2R
mappings, LOD
Refine RDF
skeletons)
SPARQL Endpoint
(e.g., Virtuoso, Fuseki)
RDF
Dumps
Linked Data
Resources
(e.g,, Pubby, ELDA)
Triple Store Web Server
SPARQL Clients Linked Data Clients
Data sources
Transformation
Storage
Access
21. Issues in DBpedia mappings
• 16 DBpeida chapters
• Crowd-sourced mappings using mapings wiki
• 5553 template mappings
• Mostly using DBpedia ontology
• 739 classes, 3049 properties
• In-concise usage of similar properties
• elevation & height, formationYear & foundingYear, team &
club, occupation & profession, foundedBy & founder
• Plan for repair
• Detection of inconsistent property usage
• Feedback to the ontology team
• Feedback and guidance to the mapping teams
• Automatic cleaning of the mappings (in RML)
21Nandana Mihindukulasooriya, OEG
22. Repairing conciseness issues in mappings
22Nandana Mihindukulasooriya, OEG
Bulk RDF Transformation
(e.g., LOD Refine, DBpedia extraction
framework, Ad-hoc programs)
structured data
unstructured
Query Rewriting
RDF Mappings
(e.g., R2RML,
Mappings Wiki, D2R
mappings, LOD
Refine RDF
skeletons)
SPARQL Endpoint
(e.g., Virtuoso, Fuseki)
RDF
Dumps
Linked Data
Resources
(e.g,, Pubby, ELDA)
Triple Store Web Server
SPARQL Clients Linked Data Clients
Data sources
Transformation
Storage
Access
23. Detecting in-concise mapping based on data
dbr:Adobe_Systems dbo:formationYear “1982” ^^xsd:gYear
23Ontology Engineering Group
dbr:Adobe_Systems dbo:foundingYear “1982” ^^xsd:gYear
DBpedia EN
DBpedia ES
25. RDF generation process
25Nandana Mihindukulasooriya, OEG
Bulk RDF Transformation
(e.g., LOD Refine, DBpedia extraction
framework, Ad-hoc programs)
structured data
unstructured
Query Rewriting
RDF Mappings
(e.g., R2RML,
Mappings Wiki, D2R
mappings, LOD
Refine RDF
skeletons)
SPARQL Endpoint
(e.g., Virtuoso, Fuseki)
RDF
Dumps
Linked Data
Resources
(e.g,, Pubby, ELDA)
Triple Store Web Server
SPARQL Clients Linked Data Clients
Data sources
Transformation
Storage
Access
26. Property Maps
Property Map
Generation
• Step 1: group properties
into clusters according to
their domain and range
• Step 2: Multilingual NL
preprocessing
• Step 3: aggregate
properties by similarity
(syntactic and semantic)
26Ontology Engineering Group
32. Consistency – (Incorrect) inferences
32Nandana Mihindukulasooriya, OEG
dbr:Juan_Alberto_Belloch owl:sameAs dbr:Pedro_Santisteve_Roche .
dbr:Aragón a dbo:Country .
• Open World Assumption and Non-Unique Name Assumption
• Works better for inferencing than validation
2
3
41. Automated testing with SPARQL Interceptor
41Ontology Engineering Group, Universidad Politécnica de Madrid
• a set of user-defined SPARQL queries (as unit tests)
• Knowledge-based specific
Test
SPARQL
Queries
System
Requirements
Schema
Constraints
Conventions
and other
restrictions
Inputs from
Exploratory
Testing