Your SlideShare is downloading. ×
A candidate dataset_discovery_and_linkage_recommendation_system_for_linked_data
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

A candidate dataset_discovery_and_linkage_recommendation_system_for_linked_data

252
views

Published on

Published in: Education, Technology

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
252
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
2
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. A Candidate Dataset Discovery andLinkage Recommendation System for Linked Data OC WG Meeting – 26.6.2012 Michael Luger 1
  • 2. Outline ● Introduction ● Linked Data ● Candidate Dataset Selection ● Linkage Recommendation ● Conclusion 2
  • 3. Introduction ● Maturing of Semantic Web technologies ● Growing amount of structured data becoming available as Linked Data – Data publishers from a variety of domains should be encouraged to participate ● Establishment of links between Linked Data resources → Data Linking – Draws influences from Record Linkage, Ontology Matching – Need for solutions to facilitate this process ● Data Linking can benefit from the exploitation of various information sources – Available Metadata – Ontology Alignments – Statistical Information – User Input 3
  • 4. Linked Data – Linking Open Data Initiative ● 300 datasets LOD Cloud – September 2011 ● 30 billion triples 4 ● 500 million RDF links
  • 5. Linked Data – Publishing Workflow Ontology Selection Dataset Selection Conversion Matcher Configuration Publication Matching Interlinking Application 5
  • 6. Dataset Discovery – Overview ● Goal – Given a source dataset, discover candidate target datasets from the LOD cloud Source Dataset Set of LOD DatasetsUser User-supplied Information Information Information Extraction Extraction Sources (pre-computed) Representative Information Sources Representative Information Sources (Keywords, Schema Elements, Topics, ...) (Keywords, Schema Elements, Topics, ...) Comparison of Information Sources Results – Top ranked similar Datasets from the set of LOD Datasets 6
  • 7. Dataset Discovery – Metadata Initiatives (i) ● CKAN – Data hub platform covering LOD datasets – Allows organizations to publish metadata about their data in a structured format Example – CKAN metadata published about the dataset bbc­music. 7
  • 8. Dataset Discovery – Metadata Initiatives (ii) ● Ontologies for Linked Data – VoID – Vocabulary of Interlinked Datasets ● RDFS vocabulary for describing linked datasets ● Main concepts – void:Dataset, void:Linkset – LOV – Linked Open Vocabularies ● RDFS vocabulary for describing ontologies used by linked datasets – VOAF – Vocabulary of a Friend ● Extension of LOV ● Allows to describe relationships between ontologies and topic classifications 8
  • 9. Dataset Discovery – Metadata Initiatives (iii) Example – LOV / VOAF metadata about the Music Ontology. 9
  • 10. Dataset Discovery – Information Sources ● Schema Elements – Classes and properties, extracted from datasets using SPARQL – Ontology Alignments ● Keywords – Class- and property-labels, extracted from schema information – CKAN metadata (tags, groups, …) – User-supplied keywords ● Vocabularies – Derived from schema elements – Retrieved from VoID metadata ● Topics – Retrieved from LOV / VOAF metadata – User-supplied topics 10
  • 11. Dataset Discovery – Computation ● Computation of overlap score O between the source dataset DS and each potential target dataset DT ● DS … Source Dataset ● DT … Target Dataset ● KWDS, KWDT, SEDS, SEDT, TDS, TDT ... Sets of Keywords, Schema Elements, Topics ● kwDS, kwDT, seDS, seDT, tDS, tDT ... Individual Keywords, Schema Elements, Topics ● Mkw, Mse, Mt … Matching Predicates ● wkw, wse, wt ... Weights 11
  • 12. Dataset Discovery – Results ● CKAN Metadata – Complete but can skew results (non-representative keywords) ● VoID Descriptions – Not available for all the datasets ● Schema Elements – Problems with retrieval for some datasets – Ontology alignments improve results significantly – Potential through application of ontology matching ● Topics – LOV provides a very general classification – Potential through application of Upper Level Ontologies ● Computation works well in combination with user input (weights, custom input) 12
  • 13. Linkage Recommendation – Overview ● Matching Linked Data resources → Data Linking – Given two datasets, links between their resources are established by means of assessing similarity through matching – Reliance on well established value-matching techniques – Adapted to RDF-type data characteristics (graph-like structure, heterogeneous vocabularies, large collections) ● Existing data linking tools – Fully automated approaches exhibit limited applicability – Semi-automated approaches demand manual configuration → Linkage Specification ● Goal – User interactive environment for creating linkage specifications – Recommendation of linkage specifications – Exploitation of available information sources – Ability to perform matching and evaluate results in an iterative way 13
  • 14. Linkage Recommendation – Specification (Silk LSL) <Prefixes> <Prefix id="rdfs" namespace="http://www.w3.org/2000/01/rdf-schema#" /> <Prefix id="dbpedia" namespace="http://dbpedia.org/ontology/" /> <Prefix id="gn" namespace="http://www.geonames.org/ontology#" /> </Prefixes> <DataSources> <DataSource id="dbpedia"> <Param name="endpointURI" value="http://example.org/sparql" /> <Param name="graph" value="http://dbpedia.org" /> </DataSource> <DataSource id="geonames"> <Param name="endpointURI" value="http://example2.org/sparql" /> <Param name="graph" value="http://sws.geonames.org/" /> </DataSource> </DataSources> <Interlinks> <Interlink id="cities"> <LinkType>owl:sameAs</LinkType> <SourceDataset dataSource="dbpedia" var="a"> <RestrictTo> ?a rdf:type dbpedia:City </RestrictTo> </SourceDataset> <TargetDataset dataSource="geonames" var="b"> <RestrictTo> ?b rdf:type gn:P </RestrictTo> </TargetDataset> <LinkageRule> <Aggregate type="average"> <Compare metric="jaro"> <Input path="?a/rdfs:label" /> <Input path="?b/gn:name" /> </Compare> <Compare metric="num"> <Input path="?a/dbpedia:populationTotal" /> <Input path="?b/gn:population" /> </Compare> </Aggregate> 14 </LinkageRule> <...>
  • 15. Linkage Recommendation – Information Sources (i) ● User input – Tabular view for browsing the contents of two chosen datasets – Filtering of contents by specifying constraints – Ability to specify and execute linkage specification – Display of additional information sources inline Classes Properties Classes Properties Storage / Retrieval User Alignment Server SPARQL SPARQL Matcher & Analysis Service Source Dataset Target Dataset Endpoint Endpoint 15
  • 16. Linkage Recommendation – Information Sources (ii) ● Ontology Alignments – Management of ontology correspondences as part of the user interface ● Statistics – Property Discriminability & Coverage ∣ o | t={< i , p , o >} ∣ dis p= ∣ t | t ={< i , p , o > } ∣ ∣ i |t ={< i , p , o >} ∣ cov  p = ∣ i| t={< i ,∗, o >} ∣ – Sample Property Pair Analysis ● Instance-based ontology matching based on instance samples 16
  • 17. Linkage Recommendation – User Interface (i) 17
  • 18. Linkage Recommendation – User Interface (ii) 18
  • 19. Linkage Recommendation – User Interface (iii) 19
  • 20. Conclusion ● Linked Data is exhibiting growing adaptation ● Development of a Data Linking system – Focus on dataset selection & matcher configuration – Combination of available information sources and user feedback ● Outlook – Data publishers should be encouraged to participate in Linked Data – High quality data is key – Important aspects also include up-to-date views on data, trust, provenance information, ... – Ongoing research and growing tool support 20