SchemEX – Building an Index        for Linked Open Data  Ansgar Scherp, Thomas Gottron, Mathias Konrath  University of Kob...
Learning Goals• Understand the motivation and  fundamentals of Linked Open Data (LOD).• Qualify in why an index for LOD is...
Scenario• Tim plans to travel  – from London  – to a customer in CologneSchemEX – Building an Index for LOD   Slide 3 of 44
Website of the German RailwayIt works, why bother…?SchemEX – Building an Index for LOD   Slide 4 of 44
Let„s Try Different Queries Bottlenecks in public transportation? Compare the connections with flights? Visualize on a ...
… locked in Silos! – High Integration Effort – Lack in Reuse of DataSchemEX – Building an Index for LOD   Slide 6Jagendorf...
Linked Data• Publishing and interlinking of data• Different quality and purpose• From different sources in the Web        ...
Relevance of Linked Data?SchemEX – Building an Index for LOD   Slide 8 of 44
Linked Data: May „07                                                Sept. „11                                            ...
Linked Data Principles1.        Identification2.        Interlinkage3.        Dereferencing4.        DescriptionSchemEX – ...
Example: Big Lynx                                 Matt Briggs                                Scott Miller                 ...
1. Use URIs for Identification Matt Briggs                                                                                ...
Example: Big Lynx                                       Matt Briggs                                      Scott Miller     ...
Resource DescriptionFramework (RDF)• Description of Ressources with RDF triple            Matt Briggs                   is...
1. Use URIs also for Relations        http://biglynx.co.uk/        people/matt-briggs                                     ...
Example: Big Lynx                                                             Dave Smith         London                   ...
2. Establishing Interlinkage• Relation links between ressources       <http://biglynx.co.uk/people/dave-smith>           f...
Example: Big Lynx                                                               Dave Smith         London                 ...
3. Dereferencing of URIs• Looking up of web documents• How can we ―look up‖ things of the real world?                     ...
Two Approaches1. Hash URIs   – URI contains a part separated by #, e.g.,    http://biglynx.co.uk/vocab/sme#Team2. Negotiat...
Example: Big Lynx                                                                Dave Smith         London                ...
4. Description of URIs                  foaf:Person                                                   ……                  ...
RDF / RDF Schema Vocabulary•    Set of URIs defined in rdf:/rdfs: namespace•    rdf:type               • rdfs:domain•    r...
Semantic Web Layer Cake (Simplified)SchemEX – Building an Index for LOD   Slide 24 of 44
Learning Goals• Understand the motivation and  fundamentals of Linked Open Data (LOD).• Qualify in why an index for LOD is...
Scenario• People who are politicians and actors• Who else?• Where do they live?• Whom do they know? …are they married with...
Problem• No single federated query interface provided• Execute those queries on the LOD cloudSELECT ?xFROM …WHERE { ?x rdf...
Principle Solution• Suitable index structure for looking up sources       “politicians       and actors”SchemEX – Building...
The Naive Approach1.     Download the entire LOD cloud2.     Put it into a (really) large triple store3.     Process the d...
Idea Schema-level index   Define families of graph patterns   Assign instances to graph patterns   Map graph patterns ...
Input Data n-Quads         <subject> <predicate> <object> <context> Example:            <http://www.w3.org/People/Connol...
Building the Schema and Index                                                                     RDF      C1             ...
Layer 1: RDF Classes All instances of a                                                 C1  particular type              ...
Layer 2: Type Clusters All instances belonging                                       C1         C2  to exactly the same s...
Layer 3: Equivalence Classes Two instances are                                     C1           C2         C3  equivalent...
Layer 3: Equivalence ClassesSELECT ?xWHERE {   ?x rdfs:type foaf:Person foaf:Person                            .   ?x rdfs...
The SchemEX Approach• Stream-based schema extraction• While crawling the data                                      FIFOLOD...
Building the Index from a Stream Stream of n-quads (coming from a LD crawler)      … Q16, Q15, Q14, Q13, Q12, Q11, Q10, Q...
Computing SchemEX: TimBL Data Set• Analysis of a smaller data set• 11 M triples, TimBL’s FOAF profile• LDspider with ~ 2k ...
Quality of Stream-based IndexConstruction• Runtime increases hardly with window size• Memory consumption scales with windo...
Computing SchemEX: Full BTC 2011 DataCache size: 50 kSchemEX – Building an Index for LOD   Slide 41 of 44
Billion Triple Challenge 2011  …SchemEX – Building an Index for LOD   Slide 42 of 44
Conclusions: SchemEX• Linked Open Data (LOD) approach   • Publishing and interlinking data on the web• SchemEX   • Stream-...
Learning Goals• Understand the motivation and  fundamentals of Linked Open Data (LOD).• Qualify in why an index for LOD is...
Recommended Readings• Maciej Janik, Ansgar Scherp, Steffen Staab: The Semantic Web:  Collective Intelligence on the Web. I...
Upcoming SlideShare
Loading in...5
×

SchemEX -- Building an Index for Linked Open Data

1,631

Published on

General introduction to Linked Open Data and schema extraction using SchemEX. Download full slide set to enjoy all animations.

Published in: Education, Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,631
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
21
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

SchemEX -- Building an Index for Linked Open Data

  1. 1. SchemEX – Building an Index for Linked Open Data Ansgar Scherp, Thomas Gottron, Mathias Konrath University of Koblenz-Landau, Germany Oslo, Norway August 2012SchemEX – Building an Index for LOD Slide 1 of 44
  2. 2. Learning Goals• Understand the motivation and fundamentals of Linked Open Data (LOD).• Qualify in why an index for LOD is needed and how to efficiently create such an index.SchemEX – Building an Index for LOD Slide 2 of 44
  3. 3. Scenario• Tim plans to travel – from London – to a customer in CologneSchemEX – Building an Index for LOD Slide 3 of 44
  4. 4. Website of the German RailwayIt works, why bother…?SchemEX – Building an Index for LOD Slide 4 of 44
  5. 5. Let„s Try Different Queries Bottlenecks in public transportation? Compare the connections with flights? Visualize on a map?… All these queries cannot be answered, because the data …SchemEX – Building an Index for LOD Slide 5 of 44
  6. 6. … locked in Silos! – High Integration Effort – Lack in Reuse of DataSchemEX – Building an Index for LOD Slide 6Jagendorf, http://www.flickr.com/photos/bobjagendorf/, CC-BY B. of 44
  7. 7. Linked Data• Publishing and interlinking of data• Different quality and purpose• From different sources in the Web World Wide Web Linked Data Documents Data Hyperlinks Typed Links HTML RDF Addresses (URIs) Addresses (URIs) Example: http://www.uio.no/SchemEX – Building an Index for LOD Slide 7 of 44
  8. 8. Relevance of Linked Data?SchemEX – Building an Index for LOD Slide 8 of 44
  9. 9. Linked Data: May „07  Sept. „11 Web 2.0 Media Publications eGovernment Cross-Domain Life Geographic SciencesSchemEX – Building an Index for LOD< 31 Billion Triples Slide 9 of 44 Source: http://lod-cloud.net
  10. 10. Linked Data Principles1. Identification2. Interlinkage3. Dereferencing4. DescriptionSchemEX – Building an Index for LOD Slide 10 of 44
  11. 11. Example: Big Lynx Matt Briggs Scott Miller ? Big Lynx CompanySchemEX – Building an Index< 31 Milliarde Triple for LOD Slide 11 of 44 Source: http://lod-cloud.net
  12. 12. 1. Use URIs for Identification Matt Briggs Scott Miller http://biglynx.co.uk/ people/matt-briggs http://biglynx.co.uk/ people/scott-millerSchemEX – Building B. Gazen,http://www.flickr.com/photos/bayat/,12 of 44 an Index for LOD Slide CC-BY
  13. 13. Example: Big Lynx Matt Briggs Scott Miller Big Lynx Company How to model relationships like knows?SchemEX – Building an Index for LOD Slide 13 of 44
  14. 14. Resource DescriptionFramework (RDF)• Description of Ressources with RDF triple Matt Briggs is a Person Subject Predicate Object @prefix rdf:<http://w3.org/1999/02/22-rdf- syntax-ns#> . @prefix foaf:<http://xmlns.com/foaf/0.1/> . <http://biglynx.co.uk/people/matt-briggs> rdf:type foaf:Person .SchemEX – Building an Index for LOD Slide 14 of 44
  15. 15. 1. Use URIs also for Relations http://biglynx.co.uk/ people/matt-briggs http://biglynx.co.uk/ people/scott-millerSchemEX – Building B. Gazen,http://www.flickr.com/photos/bayat/,15 of 44 an Index for LOD Slide CC-BY
  16. 16. Example: Big Lynx Dave Smith London „lives here― Matt Briggs „same Scott Miller Big Lynx … person― Company DBpedia Matt Briggs Matts private WebseiteSchemEX – Building an Index for LOD Slide 16 of 44
  17. 17. 2. Establishing Interlinkage• Relation links between ressources <http://biglynx.co.uk/people/dave-smith> foaf:based_near <http://dbpedia.org/resource/London> . Identity links between ressources <http://biglynx.co.uk/people/matt-briggs> owl:sameAs <http://www.matt-briggs.eg.uk#me> .SchemEX – Building an Index for LOD Slide 17 of 44
  18. 18. Example: Big Lynx Dave Smith London „lives here― foaf:based_near Matt Briggs „same owl:sameAs Person― Big Lynx Company DBpedia Matt Briggs Matts private WebseiteSchemEX – Building an Index for LOD Slide 18 of 44
  19. 19. 3. Dereferencing of URIs• Looking up of web documents• How can we ―look up‖ things of the real world? http://biglynx.co.uk/ people/matt-briggsSchemEX – Building an Index for LOD Slide 19 of 44
  20. 20. Two Approaches1. Hash URIs – URI contains a part separated by #, e.g., http://biglynx.co.uk/vocab/sme#Team2. Negotiation via „303 See Other― request http://biglynx.co.uk/people/matt-briggs Response: „Look here:― http://biglynx.co.uk/people/matt-briggs.rdfSchemEX – Building an Index for LOD Slide 20 of 44
  21. 21. Example: Big Lynx Dave Smith London foaf:based_near Description of Matt Briggs Matt? owl:sameAs Big Lynx Company DBpedia Matt Briggs Matts private WebseiteSchemEX – Building an Index for LOD Slide 21 of 44
  22. 22. 4. Description of URIs foaf:Person …… dp:Birmingham rdf:type foaf:based_near … biglynx:matt-briggs ex:loc _:point foaf:knows wgs84: wgs84: long biglynx:dave-smith lat ―-0.118‖ foaf:based_near ―51.509‖ dp:London … …SchemEX – Building an Index for LOD Slide 22 of 44
  23. 23. RDF / RDF Schema Vocabulary• Set of URIs defined in rdf:/rdfs: namespace• rdf:type • rdfs:domain• rdf:Property • rdfs:range• rdf:XMLLiteral • rdfs:Resource• rdf:List • rdfs:Literal• rdf:first • rdfs:Datatype• rdf:rest • rdfs:Class• rdf:Seq • rdfs:subClassOf• rdf:Bag • rdfs:subPropertyOf• rdf:Alt • rdfs:comment• ... • …• rdf:value • rdfs:labelSchemEX – Building an Index for LOD Slide 23 of 44
  24. 24. Semantic Web Layer Cake (Simplified)SchemEX – Building an Index for LOD Slide 24 of 44
  25. 25. Learning Goals• Understand the motivation and fundamentals of Linked Open Data (LOD).• Qualify in why an index for LOD is needed and how to efficiently create such an index.SchemEX – Building an Index for LOD Slide 25 of 44
  26. 26. Scenario• People who are politicians and actors• Who else?• Where do they live?• Whom do they know? …are they married with?SchemEX – Building an Index for LOD Slide 26 of 44
  27. 27. Problem• No single federated query interface provided• Execute those queries on the LOD cloudSELECT ?xFROM …WHERE { ?x rdf:type ex:Actor . ?x rdf:type ex:Politician .} “politicians and actors”SchemEX – Building an Index for LOD Slide 27 of 44
  28. 28. Principle Solution• Suitable index structure for looking up sources “politicians and actors”SchemEX – Building an Index for LOD Slide 28 of 44
  29. 29. The Naive Approach1. Download the entire LOD cloud2. Put it into a (really) large triple store3. Process the data and extract schema4. Provide lookup- Big machinery- Late in processing the data- High effort to scale with LOD cloudSchemEX – Building an Index for LOD Slide 29 of 44
  30. 30. Idea Schema-level index  Define families of graph patterns  Assign instances to graph patterns  Map graph patterns to context (source URI) Construction  Stream-based for scalability  Little loss of accuracy Note  Index defined over instances  But stores the contextSchemEX – Building an Index for LOD Slide 30 of 44
  31. 31. Input Data n-Quads <subject> <predicate> <object> <context> Example: <http://www.w3.org/People/Connolly/#me> <http://www.w3.org/1999/02/22-rdf-syntax-ns# <http://xmlns.com/foaf/0.1/Person> <http://dig.csail.mit.edu/2008/webdav/timbl/ http://dig.csail.mit.edu/2008/ webdav/timbl/foaf.rdf w3p: #me foaf: PersonSchemEX – Building an Index for LOD Slide 31 of 44
  32. 32. Building the Schema and Index RDF C1 C2 C3 … Ck classes consistsOf Type TC1 TC2 … TCm clustershasEQClass p1 p2 EQC1 EQC2 … EQCn Equivalence classes hasDataSource … Data DS1 DS2 DS3 DS4 DS5 DSx sourcesSchemEX – Building an Index for LOD Slide 32 of 44
  33. 33. Layer 1: RDF Classes All instances of a C1 particular type DS 1 DS 2 DS 3 SELECT ?x FROM … WHERE { ?x rdfs:type foaf:Person . foaf:Person } http://dig.csail.mit.edu/2008/... foaf: timbl: Person card#i http://www.w3.org/People/Berners-Lee/cardSchemEX – Building an Index for LOD Slide 33 of 44
  34. 34. Layer 2: Type Clusters All instances belonging C1 C2 to exactly the same set TC1 of types SELECT ?x DS 1 DS 2 DS 3 FROM … WHERE { foaf:Person pim:Male ?x rdfs:type foaf:Person . ?x rdfs:type pim:Male . tc4711 } pim: Male http://www.w3.org/People/Berners-Lee/card foaf: timbl: Person card#iSchemEX – Building an Index for LOD Slide 34 of 44
  35. 35. Layer 3: Equivalence Classes Two instances are C1 C2 C3 equivalent iff:  They are in the same TC TC1 TC2  They have the same p properties EQC1  The property targets are in the same TC DS 1 DS 2 DS 3  Similar to 1-BisimulationSchemEX – Building an Index for LOD Slide 35 of 44
  36. 36. Layer 3: Equivalence ClassesSELECT ?xWHERE { ?x rdfs:type foaf:Person foaf:Person . ?x rdfs:type pim:Male . pim:Male foaf:PPD ?x foaf:maker ?y . ?y rdfs:type foaf:PersonalProfileDocument . tc4711 tc1234} eqc0815 -maker- pim: foaf: foaf: tc1234 Male Person PPD eqc0815 foaf:maker timbl: http://www.w3.org/People/Berners-Lee/card timbl: card card#iSchemEX – Building an Index for LOD Slide 36 of 44
  37. 37. The SchemEX Approach• Stream-based schema extraction• While crawling the data FIFOLOD-Crawler Instance- RDF-Dump Cache RDF Triple Store RDBMS NxParser Nquad- Schema- Schema- Parser Stream Extractor Level IndexSchemEX – Building an Index for LOD Slide 37 of 44
  38. 38. Building the Index from a Stream Stream of n-quads (coming from a LD crawler) … Q16, Q15, Q14, Q13, Q12, Q11, Q10, Q9, Q8, Q7, Q6, Q5, Q4, Q3, Q2, Q1 FiFo 1 C3 4 6 C2 3 4 2 C2 2 1 3 C1 5• Linear runtime complexity wrt # of input triplesSchemEX – Building an Index for LOD Slide 38 of 44
  39. 39. Computing SchemEX: TimBL Data Set• Analysis of a smaller data set• 11 M triples, TimBL’s FOAF profile• LDspider with ~ 2k triples / sec• Different cache sizes: 100, 1k, 10k, 50k, 100k• Compared SchemEX with reference schema• Index queries on all Types, TCs, EQCs• Good precision/recall ratio at 50k+SchemEX – Building an Index for LOD Slide 39 of 44
  40. 40. Quality of Stream-based IndexConstruction• Runtime increases hardly with window size• Memory consumption scales with window sizeSchemEX – Building an Index for LOD Slide 40 of 44
  41. 41. Computing SchemEX: Full BTC 2011 DataCache size: 50 kSchemEX – Building an Index for LOD Slide 41 of 44
  42. 42. Billion Triple Challenge 2011 …SchemEX – Building an Index for LOD Slide 42 of 44
  43. 43. Conclusions: SchemEX• Linked Open Data (LOD) approach • Publishing and interlinking data on the web• SchemEX • Stream-based approach to LOD schema extraction • Scalable to arbitrary amount of Linked Data • Applicable on commodity hardware (4GB RAM, single CPU)SchemEX – Building an Index for LOD Slide 43 of 44
  44. 44. Learning Goals• Understand the motivation and fundamentals of Linked Open Data (LOD).• Qualify in why an index for LOD is needed and how to efficiently create such an index.SchemEX – Building an Index for LOD Slide 44 of 44
  45. 45. Recommended Readings• Maciej Janik, Ansgar Scherp, Steffen Staab: The Semantic Web: Collective Intelligence on the Web. Informatik Spektrum 34(5): 469-483 (2011) URL: http://dx.doi.org/10.1007/s00287-011-0535-x• Mathias Konrath, Thomas Gottron, Steffen Staab, Ansgar Scherp: SchemEX — Efficient construction of a data catalogue by stream-based indexing of linked data, J. of Web Semantics: Science, Services and Agents on the World Wide Web, Available online 23 June 2012 URL: http://www.sciencedirect.com/science/article/pii/S1570826812000716• Tom Heath, Christian Bizer: Linked Data: Evolving the Web into a Global Data Space, Morgan & Claypool Publishers, 2011 URL: http://dx.doi.org/10.2200/S00334ED1V01Y201102WBE001SchemEX – Building an Index for LOD Slide 45 of 44
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×