SchemEX -- Building an Index for Linked Open Data
Upcoming SlideShare
Loading in...5
×
 

SchemEX -- Building an Index for Linked Open Data

on

  • 250 views

General introduction to Linked Open Data and schema extraction using SchemEX. Download full slide set to enjoy all animations.

General introduction to Linked Open Data and schema extraction using SchemEX. Download full slide set to enjoy all animations.

Statistics

Views

Total Views
250
Views on SlideShare
249
Embed Views
1

Actions

Likes
0
Downloads
3
Comments
0

1 Embed 1

https://www.linkedin.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

SchemEX -- Building an Index for Linked Open Data SchemEX -- Building an Index for Linked Open Data Presentation Transcript

  • SchemEX – Building an Index for Linked Open Data Ansgar Scherp, Thomas Gottron, Mathias Konrath University of Koblenz-Landau, Germany Oslo, Norway August 2012SchemEX – Building an Index for LOD Slide 1 of 44
  • Learning Goals• Understand the motivation and fundamentals of Linked Open Data (LOD).• Qualify in why an index for LOD is needed and how to efficiently create such an index.SchemEX – Building an Index for LOD Slide 2 of 44
  • Scenario• Tim plans to travel – from London – to a customer in CologneSchemEX – Building an Index for LOD Slide 3 of 44
  • Website of the German RailwayIt works, why bother…?SchemEX – Building an Index for LOD Slide 4 of 44
  • Let„s Try Different Queries Bottlenecks in public transportation? Compare the connections with flights? Visualize on a map?… All these queries cannot be answered, because the data …SchemEX – Building an Index for LOD Slide 5 of 44
  • … locked in Silos! – High Integration Effort – Lack in Reuse of DataSchemEX – Building an Index for LOD Slide 6Jagendorf, http://www.flickr.com/photos/bobjagendorf/, CC-BY B. of 44
  • Linked Data• Publishing and interlinking of data• Different quality and purpose• From different sources in the Web World Wide Web Linked Data Documents Data Hyperlinks Typed Links HTML RDF Addresses (URIs) Addresses (URIs) Example: http://www.uio.no/SchemEX – Building an Index for LOD Slide 7 of 44
  • Relevance of Linked Data?SchemEX – Building an Index for LOD Slide 8 of 44
  • Linked Data: May „07  Sept. „11 Web 2.0 Media Publications eGovernment Cross-Domain Life Geographic SciencesSchemEX – Building an Index for LOD< 31 Billion Triples Slide 9 of 44 Source: http://lod-cloud.net
  • Linked Data Principles1. Identification2. Interlinkage3. Dereferencing4. DescriptionSchemEX – Building an Index for LOD Slide 10 of 44
  • Example: Big Lynx Matt Briggs Scott Miller ? Big Lynx CompanySchemEX – Building an Index< 31 Milliarde Triple for LOD Slide 11 of 44 Source: http://lod-cloud.net
  • 1. Use URIs for Identification Matt Briggs Scott Miller http://biglynx.co.uk/ people/matt-briggs http://biglynx.co.uk/ people/scott-millerSchemEX – Building B. Gazen,http://www.flickr.com/photos/bayat/,12 of 44 an Index for LOD Slide CC-BY
  • Example: Big Lynx Matt Briggs Scott Miller Big Lynx Company How to model relationships like knows?SchemEX – Building an Index for LOD Slide 13 of 44
  • Resource DescriptionFramework (RDF)• Description of Ressources with RDF triple Matt Briggs is a Person Subject Predicate Object @prefix rdf:<http://w3.org/1999/02/22-rdf- syntax-ns#> . @prefix foaf:<http://xmlns.com/foaf/0.1/> . <http://biglynx.co.uk/people/matt-briggs> rdf:type foaf:Person .SchemEX – Building an Index for LOD Slide 14 of 44
  • 1. Use URIs also for Relations http://biglynx.co.uk/ people/matt-briggs http://biglynx.co.uk/ people/scott-millerSchemEX – Building B. Gazen,http://www.flickr.com/photos/bayat/,15 of 44 an Index for LOD Slide CC-BY
  • Example: Big Lynx Dave Smith London „lives here― Matt Briggs „same Scott Miller Big Lynx … person― Company DBpedia Matt Briggs Matts private WebseiteSchemEX – Building an Index for LOD Slide 16 of 44
  • 2. Establishing Interlinkage• Relation links between ressources <http://biglynx.co.uk/people/dave-smith> foaf:based_near <http://dbpedia.org/resource/London> . Identity links between ressources <http://biglynx.co.uk/people/matt-briggs> owl:sameAs <http://www.matt-briggs.eg.uk#me> .SchemEX – Building an Index for LOD Slide 17 of 44
  • Example: Big Lynx Dave Smith London „lives here― foaf:based_near Matt Briggs „same owl:sameAs Person― Big Lynx Company DBpedia Matt Briggs Matts private WebseiteSchemEX – Building an Index for LOD Slide 18 of 44
  • 3. Dereferencing of URIs• Looking up of web documents• How can we ―look up‖ things of the real world? http://biglynx.co.uk/ people/matt-briggsSchemEX – Building an Index for LOD Slide 19 of 44
  • Two Approaches1. Hash URIs – URI contains a part separated by #, e.g., http://biglynx.co.uk/vocab/sme#Team2. Negotiation via „303 See Other― request http://biglynx.co.uk/people/matt-briggs Response: „Look here:― http://biglynx.co.uk/people/matt-briggs.rdfSchemEX – Building an Index for LOD Slide 20 of 44
  • Example: Big Lynx Dave Smith London foaf:based_near Description of Matt Briggs Matt? owl:sameAs Big Lynx Company DBpedia Matt Briggs Matts private WebseiteSchemEX – Building an Index for LOD Slide 21 of 44
  • 4. Description of URIs foaf:Person …… dp:Birmingham rdf:type foaf:based_near … biglynx:matt-briggs ex:loc _:point foaf:knows wgs84: wgs84: long biglynx:dave-smith lat ―-0.118‖ foaf:based_near ―51.509‖ dp:London … …SchemEX – Building an Index for LOD Slide 22 of 44
  • RDF / RDF Schema Vocabulary• Set of URIs defined in rdf:/rdfs: namespace• rdf:type • rdfs:domain• rdf:Property • rdfs:range• rdf:XMLLiteral • rdfs:Resource• rdf:List • rdfs:Literal• rdf:first • rdfs:Datatype• rdf:rest • rdfs:Class• rdf:Seq • rdfs:subClassOf• rdf:Bag • rdfs:subPropertyOf• rdf:Alt • rdfs:comment• ... • …• rdf:value • rdfs:labelSchemEX – Building an Index for LOD Slide 23 of 44
  • Semantic Web Layer Cake (Simplified)SchemEX – Building an Index for LOD Slide 24 of 44
  • Learning Goals• Understand the motivation and fundamentals of Linked Open Data (LOD).• Qualify in why an index for LOD is needed and how to efficiently create such an index.SchemEX – Building an Index for LOD Slide 25 of 44
  • Scenario• People who are politicians and actors• Who else?• Where do they live?• Whom do they know? …are they married with?SchemEX – Building an Index for LOD Slide 26 of 44
  • Problem• No single federated query interface provided• Execute those queries on the LOD cloudSELECT ?xFROM …WHERE { ?x rdf:type ex:Actor . ?x rdf:type ex:Politician .} “politicians and actors”SchemEX – Building an Index for LOD Slide 27 of 44
  • Principle Solution• Suitable index structure for looking up sources “politicians and actors”SchemEX – Building an Index for LOD Slide 28 of 44
  • The Naive Approach1. Download the entire LOD cloud2. Put it into a (really) large triple store3. Process the data and extract schema4. Provide lookup- Big machinery- Late in processing the data- High effort to scale with LOD cloudSchemEX – Building an Index for LOD Slide 29 of 44
  • Idea Schema-level index  Define families of graph patterns  Assign instances to graph patterns  Map graph patterns to context (source URI) Construction  Stream-based for scalability  Little loss of accuracy Note  Index defined over instances  But stores the contextSchemEX – Building an Index for LOD Slide 30 of 44
  • Input Data n-Quads <subject> <predicate> <object> <context> Example: <http://www.w3.org/People/Connolly/#me> <http://www.w3.org/1999/02/22-rdf-syntax-ns# <http://xmlns.com/foaf/0.1/Person> <http://dig.csail.mit.edu/2008/webdav/timbl/ http://dig.csail.mit.edu/2008/ webdav/timbl/foaf.rdf w3p: #me foaf: PersonSchemEX – Building an Index for LOD Slide 31 of 44
  • Building the Schema and Index RDF C1 C2 C3 … Ck classes consistsOf Type TC1 TC2 … TCm clustershasEQClass p1 p2 EQC1 EQC2 … EQCn Equivalence classes hasDataSource … Data DS1 DS2 DS3 DS4 DS5 DSx sourcesSchemEX – Building an Index for LOD Slide 32 of 44
  • Layer 1: RDF Classes All instances of a C1 particular type DS 1 DS 2 DS 3 SELECT ?x FROM … WHERE { ?x rdfs:type foaf:Person . foaf:Person } http://dig.csail.mit.edu/2008/... foaf: timbl: Person card#i http://www.w3.org/People/Berners-Lee/cardSchemEX – Building an Index for LOD Slide 33 of 44
  • Layer 2: Type Clusters All instances belonging C1 C2 to exactly the same set TC1 of types SELECT ?x DS 1 DS 2 DS 3 FROM … WHERE { foaf:Person pim:Male ?x rdfs:type foaf:Person . ?x rdfs:type pim:Male . tc4711 } pim: Male http://www.w3.org/People/Berners-Lee/card foaf: timbl: Person card#iSchemEX – Building an Index for LOD Slide 34 of 44
  • Layer 3: Equivalence Classes Two instances are C1 C2 C3 equivalent iff:  They are in the same TC TC1 TC2  They have the same p properties EQC1  The property targets are in the same TC DS 1 DS 2 DS 3  Similar to 1-BisimulationSchemEX – Building an Index for LOD Slide 35 of 44
  • Layer 3: Equivalence ClassesSELECT ?xWHERE { ?x rdfs:type foaf:Person foaf:Person . ?x rdfs:type pim:Male . pim:Male foaf:PPD ?x foaf:maker ?y . ?y rdfs:type foaf:PersonalProfileDocument . tc4711 tc1234} eqc0815 -maker- pim: foaf: foaf: tc1234 Male Person PPD eqc0815 foaf:maker timbl: http://www.w3.org/People/Berners-Lee/card timbl: card card#iSchemEX – Building an Index for LOD Slide 36 of 44
  • The SchemEX Approach• Stream-based schema extraction• While crawling the data FIFOLOD-Crawler Instance- RDF-Dump Cache RDF Triple Store RDBMS NxParser Nquad- Schema- Schema- Parser Stream Extractor Level IndexSchemEX – Building an Index for LOD Slide 37 of 44
  • Building the Index from a Stream Stream of n-quads (coming from a LD crawler) … Q16, Q15, Q14, Q13, Q12, Q11, Q10, Q9, Q8, Q7, Q6, Q5, Q4, Q3, Q2, Q1 FiFo 1 C3 4 6 C2 3 4 2 C2 2 1 3 C1 5• Linear runtime complexity wrt # of input triplesSchemEX – Building an Index for LOD Slide 38 of 44
  • Computing SchemEX: TimBL Data Set• Analysis of a smaller data set• 11 M triples, TimBL’s FOAF profile• LDspider with ~ 2k triples / sec• Different cache sizes: 100, 1k, 10k, 50k, 100k• Compared SchemEX with reference schema• Index queries on all Types, TCs, EQCs• Good precision/recall ratio at 50k+SchemEX – Building an Index for LOD Slide 39 of 44
  • Quality of Stream-based IndexConstruction• Runtime increases hardly with window size• Memory consumption scales with window sizeSchemEX – Building an Index for LOD Slide 40 of 44
  • Computing SchemEX: Full BTC 2011 DataCache size: 50 kSchemEX – Building an Index for LOD Slide 41 of 44
  • Billion Triple Challenge 2011 …SchemEX – Building an Index for LOD Slide 42 of 44
  • Conclusions: SchemEX• Linked Open Data (LOD) approach • Publishing and interlinking data on the web• SchemEX • Stream-based approach to LOD schema extraction • Scalable to arbitrary amount of Linked Data • Applicable on commodity hardware (4GB RAM, single CPU)SchemEX – Building an Index for LOD Slide 43 of 44
  • Learning Goals• Understand the motivation and fundamentals of Linked Open Data (LOD).• Qualify in why an index for LOD is needed and how to efficiently create such an index.SchemEX – Building an Index for LOD Slide 44 of 44
  • Recommended Readings• Maciej Janik, Ansgar Scherp, Steffen Staab: The Semantic Web: Collective Intelligence on the Web. Informatik Spektrum 34(5): 469-483 (2011) URL: http://dx.doi.org/10.1007/s00287-011-0535-x• Mathias Konrath, Thomas Gottron, Steffen Staab, Ansgar Scherp: SchemEX — Efficient construction of a data catalogue by stream-based indexing of linked data, J. of Web Semantics: Science, Services and Agents on the World Wide Web, Available online 23 June 2012 URL: http://www.sciencedirect.com/science/article/pii/S1570826812000716• Tom Heath, Christian Bizer: Linked Data: Evolving the Web into a Global Data Space, Morgan & Claypool Publishers, 2011 URL: http://dx.doi.org/10.2200/S00334ED1V01Y201102WBE001SchemEX – Building an Index for LOD Slide 45 of 44