Linked open data - how to juggle with more than a billion triples

1,475 views

Published on

Slides of my inauguration talk at the University of Mannheim in Germany in October 2012. Download this slide set to enjoy all animations.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,475
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
10
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Linked open data - how to juggle with more than a billion triples

  1. 1. How to Juggle with morethan a Billion Triples?Ansgar ScherpResearch Group on Data andWeb ScienceUniversität MannheimOctober 2012 Image source: http://www.flickr.com/photos/pedromourapinheiro/2122754745/ 1Ansgar Scherp – ansgar@informatik.uni-mannheim.de Slide
  2. 2. My thanks go to …• Marianna • Daniel Eißing• Simon Schenk • Mathias Konrath• Carsten Saathoff • Daniel Schmeiß• Thomas Franz • Anton Baumesberger• Thomas Gottron • Frederik Jochum• Steffen Staab • Alexander Kleinen• Arne Peters• Bastian Krayer And many more …Ansgar Scherp – ansgar@informatik.uni-mannheim.de Slide 2
  3. 3. Scenario• Tim plans to travel – from London – to a customer in CologneAnsgar Scherp – ansgar@informatik.uni-mannheim.de Slide 3
  4. 4. Website of the German RailwayIt works, why bother…?Ansgar Scherp – ansgar@informatik.uni-mannheim.de Slide 4
  5. 5. Let„s Try Different Queries Bottlenecks in public transportation? Compare the connections with flights? Visualize on a map?… All these queries cannot be answered, because the data …Ansgar Scherp – ansgar@informatik.uni-mannheim.de Slide 5
  6. 6. … locked in Silos! – High Integration Effort – Lack in Reuse of DataAnsgar Scherp – ansgar@informatik.uni-mannheim.de Slide 6 B. Jagendorf, http://www.flickr.com/photos/bobjagendorf/, CC-BY
  7. 7. Linked Data• Publishing and interlinking of data• Different quality and purpose• From different sources in the Web World Wide Web Linked Data Documents Data Hyperlinks Typed Links HTML RDF Addresses (URIs) Addresses (URIs)Example: http://www.uni-mannheim.de/Ansgar Scherp – ansgar@informatik.uni-mannheim.de Slide 7
  8. 8. Relevance of Linked Data?Ansgar Scherp – ansgar@informatik.uni-mannheim.de Slide 8
  9. 9. Linked Data: May „07  Sept. „11 Web 2.0 Media Publications eGovernment Cross-Domain Life Geographic SciencesAnsgar Billion–Triples< 31 Scherp ansgar@informatik.uni-mannheim.de Source: http://lod-cloud.net Slide 9
  10. 10. Linked Data Principles1. Identification2. Interlinkage3. Dereferencing4. DescriptionAnsgar Scherp – ansgar@informatik.uni-mannheim.de Slide 10
  11. 11. Example: Big Lynx Matt Briggs Scott Miller ? Big Lynx CompanyAnsgar Scherp – ansgar@informatik.uni-mannheim.de< 31 Milliarde Triple Source: http://lod-cloud.net Slide 11
  12. 12. 1. Use URIs for Identification Matt Briggs Scott Miller http://biglynx.co.uk/ people/matt-briggs http://biglynx.co.uk/ people/scott-millerAnsgar Scherp – ansgar@informatik.uni-mannheim.de B. Gazen,http://www.flickr.com/photos/bayat/, CC-BY Slide 12
  13. 13. Example: Big Lynx Matt Briggs Scott Miller Big Lynx Company How to model relationships like knows?Ansgar Scherp – ansgar@informatik.uni-mannheim.de Slide 13
  14. 14. Resource DescriptionFramework (RDF)• Description of Ressources with RDF triple Matt Briggs is a Person Subject Predicate Object@prefix rdf:<http://w3.org/1999/02/22-rdf- syntax-ns#> .@prefix foaf:<http://xmlns.com/foaf/0.1/> .<http://biglynx.co.uk/people/matt-briggs> rdf:type foaf:Person .Ansgar Scherp – ansgar@informatik.uni-mannheim.de Slide 14
  15. 15. 1. Use URIs also for Relations http://biglynx.co.uk/ people/matt-briggs http://biglynx.co.uk/ people/scott-millerAnsgar Scherp – ansgar@informatik.uni-mannheim.de B. Gazen,http://www.flickr.com/photos/bayat/, CC-BY Slide 15
  16. 16. Example: Big Lynx Dave Smith London „lives here― Matt Briggs „same Scott Miller Big Lynx … person― Company DBpedia Matt Briggs Matts private WebseiteAnsgar Scherp – ansgar@informatik.uni-mannheim.de Slide 16
  17. 17. 2. Establishing Interlinkage• Relation links between ressources <http://biglynx.co.uk/people/dave-smith> foaf:based_near <http://dbpedia.org/resource/London> . Identity links between ressources <http://biglynx.co.uk/people/matt-briggs> owl:sameAs <http://www.matt-briggs.eg.uk#me> .Ansgar Scherp – ansgar@informatik.uni-mannheim.de Slide 17
  18. 18. Example: Big Lynx Dave Smith London „lives here― foaf:based_near Matt Briggs „same owl:sameAs Person― Big Lynx Company DBpedia Matt Briggs Matts private WebseiteAnsgar Scherp – ansgar@informatik.uni-mannheim.de Slide 18
  19. 19. 3. Dereferencing of URIs• Looking up of web documents• How can we ―look up‖ things of the real world? http://biglynx.co.uk/ people/matt-briggsAnsgar Scherp – ansgar@informatik.uni-mannheim.de Slide 19
  20. 20. Two Approaches1. Hash URIs – URI contains a part separated by #, e.g., http://biglynx.co.uk/vocab/sme#Team2. Negotiation via „303 See Other― request http://biglynx.co.uk/people/matt-briggs Response: „Look here:― http://biglynx.co.uk/people/matt-briggs.rdfAnsgar Scherp – ansgar@informatik.uni-mannheim.de Slide 20
  21. 21. Example: Big Lynx Dave Smith London foaf:based_near Description of Matt Briggs Matt? owl:sameAs Big Lynx Company DBpedia Matt Briggs Matts private WebseiteAnsgar Scherp – ansgar@informatik.uni-mannheim.de Slide 21
  22. 22. 4. Description of URIs foaf:Person …… dp:Birmingham rdf:type foaf:based_near … biglynx:matt-briggs ex:loc _:point foaf:knows wgs84: wgs84: long biglynx:dave-smith lat ―-0.118‖ foaf:based_near ―51.509‖ dp:London … …Ansgar Scherp – ansgar@informatik.uni-mannheim.de Slide 22
  23. 23. Formalization of Description Given a RDF graph G (V , P, E ) with V R B L and E ( R B) P V ∩∞ SimpleCBD(n) = I j with j=0 I 0 = { (s, p, o) | (s, p, o) E s=n} I j+1 = { (o, p‗, o‗) E| (s, p, o) Ij : o B ∩j (o, p‗, o‗) Ik} k=0Ansgar Scherp – ansgar@informatik.uni-mannheim.de Slide 23
  24. 24. W3C RDF / RDF Schema Vocabulary• Set of URIs defined in rdf:/rdfs: namespace• rdf:type • rdfs:domain• rdf:Property • rdfs:range• rdf:XMLLiteral • rdfs:Resource• rdf:List • rdfs:Literal• rdf:first • rdfs:Datatype• rdf:rest • rdfs:Class• rdf:Seq • rdfs:subClassOf• rdf:Bag • rdfs:subPropertyOf• rdf:Alt • rdfs:comment• ... • …• rdf:value • rdfs:labelAnsgar Scherp – ansgar@informatik.uni-mannheim.de Slide 24
  25. 25. Semantic Web Layer Cake (Simplified)Ansgar Scherp – ansgar@informatik.uni-mannheim.de Slide 25
  26. 26. Exploration of Linked Data Word Net Swoogle Geo NamesAnsgar Scherp – ansgar@informatik.uni-mannheim.de< 31 Billion Triples Source: http://lod-cloud.net Slide 26
  27. 27. Naive Approach• Download all data• Store in really big database RDFS• Programming of WordNet Rules queries Swoogle Geo• Design of user interface GeoNames Inflexible Monolithic NotAnsgar Scherp – ansgar@informatik.uni-mannheim.de scaleable Slide 27
  28. 28. SemaPlorer Approach Flexible Extensible Scaleable birthplace placeOfBirth birthplace Geo RDFS Rules Fulltext Queries > 1 Billion Triples WordNet + + Swoogle + + GeoNames 12 Month in 2005/06Ansgar Scherp – ansgar@informatik.uni-mannheim.de  700 Mio. Triple Slide 28
  29. 29. SemaPlorer – Semantic Social MediaAnsgar Scherpvideo online: http://vimeo.com/2057249 Watch – ansgar@informatik.uni-mannheim.de Slide 29
  30. 30. Billion Triple Challenge 2008 [JWS 2009]Ansgar Scherp – ansgar@informatik.uni-mannheim.de Slide 30
  31. 31. Searching for Linked Data Sources ? Persons that are - Politicians and - Actors ?<Ansgar Scherp – ansgar@informatik.uni-mannheim.de 31 Milliarde Triples Quelle: http://lod-cloud.net Slide 31
  32. 32. Idea: Index of Data SourcesSELECT ?xFROM …WHERE { ?x rdf:type ex:Actor . ?x rdf:type ex:Politician .} Index ? Query “Politician and Actor”Ansgar Scherp – ansgar@informatik.uni-mannheim.de Slide 32
  33. 33. The Naive Approach1. Download the entire LOD cloud2. Put it into a (really) large triple store3. Process the data and extract schema4. Provide lookup- Big machinery- Late in processing the data- High effort to scale with LOD cloudAnsgar Scherp – ansgar@informatik.uni-mannheim.de Slide 33
  34. 34. Idea Schema-level index  Define families of graph patterns  Assign instances to graph patterns  Map graph patterns to context (source URI) Construction  Stream-based for scalability  Little loss of accuracy Note  Index defined over instances  But stores the contextAnsgar Scherp – ansgar@informatik.uni-mannheim.de Slide 34
  35. 35. Input Data n-Quads <subject> <predicate> <object> <context> Example: <http://www.w3.org/People/Connolly/#me> <http://www.w3.org/1999/02/22-rdf-syntax-ns# <http://xmlns.com/foaf/0.1/Person> <http://dig.csail.mit.edu/2008/webdav/timbl/ http://dig.csail.mit.edu/2008/ webdav/timbl/foaf.rdf w3p: #me foaf: PersonAnsgar Scherp – ansgar@informatik.uni-mannheim.de Slide 35
  36. 36. SchemEX Approach• Stream-based schema extraction• While crawling the data FIFOLOD-Crawler Instance- RDF-Dump Cache RDF Triple Store RDBMS NxParser Nquad- Schema- Schema- Parser Stream Extractor Level IndexAnsgar Scherp – ansgar@informatik.uni-mannheim.de Slide 36
  37. 37. Building the Index from a Stream Stream of n-quads (coming from a LD crawler) … Q16, Q15, Q14, Q13, Q12, Q11, Q10, Q9, Q8, Q7, Q6, Q5, Q4, Q3, Q2, Q1 FiFo 1 C3 4 6 C2 3 4 2 C2 2 1 3 C1 5• Linear runtime complexity wrt # of input triplesAnsgar Scherp – ansgar@informatik.uni-mannheim.de Slide 37
  38. 38. Building the Schema and Index RDF C1 C2 C3 … Ck classes consistsOf Type TC1 TC2 … TCm clustershasEQClass p1 p2 EQC1 EQC2 … EQCn Equivalence classes hasDataSource … Data DS1 DS2 DS3 DS4 DS5 DSx sourcesAnsgar Scherp – ansgar@informatik.uni-mannheim.de Slide 38
  39. 39. Layer 1: RDF Classes All instances of a C1 particular type DS 1 DS 2 DS 3 SELECT ?x FROM … WHERE { ?x rdfs:type foaf:Person . foaf:Person } http://dig.csail.mit.edu/2008/... foaf: timbl: Person card#i http://www.w3.org/People/Berners-Lee/cardAnsgar Scherp – ansgar@informatik.uni-mannheim.de Slide 39
  40. 40. Layer 2: Type Clusters All instances belonging C1 C2 to exactly the same set TC1 of types SELECT ?x DS 1 DS 2 DS 3 FROM … WHERE { foaf:Person pim:Male ?x rdfs:type foaf:Person . ?x rdfs:type pim:Male . tc4711 } pim: Male http://www.w3.org/People/Berners-Lee/card foaf: timbl: Person card#iAnsgar Scherp – ansgar@informatik.uni-mannheim.de Slide 40
  41. 41. Layer 3: Equivalence Classes Two instances are C1 C2 C3 equivalent iff:  They are in the same TC TC1 TC2  They have the same p properties EQC1  The property targets are in the same TC DS 1 DS 2 DS 3  Similar to 1-BisimulationAnsgar Scherp – ansgar@informatik.uni-mannheim.de Slide 41
  42. 42. Layer 3: Equivalence ClassesSELECT ?xWHERE { ?x rdfs:type foaf:Person foaf:Person . ?x rdfs:type pim:Male . pim:Male foaf:PPD ?x foaf:maker ?y . ?y rdfs:type foaf:PersonalProfileDocument . tc4711 tc1234} eqc0815 -maker- pim: foaf: foaf: tc1234 Male Person PPD eqc0815 foaf:maker timbl: http://www.w3.org/People/Berners-Lee/card timbl: card card#iAnsgar Scherp – ansgar@informatik.uni-mannheim.de Slide 42
  43. 43. Computing SchemEX: TimBL Data Set• Analysis of a smaller data set• 11 M triples, TimBL‘s FOAF profile• LDspider with ~ 2k triples / sec• Different cache sizes: 100, 1k, 10k, 50k, 100k• Compared SchemEX with reference schema• Index queries on all Types, TCs, EQCs• Good precision/recall ratio at 50k+• Commodity hardware (4GB RAM, single CPU)Ansgar Scherp – ansgar@informatik.uni-mannheim.de Slide 43
  44. 44. Quality of Stream-based IndexConstruction+ Runtime increases hardly with window size+ Memory consumption scales with window sizeAnsgar Scherp – ansgar@informatik.uni-mannheim.de Slide 44
  45. 45. Computing SchemEX: Full BTC 2011 DataCache size: 50 kAnsgar Scherp – ansgar@informatik.uni-mannheim.de Slide 45
  46. 46. Billion Triple Challenge 2011 … [JWS 2012]Ansgar Scherp – ansgar@informatik.uni-mannheim.de Slide 46
  47. 47. And 2012? Get the Google Feeling!Ansgar Scherp – ansgar@informatik.uni-mannheim.de Slide 47
  48. 48. Semantic Data Management Chain• Research topics in a greater context SchemEX* OntoMDE SemaPlorer* Publish Collect Aggregate Use Kreuzverweis.com Core Ontologies Mobile Facets* Winner of Billion Triple Challenge 2011/2008  See at: dws.informatik.uni-mannheim.de Ansgar Scherp – ansgar@informatik.uni-mannheim.de Slide 48
  49. 49. Ansgar Scherp – ansgar@informatik.uni-mannheim.de Slide 49
  50. 50. Recommended Readings• Maciej Janik, Ansgar Scherp, Steffen Staab: The Semantic Web: Collective Intelligence on the Web. Informatik Spektrum 34(5): 469-483 (2011) URL: http://dx.doi.org/10.1007/s00287-011-0535-x• Simon Schenk, Carsten Saathoff, Steffen Staab, Ansgar Scherp: SemaPlorer - Interactive semantic exploration of data and media based on a federated cloud infrastructure. J. Web Sem. 7(4): 298-304 (2009) URL: http://dx.doi.org/10.1016/j.websem.2009.09.006• Mathias Konrath, Thomas Gottron, Steffen Staab, Ansgar Scherp: SchemEX — Efficient construction of a data catalogue by stream-based indexing of linked data, J. of Web Semantics: Science, Services and Agents on the World Wide Web, Available online 23 June 2012 URL: http://www.sciencedirect.com/science/article/pii/S1570826812000716• Tom Heath, Christian Bizer: Linked Data: Evolving the Web into a Global Data Space, Morgan & Claypool Publishers, 2011 URL: http://dx.doi.org/10.2200/S00334ED1V01Y201102WBE001Ansgar Scherp – ansgar@informatik.uni-mannheim.de Slide 50

×