How to Juggle with more
than a Billion Triples?

Ansgar Scherp
Research Group on Data and
Web Science

Universität Mannheim
October 2012
                                                                                             Image source:
                                              http://www.flickr.com/photos/pedromourapinheiro/2122754745/ 1
Ansgar Scherp – ansgar@informatik.uni-mannheim.de                                                    Slide
My thanks go to …
•    Marianna                                       •   Daniel Eißing
•    Simon Schenk                                   •   Mathias Konrath
•    Carsten Saathoff                               •   Daniel Schmeiß
•    Thomas Franz                                   •   Anton Baumesberger
•    Thomas Gottron                                 •   Frederik Jochum
•    Steffen Staab                                  •   Alexander Kleinen
•    Arne Peters
•    Bastian Krayer                                      And many more …


Ansgar Scherp – ansgar@informatik.uni-mannheim.de                      Slide 2
Scenario

• Tim plans to travel
  – from London
  – to a customer in Cologne




Ansgar Scherp – ansgar@informatik.uni-mannheim.de   Slide 3
Website of the German Railway




It works, why bother…?
Ansgar Scherp – ansgar@informatik.uni-mannheim.de   Slide 4
Let„s Try Different Queries

 Bottlenecks in public transportation?
 Compare the connections with flights?
 Visualize on a map?
…


 All these queries cannot be answered,
  because the data …


Ansgar Scherp – ansgar@informatik.uni-mannheim.de   Slide 5
… locked in Silos!


 – High Integration Effort
 – Lack in Reuse of Data
Ansgar Scherp – ansgar@informatik.uni-mannheim.de                                                           Slide 6
                                                    B. Jagendorf, http://www.flickr.com/photos/bobjagendorf/, CC-BY
Linked Data
• Publishing and interlinking of data
• Different quality and purpose
• From different sources in the Web

          World Wide Web                                Linked Data
        Documents                                   Data
        Hyperlinks                                  Typed Links
        HTML                                        RDF
        Addresses (URIs)                            Addresses (URIs)

Example: http://www.uni-mannheim.de/
Ansgar Scherp – ansgar@informatik.uni-mannheim.de                      Slide 7
Relevance of Linked Data?




Ansgar Scherp – ansgar@informatik.uni-mannheim.de   Slide 8
Linked Data: May „07                                           Sept. „11
                                                Web 2.0


                                Media



                                                                             Publications

   eGovernment

                                 Cross-Domain



                                                            Life
               Geographic                                 Sciences



Ansgar Billion–Triples
< 31 Scherp ansgar@informatik.uni-mannheim.de                        Source: http://lod-cloud.net
                                                                                           Slide 9
Linked Data Principles


1.        Identification
2.        Interlinkage
3.        Dereferencing
4.        Description




Ansgar Scherp – ansgar@informatik.uni-mannheim.de   Slide 10
Example: Big Lynx
                                               Matt Briggs




                                             Scott Miller
                                                               ?
                                                              Big Lynx
                                                              Company




Ansgar Scherp – ansgar@informatik.uni-mannheim.de
< 31 Milliarde Triple                                        Source: http://lod-cloud.net
                                                                                   Slide 11
1. Use URIs for Identification




 Matt Briggs


                                                                              Scott Miller
         http://biglynx.co.uk/
         people/matt-briggs
                                                                         http://biglynx.co.uk/
                                                                         people/scott-miller

Ansgar Scherp – ansgar@informatik.uni-mannheim.de
                   B. Gazen,http://www.flickr.com/photos/bayat/, CC-BY                           Slide 12
Example: Big Lynx
                                               Matt Briggs




                                             Scott Miller
                                                             Big Lynx
                                                             Company



 How to model relationships like knows?

Ansgar Scherp – ansgar@informatik.uni-mannheim.de                       Slide 13
Resource DescriptionFramework (RDF)
• Description of Ressources with RDF triple
            Matt Briggs                               is a      Person


                  Subject                           Predicate    Object

@prefix rdf:<http://w3.org/1999/02/22-rdf-
      syntax-ns#> .
@prefix foaf:<http://xmlns.com/foaf/0.1/> .
<http://biglynx.co.uk/people/matt-briggs>
    rdf:type foaf:Person .
Ansgar Scherp – ansgar@informatik.uni-mannheim.de                         Slide 14
1. Use URIs also for Relations




        http://biglynx.co.uk/
        people/matt-briggs

                                                                         http://biglynx.co.uk/
                                                                         people/scott-miller

Ansgar Scherp – ansgar@informatik.uni-mannheim.de
                   B. Gazen,http://www.flickr.com/photos/bayat/, CC-BY                           Slide 15
Example: Big Lynx
                                                             Dave Smith
         London
                                       „lives here―

                                             Matt Briggs

                                              „same
                                             Scott Miller
                                                            Big Lynx
                          …                     person―
                                                            Company

           DBpedia                                           Matt Briggs

                              Matts private
                              Webseite
Ansgar Scherp – ansgar@informatik.uni-mannheim.de                      Slide 16
2. Establishing Interlinkage
• Relation links between ressources
       <http://biglynx.co.uk/people/dave-smith>
           foaf:based_near
           <http://dbpedia.org/resource/London> .


 Identity links between ressources
    <http://biglynx.co.uk/people/matt-briggs>
        owl:sameAs
         <http://www.matt-briggs.eg.uk#me> .
Ansgar Scherp – ansgar@informatik.uni-mannheim.de   Slide 17
Example: Big Lynx
                                                            Dave Smith
         London
                                      „lives here―
                                    foaf:based_near


                                             Matt Briggs

                                              „same
                                             owl:sameAs
                                              Person―      Big Lynx
                                                           Company

           DBpedia                                          Matt Briggs

                              Matts private
                              Webseite
Ansgar Scherp – ansgar@informatik.uni-mannheim.de                     Slide 18
3. Dereferencing of URIs

• Looking up of web documents

• How can we ―look up‖ things of the real world?




                                 http://biglynx.co.uk/
                                 people/matt-briggs


Ansgar Scherp – ansgar@informatik.uni-mannheim.de        Slide 19
Two Approaches
1. Hash URIs
   – URI contains a part separated by #, e.g.,
    http://biglynx.co.uk/vocab/sme#Team

2. Negotiation via „303 See Other― request
      http://biglynx.co.uk/people/matt-briggs
      Response: „Look here:―
      http://biglynx.co.uk/people/matt-briggs.rdf


Ansgar Scherp – ansgar@informatik.uni-mannheim.de   Slide 20
Example: Big Lynx
                                                           Dave Smith
         London
                                    foaf:based_near


                               Description of
                                     Matt Briggs
                               Matt?
                                             owl:sameAs
                                                          Big Lynx
                                                          Company

           DBpedia                                         Matt Briggs

                              Matts private
                              Webseite
Ansgar Scherp – ansgar@informatik.uni-mannheim.de                    Slide 21
4. Description of URIs
                  foaf:Person                                                …
…                                                    dp:Birmingham
                              rdf:type
                                                    foaf:based_near          …

             biglynx:matt-briggs                    ex:loc
                                                              _:point
                              foaf:knows
                                                                          wgs84:
                                                         wgs84:             long
            biglynx:dave-smith
                                                         lat
                                                                        ―-0.118‖
                              foaf:based_near
                                                             ―51.509‖
                   dp:London

        …                                           …
Ansgar Scherp – ansgar@informatik.uni-mannheim.de                            Slide 22
Formalization of Description
 Given a RDF graph G (V , P, E ) with
  V R B L and E ( R B) P V

                                         ∩∞
 SimpleCBD(n) =                                    I j with
                                        j=0

        I 0 = { (s, p, o) | (s, p, o)                          E     s=n}

     I j+1 = { (o, p‗, o‗)                    E|        (s, p, o)       Ij : o   B
                                                                                 ∩j
                                                                   (o, p‗, o‗)        Ik}
                                                                                 k=0

Ansgar Scherp – ansgar@informatik.uni-mannheim.de                                     Slide 23
W3C RDF / RDF Schema Vocabulary
•    Set of URIs defined in rdf:/rdfs: namespace
•    rdf:type               • rdfs:domain
•    rdf:Property           • rdfs:range
•    rdf:XMLLiteral         • rdfs:Resource
•    rdf:List               • rdfs:Literal
•    rdf:first              • rdfs:Datatype
•    rdf:rest               • rdfs:Class
•    rdf:Seq                • rdfs:subClassOf
•    rdf:Bag                • rdfs:subPropertyOf
•    rdf:Alt                • rdfs:comment
•    ...                    • …
•    rdf:value              • rdfs:label
Ansgar Scherp – ansgar@informatik.uni-mannheim.de   Slide 24
Semantic Web Layer Cake (Simplified)




Ansgar Scherp – ansgar@informatik.uni-mannheim.de   Slide 25
Exploration of Linked Data


                             Word
                             Net




         Swoogle

                                                Geo
                                               Names
Ansgar Scherp – ansgar@informatik.uni-mannheim.de
< 31 Billion Triples                                   Source: http://lod-cloud.net
                                                                             Slide 26
Naive Approach
• Download all data
• Store in really big
  database                                                               RDFS
• Programming of                                    WordNet              Rules
  queries                                           Swoogle               Geo
• Design of
  user interface                                     GeoNames

                                                Inflexible           Monolithic
                                                                Not
Ansgar Scherp – ansgar@informatik.uni-mannheim.de
                                                             scaleable
                                                                                 Slide 27
SemaPlorer Approach
                                                                             Flexible

                                                                               Extensible

                                                                                 Scaleable
                                                    birthplace



                              placeOfBirth
                               birthplace

                                                                    Geo
               RDFS             Rules          Fulltext            Queries     > 1 Billion
                                                                                 Triples
             WordNet      +              +   Swoogle      +   +   GeoNames
                                                                      12 Month in 2005/06
Ansgar Scherp – ansgar@informatik.uni-mannheim.de                       700 Mio. Triple Slide 28
SemaPlorer – Semantic Social Media




Ansgar Scherpvideo online: http://vimeo.com/2057249
    Watch – ansgar@informatik.uni-mannheim.de         Slide 29
Billion Triple Challenge 2008




                                                    [JWS 2009]
Ansgar Scherp – ansgar@informatik.uni-mannheim.de          Slide 30
Searching for Linked Data Sources




                                                      ?
       Persons that are
       - Politicians and
       - Actors
       ?




<Ansgar Scherp – ansgar@informatik.uni-mannheim.de
  31 Milliarde Triples                               Quelle: http://lod-cloud.net
                                                                           Slide 31
Idea: Index of Data Sources
SELECT ?x
FROM …
WHERE {
 ?x rdf:type ex:Actor .
 ?x rdf:type ex:Politician .
}

                                 Index


                                        ?
           Query

  “Politician and
      Actor”
Ansgar Scherp – ansgar@informatik.uni-mannheim.de   Slide 32
The Naive Approach
1.     Download the entire LOD cloud
2.     Put it into a (really) large triple store
3.     Process the data and extract schema
4.     Provide lookup

- Big machinery
- Late in processing the data
- High effort to scale with LOD cloud



Ansgar Scherp – ansgar@informatik.uni-mannheim.de   Slide 33
Idea
 Schema-level index
   Define families of graph patterns
   Assign instances to graph patterns
   Map graph patterns to context (source URI)
 Construction
   Stream-based for scalability
   Little loss of accuracy
 Note
   Index defined over instances
   But stores the context
Ansgar Scherp – ansgar@informatik.uni-mannheim.de   Slide 34
Input Data
 n-Quads
         <subject> <predicate> <object> <context>
 Example:
            <http://www.w3.org/People/Connolly/#me>
            <http://www.w3.org/1999/02/22-rdf-syntax-ns#
            <http://xmlns.com/foaf/0.1/Person>
            <http://dig.csail.mit.edu/2008/webdav/timbl/
                             http://dig.csail.mit.edu/2008/
                             webdav/timbl/foaf.rdf
                          w3p:
                          #me
                                                       foaf:
                                                      Person



Ansgar Scherp – ansgar@informatik.uni-mannheim.de              Slide 35
SchemEX Approach
• Stream-based schema extraction
• While crawling the data


                                          FIFO
LOD-Crawler                                         Instance-
 RDF-Dump                                             Cache      RDF
 Triple Store                                                   RDBMS
                              NxParser

    Nquad-                                          Schema-     Schema-
                                Parser
    Stream                                          Extractor    Level
                                                                 Index
Ansgar Scherp – ansgar@informatik.uni-mannheim.de                   Slide 36
Building the Index from a Stream
 Stream of n-quads (coming from a LD crawler)
      … Q16, Q15, Q14, Q13, Q12, Q11, Q10, Q9, Q8, Q7, Q6, Q5, Q4, Q3, Q2, Q1



                                                         FiFo
                                                                     1
                                                    C3    4
                                                                     6
                                                    C2    3
                                                                     4
                                                          2
                                                    C2               2
                                                          1              3
                                                    C1               5



• Linear runtime complexity wrt # of input triples
Ansgar Scherp – ansgar@informatik.uni-mannheim.de                            Slide 37
Building the Schema and Index
                                                                      RDF
      C1                C2              C3               …    Ck
                                                                     classes
                                         consistsOf
                                                                      Type
        TC1                     TC2                      …   TCm     clusters
hasEQ
Class                 p1                            p2
       EQC1                   EQC2                       … EQCn    Equivalence
                                                                     classes
                                            hasDataSource

                                                         …           Data
  DS1 DS2 DS3 DS4 DS5                                        DSx    sources
Ansgar Scherp – ansgar@informatik.uni-mannheim.de                          Slide 38
Layer 1: RDF Classes
 All instances of a                                                   C1
  particular type
                                                            DS 1      DS 2        DS 3

 SELECT ?x
 FROM …
 WHERE {
    ?x rdfs:type foaf:Person .
                           foaf:Person
 }

                                                                   http://dig.csail.mit.edu/2008/...
                                foaf:
 timbl:                        Person
 card#i                                             http://www.w3.org/People/Berners-Lee/card



Ansgar Scherp – ansgar@informatik.uni-mannheim.de                                          Slide 39
Layer 2: Type Clusters
 All instances belonging                                        C1         C2

  to exactly the same set
                                                                      TC1
  of types
 SELECT ?x                     DS 1      DS 2    DS 3
 FROM …
 WHERE {
                            foaf:Person       pim:Male
    ?x rdfs:type foaf:Person .
    ?x rdfs:type pim:Male .           tc4711
 }
                       pim:
                       Male
                                                    http://www.w3.org/People/Berners-Lee/card
                                     foaf:
 timbl:
                                    Person
 card#i
Ansgar Scherp – ansgar@informatik.uni-mannheim.de                                      Slide 40
Layer 3: Equivalence Classes
 Two instances are                                  C1           C2         C3

  equivalent iff:
    They are in the same TC                               TC1               TC2

    They have the same                                                p
     properties
                                                           EQC1
    The property targets are
     in the same TC                                 DS 1     DS 2          DS 3




  Similar to 1-Bisimulation
Ansgar Scherp – ansgar@informatik.uni-mannheim.de                            Slide 41
Layer 3: Equivalence Classes
SELECT ?x
WHERE {
   ?x rdfs:type foaf:Person foaf:Person
                            .
   ?x rdfs:type pim:Male .            pim:Male foaf:PPD
   ?x foaf:maker ?y .
   ?y rdfs:type
      foaf:PersonalProfileDocument .
                                 tc4711         tc1234
}                                       eqc0815
                                                                          -maker-
 pim:           foaf:                foaf:                                 tc1234
 Male          Person                PPD
                                                                eqc0815
                                                                               foaf:maker


                                  timbl:            http://www.w3.org/People/Berners-Lee/card
      timbl:                       card
      card#i
Ansgar Scherp – ansgar@informatik.uni-mannheim.de                                       Slide 42
Computing SchemEX: TimBL Data Set
• Analysis of a smaller data set
• 11 M triples, TimBL‘s FOAF profile
• LDspider with ~ 2k triples / sec


•   Different cache sizes: 100, 1k, 10k, 50k, 100k
•   Compared SchemEX with reference schema
•   Index queries on all Types, TCs, EQCs
•   Good precision/recall ratio at 50k+
• Commodity hardware (4GB RAM, single CPU)
Ansgar Scherp – ansgar@informatik.uni-mannheim.de   Slide 43
Quality of Stream-based Index
Construction




+ Runtime increases hardly with window size
+ Memory consumption scales with window size
Ansgar Scherp – ansgar@informatik.uni-mannheim.de   Slide 44
Computing SchemEX: Full BTC 2011 Data




Cache size: 50 k
Ansgar Scherp – ansgar@informatik.uni-mannheim.de   Slide 45
Billion Triple Challenge 2011




  …




                                                    [JWS 2012]
Ansgar Scherp – ansgar@informatik.uni-mannheim.de          Slide 46
And 2012? Get the Google Feeling!




Ansgar Scherp – ansgar@informatik.uni-mannheim.de   Slide 47
Semantic Data Management Chain
• Research topics in a greater context

       SchemEX*                                OntoMDE       SemaPlorer*

      Publish                  Collect              Aggregate     Use

      Kreuzverweis.com                              Core Ontologies

                                                            Mobile Facets
* Winner of Billion Triple Challenge 2011/2008
    See at: dws.informatik.uni-mannheim.de 
Ansgar Scherp – ansgar@informatik.uni-mannheim.de                       Slide 48
Ansgar Scherp – ansgar@informatik.uni-mannheim.de   Slide 49
Recommended Readings
• Maciej Janik, Ansgar Scherp, Steffen Staab: The Semantic Web:
  Collective Intelligence on the Web. Informatik Spektrum 34(5): 469-483
  (2011) URL: http://dx.doi.org/10.1007/s00287-011-0535-x
• Simon Schenk, Carsten Saathoff, Steffen Staab, Ansgar Scherp:
  SemaPlorer - Interactive semantic exploration of data and media based on
  a federated cloud infrastructure. J. Web Sem. 7(4): 298-304 (2009)
  URL: http://dx.doi.org/10.1016/j.websem.2009.09.006
• Mathias Konrath, Thomas Gottron, Steffen Staab, Ansgar Scherp:
  SchemEX — Efficient construction of a data catalogue by stream-based
  indexing of linked data, J. of Web Semantics: Science, Services and
  Agents on the World Wide Web, Available online 23 June 2012
  URL: http://www.sciencedirect.com/science/article/pii/S1570826812000716
• Tom Heath, Christian Bizer: Linked Data: Evolving the Web into a Global
  Data Space, Morgan & Claypool Publishers, 2011
  URL: http://dx.doi.org/10.2200/S00334ED1V01Y201102WBE001



Ansgar Scherp – ansgar@informatik.uni-mannheim.de                    Slide 50

Linked open data - how to juggle with more than a billion triples

  • 1.
    How to Jugglewith more than a Billion Triples? Ansgar Scherp Research Group on Data and Web Science Universität Mannheim October 2012 Image source: http://www.flickr.com/photos/pedromourapinheiro/2122754745/ 1 Ansgar Scherp – ansgar@informatik.uni-mannheim.de Slide
  • 2.
    My thanks goto … • Marianna • Daniel Eißing • Simon Schenk • Mathias Konrath • Carsten Saathoff • Daniel Schmeiß • Thomas Franz • Anton Baumesberger • Thomas Gottron • Frederik Jochum • Steffen Staab • Alexander Kleinen • Arne Peters • Bastian Krayer And many more … Ansgar Scherp – ansgar@informatik.uni-mannheim.de Slide 2
  • 3.
    Scenario • Tim plansto travel – from London – to a customer in Cologne Ansgar Scherp – ansgar@informatik.uni-mannheim.de Slide 3
  • 4.
    Website of theGerman Railway It works, why bother…? Ansgar Scherp – ansgar@informatik.uni-mannheim.de Slide 4
  • 5.
    Let„s Try DifferentQueries  Bottlenecks in public transportation?  Compare the connections with flights?  Visualize on a map? …  All these queries cannot be answered, because the data … Ansgar Scherp – ansgar@informatik.uni-mannheim.de Slide 5
  • 6.
    … locked inSilos! – High Integration Effort – Lack in Reuse of Data Ansgar Scherp – ansgar@informatik.uni-mannheim.de Slide 6 B. Jagendorf, http://www.flickr.com/photos/bobjagendorf/, CC-BY
  • 7.
    Linked Data • Publishingand interlinking of data • Different quality and purpose • From different sources in the Web World Wide Web Linked Data Documents Data Hyperlinks Typed Links HTML RDF Addresses (URIs) Addresses (URIs) Example: http://www.uni-mannheim.de/ Ansgar Scherp – ansgar@informatik.uni-mannheim.de Slide 7
  • 8.
    Relevance of LinkedData? Ansgar Scherp – ansgar@informatik.uni-mannheim.de Slide 8
  • 9.
    Linked Data: May„07  Sept. „11 Web 2.0 Media Publications eGovernment Cross-Domain Life Geographic Sciences Ansgar Billion–Triples < 31 Scherp ansgar@informatik.uni-mannheim.de Source: http://lod-cloud.net Slide 9
  • 10.
    Linked Data Principles 1. Identification 2. Interlinkage 3. Dereferencing 4. Description Ansgar Scherp – ansgar@informatik.uni-mannheim.de Slide 10
  • 11.
    Example: Big Lynx Matt Briggs Scott Miller ? Big Lynx Company Ansgar Scherp – ansgar@informatik.uni-mannheim.de < 31 Milliarde Triple Source: http://lod-cloud.net Slide 11
  • 12.
    1. Use URIsfor Identification Matt Briggs Scott Miller http://biglynx.co.uk/ people/matt-briggs http://biglynx.co.uk/ people/scott-miller Ansgar Scherp – ansgar@informatik.uni-mannheim.de B. Gazen,http://www.flickr.com/photos/bayat/, CC-BY Slide 12
  • 13.
    Example: Big Lynx Matt Briggs Scott Miller Big Lynx Company  How to model relationships like knows? Ansgar Scherp – ansgar@informatik.uni-mannheim.de Slide 13
  • 14.
    Resource DescriptionFramework (RDF) •Description of Ressources with RDF triple Matt Briggs is a Person Subject Predicate Object @prefix rdf:<http://w3.org/1999/02/22-rdf- syntax-ns#> . @prefix foaf:<http://xmlns.com/foaf/0.1/> . <http://biglynx.co.uk/people/matt-briggs> rdf:type foaf:Person . Ansgar Scherp – ansgar@informatik.uni-mannheim.de Slide 14
  • 15.
    1. Use URIsalso for Relations http://biglynx.co.uk/ people/matt-briggs http://biglynx.co.uk/ people/scott-miller Ansgar Scherp – ansgar@informatik.uni-mannheim.de B. Gazen,http://www.flickr.com/photos/bayat/, CC-BY Slide 15
  • 16.
    Example: Big Lynx Dave Smith London „lives here― Matt Briggs „same Scott Miller Big Lynx … person― Company DBpedia Matt Briggs Matts private Webseite Ansgar Scherp – ansgar@informatik.uni-mannheim.de Slide 16
  • 17.
    2. Establishing Interlinkage •Relation links between ressources <http://biglynx.co.uk/people/dave-smith> foaf:based_near <http://dbpedia.org/resource/London> .  Identity links between ressources <http://biglynx.co.uk/people/matt-briggs> owl:sameAs <http://www.matt-briggs.eg.uk#me> . Ansgar Scherp – ansgar@informatik.uni-mannheim.de Slide 17
  • 18.
    Example: Big Lynx Dave Smith London „lives here― foaf:based_near Matt Briggs „same owl:sameAs Person― Big Lynx Company DBpedia Matt Briggs Matts private Webseite Ansgar Scherp – ansgar@informatik.uni-mannheim.de Slide 18
  • 19.
    3. Dereferencing ofURIs • Looking up of web documents • How can we ―look up‖ things of the real world? http://biglynx.co.uk/ people/matt-briggs Ansgar Scherp – ansgar@informatik.uni-mannheim.de Slide 19
  • 20.
    Two Approaches 1. HashURIs – URI contains a part separated by #, e.g., http://biglynx.co.uk/vocab/sme#Team 2. Negotiation via „303 See Other― request http://biglynx.co.uk/people/matt-briggs Response: „Look here:― http://biglynx.co.uk/people/matt-briggs.rdf Ansgar Scherp – ansgar@informatik.uni-mannheim.de Slide 20
  • 21.
    Example: Big Lynx Dave Smith London foaf:based_near Description of Matt Briggs Matt? owl:sameAs Big Lynx Company DBpedia Matt Briggs Matts private Webseite Ansgar Scherp – ansgar@informatik.uni-mannheim.de Slide 21
  • 22.
    4. Description ofURIs foaf:Person … … dp:Birmingham rdf:type foaf:based_near … biglynx:matt-briggs ex:loc _:point foaf:knows wgs84: wgs84: long biglynx:dave-smith lat ―-0.118‖ foaf:based_near ―51.509‖ dp:London … … Ansgar Scherp – ansgar@informatik.uni-mannheim.de Slide 22
  • 23.
    Formalization of Description Given a RDF graph G (V , P, E ) with V R B L and E ( R B) P V ∩∞  SimpleCBD(n) = I j with j=0 I 0 = { (s, p, o) | (s, p, o) E s=n} I j+1 = { (o, p‗, o‗) E| (s, p, o) Ij : o B ∩j (o, p‗, o‗) Ik} k=0 Ansgar Scherp – ansgar@informatik.uni-mannheim.de Slide 23
  • 24.
    W3C RDF /RDF Schema Vocabulary • Set of URIs defined in rdf:/rdfs: namespace • rdf:type • rdfs:domain • rdf:Property • rdfs:range • rdf:XMLLiteral • rdfs:Resource • rdf:List • rdfs:Literal • rdf:first • rdfs:Datatype • rdf:rest • rdfs:Class • rdf:Seq • rdfs:subClassOf • rdf:Bag • rdfs:subPropertyOf • rdf:Alt • rdfs:comment • ... • … • rdf:value • rdfs:label Ansgar Scherp – ansgar@informatik.uni-mannheim.de Slide 24
  • 25.
    Semantic Web LayerCake (Simplified) Ansgar Scherp – ansgar@informatik.uni-mannheim.de Slide 25
  • 26.
    Exploration of LinkedData Word Net Swoogle Geo Names Ansgar Scherp – ansgar@informatik.uni-mannheim.de < 31 Billion Triples Source: http://lod-cloud.net Slide 26
  • 27.
    Naive Approach • Downloadall data • Store in really big database RDFS • Programming of WordNet Rules queries Swoogle Geo • Design of user interface GeoNames Inflexible Monolithic Not Ansgar Scherp – ansgar@informatik.uni-mannheim.de scaleable Slide 27
  • 28.
    SemaPlorer Approach Flexible Extensible Scaleable birthplace placeOfBirth birthplace Geo RDFS Rules Fulltext Queries > 1 Billion Triples WordNet + + Swoogle + + GeoNames 12 Month in 2005/06 Ansgar Scherp – ansgar@informatik.uni-mannheim.de  700 Mio. Triple Slide 28
  • 29.
    SemaPlorer – SemanticSocial Media Ansgar Scherpvideo online: http://vimeo.com/2057249 Watch – ansgar@informatik.uni-mannheim.de Slide 29
  • 30.
    Billion Triple Challenge2008 [JWS 2009] Ansgar Scherp – ansgar@informatik.uni-mannheim.de Slide 30
  • 31.
    Searching for LinkedData Sources ? Persons that are - Politicians and - Actors ? <Ansgar Scherp – ansgar@informatik.uni-mannheim.de 31 Milliarde Triples Quelle: http://lod-cloud.net Slide 31
  • 32.
    Idea: Index ofData Sources SELECT ?x FROM … WHERE { ?x rdf:type ex:Actor . ?x rdf:type ex:Politician . } Index ? Query “Politician and Actor” Ansgar Scherp – ansgar@informatik.uni-mannheim.de Slide 32
  • 33.
    The Naive Approach 1. Download the entire LOD cloud 2. Put it into a (really) large triple store 3. Process the data and extract schema 4. Provide lookup - Big machinery - Late in processing the data - High effort to scale with LOD cloud Ansgar Scherp – ansgar@informatik.uni-mannheim.de Slide 33
  • 34.
    Idea  Schema-level index  Define families of graph patterns  Assign instances to graph patterns  Map graph patterns to context (source URI)  Construction  Stream-based for scalability  Little loss of accuracy  Note  Index defined over instances  But stores the context Ansgar Scherp – ansgar@informatik.uni-mannheim.de Slide 34
  • 35.
    Input Data  n-Quads <subject> <predicate> <object> <context>  Example: <http://www.w3.org/People/Connolly/#me> <http://www.w3.org/1999/02/22-rdf-syntax-ns# <http://xmlns.com/foaf/0.1/Person> <http://dig.csail.mit.edu/2008/webdav/timbl/ http://dig.csail.mit.edu/2008/ webdav/timbl/foaf.rdf w3p: #me foaf: Person Ansgar Scherp – ansgar@informatik.uni-mannheim.de Slide 35
  • 36.
    SchemEX Approach • Stream-basedschema extraction • While crawling the data FIFO LOD-Crawler Instance- RDF-Dump Cache RDF Triple Store RDBMS NxParser Nquad- Schema- Schema- Parser Stream Extractor Level Index Ansgar Scherp – ansgar@informatik.uni-mannheim.de Slide 36
  • 37.
    Building the Indexfrom a Stream  Stream of n-quads (coming from a LD crawler) … Q16, Q15, Q14, Q13, Q12, Q11, Q10, Q9, Q8, Q7, Q6, Q5, Q4, Q3, Q2, Q1 FiFo 1 C3 4 6 C2 3 4 2 C2 2 1 3 C1 5 • Linear runtime complexity wrt # of input triples Ansgar Scherp – ansgar@informatik.uni-mannheim.de Slide 37
  • 38.
    Building the Schemaand Index RDF C1 C2 C3 … Ck classes consistsOf Type TC1 TC2 … TCm clusters hasEQ Class p1 p2 EQC1 EQC2 … EQCn Equivalence classes hasDataSource … Data DS1 DS2 DS3 DS4 DS5 DSx sources Ansgar Scherp – ansgar@informatik.uni-mannheim.de Slide 38
  • 39.
    Layer 1: RDFClasses  All instances of a C1 particular type DS 1 DS 2 DS 3 SELECT ?x FROM … WHERE { ?x rdfs:type foaf:Person . foaf:Person } http://dig.csail.mit.edu/2008/... foaf: timbl: Person card#i http://www.w3.org/People/Berners-Lee/card Ansgar Scherp – ansgar@informatik.uni-mannheim.de Slide 39
  • 40.
    Layer 2: TypeClusters  All instances belonging C1 C2 to exactly the same set TC1 of types SELECT ?x DS 1 DS 2 DS 3 FROM … WHERE { foaf:Person pim:Male ?x rdfs:type foaf:Person . ?x rdfs:type pim:Male . tc4711 } pim: Male http://www.w3.org/People/Berners-Lee/card foaf: timbl: Person card#i Ansgar Scherp – ansgar@informatik.uni-mannheim.de Slide 40
  • 41.
    Layer 3: EquivalenceClasses  Two instances are C1 C2 C3 equivalent iff:  They are in the same TC TC1 TC2  They have the same p properties EQC1  The property targets are in the same TC DS 1 DS 2 DS 3  Similar to 1-Bisimulation Ansgar Scherp – ansgar@informatik.uni-mannheim.de Slide 41
  • 42.
    Layer 3: EquivalenceClasses SELECT ?x WHERE { ?x rdfs:type foaf:Person foaf:Person . ?x rdfs:type pim:Male . pim:Male foaf:PPD ?x foaf:maker ?y . ?y rdfs:type foaf:PersonalProfileDocument . tc4711 tc1234 } eqc0815 -maker- pim: foaf: foaf: tc1234 Male Person PPD eqc0815 foaf:maker timbl: http://www.w3.org/People/Berners-Lee/card timbl: card card#i Ansgar Scherp – ansgar@informatik.uni-mannheim.de Slide 42
  • 43.
    Computing SchemEX: TimBLData Set • Analysis of a smaller data set • 11 M triples, TimBL‘s FOAF profile • LDspider with ~ 2k triples / sec • Different cache sizes: 100, 1k, 10k, 50k, 100k • Compared SchemEX with reference schema • Index queries on all Types, TCs, EQCs • Good precision/recall ratio at 50k+ • Commodity hardware (4GB RAM, single CPU) Ansgar Scherp – ansgar@informatik.uni-mannheim.de Slide 43
  • 44.
    Quality of Stream-basedIndex Construction + Runtime increases hardly with window size + Memory consumption scales with window size Ansgar Scherp – ansgar@informatik.uni-mannheim.de Slide 44
  • 45.
    Computing SchemEX: FullBTC 2011 Data Cache size: 50 k Ansgar Scherp – ansgar@informatik.uni-mannheim.de Slide 45
  • 46.
    Billion Triple Challenge2011 … [JWS 2012] Ansgar Scherp – ansgar@informatik.uni-mannheim.de Slide 46
  • 47.
    And 2012? Getthe Google Feeling! Ansgar Scherp – ansgar@informatik.uni-mannheim.de Slide 47
  • 48.
    Semantic Data ManagementChain • Research topics in a greater context SchemEX* OntoMDE SemaPlorer* Publish Collect Aggregate Use Kreuzverweis.com Core Ontologies Mobile Facets * Winner of Billion Triple Challenge 2011/2008  See at: dws.informatik.uni-mannheim.de  Ansgar Scherp – ansgar@informatik.uni-mannheim.de Slide 48
  • 49.
    Ansgar Scherp –ansgar@informatik.uni-mannheim.de Slide 49
  • 50.
    Recommended Readings • MaciejJanik, Ansgar Scherp, Steffen Staab: The Semantic Web: Collective Intelligence on the Web. Informatik Spektrum 34(5): 469-483 (2011) URL: http://dx.doi.org/10.1007/s00287-011-0535-x • Simon Schenk, Carsten Saathoff, Steffen Staab, Ansgar Scherp: SemaPlorer - Interactive semantic exploration of data and media based on a federated cloud infrastructure. J. Web Sem. 7(4): 298-304 (2009) URL: http://dx.doi.org/10.1016/j.websem.2009.09.006 • Mathias Konrath, Thomas Gottron, Steffen Staab, Ansgar Scherp: SchemEX — Efficient construction of a data catalogue by stream-based indexing of linked data, J. of Web Semantics: Science, Services and Agents on the World Wide Web, Available online 23 June 2012 URL: http://www.sciencedirect.com/science/article/pii/S1570826812000716 • Tom Heath, Christian Bizer: Linked Data: Evolving the Web into a Global Data Space, Morgan & Claypool Publishers, 2011 URL: http://dx.doi.org/10.2200/S00334ED1V01Y201102WBE001 Ansgar Scherp – ansgar@informatik.uni-mannheim.de Slide 50