SchemEX
Creating the Yellow Pages of the LOD Cloud

Mathias Konrath, Thomas Gottron, Ansgar Scherp
Scenario
• People who are politicians and actors




• Who else?
• Where do they live?
• Whom do they know?
SchemEX – Mathias Konrath, Thomas Gottron, Ansgar Scherp   2 of 12
Problem
• Execute those queries on the LOD cloud
• No single federated query interface provided




       “politicians
       and actors”

SchemEX – Mathias Konrath, Thomas Gottron, Ansgar Scherp   3 of 12
Principle Solution
• Suitable index structure for looking up sources




       “politicians
       and actors”

SchemEX – Mathias Konrath, Thomas Gottron, Ansgar Scherp   4 of 12
The Naive Approach
1.     Download the entire LOD cloud
2.     Put it into a (really) large triple store
3.     Process the data and extract schema
4.     Provide lookup

- Big machinery
- Late in processing the data
- High effort to scale with LOD cloud



SchemEX – Mathias Konrath, Thomas Gottron, Ansgar Scherp   5 of 12
Yes, we can …



SchemEX – Mathias Konrath, Thomas Gottron, Ansgar Scherp   6 of 12
The SchemEX Approach
• Stream-based schema extraction
• While crawling the data


                                          FIFO
LOD-Crawler                                                Instance-
 RDF-Dump                                                    Cache
                                                                        RDF
 Triple Store                                                          RDBMS
                              NxParser

    Nquad-                                                 Schema-
                                Parser                                 Schema
    Stream                                                 Extractor

SchemEX – Mathias Konrath, Thomas Gottron, Ansgar Scherp   7 of 12
Efficient Instance Cache
• Observe a quadruple stream from LD spider




• Ring queue, backed up by a hash map
• Organizes triples with same subject URI
• Dismiss oldest, when cache full (FIFO)
→ Runtime complexity O(1)
SchemEX – Mathias Konrath, Thomas Gottron, Ansgar Scherp   8 of 12
Building the Schema and Index
                                                                              RDF
       c1               c2               c3                …         ck
                                                                             classes
                                         consistsOf
                                                                              Type
        TC1                     TC2                        …         TCm     clusters
hasEQ
Class                 p1                            p2
       EQC1                   EQC2                         … EQCn          Equivalence
                                                                             classes
                                           hasDataSource

                                                           …                 Data
  DS1 DS2 DS3 DS4 DS5                                            DSx        sources
SchemEX – Mathias Konrath, Thomas Gottron, Ansgar Scherp   9 of 12
Computing SchemEX: TimBL Data Set
• Analysis of a smaller data set
• 11 M triples, TimBL’s FOAF profile
• LDspider with ~ 2k triples / sec


•   Different cache sizes: 100, 1k, 10k, 50k, 100k
•   Compared SchemEX with reference schema
•   Index queries on all Types, TCs, EQCs
•   Good precision/recall ratio at 50k+

SchemEX – Mathias Konrath, Thomas Gottron, Ansgar Scherp   10 of 12
Computing SchemEX: Full BTC 2011 Data




Cache size: 50 k
SchemEX – Mathias Konrath, Thomas Gottron, Ansgar Scherp   11 of 12
Conclusions: SchemEX
• Stream-based approach to schema extraction
• Scalable to arbitrary amount of Linked Data
• Applicable on commodity hardware
  (4GB RAM, standard single CPU)




• Lookup-index to find relevant data sources
• Support federated queries on the LOD cloud
SchemEX – Mathias Konrath, Thomas Gottron, Ansgar Scherp   12 of 12
BACKUP




SchemEX – Mathias Konrath, Thomas Gottron, Ansgar Scherp   13 of 12
SchemEX Computation: Window Sizes
                                      Runtime increases hardly with
                                          greater window sizes




 Crawled TimBL dataset                                     Memory consumption scales
  (11M triples)                                                with window size


SchemEX – Mathias Konrath, Thomas Gottron, Ansgar Scherp   14 of 12
SchemEX Quality: Precision




SchemEX – Mathias Konrath, Thomas Gottron, Ansgar Scherp   15 of 12
SchemEX Quality: Recall




SchemEX – Mathias Konrath, Thomas Gottron, Ansgar Scherp   16 of 12
Example Data Graph




SchemEX – Mathias Konrath, Thomas Gottron, Ansgar Scherp   17 of 12
Output Vocabulary: voiD




SchemEX – Mathias Konrath, Thomas Gottron, Ansgar Scherp   18 of 12
SchemEX Extraction: Progress Plot

                  Type-cluster
                  Equivalence classes
 Count




                                 ##processed instances
                                        processed 12           instances
SchemEX – Mathias Konrath, Thomas Gottron, Ansgar Scherp 19 of

SchemEX - Creating the Yellow Pages for the Linked Open Data Cloud

  • 1.
    SchemEX Creating the YellowPages of the LOD Cloud Mathias Konrath, Thomas Gottron, Ansgar Scherp
  • 2.
    Scenario • People whoare politicians and actors • Who else? • Where do they live? • Whom do they know? SchemEX – Mathias Konrath, Thomas Gottron, Ansgar Scherp 2 of 12
  • 3.
    Problem • Execute thosequeries on the LOD cloud • No single federated query interface provided “politicians and actors” SchemEX – Mathias Konrath, Thomas Gottron, Ansgar Scherp 3 of 12
  • 4.
    Principle Solution • Suitableindex structure for looking up sources “politicians and actors” SchemEX – Mathias Konrath, Thomas Gottron, Ansgar Scherp 4 of 12
  • 5.
    The Naive Approach 1. Download the entire LOD cloud 2. Put it into a (really) large triple store 3. Process the data and extract schema 4. Provide lookup - Big machinery - Late in processing the data - High effort to scale with LOD cloud SchemEX – Mathias Konrath, Thomas Gottron, Ansgar Scherp 5 of 12
  • 6.
    Yes, we can… SchemEX – Mathias Konrath, Thomas Gottron, Ansgar Scherp 6 of 12
  • 7.
    The SchemEX Approach •Stream-based schema extraction • While crawling the data FIFO LOD-Crawler Instance- RDF-Dump Cache RDF Triple Store RDBMS NxParser Nquad- Schema- Parser Schema Stream Extractor SchemEX – Mathias Konrath, Thomas Gottron, Ansgar Scherp 7 of 12
  • 8.
    Efficient Instance Cache •Observe a quadruple stream from LD spider • Ring queue, backed up by a hash map • Organizes triples with same subject URI • Dismiss oldest, when cache full (FIFO) → Runtime complexity O(1) SchemEX – Mathias Konrath, Thomas Gottron, Ansgar Scherp 8 of 12
  • 9.
    Building the Schemaand Index RDF c1 c2 c3 … ck classes consistsOf Type TC1 TC2 … TCm clusters hasEQ Class p1 p2 EQC1 EQC2 … EQCn Equivalence classes hasDataSource … Data DS1 DS2 DS3 DS4 DS5 DSx sources SchemEX – Mathias Konrath, Thomas Gottron, Ansgar Scherp 9 of 12
  • 10.
    Computing SchemEX: TimBLData Set • Analysis of a smaller data set • 11 M triples, TimBL’s FOAF profile • LDspider with ~ 2k triples / sec • Different cache sizes: 100, 1k, 10k, 50k, 100k • Compared SchemEX with reference schema • Index queries on all Types, TCs, EQCs • Good precision/recall ratio at 50k+ SchemEX – Mathias Konrath, Thomas Gottron, Ansgar Scherp 10 of 12
  • 11.
    Computing SchemEX: FullBTC 2011 Data Cache size: 50 k SchemEX – Mathias Konrath, Thomas Gottron, Ansgar Scherp 11 of 12
  • 12.
    Conclusions: SchemEX • Stream-basedapproach to schema extraction • Scalable to arbitrary amount of Linked Data • Applicable on commodity hardware (4GB RAM, standard single CPU) • Lookup-index to find relevant data sources • Support federated queries on the LOD cloud SchemEX – Mathias Konrath, Thomas Gottron, Ansgar Scherp 12 of 12
  • 13.
    BACKUP SchemEX – MathiasKonrath, Thomas Gottron, Ansgar Scherp 13 of 12
  • 14.
    SchemEX Computation: WindowSizes Runtime increases hardly with greater window sizes Crawled TimBL dataset Memory consumption scales (11M triples) with window size SchemEX – Mathias Konrath, Thomas Gottron, Ansgar Scherp 14 of 12
  • 15.
    SchemEX Quality: Precision SchemEX– Mathias Konrath, Thomas Gottron, Ansgar Scherp 15 of 12
  • 16.
    SchemEX Quality: Recall SchemEX– Mathias Konrath, Thomas Gottron, Ansgar Scherp 16 of 12
  • 17.
    Example Data Graph SchemEX– Mathias Konrath, Thomas Gottron, Ansgar Scherp 17 of 12
  • 18.
    Output Vocabulary: voiD SchemEX– Mathias Konrath, Thomas Gottron, Ansgar Scherp 18 of 12
  • 19.
    SchemEX Extraction: ProgressPlot Type-cluster Equivalence classes Count ##processed instances processed 12 instances SchemEX – Mathias Konrath, Thomas Gottron, Ansgar Scherp 19 of