SchemEX - Creating the Yellow Pages for the Linked Open Data Cloud

1,156 views

Published on

Slides of the billion triple challenge 2011 on SchemEX.

Please download original file to enjoy all animations.

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,156
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
11
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

SchemEX - Creating the Yellow Pages for the Linked Open Data Cloud

  1. 1. SchemEXCreating the Yellow Pages of the LOD CloudMathias Konrath, Thomas Gottron, Ansgar Scherp
  2. 2. Scenario• People who are politicians and actors• Who else?• Where do they live?• Whom do they know?SchemEX – Mathias Konrath, Thomas Gottron, Ansgar Scherp 2 of 12
  3. 3. Problem• Execute those queries on the LOD cloud• No single federated query interface provided “politicians and actors”SchemEX – Mathias Konrath, Thomas Gottron, Ansgar Scherp 3 of 12
  4. 4. Principle Solution• Suitable index structure for looking up sources “politicians and actors”SchemEX – Mathias Konrath, Thomas Gottron, Ansgar Scherp 4 of 12
  5. 5. The Naive Approach1. Download the entire LOD cloud2. Put it into a (really) large triple store3. Process the data and extract schema4. Provide lookup- Big machinery- Late in processing the data- High effort to scale with LOD cloudSchemEX – Mathias Konrath, Thomas Gottron, Ansgar Scherp 5 of 12
  6. 6. Yes, we can …SchemEX – Mathias Konrath, Thomas Gottron, Ansgar Scherp 6 of 12
  7. 7. The SchemEX Approach• Stream-based schema extraction• While crawling the data FIFOLOD-Crawler Instance- RDF-Dump Cache RDF Triple Store RDBMS NxParser Nquad- Schema- Parser Schema Stream ExtractorSchemEX – Mathias Konrath, Thomas Gottron, Ansgar Scherp 7 of 12
  8. 8. Efficient Instance Cache• Observe a quadruple stream from LD spider• Ring queue, backed up by a hash map• Organizes triples with same subject URI• Dismiss oldest, when cache full (FIFO)→ Runtime complexity O(1)SchemEX – Mathias Konrath, Thomas Gottron, Ansgar Scherp 8 of 12
  9. 9. Building the Schema and Index RDF c1 c2 c3 … ck classes consistsOf Type TC1 TC2 … TCm clustershasEQClass p1 p2 EQC1 EQC2 … EQCn Equivalence classes hasDataSource … Data DS1 DS2 DS3 DS4 DS5 DSx sourcesSchemEX – Mathias Konrath, Thomas Gottron, Ansgar Scherp 9 of 12
  10. 10. Computing SchemEX: TimBL Data Set• Analysis of a smaller data set• 11 M triples, TimBL’s FOAF profile• LDspider with ~ 2k triples / sec• Different cache sizes: 100, 1k, 10k, 50k, 100k• Compared SchemEX with reference schema• Index queries on all Types, TCs, EQCs• Good precision/recall ratio at 50k+SchemEX – Mathias Konrath, Thomas Gottron, Ansgar Scherp 10 of 12
  11. 11. Computing SchemEX: Full BTC 2011 DataCache size: 50 kSchemEX – Mathias Konrath, Thomas Gottron, Ansgar Scherp 11 of 12
  12. 12. Conclusions: SchemEX• Stream-based approach to schema extraction• Scalable to arbitrary amount of Linked Data• Applicable on commodity hardware (4GB RAM, standard single CPU)• Lookup-index to find relevant data sources• Support federated queries on the LOD cloudSchemEX – Mathias Konrath, Thomas Gottron, Ansgar Scherp 12 of 12
  13. 13. BACKUPSchemEX – Mathias Konrath, Thomas Gottron, Ansgar Scherp 13 of 12
  14. 14. SchemEX Computation: Window Sizes Runtime increases hardly with greater window sizes Crawled TimBL dataset Memory consumption scales (11M triples) with window sizeSchemEX – Mathias Konrath, Thomas Gottron, Ansgar Scherp 14 of 12
  15. 15. SchemEX Quality: PrecisionSchemEX – Mathias Konrath, Thomas Gottron, Ansgar Scherp 15 of 12
  16. 16. SchemEX Quality: RecallSchemEX – Mathias Konrath, Thomas Gottron, Ansgar Scherp 16 of 12
  17. 17. Example Data GraphSchemEX – Mathias Konrath, Thomas Gottron, Ansgar Scherp 17 of 12
  18. 18. Output Vocabulary: voiDSchemEX – Mathias Konrath, Thomas Gottron, Ansgar Scherp 18 of 12
  19. 19. SchemEX Extraction: Progress Plot Type-cluster Equivalence classes Count ##processed instances processed 12 instancesSchemEX – Mathias Konrath, Thomas Gottron, Ansgar Scherp 19 of

×