Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Friday talk 11.02.2011

468 views

Published on

MiniViva presentation as part of the DERI Friday talk events

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

Friday talk 11.02.2011

  1. 1. Querying Live Linked Data by Jürgen Umbrich Mini Viva presentation ( 11.02.2011)
  2. 2. Querying in the Linked Data space millions of diverse but often interrelated data sources “ data everywhere” on the Web no complete control over the data QP static dynamic crawl Index Yars2 Virtuoso live distributed querying
  3. 3. Linked Data is Dynamic <ul><li>Dataset – Web data (’08 – ‘09) </li></ul><ul><li>24 weekly snapshots </li></ul><ul><li>4 hop neighborhood from </li></ul><ul><li>Tim Berners-Lee FOAF file </li></ul><ul><li>550K RDF/XML docs, 3.3M unique entities </li></ul><ul><ul><li>[ Umbrich et al. 2010 ] </li></ul></ul>Findings (entity level) 68% 32% static dynamic 52% 24% 10% 14% <1 week >1 week <= 1 month >1 month <= 3 month >3 month <= 6 month Change frequency
  4. 4. Accessing Linked Data <ul><li>Use URIs for things </li></ul><ul><li>Use HTTP URIs so that people can look it up </li></ul><ul><li>Provide useful information, using standards (RDF, SPARQL) </li></ul><ul><li>Include links to other URIs </li></ul>Direct correspondence between thing-URI and source-URI http://umbrich.net/foaf.rdf#me foaf:based_near HTTP-GET http://umbrich.net/foaf.rdf RDF/XML #me http://dbpedia.org/resource/Galway
  5. 5. Accessing Linked Data http://dbpedia.org/ resource /Galway Re-direct correspondence between thing-URI and source-URI HTTP-GET Direct correspondence between thing-URI and source-URI http://umbrich.net/foaf.rdf#me http://dbpedia.org/ data /Galway HTML http://dbpedia.org/ page /Galway HTTP-GET http://umbrich.net/foaf.rdf RDF/XML #me http://dbpedia.org/resource/Galway RDF/XML
  6. 6. The Problem SELECT ?friendLabel WHERE{ juum:me foaf:knows ?f . ?f foaf:name ?friendLabel . } What are the query relevant sources? Example Query ?f foaf:name ?friendLabel . juum:me foaf:knows ?f . polleres.net/foaf.rdf umbrich.net/foaf.rdf sw.deri.org/~aidanh/
  7. 7. Source Selection Approaches Index Quad Store (e.g. Yars2) “ Aidan Hogan” “ Aidan Hogan” “ Axel Polleres” “ Axel Polleres” Quad Store (e.g. Yars2) ?f foaf:name ?friendLabel . juum:me foaf:knows ?f . HTTP GET “ Aidan Hogan” HTTP GET “ Axel Polleres” HTTP GET
  8. 8. Source Selection Approaches Quad Store (e.g. Yars2) Direct execution/ graph traversal [Hartig et al. 2009] “ Aidan Hogan” “ Aidan Hogan” “ Axel Polleres” “ Axel Polleres” Direct execution/ graph traversal [Hartig et al. 2009] ?f foaf:name ?friendLabel . juum:me foaf:knows ?f . HTTP GET HTTP GET “ Aidan Hogan” “ Axel Polleres”
  9. 9. Source Selection Approaches Schema-Level Indices [Stuckenschmidt et al. 2004] Data Summaries [Umbrich et al. 2010] Inverted Indices [Heflin et al. 2010] (e.g. Sindice V1.0) Quad Store (e.g. Yars2) Direct execution/ graph traversal [Hartig et al. 2009] Index Size Query time Results Query System recall freshness
  10. 10. Approximate Data Summaries <ul><li>Combined description of schema level and instance level </li></ul><ul><li>Use approximation to reduce index size (incurs false positives) </li></ul><ul><li>Index growth only with the number of sources </li></ul><ul><li>Multidimensional numerical dataspace </li></ul>Hash-based data summaries o s 1 30 1 30
  11. 11. Hash-based Data Summaries <ul><li>juum:me foaf:knows ah:ah <http…foaf.rdf> </li></ul><ul><li>Input: triple + source information </li></ul><ul><li>Hash triples </li></ul><ul><li>[ 24 , 5 , 2 ] <http…foaf.rdf> </li></ul><ul><li>Insert hash-triple into dataspace and store source information with buckets </li></ul><ul><li>INS( [ 24 , 5 , 2 ] , http…foaf.rdf ) </li></ul>Equi-width histogram <ul><li>Query for relevant sources </li></ul><ul><li>QUERY ( juum:me ?p ?o ) -> ( 24, ?, ? ) </li></ul>o s 1 30 1 30 10 20 10 20
  12. 12. QTree: Efficient source selection Equi-width histogram QTree <ul><li>Combination of histograms and R-tree inheriting the benefit of both data structures </li></ul><ul><ul><li>optimal for sparse data </li></ul></ul><ul><li>Buckets store cardinality and set of sources => Top-k source ranking e.g. R 1,1 ( 1: { http://…/foaf.rdf } ) </li></ul>o s 1 30 1 30 10 20 10 20
  13. 13. Evaluation: Source Selection J. Umbrich , K. Hose, M. Karnstedt , A. Harth, A. Polleres . &quot; Comparing Data Summaries for Processing Live Queries over Linked Data. ”. In WWW Journal, Special Issue &quot;Querying the Data Web&quot;, 2011
  14. 14. Source Selection Approaches Schema-Level Indices [Stuckenschmidt et al. 2004] Data Summaries [Umbrich et al. 2010] Inverted Indices [Hefflin et al. 2010] (e.g. Sindice V1.0) Quad Store (e.g. Yars2) Direct execution/ graph traversal [Hartig et al. 2009] Index Size Query time Results Query System recall freshness
  15. 15. Querying in the Linked Data Space millions of diverse but often interrelated data sources “ data everywhere” on the Web no complete control over the data QP static dynamic Combined Query of RDF stores and the Linked Data Web crawl MAT Index live distributed querying
  16. 16. Improved Query Time & Fresh Results query time #number of query execution combined querying learning about source dynamics combined querying decrease query time by avoiding unnecessary HTTP lookups and still returning fresh results live querying index querying
  17. 17. Current Research Question How to combined query RDF stores and the Linked Data Web
  18. 18. Combined Query Processing <ul><li>Live results on top of SPARQL stores </li></ul>to decide (at query time) if we access the static store or the Web resources by integrating the knowledge about the dynamic of sources into the query processor Yars2, Virtuoso SPARQL Index query live results Query Processor Linked Data Web Source Selection Dynamics Source Selection Dynamics Query Processor
  19. 19. Mining Dynamic/Static Patterns <ul><li>Goal </li></ul><ul><ul><li>acquire knowledge about dynamic patterns ( e.g. geo:lat, geo:long) </li></ul></ul><ul><ul><li>Considering context of a node ( e.g. a location value of a city vs location value of a GPS sensor ) </li></ul></ul><ul><li>Based on two datasets (started in March 2010 ) </li></ul><ul><ul><li>Daily 3-hop neighborhood crawls from 20 seed URIs </li></ul></ul><ul><ul><li>Weekly snapshots over ~10 month 10% sampling from a billion triples crawl (fixed URI list, contains ~2K web vocabularies) </li></ul></ul><ul><li>Learn to predict changes events </li></ul>Dynamics
  20. 20. Query Processor <ul><li>Collaboration with Yuan (APEXLAB) </li></ul><ul><li>Elaboration on how dynamic query planning can support data access decision taking into account dynamic patterns </li></ul><ul><li>Investigation of one of the possible approaches </li></ul>Query Processor
  21. 21. Evaluation <ul><li>Based on simulation </li></ul><ul><ul><li>using our dynamic mining dataset </li></ul></ul><ul><li>Based on real-world data </li></ul><ul><ul><li>Linked Stream Data effort </li></ul></ul><ul><ul><li>Using the gathered knowledge from our dynamic mining </li></ul></ul><ul><li>Evaluation criteria </li></ul><ul><ul><li>Query time ( number of HTTP lookups ) </li></ul></ul><ul><ul><li>Result freshness </li></ul></ul><ul><ul><li>Recall (number of results) </li></ul></ul>
  22. 22. How to combined query RDF stores and the Linked Data Web to return fresh results Questions ? Source Selection Dynamics SPARQL Index Query Processor query live results
  23. 23. Literature [Hartig 2009 ] O. Hartig, Ch. Bizer, and J.-Ch. Freytag. Executing SPARQL Queries over the Web of Linked Data. In ISWC’09, 2009. [Stuckenschmidt] H. Stuckenschmidt, R. Vdovjak, J. Broekstra, and G.-J. Houben. Towards distributed processing of RDF path queries. JWET, 2(2/3):207–230, 2005. [Umbrich 2010] J. Umbrich, M. Hausenblas, A. Hogan, A. Polleres, S. Decker. Towards Understanding Dataset Dynamics: Change Frequency of Linked Data Sources. LODW 2010 at WWW 2010, 2010. . [Heflin 2010] Y. Li and J. Heflin. Using Reformulation Trees to Optimize Queries over Distributed Heterogeneous Sources. In proceedings of the 9th International Semantic Web Conference (ISWC2010). 2010.

×