Friday talk 11.02.2011
Upcoming SlideShare
Loading in...5
×
 

Friday talk 11.02.2011

on

  • 418 views

MiniViva presentation as part of the DERI Friday talk events

MiniViva presentation as part of the DERI Friday talk events

Statistics

Views

Total Views
418
Views on SlideShare
418
Embed Views
0

Actions

Likes
0
Downloads
4
Comments
0

0 Embeds 0

No embeds

Accessibility

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Property name
  • Webpages with browser bars
  • Qtree optimal for sparse data Same number of buckets , but more fine grained source selection
  • Qtree highlight red
  • Swse/yars2, sindice/virtuoso

Friday talk 11.02.2011 Friday talk 11.02.2011 Presentation Transcript

  • Querying Live Linked Data by Jürgen Umbrich Mini Viva presentation ( 11.02.2011)
  • Querying in the Linked Data space millions of diverse but often interrelated data sources “ data everywhere” on the Web no complete control over the data QP static dynamic crawl Index Yars2 Virtuoso live distributed querying
  • Linked Data is Dynamic
    • Dataset – Web data (’08 – ‘09)
    • 24 weekly snapshots
    • 4 hop neighborhood from
    • Tim Berners-Lee FOAF file
    • 550K RDF/XML docs, 3.3M unique entities
      • [ Umbrich et al. 2010 ]
    Findings (entity level) 68% 32% static dynamic 52% 24% 10% 14% <1 week >1 week <= 1 month >1 month <= 3 month >3 month <= 6 month Change frequency
  • Accessing Linked Data
    • Use URIs for things
    • Use HTTP URIs so that people can look it up
    • Provide useful information, using standards (RDF, SPARQL)
    • Include links to other URIs
    Direct correspondence between thing-URI and source-URI http://umbrich.net/foaf.rdf#me foaf:based_near HTTP-GET http://umbrich.net/foaf.rdf RDF/XML #me http://dbpedia.org/resource/Galway
  • Accessing Linked Data http://dbpedia.org/ resource /Galway Re-direct correspondence between thing-URI and source-URI HTTP-GET Direct correspondence between thing-URI and source-URI http://umbrich.net/foaf.rdf#me http://dbpedia.org/ data /Galway HTML http://dbpedia.org/ page /Galway HTTP-GET http://umbrich.net/foaf.rdf RDF/XML #me http://dbpedia.org/resource/Galway RDF/XML
  • The Problem SELECT ?friendLabel WHERE{ juum:me foaf:knows ?f . ?f foaf:name ?friendLabel . } What are the query relevant sources? Example Query ?f foaf:name ?friendLabel . juum:me foaf:knows ?f . polleres.net/foaf.rdf umbrich.net/foaf.rdf sw.deri.org/~aidanh/
  • Source Selection Approaches Index Quad Store (e.g. Yars2) “ Aidan Hogan” “ Aidan Hogan” “ Axel Polleres” “ Axel Polleres” Quad Store (e.g. Yars2) ?f foaf:name ?friendLabel . juum:me foaf:knows ?f . HTTP GET “ Aidan Hogan” HTTP GET “ Axel Polleres” HTTP GET
  • Source Selection Approaches Quad Store (e.g. Yars2) Direct execution/ graph traversal [Hartig et al. 2009] “ Aidan Hogan” “ Aidan Hogan” “ Axel Polleres” “ Axel Polleres” Direct execution/ graph traversal [Hartig et al. 2009] ?f foaf:name ?friendLabel . juum:me foaf:knows ?f . HTTP GET HTTP GET “ Aidan Hogan” “ Axel Polleres”
  • Source Selection Approaches Schema-Level Indices [Stuckenschmidt et al. 2004] Data Summaries [Umbrich et al. 2010] Inverted Indices [Heflin et al. 2010] (e.g. Sindice V1.0) Quad Store (e.g. Yars2) Direct execution/ graph traversal [Hartig et al. 2009] Index Size Query time Results Query System recall freshness
  • Approximate Data Summaries
    • Combined description of schema level and instance level
    • Use approximation to reduce index size (incurs false positives)
    • Index growth only with the number of sources
    • Multidimensional numerical dataspace
    Hash-based data summaries o s 1 30 1 30
  • Hash-based Data Summaries
    • juum:me foaf:knows ah:ah <http…foaf.rdf>
    • Input: triple + source information
    • Hash triples
    • [ 24 , 5 , 2 ] <http…foaf.rdf>
    • Insert hash-triple into dataspace and store source information with buckets
    • INS( [ 24 , 5 , 2 ] , http…foaf.rdf )
    Equi-width histogram
    • Query for relevant sources
    • QUERY ( juum:me ?p ?o ) -> ( 24, ?, ? )
    o s 1 30 1 30 10 20 10 20
  • QTree: Efficient source selection Equi-width histogram QTree
    • Combination of histograms and R-tree inheriting the benefit of both data structures
      • optimal for sparse data
    • Buckets store cardinality and set of sources => Top-k source ranking e.g. R 1,1 ( 1: { http://…/foaf.rdf } )
    o s 1 30 1 30 10 20 10 20
  • Evaluation: Source Selection J. Umbrich , K. Hose, M. Karnstedt , A. Harth, A. Polleres . &quot; Comparing Data Summaries for Processing Live Queries over Linked Data. ”. In WWW Journal, Special Issue &quot;Querying the Data Web&quot;, 2011
  • Source Selection Approaches Schema-Level Indices [Stuckenschmidt et al. 2004] Data Summaries [Umbrich et al. 2010] Inverted Indices [Hefflin et al. 2010] (e.g. Sindice V1.0) Quad Store (e.g. Yars2) Direct execution/ graph traversal [Hartig et al. 2009] Index Size Query time Results Query System recall freshness
  • Querying in the Linked Data Space millions of diverse but often interrelated data sources “ data everywhere” on the Web no complete control over the data QP static dynamic Combined Query of RDF stores and the Linked Data Web crawl MAT Index live distributed querying
  • Improved Query Time & Fresh Results query time #number of query execution combined querying learning about source dynamics combined querying decrease query time by avoiding unnecessary HTTP lookups and still returning fresh results live querying index querying
  • Current Research Question How to combined query RDF stores and the Linked Data Web
  • Combined Query Processing
    • Live results on top of SPARQL stores
    to decide (at query time) if we access the static store or the Web resources by integrating the knowledge about the dynamic of sources into the query processor Yars2, Virtuoso SPARQL Index query live results Query Processor Linked Data Web Source Selection Dynamics Source Selection Dynamics Query Processor
  • Mining Dynamic/Static Patterns
    • Goal
      • acquire knowledge about dynamic patterns ( e.g. geo:lat, geo:long)
      • Considering context of a node ( e.g. a location value of a city vs location value of a GPS sensor )
    • Based on two datasets (started in March 2010 )
      • Daily 3-hop neighborhood crawls from 20 seed URIs
      • Weekly snapshots over ~10 month 10% sampling from a billion triples crawl (fixed URI list, contains ~2K web vocabularies)
    • Learn to predict changes events
    Dynamics
  • Query Processor
    • Collaboration with Yuan (APEXLAB)
    • Elaboration on how dynamic query planning can support data access decision taking into account dynamic patterns
    • Investigation of one of the possible approaches
    Query Processor
  • Evaluation
    • Based on simulation
      • using our dynamic mining dataset
    • Based on real-world data
      • Linked Stream Data effort
      • Using the gathered knowledge from our dynamic mining
    • Evaluation criteria
      • Query time ( number of HTTP lookups )
      • Result freshness
      • Recall (number of results)
  • How to combined query RDF stores and the Linked Data Web to return fresh results Questions ? Source Selection Dynamics SPARQL Index Query Processor query live results
  • Literature [Hartig 2009 ] O. Hartig, Ch. Bizer, and J.-Ch. Freytag. Executing SPARQL Queries over the Web of Linked Data. In ISWC’09, 2009. [Stuckenschmidt] H. Stuckenschmidt, R. Vdovjak, J. Broekstra, and G.-J. Houben. Towards distributed processing of RDF path queries. JWET, 2(2/3):207–230, 2005. [Umbrich 2010] J. Umbrich, M. Hausenblas, A. Hogan, A. Polleres, S. Decker. Towards Understanding Dataset Dynamics: Change Frequency of Linked Data Sources. LODW 2010 at WWW 2010, 2010. . [Heflin 2010] Y. Li and J. Heflin. Using Reformulation Trees to Optimize Queries over Distributed Heterogeneous Sources. In proceedings of the 9th International Semantic Web Conference (ISWC2010). 2010.