Your SlideShare is downloading. ×
  • Like
  • Save
A HYBRID FRAMEWORK FOR QUERYING
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Now you can save presentations on your phone or tablet

Available for both IPhone and Android

Text the download link to your phone

Standard text messaging rates apply

A HYBRID FRAMEWORK FOR QUERYING

  • 1,798 views
Published

PhD viva presentation slides

PhD viva presentation slides

Published in Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
1,798
On SlideShare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
0
Comments
0
Likes
7

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • e.g. sindice, watson,swse, virtuoso
  • No stream processing mentioning No infrastructure needed – not asking for eventsAd-hocHow to do query processing
  • This setup allows for study-ing (i) dynamics within the datasets (ii) dynamics between datasets (esp. links) (iii)and the growth of Linked Data and the arrival of new sources (although to a lesserextent).
  • This setup allows for study-ing (i) dynamics within the datasets (ii) dynamics between datasets (esp. links) (iii)and the growth of Linked Data and the arrival of new sources (although to a lesserextent).
  • Make it clearerand the growth of Linked Data and the arrival of new sources (although to a lesserextent).
  • Don’t mention two stores
  • We proofed for two prominent stores that that problem exists
  • e.g. sindice, watson,swse, virtuoso
  • denote
  • More links and connect more parts of the graph
  • Snapshot live
  • Overlay, dereferencing schema knowledge
  • Reasonable increase Most inferences look reasonable
  • We run it liveQuery generation to the slide
  • Add here some query timesIf you would assume linear query times Use a table with ratios
  • Materialsied store, outdated results, we need to check them again But that means we do not use the data, only the source information
  • Shrink the source index , compared to materialsed index
  • Introduce example query to show that LTQBE is limited and we can fix it by doing source selection We do not need a full materialsed index, since we retrieve the source and compute the query over itIf we do live lookup.
  • investigate several lightweight source selection approaches to further im-prove the query times, increase the result recall and loosen the query typerestriction of pure link traversal based query approaches
  • Could combine with previousAttachsourceto pointIf we wouldstore for each point the source information we would end up with full index with dic.so we split the numerical data space into buckets
  • Qtree optimal for sparse dataSame number of buckets , but more fine grained source selection
  • ShowexperUse the diagram again iments in a different way
  • Introduce bit by bitInterfacesCoehereQuery planner
  • involve monitoring a large range of Linked Data sources to build a comprehensive, global picture of the dynamicity of the Web of Data. Previous empirical studies [17,15] have shown varying levels of dynamicity across Linked Data sources; furthermore, we speculate that dynamicity varies by the schema of data [17]. In term of benefits, cache- independent estimates can be applied generically to any store (and indeed to other use-cases) [6]; however, they give no indication as to the specific coverage or update rates, etc., of the cache engine at hand.
  • Materialised storesLODcacheSindice SPARQLUse store icons
  • Triple pattern estimatesCentered predicatesQuery Sampling URIsDistinct predicates for chaces
  • More details
  • Filter out queries which produced empty results (offline sources)
  • We verifiied that Linked Data is dynamic which has an impact on results of mat enginesLTBQE approaches offer fresh results but works only for deref URIs and we can improve the recall through reasoing extensionsA compact data summary such as the Qtree pose no query restrictions and can find more sources that can answer the query than ltbqeMat cahces and lTBQE can be combined in a hybird execution framework to deliver fresh and fast results by integrating the knowledge about data dynamics.
  • Bildnicht optimal
  • This setup allows for study-ing (i) dynamics within the datasets (ii) dynamics between datasets (esp. links) (iii)and the growth of Linked Data and the arrival of new sources (although to a lesserextent).
  • This setup allows for study-ing (i) dynamics within the datasets (ii) dynamics between datasets (esp. links) (iii)Make it clearerand the growth of Linked Data and the arrival of new sources (although to a lesserextent).
  • Fuege label ein und loeschezweiquellen
  • Explain to claudio – maybe remove it
  • Triple pattern estimatesCentered predicatesQuery Sampling URIsDistinct predicates for chaces

Transcript

  • 1. A HYBRID FRAMEWORK FOR QUERYING LINKED DATA DYNAMICALLY JÜRGEN UMBRICH PhD Viva November 26th, 2012
  • 2. Classical Query Approach MATERIALISED STORE centralised data warehousing fast query times 26/11/2012 PhD Viva, Jürgen Umbrich Slide 1 of 39 1
  • 3. MOTIVATING EXAMPLEGIVE ME THE CURRENT TEMPERATURE OF THE EUROPEAN CAPITALS. 26/11/2012 PhD Viva, Jürgen Umbrich Slide 2 of 39 2
  • 4. Research QuestionsHow dynamic is linked data and what is the impact for store basedquery processing?Can the performance of live querying be improved by applyinglightweight reasoning? How effective are hash-based data summaries for source selection in live query processing?How can live and store based processing be combined to obtain atrade-off between fast and fresh results? 26/11/2012 PhD Viva, Jürgen Umbrich Slide 3 of 39 3
  • 5. How dynamic is linked data and what is the impact for store based query processing? [LDOW 2010] [DESWEB 2010] [COLD 2011]26/11/2012 PhD Viva, Jürgen Umbrich Slide 4 of 39 4
  • 6. DYNAMIC LINKED DATA OBSERVATORY Allows to study and assess the dynamics of Linked Data  DataHub and BTC  95K static URIs  95K dynamic (2 hops)  once a week  started in March 2012 [http://world.yale.edu] weekly dumps are freely available at http://swse.deri.org/dyldo 26/11/2012 PhD Viva, Jürgen Umbrich Slide 5 of 39 5
  • 7. DYNAMICS OF LINKED DATAHow fast does a source change? 15 weeks rest 17% only once 8% no changes 58% every week 17% 26/11/2012 PhD Viva, Jürgen Umbrich Slide 6 of 39 6
  • 8. DYNAMICS OF LINKED DATACan we observe different types of changes? 15 weeks others only value 14% updates 24% adds/dels 19% value updates & only adds adds/dels 20% 23% 26/11/2012 PhD Viva, Jürgen Umbrich Slide 7 of 39 7
  • 9. IMPLICATIONS FOR CENTRALISED QUERYING How coherent are the results? 26/11/2012 PhD Viva, Jürgen Umbrich Slide 8 of 39 8
  • 10. COHERENCE OF QUERIES LOD cache SPARQL endpoints complete coherent 1% 15% 35% 43% complete incoherent 56% partially coherent 50% 26/11/2012 PhD Viva, Jürgen Umbrich Slide 9 of 39 9
  • 11. PROBLEM WITH CLASSICAL QUERY APPROACH MATERIALISED STORE centralised data warehousing outdated results limited coverage fast query times 26/11/2012 PhD Viva, Jürgen Umbrich Slide 10 of10 39
  • 12. LTBQE: LINK TRAVERSAL BASED QUERY EXECUTION ohDoc: Exploiting Linked Data principles: oh:olaf foaf:name Olaf Hartig  dereferencing URIs owl:sameAs  following links foaf:img foaf:knows foaf:knows dblpA:Olaf_Hartig http://... cb:chris SELECT ?f ?img rdfs:seeAlso WHERE { oh:olaf foaf:knows ?f . cbDoc: ?f foaf:depiction ?img . } cbDoc: cb:chris foaf:depiction ?f ?img owl:sameAs cb:chris http://.. http://... foaf:name dblpA:Christian _Bizer Chris Bizer 26/11/2012 PhD Viva, Jürgen Umbrich Slide 11 of11 39
  • 13. PERFORMANCE FACTORS OF LTBQE  query time is influenced by  source selection  number of sequential lookups  result recall is influenced by  dereferenceability  execution order  connectivity 26/11/2012 PhD Viva, Jürgen Umbrich Slide 12 of12 39
  • 14. Can the performance of live querying be improved by applying lightweight reasoning? [RR 2012] [SWJ submission]26/11/2012 PhD Viva, Jürgen Umbrich Slide 13 of13 39
  • 15. OUR CONTRIBUTION TO LTBQE  Improved recall with reasoning extensions to make more raw data available  subset of RDFS  explicit owl:sameAs 26/11/2012 PhD Viva, Jürgen Umbrich Slide 14 of14 39
  • 16. HOW REASONING CAN HELP LTBQE ohDoc: SELECT ?label WHERE { oh:olaf foaf:name Olaf Hartig oh:olaf foaf:knows ?f . owl:sameAs ?f rdfs:label ?label . foaf:img } foaf:knows dblpA:Olaf_Hartig http://... cb:chris ?label rdfs:seeAlso Christian Bizer Chris bizer cbDoc: foaf:name rdfs:subPropertyOf rdfs:label cbDoc: rdfs:label cb:chris Christian Bizer foaf:depiction dblpA:Christian owl:sameAs _Bizer http://... foaf:name dblpA:Christian foaf:maker _Bizer dblpP:Hartig09 Chris Bizer dblpADoc:Christian_Bizer 26/11/2012 PhD Viva, Jürgen Umbrich Slide 15 of15 39
  • 17. LTBQE ANALYSIS Investigate how practical LTBQE is and how much more raw data and results can be make available with our extensions? How many URIs can be dereferenced? How much additional data with our extensions? How do our extensions perform in practice? 26/11/2012 PhD Viva, Jürgen Umbrich Slide 16 of16 39
  • 18. LTBQE ANALYSIS: EXPERIMENTSHow many URIs can be dereferenced? position %URIs available data <URI> ?p ?o . 85% 95% BTC 2011 ?s ?p <URI> . 46% 44% 25.4m URIs ?s <URI> ?o . 1% 0.00…% ?s rdf:type <URI> . 10% 0.2% <URI> 44% 51% Schema data Improved query time by around 50% by reducing number of lookup 26/11/2012 PhD Viva, Jürgen Umbrich Slide 17 of17 39
  • 19. LTBQE ANALYSIS: EXPERIMENTSHow much additional data with our extensions? position %URIs available data BTC 2011 <URI> rdfs:seeAlso ?o . 2% 1.006x 18.65m URIs <URI> owl:sameAs ?o . 16% 2.5x RDFS reasoning* 81% 1.78 x *rdfs:subClassOf, rdfs:subPropertyOf, rdfs:domain, rdfs:range authoritativeTbox[Bonatti] extracted from BTC 2011 26/11/2012 PhD Viva, Jürgen Umbrich Slide 18 of18 39
  • 20. QUERY GENERATIONHow do our extensions perform in practice?Existing benchmarks target either a single domain or provideonly a few queries. BTC 2011 1100 queries 100 each forQWalk:Random walk based 11 “typical”query generation. shapes 26/11/2012 PhD Viva, Jürgen Umbrich Slide 19 of19 39
  • 21. THROUGHPUT: AVERAGE RESULT/TIME RATIO worst best LTBQE Core- seeAlso sameAs RDFS Comb entity-s 1 1.68 1.67 2.15 1.29 1.53 entity-o 3.97 6.48 6.16 5.7 5.37 4.33 entity-so 2.02 2.82 2.66 3.71 3.73 4.8 star-3-0 0.11 0.16 0.15 0.15 0.24 0.2 star-2-1 0.58 1.12 1 1.04 2.14 1.75 star-1-2 0.17 1.6 1.35 1.6 70.97 58.85 star-0-3 0.18 0.35 0.33 0.94 0.24 0.68 s-path-2 0.44 0.72 0.68 0.7 0.83 0.78 s-path-3 1.76 2.45 2.56 2.46 2.43 2.1 o-path-2 1.38 8.39 7.76 10.55 6.36 6.89 o-path-3 0.95 5.7 5.84 6.08 5.04 4.68 Overall average query time of ~12 seconds. 26/11/2012 PhD Viva, Jürgen Umbrich Slide 20 of20 39
  • 22. LIMITATION OF LTBQE: JOIN OVER LITERALS ohDoc: dblpADoc:Olaf_Hartig foaf:name Olaf Hartig dblpP:Hartig09 oh:olaf Olaf Hartig owl:sameAs foaf:name foaf:maker foaf:img foaf:knows dblpA:Olaf_Hartig dblpA:Olaf_Hartig http://... cb:chris rdfs:seeAlso cbDoc: join over Literal materialised SELECT ?p2 LTBQE store WHERE { oh:olaf foaf:name ?name . ? outdated results ?p2 foaf:name ?name . } 26/11/2012 PhD Viva, Jürgen Umbrich Slide 21 of21 39
  • 23. ALTERNATIVE: SOURCE SELECTION ohDoc: dblpADoc:Olaf_Hartig SOURCE INDEX QUERY ENGINE 26/11/2012 PhD Viva, Jürgen Umbrich Slide 22 of22 39
  • 24. How effective are hash-based data summaries for source selection in live query processing? [WWW 2010] [WWWJ 2011]26/11/2012 PhD Viva, Jürgen Umbrich Slide 23 of23 39
  • 25. APPROXIMATE DATA SUMMARIES  Combined description of  schema and  instance data  Use approximation to reduce index size (incurs false positives)  Hash-based approach  Space complexity: O(buckets * #sources)  QTree: Combination of histograms and R-tree inheriting the benefit of both data structures  optimal for sparse data 26/11/2012 PhD Viva, Jürgen Umbrich Slide 24 of24 39
  • 26. HASH-BASED DATA SUMMARIES ohDoc: ohDoc: oh:olaf foaf:name Olaf Hartig o Input: triple + source p Hash: triple Insert: 3D point and save source information 30 Data oh:olaf foaf:name “Olaf Hartig” . ohDoc: 20 Hash:o [ 24 , 5 , 2 ] , ohDoc: 10 Insert: 1 ([ 24 , 5 , 2 ] , ohDoc: ) 1 10 20 30 s 26/11/2012 PhD Viva, Jürgen Umbrich Slide 25 of25 39
  • 27. EFFICIENT SOURCE SELECTION Summarise data with buckets and store cardinality and source information Query: Lookup { oh:olaf ?p ?o } hash ( 24 , ? , ? ) equi-width histogram QTree 30 20 o 10 1 1 10 20 30 ohDoc: s 26/11/2012 PhD Viva, Jürgen Umbrich Slide 26 of26 39
  • 28. EVALUATION Number of estimated sources as the crucial performance factor other approaches Qtree Number of sources (log) actually relevant 26/11/2012 PhD Viva, Jürgen Umbrich Slide 27 of27 39
  • 29. TRADE-OFF: FRESH OR FAST ACCESSING DATA MATERIALISED AT RUNTIME STORE fresh fast outdated results query results times slow query limited times coverage 26/11/2012 PhD Viva, Jürgen Umbrich Slide 28 of28 39
  • 30. How can live and store query processing be combined to obtain a trade-off between fast and fresh results? [DESWEB 2012] [EKAW 2012] [ISWC 2012]26/11/2012 PhD Viva, Jürgen Umbrich Slide 29 of29 39
  • 31. HYBRID SPARQL EXECUTION IDEAGIVE ME THE CURRENT TEMPERATURE OF THE EUROPEAN CAPITALS. fresh fast query results times dynamic static 26/11/2012 PhD Viva, Jürgen Umbrich Slide 30 of30 39
  • 32. HYBRID SPARQL: ARCHITECTURE coherence update update Index query Live query monitor interface interface query planner 26/11/2012 PhD Viva, Jürgen Umbrich Slide 31 of31 39
  • 33. COHERENCE MONITOR coherence update update Index query Live query monitor interface interface query planner computes and stores statistics about the freshness and coverage of cache for individual query patterns  store independent: can be applied to any store; no indication of specific coverage or update rates  store specific: more sensitive to the update patterns and coverage of the store 26/11/2012 PhD Viva, Jürgen Umbrich Slide 32 of32 39
  • 34. COHERENCE OF PREDICATES LOD cache SPARQL endpoints complete coherent 10% 30% 23% complete incoherent 46% 67% partially coherent 24% sioc:account_of swivt:creationDate foaf:knows 26/11/2012 PhD Viva, Jürgen Umbrich Slide 33 of33 39
  • 35. COHERENCE ESTIMATES 26/11/2012 PhD Viva, Jürgen Umbrich Slide 34 of34 39
  • 36. QUERY PLANNER coherence update update Index query Live query monitor interface interface query planner finding best query plan identifying dynamic/static patterns delegation and merging 26/11/2012 PhD Viva, Jürgen Umbrich Slide 35 of35 39
  • 37. QUERY PLANNING selectivity-based coherence-based tp4 tp3 tp3 tp2 tp1 tp2 tp4 tp1 Pattern Selectivity Coherence tp1 0.98 0.86 tp2 0.43 0.32 tp3 0.21 0.00 tp4 0.15 0.91 26/11/2012 PhD Viva, Jürgen Umbrich Slide 36 of36 39
  • 38. REAL WORLD EXPERIMENTS Evaluation of different hybrid query plan strategies Methodology  QWalk: Various types of SPARQL SELECT queries  star-shaped, path-shaped, mixed  different numbers of patterns  at least one static and dynamic pattern  Variable counting ordering  Single split with threshold (e.g. 0.5)  Static part is executed first  Linked traversal based query execution 26/11/2012 PhD Viva, Jürgen Umbrich Slide 37 of37 39
  • 39. REAL WORLD EXPERIMENTS Avg. of 43 queries live ordering 1 coh sel 0.8 live recall split rnd. thres. 0.4 fixed opt store 0.3 1 2 6 12 speedup 26/11/2012 PhD Viva, Jürgen Umbrich Slide 38 of38 39
  • 40. CONCLUSION How dynamic is Linked Data and what is the impact for store based query processing?  We verified that Linked Data is dynamic and that it impacts the result freshness and completeness of cache based query engines. Can the performance of live querying be improved by applying lightweight reasoning?  our source selection and reasoning optimisation improve query time and result recall compared to the state of the art. How effective are hash-based data summaries for source selection in live query processing?  The QTree loosen the query restrictions of pure live querying and outperforms similar source selection approaches. How can live and cache query processing be combined to obtain a trade-off between fast and fresh results?  Hybrid query execution with the knowledge of data dynamics for fast and fresh results. 26/11/2012 PhD Viva, Jürgen Umbrich Slide 39 of39 39
  • 41. FUTURE WORK Dynamic Linked Data Observatory  Extended experiments  Data mining to discover dynamic relations Hybrid Query Execution  Develop a cost model which combines selectivity and coherence  Automatically find best plan and split  Combination of different query approaches SPARQL as the query language for the Web  Navigational features 26/11/2012 PhD Viva, Jürgen Umbrich Slide 40 of40 39