Searching Linked Data        From Finding Relevant Sources to Computing Answers        Invited Presentation @ Internationa...
Agenda Searching Linked Data      Opportunities & challenges Keyword Query Routing      Problem Definition      Summa...
Linked Data    - 203 linked datasets serve 25 billion RDF triples interconnected by 395 million links    - As of 09-2010 +...
Opportunities         “Articles from awarded researchers at Stanford ”     Freebase contains data about people           ...
Problems         “Articles from awarded researchers at Stanford ”                       Large number of unknown,         ...
Searching Linked Data     Given the needs (expressed as sets of keywords),               are there answers in linked dat...
Agenda Searching Linked Data      Opportunities & challenges Keyword Query Routing      Problem Definition      Summa...
LOD Data Graph     Web data modeled as a set of interlinked data graphs     Each data graph represent a source     Data...
LOD Schema Graph     Web data modeled as a set of interlinked data graphs     Each data graph represent a source     Da...
LOD Source Graph      Web data modeled as a set of interlinked data graphs      Each data graph represent a source     ...
Keyword Query Answers          User information need                                                 „stanford           a...
Problem Definition      Keyword query result (also called Steiner graph) is a       subgraph of data graph that for every...
A Valid Keyword Routing Plan          User information need                                                 „stanford     ...
Agenda Searching Linked Data      Opportunities & challenges Keyword Query Routing      Problem Definition      Summa...
Keyword Sets      One keyword set for every data source      Elements stand for distinct keywords mentioned in a source ...
Element-level Keyword-Element Relationship Graph (E- KERG)           A keyword-element captures a keyword k and the data ...
Schema-level Keyword-Element Relationship Graph (S-KERG)           A keyword-element captures a keyword k and the schema ...
Data-Source-level Keyword-Element Relationship Graph (D-KERG)           A keyword-element captures a keyword k and the so...
Agenda Searching Linked Data      Opportunities & challenges Keyword Query Routing      Problem Definition      Summa...
Experiments      Chunk of the BTC dataset containing 10M RDF       triples from 154 sources, linked via 500K mappings    ...
Validity                P@k measure the percentage of plans that are valid out of the top-k plans                P@5 for...
Performance                Times increased with higher values for dmax                                       Sharp for E...
Agenda Searching Linked Data      Opportunities & challenges Keyword Query Routing      Problem Definition      Summa...
Mixed Query Processing Strategy Combination of top-down and bottom-up  strategies            Top-down: partial local ind...
Agenda Searching Linked Data      Opportunities & challenges Keyword Query Routing      Problem Definition      Summa...
Stream-based Query Processing                                                                                             ...
Agenda Searching Linked Data      Opportunities & challenges Keyword Query Routing      Problem Definition      Summa...
Corrective Source Ranking Prefer more relevant sources Relevancy of a source is based on            Current query      ...
Evaluation Three systems: top-down (TD), bottom-up (BU), mixed (MI) 8 queries over various datasets (DBpedia, Geonames, ...
Results    Overall early result reporting        25% results: MI 8.7s, BU 15.1s        50% results: MI 12.8s, BU 22.0s    ...
Result Arrival TimesThanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu   KIT – University of the State of Baden-Wuertt...
Agenda Searching Linked Data      Opportunities & challenges Keyword Query Routing      Problem Definition      Summa...
Conclusions      Keyword query routing         Helps users without knowledge of linked data and schemas to           fin...
Thanks for Your Attention!                                                                  Institute AIFB, KIT           ...
Upcoming SlideShare
Loading in …5
×

Searching Linked Data

258 views

Published on

Searching Linked Data - From Finding Relevant Sources to Computing Answers

Invited Presentation @ International Workshop on Scalable Semantic Computing, Hangzhou, China, November 2010.

Published in: Education, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
258
On SlideShare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
7
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • More complex information needs More precise results More integrated results
  • So far, these requirements have proven to be a large burden. Given the amount of linked data is large and continuously evolving, it is inherently dicultto know what is in there (i.e., the data and the schema) and to formulate the corresponding structured queries for addressing some given information needs.Hence, it is desirable to have a mechanism, which allows users to express information needs in their own words. Another aspect of dealing with the large Web of linked data is scalability. Processing the needs against the entire Web might be too time consuming and not needed, especially when users are interested in and want to choose some particular sources of information. Processing against a relevant subset of linked data identied by the user is more scalable and possibly the only practical solution for the large Web of linked data.
  • (Rank combination of sources)(Automatically process relevant combination of sources)Concerning these problems, the question we deal with is given the needs expressed by users as sets of keywords, are there corresponding answers in linked data and what combination of data sources shall be used to produce them?Further, the aim is not to directly compute results but to quickly identify and let users andsystem focus on the combination of sources that produce non-empty results.They recognized the fact that the computational complexity resulting from a large-scalesetting can be partially addressed when allowing users to choose and retrieve an-swers from only some particular databases. Given a set of keywords, the goal is tond and rank the single most relevant databases that contain the answers. Follow-ing this line, we propose specic solutions for the linked data context. The dier-
  • Linked data can be conceived as a set of data graphs, each represents a particular source. As a working denition, we present a simple graph-based model of linked data called the Web graph. In that model, we distinguish between the - Web data graph representing relationships between individual data elements, - the Web schema graph, which captures information about group of elements, and the Web source graph that contains information at the level of data sources.- This is a simple model of linked data that omits details not necessary forthis work. In particular, data elements may correspond to RDF resources, blank nodes or literals. Schema elements might stand for classes or data types. For keyword query routing, these distinctions are not relevant but the fact that theelements can be recognized via their labels. While dierent kinds of links can beestablished, the ones frequently found are sameAs links, which denote that twoRDF resources or two classes are the same. There is also no need to distinguishthe types of links. Only the fact that sources can be reached via some kinds oflink m 2M matters.
  • Linked data can be conceived as a set of data graphs, each represents a particular source. As a working denition, we present a simple graph-based model of linked data called the Web graph. In that model, we distinguish between the - Web data graph representing relationships between individual data elements, - the Web schema graph, which captures information about group of elements, and the Web source graph that contains information at the level of data sources.- This is a simple model of linked data that omits details not necessary forthis work. In particular, data elements may correspond to RDF resources, blank nodes or literals. Schema elements might stand for classes or data types. For keyword query routing, these distinctions are not relevant but the fact that theelements can be recognized via their labels. While dierent kinds of links can beestablished, the ones frequently found are sameAs links, which denote that twoRDF resources or two classes are the same. There is also no need to distinguishthe types of links. Only the fact that sources can be reached via some kinds oflink m 2M matters.
  • Linked data can be conceived as a set of data graphs, each represents a particular source. As a working denition, we present a simple graph-based model of linked data called the Web graph. In that model, we distinguish between the - Web data graph representing relationships between individual data elements, - the Web schema graph, which captures information about group of elements, and the Web source graph that contains information at the level of data sources.- This is a simple model of linked data that omits details not necessary forthis work. In particular, data elements may correspond to RDF resources, blank nodes or literals. Schema elements might stand for classes or data types. For keyword query routing, these distinctions are not relevant but the fact that theelements can be recognized via their labels. While dierent kinds of links can beestablished, the ones frequently found are sameAs links, which denote that twoRDF resources or two classes are the same. There is also no need to distinguishthe types of links. Only the fact that sources can be reached via some kinds oflink m 2M matters.
  • A valid plan in our example is RP = fFreebase;DBLP;DBPediag. Note that validity does not imply relevance. That is, a valid plan ensures that resultscan be produced, but for the users, these results may dier in relevance. A properaccount of relevance and the ranking of routing plans based on the relevance oftheir results go beyond the scope of this paper, which is focused on eciencyaspects of computing valid plans. We assume a xed ranking function, whichequally applies to all summaries discussed in this paper. We refer the interestedreaders to our report [8], which discusses relevance and the ranking function.
  • - Keywords map against elements of the entire data web- Routing simply based on coverage- Consider further factors for data source identification, i.e. characteristics of the data, the data sources and links between them-Keyword query routing: Keyword routing in a truly distributed setting such that several data sources might be used to answer a set of keywordsOnly the highly relevant data sources are selected to answer the user query
  • Elements stands for all the keywordsthat are mentioned in elements of the graphs G. Every nKSk 2 NKSKis in fact atuple (k; Gk) that represents a keyword k and the graphs Gk G mentioning k.
  • Elements stands for all the keywordsthat are mentioned in elements of the graphs G. Every nKSk 2 NKSKis in fact atuple (k; Gk) that represents a keyword k and the graphs Gk G mentioning k.
  • As opposed to E-KERG, this one is indeed a summary model because itclusters two element-level relationships (hki; nKi (ni; gi;Ki)i; hkj ; nKj (nj ; gj ;Kj)i)and (hkv; nKv (nv; gv;Kv)i; hkw; nKw(nw; gw;Kw)i) to one schema-level relation-ship when they capture the same keyword relationships (i.e., ki = kvand kj = kw) between the same classes (i.e, n0i = n0v and n0j =
  • As opposed to E-KERG, this one is indeed a summary model because itclusters two element-level relationships (hki; nKi (ni; gi;Ki)i; hkj ; nKj (nj ; gj ;Kj)i)and (hkv; nKv (nv; gv;Kv)i; hkw; nKw(nw; gw;Kw)i) to one schema-level relation-ship when they capture the same keyword relationships (i.e., ki = kvand kj = kw) between the same classes (i.e, n0i = n0v and n0j =
  • Intuitively speaking, this procedure simply retrieves sources that cover thekeywords and in order to cover all jKj query keywords, it uses jKj-combinationsof these sources as routing plans.
  • - Summary allows for routing plan computation that is complete but not sound, different complexities (see paper)
  • values represent the average computed for all 30 queries. Using E-KERG, precision was up to 100 percent, i.e., for dsum max = ddatamax = 4. With P@5 being always above 0.6 whendmax > 1, S-KERG and D-KERG also achieved relatively good results. P@5 for KS was only 6%. Clearly, dmax had a positive effect. More valid plans werecomputed when a higher value was used for dmax. However, using dmax = 4instead of 3 did not yield clear improvemenFig. 4b shows the eect of query length jKj. Quite clear, queries with largernumber of keywords resulted in lower precision. It dropped as low as 0.23 whenusing D-KERG for queries with 5 keywords.KS is the model that produces only very few valid plans. This result was improved byone order of magnitude when relationships between keywords were used. The morene-grained a model captures the relationships, the larger was the percentage ofvalid plans. Even a summary at the level of sources produced reasonably highquality results, i.e., every second plan was a valid one
  • Performance is measured as the average response time for com-puting routing plans. Fig. 5a shows the performance for queries at various settingsusing dierent values for dmax. This parameter had no eect on the KS's resultsbut clearly inuenced the performance achieved with KERG summaries. Times increased with higher values for dmax. While this increase was sharp for E-KERGand S-KERG, time performance of D-KERG was relatively stable. In particular,time required by D-KERG was no more than 10ms on average.While the times shown are the actual times obtainedfor the other models, only the lower bound was shown for E-KERG. This is be-cause we applied a timeout of 6min. Fig. 5c shows the exact times obtained forE-KERG and the queries that had to be aborted due to timeout. For dmax = 4for instance, 1 out of every three queries was abortedExpectedly, more time was needed when the number of query keywords in-creases, as illustrated in Fig. 5b. It seems that all the other models had poorperformance w.r.t complex queries but D-KERG.
  • Process data as they come instead of blocking / waiting
  • In terms of total execution time, MI and BU are comparable, while TDis signicantly faster in most cases. While TD incurs more overhead for theinitial source selection because of the larger index, it enables the exclusion ofsources. Due to the high network cost, not retrieving irrelevant sources resultsin a signicant performance gain. Using only a partial index, MI is not able torestrict the number of sources that have to be retrieved.even after thenal result was reported other relevant sources had to be processed, but did notcontribute to the nal result. This indicates that early result reporting resultingin better responsiveness is very important in some cases, where processing allsources might be very costly and not needed. Clearly, TD produced results earlierthan MI, which was better than BU.
  • We presented a solution to the novel problem of keyword query routing. It helpsusers without knowledge of the evolving linked data and schema to ndcombina-tion of sources that contain answers corresponding to their needs. This solutionalso partially addresses the aspect of eciency as queries can be then evaluatedagainst the relevant sources identied by the user, instead of using the entire Webof linked data.We have proposed a family of summary models. Through theoretical and ex-perimental analysis, we showed that it is important to capture keyword relation-ships. Compared to the KS model representing the naive baseline that stores onlysingle keywords, the KERG models relying on relationships could produce a much
  • Searching Linked Data

    1. 1. Searching Linked Data From Finding Relevant Sources to Computing Answers Invited Presentation @ International Workshop on Scalable Semantic Computing, Hangzhou, China, November 2010. Thanh Tran, Günter Ladwig, Veli Bicer, Lei Zhang, Daniel Herzig, Yongtao Ma, Andreas Wagner, Rudi Studer from AIFB Institute, KIT Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu KIT – University of the State of Baden-Wuerttemberg and1 National Laboratory of the Helmholtz Association
    2. 2. Agenda Searching Linked Data  Opportunities & challenges Keyword Query Routing  Problem Definition  Summary Models  Experiments Linked Data Query Processing  Combining Top-down & Bottom-up  Stream-based Query Processing  Corrective Source Ranking Conclusions Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu KIT – University of the State of Baden-Wuerttemberg and 2 National Laboratory of the Helmholtz Association
    3. 3. Linked Data - 203 linked datasets serve 25 billion RDF triples interconnected by 395 million links - As of 09-2010 + other linked data not covered by LOD cloud Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu KIT – University of the State of Baden-Wuerttemberg and3 National Laboratory of the Helmholtz Association
    4. 4. Opportunities “Articles from awarded researchers at Stanford ” Freebase contains data about people  More complex information needs DBPedia contains information about awards  More precise results DBLP contains bibliographic data  More integrated results Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu KIT – University of the State of Baden-Wuerttemberg and4 National Laboratory of the Helmholtz Association
    5. 5. Problems “Articles from awarded researchers at Stanford ”  Large number of unknown, unexplored & irrelevant sources!  What is in there?  What is out there?  What is relevant? Formulating queries is a hard task! Processing queries is expensive! • Which data sources? USABILITY • Process against all data sources? SCALABILITY • Which schema elements? • Explore all links to other sources?( z). x, y.prizes(x, Turing Award) worksAt(x,y) name(y,Stanford) publication(x, z) Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu KIT – University of the State of Baden-Wuerttemberg and5 National Laboratory of the Helmholtz Association
    6. 6. Searching Linked Data  Given the needs (expressed as sets of keywords),  are there answers in linked data?  what combination of data sources produce them?  how to incorporate related unexplored linked sources?  Keyword Query Routing to of Identify valid combination  Let user choose combination sources Linked Data Sources Relevant of sources  Identify schema elements  Focused,on this combination of Focus Adaptive and Stream- sources and explore related based Linked Data Query linked sources(c.f. LARKC) Processing Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu KIT – University of the State of Baden-Wuerttemberg and6 National Laboratory of the Helmholtz Association
    7. 7. Agenda Searching Linked Data  Opportunities & challenges Keyword Query Routing  Problem Definition  Summary Models  Experiments Linked Data Query Processing  Combining top-down & bottom-up  Stream-based query processing  Corrective source ranking Conclusions Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu KIT – University of the State of Baden-Wuerttemberg and 7 National Laboratory of the Helmholtz Association
    8. 8. LOD Data Graph  Web data modeled as a set of interlinked data graphs  Each data graph represent a source  Data graph vs. schema graph vs. source graph Freebase DBLP DBPedia … John Music John. Smith Award title name label uni1 pub2 pub1 pub3 per4 prize2 author prizes employ author author per2 per1 per3 prize1 sameAs sameAs prizes name name name name label Stanford John John John Turing University McCarthy Mccarthy McCarthy Award Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu KIT – University of the State of Baden-Wuerttemberg and8 National Laboratory of the Helmholtz Association
    9. 9. LOD Schema Graph  Web data modeled as a set of interlinked data graphs  Each data graph represent a source  Data graph vs. schema graph vs. source graph Freebase DBLP DBPedia Written University Article Work employ author author Person Author Person Prize sameAs sameAs prizes Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu KIT – University of the State of Baden-Wuerttemberg and9 National Laboratory of the Helmholtz Association
    10. 10. LOD Source Graph  Web data modeled as a set of interlinked data graphs  Each data graph represent a source  Data graph vs. schema graph vs. source graph Freebase DBLP DBPedia author sames sameAs Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu KIT – University of the State of Baden-Wuerttemberg and10 National Laboratory of the Helmholtz Association
    11. 11. Keyword Query Answers User information need „stanford article award“ Freebase DBLP DBPedia … John Music Article John. Smith Award type title name label uni1 pub2 pub1 pub3 per4 prize2 author prizes employ author author per2 per1 per3 prize1 sameAs sameAs prizes name name name name label Stanford John John John Turing University McCarthy Mccarthy McCarthy Award Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu KIT – University of the State of Baden-Wuerttemberg and11 National Laboratory of the Helmholtz Association
    12. 12. Problem Definition  Keyword query result (also called Steiner graph) is a subgraph of data graph that for every keyword, contains a matching data element (called keyword elements), and these elements are pairwise connected over a path.  d-max Steiner graph is a Steiner graph where paths between keyword elements is d-max or less.  Keyword query routing: compute valid set of data sources called keyword routing plan. A plan is valid if its union set of sources produces non-empty keyword query results. Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu KIT – University of the State of Baden-Wuerttemberg and12 National Laboratory of the Helmholtz Association
    13. 13. A Valid Keyword Routing Plan User information need „stanford article award“ Freebase DBLP DBPedia … John Music Article John. Smith Award type title name label uni1 pub2 pub1 pub3 per4 prize2 author prizes employ author author per2 per1 per3 prize1 sameAs sameAs prizes name name name name label Stanford John John John Turing University McCarthy Mccarthy McCarthy Award Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu KIT – University of the State of Baden-Wuerttemberg and13 National Laboratory of the Helmholtz Association
    14. 14. Agenda Searching Linked Data  Opportunities & challenges Keyword Query Routing  Problem Definition  Summary Models  Experiments Linked Data Query Processing  Combining Top-down & Bottom-up  Stream-based Query Processing  Corrective Source Ranking Conclusions Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu KIT – University of the State of Baden-Wuerttemberg and14 National Laboratory of the Helmholtz Association
    15. 15. Keyword Sets  One keyword set for every data source  Elements stand for distinct keywords mentioned in a source Freebase DBLP DBPedia … John Music Smith Music John. Smith Award title name label uni1 pub2 pub1 pub3 per4 prize2 author prizes author author per2 per1 per3 prize1 sameAs sameAs prizes employ Stanford John McCarthy John Award name name name label Stanford John John John Turing University McCarthy John McCarthy Turing University McCarthy Mccarthy McCarthy Award Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu KIT – University of the State of Baden-Wuerttemberg and16 National Laboratory of the Helmholtz Association
    16. 16. Element-level Keyword-Element Relationship Graph (E- KERG)  A keyword-element captures a keyword k and the data element mentioning k  A relationship between two keyword-elements exists iff there is a path between their associated data elements  In d-max KERG, the paths to be considered have length d-max or less Freebase DBLP DBPedia pub4 per4 prize2 … John Music John Smith Music John. Smith Award title name label uni1 pub2 pub1 pub3 John per4 Award prize2 author prizes author author per2 per1 per3 prize1 sameAs sameAs prizes employ uni1 per2 per1 per3 prize1 Stanford John McCarthy John Award name name name label Stanford John John John Turing University McCarthy John McCarthy Turin University McCarthy Mccarthy McCarthy Award Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu KIT – University of the State of Baden-Wuerttemberg and17 National Laboratory of the Helmholtz Association
    17. 17. Schema-level Keyword-Element Relationship Graph (S-KERG)  A keyword-element captures a keyword k and the schema element which contains some instances (date elements) mentioning k  A relationship between two keyword-elements exists if there is a path between some instances of their associated schema elements  Groups ele. (rel.) when they capture same keyword (rel. between same classes) Freebase DBLP DBPedia Article pub4 Person per4 Prize prize2 … John Music John Smith Music John. Smith Award title name label uni1 pub2 pub1 pub3 John per4 Award prize2 author prizes author author per2 per1 per3 prize1 sameAs sameAs prizes employ University uni1 Person per2 Author per1 per3 prize1 Stanford John McCarthy John Award name name name label Stanford John John John Turing University McCarthy John McCarthy Turin University McCarthy Mccarthy McCarthy Award Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu KIT – University of the State of Baden-Wuerttemberg and18 National Laboratory of the Helmholtz Association
    18. 18. Data-Source-level Keyword-Element Relationship Graph (D-KERG)  A keyword-element captures a keyword k and the source which contains some instances (date elements) mentioning k  A relationship between two keyword-elements exists if there is a path between some instances of their associated sources  Groups ele. (rel.) when they capture same keyword (rel. between same sources) Freebase DBLP DBPedia Article pub4 Person per4 Prize prize2 … John Music John Smith Music John. Smith Award title name label uni1 pub2 pub1 pub3 John per4 Award prize2 author prizes author author per2 per1 per3 prize1 sameAs sameAs prizes employ University uni1 Person per2 Author per1 per3 prize1 Stanford John McCarthy John Award name name name label Stanford John John John Turing University McCarthy John McCarthy Turin University McCarthy Mccarthy McCarthy Award Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu KIT – University of the State of Baden-Wuerttemberg and19 National Laboratory of the Helmholtz Association
    19. 19. Agenda Searching Linked Data  Opportunities & challenges Keyword Query Routing  Problem Definition  Summary Models  Experiments Linked Data Query Processing  Combining Top-down & Bottom-up  Stream-based Query Processing  Corrective Source Ranking Conclusions Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu KIT – University of the State of Baden-Wuerttemberg and21 National Laboratory of the Helmholtz Association
    20. 20. Experiments  Chunk of the BTC dataset containing 10M RDF triples from 154 sources, linked via 500K mappings  Manually crafted 30 keyword valid multi-data- source queries, i.e., produce non-empty keyword answers and involve more than 2 sources  Town River America  Beijing Conference Database 2007 Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu KIT – University of the State of Baden-Wuerttemberg and22 National Laboratory of the Helmholtz Association
    21. 21. Validity  P@k measure the percentage of plans that are valid out of the top-k plans  P@5 for KS only 6%, P@5 up to 100% for E-KERG (dmax =4)  More valid plans were computed when a higher value was used for dmax  dmax =3 seems to be a good tradeoff  Queries with larger number of keywords resulted in lower precision 1.0 1.0 E-KERG D-KERG E-KERG 0.9 0.9 D-KERG S-KERG KS 0.8 0.8 0.7 S-KERG 0.7 0.6 KS 0.6 P@5 P@5 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0.0 0.0 0 1 2 3 4 2 3 4 5 dmax |K| Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu KIT – University of the State of Baden-Wuerttemberg and23 National Laboratory of the Helmholtz Association
    22. 22. Performance  Times increased with higher values for dmax  Sharp for E-KERG and S-KERG  Relatively stable for D-KERG  Times increase with number of keywords  All other models had poor performance w.r.t complex queries but D-KERG  E-KERG needed more than 100s for queries with more than 2 keywords  Time for D-KERG was no more than 10ms on average S-KERG D-KERG KS E-KERG S-KERG D-KERG KS E-KERG 1000000 1000000 Query Processing Time (ms) Query Processing Time (ms) 100000 100000 10000 10000 1000 1000 100 100 10 10 1 1 0 1 2 3 4 2 3 4 5 dmax |K| Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu KIT – University of the State of Baden-Wuerttemberg and24 National Laboratory of the Helmholtz Association
    23. 23. Agenda Searching Linked Data  Opportunities & challenges Keyword Query Routing  Problem Definition  Summary Models  Experiments Linked Data Query Processing  Combining Top-down & Bottom-up  Stream-based Query Processing  Corrective Source Ranking Conclusions Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu KIT – University of the State of Baden-Wuerttemberg and27 National Laboratory of the Helmholtz Association
    24. 24. Mixed Query Processing Strategy Combination of top-down and bottom-up strategies  Top-down: partial local index of sources, not assumed to be complete  Bottom-up: new sources are discovered at run-time Corrective Source Ranking  Deal with heterogeneous source descriptions  Adaptive re-ranking Stream-based Query Processing  Deal with unpredictable nature of Linked Data accessThanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu KIT – University of the State of Baden-Wuerttemberg andISWC 2010, Shanghai, China National Laboratory of the Helmholtz Association
    25. 25. Agenda Searching Linked Data  Opportunities & challenges Keyword Query Routing  Problem Definition  Summary Models  Experiments Linked Data Query Processing  Combining Top-down & Bottom-up  Stream-based Query Processing  Corrective Source Ranking Conclusions Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu KIT – University of the State of Baden-Wuerttemberg and29 National Laboratory of the Helmholtz Association
    26. 26. Stream-based Query Processing Results Compile-time  Construct query plan Query Plan Join  Probe local index for sources Join name(?y, ?n) Network latency  Do not block! worksAt(?x, dbpedia:KIT) knows(?x, ?y)  Evaluation driven by Samples incoming data Run-time Push  Retrieve sources Source Retrieval Retrieve Source Ranker  Push data into query plan Source Retriever 1 source Source 1 (score: 1.0)  Discover new sources Source Retriever 2 Source Source 2 (score: 0.7) discovered ...  Rank sources ... Linked Local Data source indexThanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu KIT – University of the State of Baden-Wuerttemberg andISWC 2010, Shanghai, China National Laboratory of the Helmholtz Association
    27. 27. Agenda Searching Linked Data  Opportunities & challenges Keyword Query Routing  Problem Definition  Summary Models  Experiments Linked Data Query Processing  Combining Top-down & Bottom-up  Stream-based Query Processing  Corrective Source Ranking Conclusions Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu KIT – University of the State of Baden-Wuerttemberg and31 National Laboratory of the Helmholtz Association
    28. 28. Corrective Source Ranking Prefer more relevant sources Relevancy of a source is based on  Current query  Any available intermediate results  Overall optimization goal Define a set of source features and derive concrete source metrics  Not all metrics are available for all sources (heterogeneity) Refine previously computed metrics using newly discovered information (intermediate results, samples)Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu KIT – University of the State of Baden-Wuerttemberg andISWC 2010, Shanghai, China National Laboratory of the Helmholtz Association
    29. 29. Evaluation Three systems: top-down (TD), bottom-up (BU), mixed (MI) 8 queries over various datasets (DBpedia, Geonames, NYT) To make the approaches comparable, sources were restricted to those discoverable by the BU approach ~6200 sources, containing ~500k triples  Sources hosted on local proxy server with artificial delay of 2 seconds  25% of sources were randomly chosen to construct index for MIThanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu KIT – University of the State of Baden-Wuerttemberg andISWC 2010, Shanghai, China National Laboratory of the Helmholtz Association
    30. 30. Results Overall early result reporting 25% results: MI 8.7s, BU 15.1s 50% results: MI 12.8s, BU 22.0s Improvement of ~42% Detailed results for two queries: Query 1 Query 6 BU MI TD BU MI TD25% Results 24810.5 10300.0 11038.0 8222.5 4743.5 5545.050% Results 43464.5 40782.0 15787.0 10961.5 7650.5 5634.0Total 84066.5 86895.5 44323.5 24086.0 20711.0 16469.0Src. Selection 0.0 853.0 1444.5 0.0 1331.0 1863.5Ranking 25.5 2404.0 411.5 23.5 292.5 335.0Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu KIT – University of the State of Baden-Wuerttemberg andISWC 2010, Shanghai, China National Laboratory of the Helmholtz Association
    31. 31. Result Arrival TimesThanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu KIT – University of the State of Baden-Wuerttemberg andISWC 2010, Shanghai, China National Laboratory of the Helmholtz Association
    32. 32. Agenda Searching Linked Data  Opportunities & challenges Keyword Query Routing  Problem Definition  Summary Models  Experiments Linked Data Query Processing  Combining Top-down & Bottom-up  Stream-based Query Processing  Corrective Source Ranking Conclusions Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu KIT – University of the State of Baden-Wuerttemberg and39 National Laboratory of the Helmholtz Association
    33. 33. Conclusions  Keyword query routing  Helps users without knowledge of linked data and schemas to find combination of sources that contain answers corresponding to their needs  Focus on relevant combinations  Summarizing at the level of sources (D-KERG) represents the most practical trade-off, produces results in less than 10ms out of which every second one was valid  Stream-based query processing helps to deal with unpredictable nature of Linked data  Corrective, mixed strategy that incorporate new sources and knowledge at run-time for optimization (source ranking) helped to report early results 42% faster on average Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu KIT – University of the State of Baden-Wuerttemberg and40 National Laboratory of the Helmholtz Association
    34. 34. Thanks for Your Attention! Institute AIFB, KIT ducthanh.tran@kit.edu Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu KIT – University of the State of Baden-Wuerttemberg and41 National Laboratory of the Helmholtz Association

    ×