Summary Models for Routing Keywords        to Linked Data Sources        Thanh Tran, Lei Zhang, Rudi Studer        AIFB In...
Agenda Introduction      Opportunities & challenges      Contributions Problem Definition      LOD Data      Keyword...
Semantic Data    - 203 linked datasets serve 25 billion RDF triples interconnected by 395 million links    - As of 09-2010...
Opportunities         “Articles from awarded researchers at Stanford ”     Freebase contains data about people           ...
Problems         “Articles from awarded researchers at Stanford ”                                      Large number of un...
Keyword Query Routing     Given the needs expressed as sets of keywords,               are there “corresponding answers”...
Contributions     Introduce the novel problem of keyword query routing     Propose the multi-level relationship graph to...
Agenda Introduction      Opportunities & challenges      Contributions Problem Definition      LOD Data      Keyword...
LOD Element-level Graph     Web data modeled as a set of interlinked data graphs     Each data graph represent a source ...
LOD Schema-level Graph  Web data modeled as a set of interlinked data graphs  Each data graph represent a source  Eleme...
LOD Source-level Graph Web data modeled as a set of interlinked data graphs Each data graph represent a source Element-...
“Corresponding” Answers          User information need                                                 „stanford          ...
Problem Definition      Keyword query result (also called Steiner graph) is a       subgraph of the union of the data- an...
A Valid Keyword Routing Plan          User information need                                                 „stanford     ...
The Search Space           Multi-level inter-relationship graphs capture the entire search space           Relationships...
Agenda Introduction      Opportunities & challenges      Contributions Problem Definition      LOD Data      Keyword...
Keyword Sets      One keyword set for every data source      Elements stand for distinct keywords mentioned in a source ...
Element-level Keyword-Element Relationship Graph (E- KERG)           A keyword-element captures a keyword k and the data ...
Schema-level Keyword-Element Relationship Graph (S-KERG)           A keyword-element captures a keyword k and the schema ...
Data-Source-level Keyword-Element Relationship Graph (D-KERG)           A keyword-element captures a keyword k and the so...
Agenda Introduction      Opportunities & challenges      Contributions Problem Definition      LOD Data      Keyword...
Theoretical Results      When Steiner graphs can be found for K in the       data, then there will be keyword routing pla...
Experiments      Chunk of the BTC dataset containing 10M RDF       triples from 154 sources, linked via 500K mappings    ...
Validity                P@k measure the percentage of plans that are valid out of the top-k plans                P@5 up ...
Performance                Times increased with higher values for dmax                                       Sharp for E...
Conclusions      Keyword query routing helps users without knowledge of linked data       and schemas to find combination...
Thanks for Your Attention!                                                                  Institute AIFB, KIT           ...
Upcoming SlideShare
Loading in …5
×

Summary Models for Routing Keywords to Linked Data Sources

741 views

Published on

Summary Models for Routing Keywords to Linked Data Sources

Published in: Education, Technology, Design
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
741
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
10
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • More complex information needs More precise results More integrated results
  • So far, these requirements have proven to be a large burden. Given the amount of linked data is large and continuously evolving, it is inherently dicultto know what is in there (i.e., the data and the schema) and to formulate the corresponding structured queries for addressing some given information needs.Hence, it is desirable to have a mechanism, which allows users to express information needs in their own words. Another aspect of dealing with the large Web of linked data is scalability. Processing the needs against the entire Web might be too time consuming and not needed, especially when users are interested in and want to choose some particular sources of information. Processing against a relevant subset of linked data identied by the user is more scalable and possibly the only practical solution for the large Web of linked data.
  • (Rank combination of sources)(Automatically process relevant combination of sources)Concerning these problems, the question we deal with is given the needs expressed by users as sets of keywords, are there corresponding answers in linked data and what combination of data sources shall be used to produce them?Further, the aim is not to directly compute results but to quickly identify and let users andsystem focus on the combination of sources that produce non-empty results.They recognized the fact that the computational complexity resulting from a large-scalesetting can be partially addressed when allowing users to choose and retrieve an-swers from only some particular databases. Given a set of keywords, the goal is tond and rank the single most relevant databases that contain the answers. Follow-ing this line, we propose specic solutions for the linked data context. The dier-
  • This novel keyword query routing problem raises additional challenges. Most notably, query keywordsmay be covered by several linked sources, resulting in a large search space. Thesize of this search space grow exponentially with the number of sources and theirassociated links. Targeting this problem of scale, we report the following contri-butions in this paper:{ We propose solutions for keyword query routing which enable the exploita-tion of linked data. Without putting any burden on the users, this kind ofapproaches help to nd relevant sources containing complex answers to ad-hocinformation needs in the large and evolving Web of linked data.{ We propose a multi-level relationship graph to capture the search space ofthe keyword query routing problem. Based on this, we elaborate on a fam-ily of summary models, which compactly represent the Web of linked data.These models capture information at dierent levels, representing summariesof dierent granularities. In a theoretical analysis, we prove that ner grainedmodels can improve the result quality. This however, comes at the expenseof higher complexity. Thus, the models represent dierent trade-os betweeneectiveness and eciency.{ In the experiments, we investigate these trade-os by analyzing the precisionand the processing time needed using dierent models. The experiments werecarried out in a real-world setting using more than 150 publicly availabledatasets, and an open-source implementation we made available at http://code.google.com/p/rdfstores/. Results of using summaries are promising.While the \\best" one shall be determined w.r.t a concrete application, there isone model that seems to represent the most practical trade-o: the D-KERGmodel, which summarizes elements according to sources, produces results inless than 10ms, out of which every second is a valid one.
  • Linked data can be conceived as a set of data graphs, each represents a particular source. As a working denition, we present a simple graph-based model of linked data called the Web graph. In that model, we distinguish between the - Web data graph representing relationships between individual data elements, - the Web schema graph, which captures information about group of elements, and the Web source graph that contains information at the level of data sources.- This is a simple model of linked data that omits details not necessary forthis work. In particular, data elements may correspond to RDF resources, blank nodes or literals. Schema elements might stand for classes or data types. For keyword query routing, these distinctions are not relevant but the fact that theelements can be recognized via their labels. While dierent kinds of links can beestablished, the ones frequently found are sameAs links, which denote that twoRDF resources or two classes are the same. There is also no need to distinguishthe types of links. Only the fact that sources can be reached via some kinds oflink m 2M matters.
  • Linked data can be conceived as a set of data graphs, each represents a particular source. As a working denition, we present a simple graph-based model of linked data called the Web graph. In that model, we distinguish between the - Web data graph representing relationships between individual data elements, - the Web schema graph, which captures information about group of elements, and the Web source graph that contains information at the level of data sources.- This is a simple model of linked data that omits details not necessary forthis work. In particular, data elements may correspond to RDF resources, blank nodes or literals. Schema elements might stand for classes or data types. For keyword query routing, these distinctions are not relevant but the fact that theelements can be recognized via their labels. While dierent kinds of links can beestablished, the ones frequently found are sameAs links, which denote that twoRDF resources or two classes are the same. There is also no need to distinguishthe types of links. Only the fact that sources can be reached via some kinds oflink m 2M matters.
  • Linked data can be conceived as a set of data graphs, each represents a particular source. As a working denition, we present a simple graph-based model of linked data called the Web graph. In that model, we distinguish between the - Web data graph representing relationships between individual data elements, - the Web schema graph, which captures information about group of elements, and the Web source graph that contains information at the level of data sources.- This is a simple model of linked data that omits details not necessary forthis work. In particular, data elements may correspond to RDF resources, blank nodes or literals. Schema elements might stand for classes or data types. For keyword query routing, these distinctions are not relevant but the fact that theelements can be recognized via their labels. While dierent kinds of links can beestablished, the ones frequently found are sameAs links, which denote that twoRDF resources or two classes are the same. There is also no need to distinguishthe types of links. Only the fact that sources can be reached via some kinds oflink m 2M matters.
  • A valid plan in our example is RP = fFreebase;DBLP;DBPediag. Note that validity does not imply relevance. That is, a valid plan ensures that resultscan be produced, but for the users, these results may dier in relevance. A properaccount of relevance and the ranking of routing plans based on the relevance oftheir results go beyond the scope of this paper, which is focused on eciencyaspects of computing valid plans. We assume a xed ranking function, whichequally applies to all summaries discussed in this paper. We refer the interestedreaders to our report [8], which discusses relevance and the ranking function.Does not consider RELEVANCE, focus on EFFICIENCY
  • - Keywords map against elements of the entire data web- Routing simply based on coverage- Consider further factors for data source identification, i.e. characteristics of the data, the data sources and links between them-Keyword query routing: Keyword routing in a truly distributed setting such that several data sources might be used to answer a set of keywordsOnly the highly relevant data sources are selected to answer the user query
  • Elements stands for all the keywordsthat are mentioned in elements of the graphs G. Every nKSk 2 NKSKis in fact atuple (k; Gk) that represents a keyword k and the graphs Gk G mentioning k.
  • Elements stands for all the keywordsthat are mentioned in elements of the graphs G. Every nKSk 2 NKSKis in fact atuple (k; Gk) that represents a keyword k and the graphs Gk G mentioning k.
  • As opposed to E-KERG, this one is indeed a summary model because itclusters two element-level relationships (hki; nKi (ni; gi;Ki)i; hkj ; nKj (nj ; gj ;Kj)i)and (hkv; nKv (nv; gv;Kv)i; hkw; nKw(nw; gw;Kw)i) to one schema-level relation-ship when they capture the same keyword relationships (i.e., ki = kvand kj = kw) between the same classes (i.e, n0i = n0v and n0j =
  • As opposed to E-KERG, this one is indeed a summary model because itclusters two element-level relationships (hki; nKi (ni; gi;Ki)i; hkj ; nKj (nj ; gj ;Kj)i)and (hkv; nKv (nv; gv;Kv)i; hkw; nKw(nw; gw;Kw)i) to one schema-level relation-ship when they capture the same keyword relationships (i.e., ki = kvand kj = kw) between the same classes (i.e, n0i = n0v and n0j =
  • Intuitively speaking, this procedure simply retrieves sources that cover thekeywords and in order to cover all jKj query keywords, it uses jKj-combinationsof these sources as routing plans.
  • Valid plans (D-KERG) ≤ valid plans (S-KERG) ≤ valid plans (E-KERG) All plans are valid for D-KERG when d-max (summary) ≥ d-max (Steiner graph)This procedure is the same for all KERGs. Given that the underlying datacontain results, we provide proofs in the report [8] to show that applying thisprocedure on the S-KERG summary will yield routing plans, i.e., when Steinergraphs can be found for K in the data, then there will be corresponding graphsthat can be found in the summary. Thus, given K, the procedure will output anon-empty set of RP if W contains a result for K. In the same manner, it isstraightforward to show that E-KERG and D-KERG can provide this guarantee.However, we show formally in [8] that the other way around is not true, i.e., thegraphs derived from the summary are not necessarily valid such that there mightbe no corresponding Steiner graph in the data. Thus, the fact that a routingplan can be derived from the summaries does not guarantee there exists a resultfor K. This formal result is interesting because it makes clear that while theIn summary, the percentage of valid plans for D-KERG is less or equal thatfor S-KERG, which in turn is less or equal that for E-KERG. When dsummax valueof E-KERG is suciently large to cover all paths relevant for Steiner graph com-putation, i.e., dsummax = ddatamax, this percentage is 100 for E-KERG. By chance, thepercentage of valid plans for KS might be higher than that for the summary mod-els but in general, is expected to be less (because relationships between elementsare not considered).Compared to the KERG models, KS does not capture relationships betweenkeywords at all. Given two keywords ki; kj , the sources which cover these key-words can be derived from KS, e.g. the graphs n00 i ; n00 j . However, this does notimply there exist two elements ni 2 n00 i and nj 2 n00 j , and ni !nj . More gener-ally, a combination of sources derived from KS covers all keywords but does notensure that elements matching these keywords are connected, and thus, does notnecessarily correspond to a Steiner graph.
  • values represent the average computed for all 30 queries. Using E-KERG, precision was up to 100 percent, i.e., for dsum max = ddatamax = 4. With P@5 being always above 0.6 whendmax > 1, S-KERG and D-KERG also achieved relatively good results. P@5 for KS was only 6%. Clearly, dmax had a positive effect. More valid plans werecomputed when a higher value was used for dmax. However, using dmax = 4instead of 3 did not yield clear improvemenFig. 4b shows the eect of query length jKj. Quite clear, queries with largernumber of keywords resulted in lower precision. It dropped as low as 0.23 whenusing D-KERG for queries with 5 keywords.KS is the model that produces only very few valid plans. This result was improved byone order of magnitude when relationships between keywords were used. The morene-grained a model captures the relationships, the larger was the percentage ofvalid plans. Even a summary at the level of sources produced reasonably highquality results, i.e., every second plan was a valid one
  • Performance is measured as the average response time for com-puting routing plans. Fig. 5a shows the performance for queries at various settingsusing dierent values for dmax. This parameter had no eect on the KS's resultsbut clearly inuenced the performance achieved with KERG summaries. Times increased with higher values for dmax. While this increase was sharp for E-KERGand S-KERG, time performance of D-KERG was relatively stable. In particular,time required by D-KERG was no more than 10ms on average.While the times shown are the actual times obtainedfor the other models, only the lower bound was shown for E-KERG. This is be-cause we applied a timeout of 6min. Fig. 5c shows the exact times obtained forE-KERG and the queries that had to be aborted due to timeout. For dmax = 4for instance, 1 out of every three queries was abortedExpectedly, more time was needed when the number of query keywords in-creases, as illustrated in Fig. 5b. It seems that all the other models had poorperformance w.r.t complex queries but D-KERG.
  • We presented a solution to the novel problem of keyword query routing. It helpsusers without knowledge of the evolving linked data and schema to ndcombina-tion of sources that contain answers corresponding to their needs. This solutionalso partially addresses the aspect of eciency as queries can be then evaluatedagainst the relevant sources identied by the user, instead of using the entire Webof linked data.We have proposed a family of summary models. Through theoretical and ex-perimental analysis, we showed that it is important to capture keyword relation-ships. Compared to the KS model representing the naive baseline that stores onlysingle keywords, the KERG models relying on relationships could produce a much
  • Summary Models for Routing Keywords to Linked Data Sources

    1. 1. Summary Models for Routing Keywords to Linked Data Sources Thanh Tran, Lei Zhang, Rudi Studer AIFB Institute, KIT Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu KIT – University of the State of Baden-Wuerttemberg and1 National Laboratory of the Helmholtz Association
    2. 2. Agenda Introduction  Opportunities & challenges  Contributions Problem Definition  LOD Data  Keyword Query Answer  Keyword Query Routing Summary Models  Keyword sets  Element-level vs. schema-level vs. source-level Summary  Validity of Results vs. complexity Theo. / Exp. Results 2 Conclusions ducthanh.tran@kit.edu Thanh Tran, AIFB Institute, KIT, KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association
    3. 3. Semantic Data - 203 linked datasets serve 25 billion RDF triples interconnected by 395 million links - As of 09-2010 + other data (e.g. LON, ontologies, RDFa ) + increasing rapidly... Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu KIT – University of the State of Baden-Wuerttemberg and3 National Laboratory of the Helmholtz Association
    4. 4. Opportunities “Articles from awarded researchers at Stanford ” Freebase contains data about people  More complex information needs DBPedia contains information about awards  More precise results DBLP contains bibliographic data  More integrated results Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu KIT – University of the State of Baden-Wuerttemberg and4 National Laboratory of the Helmholtz Association
    5. 5. Problems “Articles from awarded researchers at Stanford ”  Large number of unknown & irrelevant sources!  What is in there?  What is relevant? Formulating queries is a hard task! Processing queries is expensive! • Which data sources? USABILITY • Process against all data sources? SCALABILITY • Which schema elements?( z). x, y.prizes(x, Turing Award) worksAt(x,y) name(y,Stanford) publication(x, z) Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu KIT – University of the State of Baden-Wuerttemberg and5 National Laboratory of the Helmholtz Association
    6. 6. Keyword Query Routing  Given the needs expressed as sets of keywords,  are there “corresponding answers” in linked data?  and what combination of data sources can be used to produce them?  Identify valid combination of  Let user choose sources using keywords combination of sources  Present schema elements for  Process only relevant the user to formulate query combinations of sources Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu KIT – University of the State of Baden-Wuerttemberg and6 National Laboratory of the Helmholtz Association
    7. 7. Contributions  Introduce the novel problem of keyword query routing  Propose the multi-level relationship graph to capture its search space.  Introduce various summary models, which aim to compactly represent the search space.  Investigate the resulting trade-offs between result quality and efficiency through theoretical analysis and practical experiments using publicly available linked data sources. Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu KIT – University of the State of Baden-Wuerttemberg and7 National Laboratory of the Helmholtz Association
    8. 8. Agenda Introduction  Opportunities & challenges  Contributions Problem Definition  LOD Data  Keyword Query Answer  Keyword Query Routing Summary Models  Keyword sets  Element-level vs. schema-level vs. source-level Summary  Validity of Results vs. complexity Theo. / Exp. Results 8 Conclusions ducthanh.tran@kit.edu Thanh Tran, AIFB Institute, KIT, KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association
    9. 9. LOD Element-level Graph  Web data modeled as a set of interlinked data graphs  Each data graph represent a source  Element-level graph vs. schema-level graph vs. source-level graph Freebase DBLP DBPedia … John Music John. Smith Award title name label uni1 pub2 pub1 pub3 per4 prize2 author prizes employ author author per2 per1 per3 prize1 sameAs sameAs prizes name name name name label Stanford John John John Turing University McCarthy Mccarthy McCarthy Award Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu KIT – University of the State of Baden-Wuerttemberg and9 National Laboratory of the Helmholtz Association
    10. 10. LOD Schema-level Graph  Web data modeled as a set of interlinked data graphs  Each data graph represent a source  Element-level graph vs. schema-level graph vs. source-level graph Freebase DBLP DBPedia Written University Article Work employ author author Person Author Person Prize sameAs sameAs prizes Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu KIT – University of the State of Baden-Wuerttemberg and10 National Laboratory of the Helmholtz Association
    11. 11. LOD Source-level Graph Web data modeled as a set of interlinked data graphs Each data graph represent a source Element-level graph vs. schema-level graph vs. source-level graph Freebase DBLP DBPedia author sames sameAs Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu KIT – University of the State of Baden-Wuerttemberg and11 National Laboratory of the Helmholtz Association
    12. 12. “Corresponding” Answers User information need „stanford article award“ Freebase DBLP DBPedia … John Music Article John. Smith Award type title name label uni1 pub2 pub1 pub3 per4 prize2 author prizes employ author author per2 per1 per3 prize1 sameAs sameAs prizes name name name name label Stanford John John John Turing University McCarthy Mccarthy McCarthy Award Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu KIT – University of the State of Baden-Wuerttemberg and12 National Laboratory of the Helmholtz Association
    13. 13. Problem Definition  Keyword query result (also called Steiner graph) is a subgraph of the union of the data- and schema-level graph that for every keyword, contains a matching element, and these elements are pairwise connected over a path.  d-max Steiner graph is a Steiner graph where paths between keyword elements is d-max or less.  Keyword query routing: compute valid set of data sources called keyword routing plan. A plan is valid if its sources produce non-empty keyword query results. Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu KIT – University of the State of Baden-Wuerttemberg and13 National Laboratory of the Helmholtz Association
    14. 14. A Valid Keyword Routing Plan User information need „stanford article award“ Freebase DBLP DBPedia … John Music Article John. Smith Award type title name label uni1 pub2 pub1 pub3 per4 prize2 author prizes employ author author per2 per1 per3 prize1 sameAs sameAs prizes name name name name label Stanford John John John Turing University McCarthy Mccarthy McCarthy Award Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu KIT – University of the State of Baden-Wuerttemberg and14 National Laboratory of the Helmholtz Association
    15. 15. The Search Space  Multi-level inter-relationship graphs capture the entire search space  Relationships between elements  and between different levels  Search space is too large!  Naïve solution not applicable: apply existing approaches to keyword search for computing Steiner graphs  Steiner graphs might span several linked sources  Search space grow exponentially with the number of sources and their associated links Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu KIT – University of the State of Baden-Wuerttemberg and15 National Laboratory of the Helmholtz Association
    16. 16. Agenda Introduction  Opportunities & challenges  Contributions Problem Definition  LOD Data  Keyword Query Answer  Keyword Query Routing Summary Models  Keyword sets  Element-level vs. schema-level vs. source-level KERG  Validity of Results vs. complexity Theo. / Exp. Results16 Conclusions ducthanh.tran@kit.edu Thanh Tran, AIFB Institute, KIT, KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association
    17. 17. Keyword Sets  One keyword set for every data source  Elements stand for distinct keywords mentioned in a source Freebase DBLP DBPedia … John Music Smith Music John. Smith Award title name label uni1 pub2 pub1 pub3 per4 prize2 author prizes author author per2 per1 per3 prize1 sameAs sameAs prizes employ Stanford John McCarthy John Award name name name label Stanford John John John Turing University McCarthy John McCarthy Turing University McCarthy Mccarthy McCarthy Award Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu KIT – University of the State of Baden-Wuerttemberg and17 National Laboratory of the Helmholtz Association
    18. 18. Element-level Keyword-Element Relationship Graph (E- KERG)  A keyword-element captures a keyword k and the data element mentioning k  A relationship between two keyword-elements exists iff there is a path between their associated data elements  In d-max KERG, the paths to be considered have length d-max or less Freebase DBLP DBPedia pub4 per4 prize2 … John Music John Smith Music John. Smith Award title name label uni1 pub2 pub1 pub3 John per4 Award prize2 author prizes author author per2 per1 per3 prize1 sameAs sameAs prizes employ uni1 per2 per1 per3 prize1 Stanford John McCarthy John Award name name name label Stanford John John John Turing University McCarthy John McCarthy Turin University McCarthy Mccarthy McCarthy Award Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu KIT – University of the State of Baden-Wuerttemberg and18 National Laboratory of the Helmholtz Association
    19. 19. Schema-level Keyword-Element Relationship Graph (S-KERG)  A keyword-element captures a keyword k and the schema element which contains some instances (date elements) mentioning k  A relationship between two keyword-elements exists if there is a path between some instances of their associated schema elements  Groups elements (relationships) when they capture same pair of keywords in the same class (same keyword relationships between same pair of classes) Freebase DBLP DBPedia Article pub4 Person per4 Prize prize2 … John Music John Smith Music John. Smith Award title name label uni1 pub2 pub1 pub3 John per4 Award prize2 author prizes author author per2 per1 per3 prize1 sameAs sameAs prizes employ University uni1 Person per2 Author per1 per3 prize1 Stanford John McCarthy John Award name name name label Stanford John John John Turing University McCarthy Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu John McCarthy of the State of Baden-Wuerttemberg and KIT – University Turin University McCarthy Mccarthy McCarthy Award19 National Laboratory of the Helmholtz Association
    20. 20. Data-Source-level Keyword-Element Relationship Graph (D-KERG)  A keyword-element captures a keyword k and the source which contains some instances (date elements) mentioning k  A relationship between two keyword-elements exists if there is a path between some instances of their associated sources  Groups elements (relationships) when they capture same pair of keywords in the same source (same keyword relationships between the same of pair sources) Freebase DBLP DBPedia Article pub4 Person per4 Prize prize2 … John Music John Smith Music John. Smith Award title name label uni1 pub2 pub1 pub3 John per4 Award prize2 author prizes author author per2 per1 per3 prize1 sameAs sameAs prizes employ University uni1 Person per2 Author per1 per3 prize1 Stanford John McCarthy John Award name name name label Stanford John John John Turing University McCarthy Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu John McCarthy of the State of Baden-Wuerttemberg and KIT – University Turin University McCarthy Mccarthy McCarthy Award20 National Laboratory of the Helmholtz Association
    21. 21. Agenda Introduction  Opportunities & challenges  Contributions Problem Definition  LOD Data  Keyword Query Answer  Keyword Query Routing Summary Models  Keyword sets  Element-level vs. schema-level vs. source-level KERG  Validity of Results vs. complexity Theo. / Exp. Results22 Conclusions ducthanh.tran@kit.edu Thanh Tran, AIFB Institute, KIT, KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association
    22. 22. Theoretical Results  When Steiner graphs can be found for K in the data, then there will be keyword routing plan that can be found in KERG.  The keyword routing plan derived from the summary are not necessarily valid s.t. there might be no corresponding Steiner graph in the data  Detailed results + algorithms + complexity results in the paper! Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu KIT – University of the State of Baden-Wuerttemberg and23 National Laboratory of the Helmholtz Association
    23. 23. Experiments  Chunk of the BTC dataset containing 10M RDF triples from 154 sources, linked via 500K mappings  Manually crafted 30 keyword valid multi-data- source queries, i.e., produce non-empty keyword answers and involve more than 2 sources  Town River America  Beijing Conference Database 2007 Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu KIT – University of the State of Baden-Wuerttemberg and24 National Laboratory of the Helmholtz Association
    24. 24. Validity  P@k measure the percentage of plans that are valid out of the top-k plans  P@5 up to 100% for E-KERG (dmax =4), P@5 for KS only 6%  More valid plans were computed when a higher value was used for dmax  dmax =3 seems to be a good tradeoff  Queries with larger number of keywords resulted in lower precision 1.0 1.0 E-KERG D-KERG E-KERG 0.9 0.9 D-KERG S-KERG KS 0.8 0.8 0.7 S-KERG 0.7 0.6 KS 0.6 P@5 P@5 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0.0 0.0 0 1 2 3 4 2 3 4 5 dmax |K| Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu KIT – University of the State of Baden-Wuerttemberg and25 National Laboratory of the Helmholtz Association
    25. 25. Performance  Times increased with higher values for dmax  Sharp for E-KERG and S-KERG  Relatively stable for D-KERG  Times increase with number of keywords  All other models had poor performance w.r.t complex queries but D-KERG  E-KERG needed more than 100s for queries with more than 2 keywords  Time for D-KERG was no more than 10ms on average S-KERG D-KERG KS E-KERG S-KERG D-KERG KS E-KERG 1000000 1000000 Query Processing Time (ms) Query Processing Time (ms) 100000 100000 10000 10000 1000 1000 100 100 10 10 1 1 0 1 2 3 4 2 3 4 5 dmax |K| Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu KIT – University of the State of Baden-Wuerttemberg and26 National Laboratory of the Helmholtz Association
    26. 26. Conclusions  Keyword query routing helps users without knowledge of linked data and schemas to find combination of sources that contain answers corresponding to their needs  Summarizing relationships is essential for dealing with the large-scale linked data Web (E-KERG achieved poor performance, requires more than 100s for complex queries)  Summarizing at the level of sources (D-KERG) represents the most practical trade-off, produces results in less than 10ms out of which every second one was valid  However, validity still low for complex queries (<30% when 4 keywords)  Baseline approaches for novel problem  Further improve validity and consider relevance!  Combine keyword query routing with source and structured query processing to compute final results! Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu KIT – University of the State of Baden-Wuerttemberg and27 National Laboratory of the Helmholtz Association
    27. 27. Thanks for Your Attention! Institute AIFB, KIT ducthanh.tran@kit.edu Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu KIT – University of the State of Baden-Wuerttemberg and28 National Laboratory of the Helmholtz Association

    ×