Web Queries: From a Web of Data to a Semantic Web
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Web Queries: From a Web of Data to a Semantic Web

on

  • 1,873 views

Keyword-based Web query languages are one significant effort towards combining the virtues of Web search, viz. being accessible to untrained users and able to cope with vastly heterogeneous data, with ...

Keyword-based Web query languages are one significant effort towards combining the virtues of Web search, viz. being accessible to untrained users and able to cope with vastly heterogeneous data, with those of database-style Web queries. These languages operate essentially in the same setting as XQuery or SPARQL but with an interface for untrained users.

Keyword-based query languages trade some of the precision that languages like XQuery enable by allowing to formulate exactly what data to select and how to process it, for an easier interface accessible to untrained users. The yardstick for these languages becomes an easily accessible interface that does not sacrifice the essential premise of database-style Web queries, that selection and construction are precise enough to fully automate data processing tasks.

To ground the discussion of keyword-based query languages, we give a summary of what we perceive as the main contributions of research and development on Web query languages in the past decade. This summary focuses specifically on what sets Web query languages apart from their predecessors for databases.

Further, this tutorial (1) gives an overview over keyword-based query languages for XML and RDF (2) discusses where the existing approaches succeed and what, in our opinion, are the most glaring open issues, and (3) where, beyond keyword-based query languages, we see the need, the challenges, and the opportunities for combining the ease of use of Web search with the virtues of Web queries.

Statistics

Views

Total Views
1,873
Views on SlideShare
1,852
Embed Views
21

Actions

Likes
0
Downloads
47
Comments
0

3 Embeds 21

http://www.pms.ifi.lmu.de 12
http://www.techgig.com 7
http://www.slideshare.net 2

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Web search as Google or Yahoo!
  • discovering information among the vast amount available on the Web, operate on (nearly) all the Web <br /> operate on all kinds of Web documents at the same time
  • very simple interface (usually just a bag of words), many casual users are familiar <br /> limited to &#xFB01;ltering & ranking relevant documents (for human consumption) <br /> search request are often rather vaguely related to the search intent and, at best, a ranking can be provided <br /> highly parallel (and thus scalable) evaluation of Web searches <br /> requires a human in the loop to ultimately judge the relevance of a search result
  • that&#x2019;s search, but what about &#x201C;querying&#x201D;
  • allow automated processing, aggregation, and deduction of data
  • allow automated processing, aggregation, and deduction of data
  • allow automated processing, aggregation, and deduction of data
  • really? If you believe that I have some bank stocks
  • database-style queries on Web (mostly XML or RDF) data <br /> operate on Web data formats such as XML and RDF, the presumptive foundation for the Semantic Web <br /> (a small set of ) documents, operate on small fraction of the Web&#x2BC;s data <br /> require knowledge of the language itself as well as of the data to be queried <br /> require signi&#xFB01;cant training to be employed effectively, comparable to a programming task <br /> but: allow automated processing, aggregation, and deduction of data <br /> precise selection of data items in Web documents <br /> scale not better than traditional SQL database technology, and thus are clearly unable to process signi&#xFB01;cant subsets of all Web data
  • database-style queries on Web (mostly XML or RDF) data <br /> operate on Web data formats such as XML and RDF, the presumptive foundation for the Semantic Web <br /> (a small set of ) documents, operate on small fraction of the Web&#x2BC;s data <br /> require knowledge of the language itself as well as of the data to be queried <br /> require signi&#xFB01;cant training to be employed effectively, comparable to a programming task <br /> but: allow automated processing, aggregation, and deduction of data <br /> precise selection of data items in Web documents <br /> scale not better than traditional SQL database technology, and thus are clearly unable to process signi&#xFB01;cant subsets of all Web data
  • database-style queries on Web (mostly XML or RDF) data <br /> operate on Web data formats such as XML and RDF, the presumptive foundation for the Semantic Web <br /> (a small set of ) documents, operate on small fraction of the Web&#x2BC;s data <br /> require knowledge of the language itself as well as of the data to be queried <br /> require signi&#xFB01;cant training to be employed effectively, comparable to a programming task <br /> but: allow automated processing, aggregation, and deduction of data <br /> precise selection of data items in Web documents <br /> scale not better than traditional SQL database technology, and thus are clearly unable to process signi&#xFB01;cant subsets of all Web data
  • sort by score
  • hierarchical databases have been there before? <br /> -> what&#x2019;s new? declarative access/query languages
  • XPath axes and their efficient evaluation already are and will remain one of the chief outcomes of XML reasearch
  • trotz hoher Funktionalit&#xE4;t, einfache aber m&#xE4;chtig <br /> & = combiniert mit
  • example of composition is still an easy one, hard (but too long for this talk) are construction of intermediary trees
  • Dec 2008/Jan 2009 spike in XQuery news: release of new XQuery 1.1 working draft?
  • Dec 2008/Jan 2009 spike in XQuery news: release of new XQuery 1.1 working draft?
  • Dec 2008/Jan 2009 spike in XQuery news: release of new XQuery 1.1 working draft?
  • @TODO: &#x201C;query/results&#x201D; ?

Web Queries: From a Web of Data to a Semantic Web Presentation Transcript

  • 1. Web Queries From a Web of Data to a Semantic Web François Bry, Tim Furche, Klara Weiand based on work in the European project KiWi “Knowledge in a Wiki” 1
  • 2. 2
  • 3. 3
  • 4. keyword search ranked collection of documents 4
  • 5. ? 5
  • 6. ? but what does Web querying do? 5
  • 7. 1 100 200 300 400 500 Share value: 27.40$ 6
  • 8. 1 100 200 300 400 500 Share value: 27.40$ Automatically sell at < 28.00$ Action! 6
  • 9. ? 7
  • 10. ? if you believe that I have some bank stocks for you ... 7
  • 11. or query: value of specific element Share value: 27.40$ 8
  • 12. or query: value of specific element Share value: 27.40$ rule: automatically sell at < 28.00$ Action! 8
  • 13. Web Search Web Queries Scale “Web”, TB/PBs a few “documents”, GBs for basic selection lang. high, Parallelizability very high (NC) otherwise very low independent documents, trees (XML) or graphs (RDF), Data heterogenous homogenous (almost) everyone, many casual Used by few experts users Expertise Level/Knowledge low very high required Expressiveness very low very high Result presentation Ranking, clustering... programmed Matching often fuzzy or vague only precise answers Documents (or summaries Return value parts of trees or graphs thereof) Actionability very low very high 9
  • 14. web search Expressive Power Ease of use web queries In a nutshell... 10
  • 15. Web Queries Search + Query = Easy + Automation Goals: make web querying accessible to casual users e.g., to enable data aggregation/mashups by users allow precise queries over vastly heterogeneous data precise queries and rules critical to the (Social) Semantic Web unless they are accessible no widespread adoption Classification of approaches to combining Web search & query enhance search: add data extraction / object search enhance querying: add keyword search / information retrieval keyword-based QLs for structured data: grounds-up redesign 11
  • 16. Search + Query Approach 1: Enhance Search “Peek” into web documents Extract data items / Web objects from web documents to provide more fine-grained answers Examples Google (Squared) Google Rich Snippets Yahoo! Search Monkey 12
  • 17. Search + Query Approach 1: Enhance Search “Peek” into web documents Extract data items / Web objects from web documents to provide more fine-grained answers Examples Google (Squared) Google Rich Snippets Yahoo! Search Monkey 12
  • 18. Search + Query Approach 2: Enhance Querying Enhance existing (XML) web query languages by adding information retrieval functionality ranking, scoring, fuzzy matching Examples XQuery and XPath Full Text 1.0, W3C Cand. Recommendation 13
  • 19. Search + Query Approach 2: Enhance Querying Enhance existing (XML) web query languages by adding information retrieval functionality ranking, scoring, fuzzy matching Examples XQuery and XPath Full Text 1.0, W3C Cand. Recommendation 1 doc('bib.xml')/bib/book[title ftcontains 'programming'] 2 for where $b score $s in doc('bib.xml')/bib/book $b ftcontains 'web' ftand 'query' order by $s descending return <title> {$b//title} </title> 13
  • 20. Search + Query Approach 3: Keyword Queries Keyword query for structured Web (XML and RDF) data apply web search paradigm to querying tree and graph data operates in the same setting as e.g. XQuery and SPARQL few or single document, no Web scale but: easier handling of very heterogeneous data querying of semi-structured data through easy-to-use interface Ultimate goal: Easy usage combined with enough power → to automate data processing tasks 14
  • 21. Search + Query Approach 3: Keyword Queries Keyword query for structured Web (XML and RDF) data apply web search paradigm to querying tree and graph data operates in the same setting as e.g. XQuery and SPARQL few or single document, no Web scale K. Weiand, T. Furche, and F. Bry. Quo vadis, web queries? but: easier handling of very heterogeneous data In Web4Web, 2008. querying of semi-structured data through easy-to-use interface Ultimate goal: Easy usage combined with enough power → to automate data processing tasks Focus of this tutorial 14
  • 22. Web Queries Overview 1. Summary of web query research in the 00s 1.1. XML 1.2. RDF 2. Keyword query languages 2.1. Motivation 2.2. Classification 2.3. Issues 3. KWQL 4. Discussion and Outlook 15
  • 23. 16
  • 24. Part 1 Web Queries Summary of XML & RDF Query Language Research 17
  • 25. 1.1 XML Tree Data & Tree Queries Data–XPath–XQuery 18
  • 26. nted of that element. The following listing shows a small XML . Fi- fragment that illustrates elements and element nesting: s an <bib xmlns:dc="http://purl.org/dc/elements/1.1/"> n of 2 <article journal="Computer Journal" id="12"> -E). <dc:title>...Semantic Web...</dc:title> how 4 <year>2005</year> <authors> m to 6 <author> the <first>John</first> <last>Doe</last> </author> Web 8 <author> ches <first>Mary</first> <last>Smith</last> </author> most 10 </authors> </article> ased 12 <article journal="Web Journal"> the <dc:title>...Web...</dc:title> arch 14 <year>2003</year> <authors> 16 <author> <first>Peter</first> <last>Jones</last> </author> 18 <author> <first>Sue</first> <last>Robinson</last> </author> tion 20 </authors> XML </article> eral. 22 </bib> lica- ML, In addition, we can observe attributes (name, value (sometimes graph), mostly uniform edges ordered tree pairs and associated with start tags) that are essentially like elements rop- but may only contain character data, no other nested 19 122] attributes or elements. Also, by definition, element order is tion significant, attribute order is not. For instance
  • 27. ery languages ble interfaces we can write other elements or character data as children implemented of that element. The following listing shows a small XML ion VI-D). Fi- fragment that illustrates elements and element nesting: ries not as an <bib xmlns:dc="http://purl.org/dc/elements/1.1/"> r extension of 2 <article journal="Computer Journal" id="12"> Section VI-E). <dc:title>...Semantic Web...</dc:title> mary of how 4 <year>2005</year> <authors> d RDF aim to 6 <author> !"# ther with the <first>John</first> <last>Doe</last> </author> aditional Web 8 <author> $%$ g approaches <first>Mary</first> <last>Smith</last> </author> , are the most 10 </authors> </article> eyword-based 12 <article journal="Web Journal"> nges, and the <dc:title>...Web...</dc:title> of Web search 14 <year>2003</year> <authors> 16 <author> !&# !"-#. ND RDF <first>Peter</first> <last>Jones</last> </author> 18 <author> '()%*+, '()%*+, <first>Sue</first> <last>Robinson</last> </author> epresentation 20 </authors> </article> ata in general. 22 </bib> er of applica- kup (XHTML, In addition, we can observe attributes (name, value pairs G 7 [141]) and associated with start tags) that are essentially like elements (Apple’s prop- !-# but may only contain character data, no other nested !3# !5#. !"&# !"3# !"5# !"/#. !&-# nd XSLT [122] attributes or elements. Also, by definition, element order is r serialization significant, attribute order is not. For instance )%)+, 4,'( '0)12(6 720(8'+ )%)+, 4,'( '0)12(6 720(8'+ rmats such as <author><last>Doe</last><first>John</first></author> describing the represents different information than the author element rText Markup in lines 6–9, but e on the web, <article id="12" journal="Computer are fixed. XML Journal">...</article>!!!"#$%&'()*" !/#. !9#. !":# !&=#. by specifying 4556 7.%89($2"-.92'&: !!!+$,!!! 455> +$,"-.92'&: represents the same element+$,"!!! item as lines 2–15. information Figure 1 gives a graphical representation of the XML doc- '0)12( '0)12( '0)12( '0)12( ation in XML ument that is referenced in preceding illustrations. When et [68] which represented as a graph, an XML document without links is ML document. a labeled tree where each node in the tree corresponds to parts, closely an element and its type. Edges connect nodes and their children, that is, elements and the elements nested in el, we provide ates from the them, elements and their content and elements and their !:# !<# !"=# !""# !"<# !"9#. !&"# !&&# attributes. Since the visual distinction between the parent- ved and thus ped. This view child relationship can be made without edge labels and ;(6) +'6) ;(6) +'6) ;(6) +'6) ;(6) +'6) since attributes are not addressed or receive no special uch as Xcerpt treatment in the research presented in this text, edges will follow XPath not be labeled in the following figures. Elements, attributes, and character data are XML’s most XML is a syn- common information types. In addition, XML documents ems are called may also contain comments, processing instructions (name- end tags, both -./' 0.$ 1&23 #%)(/ ;$($2 -.'$< #9$ =.,)'<.' XML value pair with specific semantics that can be placed any- r>...</author> where an element can be placed), document level informa- place of ‘. . . ’, tion (such as the XML or the document type declarations), entities, and notations, which are essentially just other kinds of information containers. Fig. 1. Visual representation of sample XML document ordered tree (sometimes graph), mostly uniform edges On top of these information types, two additional fa- B. Resource Description Framework (RDF) cilities relevant to the information content of XML docu- 19 As the second preeminent data format on the S ments are introduced by subsequent specifications: Names- Web, the Resource Description Format (RDF) [109 paces [33] and Base URIs [140]. Namespaces allow the
  • 28. ? what’s new about trees? didn’t we try that before? 20
  • 29. r ce sto an self parent child preceding-sibling following-sibling fol ing lo win d ce g pre descendant XPath axis for navigation/selection in tree, context, horizontal axes for order 21
  • 30. XPath: Navigation in Trees Intro in 5 Points Data: rooted, ordered, unranked, finite trees (no ID/IDREF resolution) Paths are sequences of steps (axis & test on properties of node) adorned with existential predicates (in []) to obtain tree queries some more advanced features: value joins, aggregation, … Answers are sets of nodes from the input document /child::html/descendant::h1[not(preceding::h1  =   “Introduction”)]/child::p[attribute::class=”abstract”] no variables, no construction, no grouping/ordering 22
  • 31. ￿ axis ￿Nodes (n) = {(n ￿ : R axis (n, n ￿ )} are conc ￿ λ ￿Nodes (n) = {(n ￿ : Labλ (n ￿ )} widely su ￿ node() ￿Nodes (n) = Nodes(T ) tension o ￿ axis::nt[qual] ￿Nodes (n)= {n ￿ : n ￿ ∈ ￿ axis ￿Nodes ∧ n ￿ ∈ ￿ nt ￿Nodes ∧ in XQue ￿ qual ￿Bool (n ￿ )} in XQue ￿ step/path ￿Nodes (n) = {n ￿￿ : n ￿ ∈ ￿ step ￿Nodes (n)∧ which ca n ￿￿ ∈ ￿ path ￿Nodes (n ￿ )} Indeed, ￿ path1 ∪ path2 ￿Nodes (n) = ￿ path1 ￿Nodes (n) ∪ ￿ path2 ￿Nodes (n) normaliz h) X ￿ path ￿Bool (n) = ￿ path ￿Nodes (n) ￿ ￿ sion of X use cases ￿ path1 ∧ path2 ￿Bool (n) = ￿ path1 ￿Bool (n) ∧ ￿ path2 ￿Bool (n) not only ￿ path1 ∨ path2 ￿Bool (n) = ￿ path1 ￿Bool (n) ∨ ￿ path2 ￿Bool (n) The mos ￿ ¬path ￿Bool (n) = ¬ ￿ path ￿Bool (n) 1) Sequ ￿ lab() = λ ￿Bool (n) = Labλ (n) expr ￿ path1 = path2 ￿Bool (n) = ∃n ￿ , n ￿￿ : n ￿ ∈ ￿ path1 ￿Nodes (n)∧ sequ n ￿￿ ∈ ￿ path2 ￿Nodes (n)∧ ￿ (n ￿ , n ￿￿ ) from 23 trast TABLE I of n
  • 32. XPath: Navigation in Trees A Success Story Commercial success: One of the most successful QLs widely implemented and used, several W3C standards Research success: designed with little concern for formal “beauty” but with few restrictions: turns out it hit a formal sweet spot Expressiveness: monadic datalog (datalog with only unary intensional predicates, i.e., answer predicates) also: two-variable first-order logic (FO2) Polynomial complexity, linear for navigational XPath 24
  • 33. XPath: Navigation in Trees A Success Story (cont.) Completeness: falls short of being first-order complete for first-order completeness a “conditional axis” (UNTIL operator) needed e.g., all a nodes reachable by a path of only b nodes but: partially justified by results on elimination of reverse axes reverse axis: parent, ancestor, preceeding, … eliminating these axes possible in navigational XPath though in few cases at exponential cost or resulting in introduction of expensive node-identity joins we can safely ignore them in most cases in conditional XPath this elimination is not possible 25
  • 34. XPath: Navigation in Trees A Success Story (cont.) Completeness: falls short of being first-order complete for first-order completeness a “conditional axis” (UNTIL operator) needed XPath, the M. Marx. Conditional First Order Complete XPath e.g., all a nodes reachable by a path of only b nodes Dialect. PODS 2004. but: partially justified by results on elimination of reverse axes reverse axis: parent, ancestor, preceeding, … eliminating these axes possible in navigational XPath D. Olteanu, H. Meuss, T. Furche, and F. Bry. XPath: Looking Forward. though in few cases at exponential cost or resulting in XMLDM @ EDBT, LNCS 2490, 2002. introduction of expensive node-identity joins we can safely ignore them in most cases in conditional XPath this elimination is not possible C. Ley and M. Benedikt. How big must complete XML query languages be? ICDT 2009. 25
  • 35. Following (x, y ) :⇔ x <pre y ∧ x <post y From these axes, all others can be defined in first-order logic. pre- and post-orders are sufficient to represent the full tree XPath: Navigation in Trees structure. A Success Story (cont.) Evaluation: naïve decomposition: Structural Joins sequence of joins, descendant with closure over child relation Computing descendants by a single theta-join on the better: structural joins for descendant (single lookup using tree labeling) representation relation. e.g., pre/post encoding R pre post lab a:1:7 1 7 a fits nicely with relational storage 2 3 b b:2:3 a:5:6 3 1 a 4 2 c 5 6 a a:3:1 c:4:2 b:6:4 d:7:5 6 4 b 7 5 d twig joins: stack-based, holistic, not easily adapted to relational DBS Example CREATE VIEW descendant AS SELECT r1.pre, r2.pre FROM R r1, R r2 WHERE r1.pre < r2.pre AND r2.post < r1.post; 26 Such joins are called structural joins [Al-Khalifa et al., 2002].
  • 36. XPath: Navigation in Trees A Success Story (cont.) 1 simple, fairly easy to learn 2 formal “sweet” spot 3 efficient evaluation 27
  • 37. XPath: Navigation in Trees A Success Story (cont.) 1 simple, fairly easy to learn M. Benedikt and C. Koch. 2 formal “sweet” spot XPath Leashed. In ACM Computing Surveys, 2007 3 efficient evaluation 27
  • 38. ? that’s it? everyone happy? 28
  • 39. let  $auction  :=  doc(’auction.xml’)   for  $o  in  $auction/open_auctions/open_auction   return   <corrupt  id=’{  $o/@id  }’>   {  if  (sum($o/(initial  |  bidder/increase))  =  $o/current)   then  text  {  ’no’  }   else  $auction//people/person[@id  =  $o//@person]/name  }   </corrupt>   29
  • 40. XQuery: Graph Queries & Construction From XPath to Hell Adds an enormous set of features to XPath price: highly complex language, Turing-complete, very hard to implement here: just some highlights Variables: XPath plus variables → no longer FO2 “an article that has the conference’s chair as author” not any more polynomial, but NP-/PSPACE-complete Construction: harmless if only at end of query but: composition (i.e., querying of data constructed in the same query) “find all papers written by the author with the most papers” NEXPTIME-complete or in EXPSPACE (can not be captured by datalog) 30
  • 41. XQuery: Graph Queries & Construction From XPath to Hell Recursion: programmable with composition → Turing complete i.e., like Prolog or datalog with value invention Sequences: rather than sets without composition little effect with composition: drastically harder optimization as iteration semantics of a query must be precisely observed 31
  • 42. s type [α] (Haskell), Array (Ruby) or 0 Q3 iter pos item for y in properly reflect this ble<T > (Linq). To 2001 to 2008 loop(s0 ) loop(s1 ) πiter:outer, Operator 1 1 ’WD/CR/PR’ Seman return herently unordered relational database s1 iter iter pos:pos1 , item 1 2 ’WD/CR/PR’ . . . . . . if ( y lt 2007) we embed order in the data and use 1 ’REC’ 1 ￿pos1 :￿sort,pos￿πa:b. . . project ···else . σa 1 8 ’REC’ select r then ’WD/CR/PR’ elseon the bles with columns pos|item (shown ’REC’ iter pos item . ✶ 7 1 ’REC’ . iter=inner × Cartes present such sequences. Note that the 8 1 ’REC’ . s need not be dense and not even be of 8∪ ✶a=b , ￿a=b . equi-jo (a) Query Q3 . ordered domain will do. In specific cases (b) Associated loop tables. @item:’REC’ @item:’WD/CR/PR’ ∪, ∪, (disjoin may already reflected by the items xi @pos:1 @pos:1 δ elimina ···then f a sequence of encoded nodes resulting πiter πiter @a:c pos ’WD/···’ iter item attach on step evaluation)—column itemand its W3C Recommendation Figure 3: may σitem σitem1 #a 1 1 ’WD/CR/PR’ 2 1 ’WD/CR/PR’ attach track maturity level (Query Q3 ). Annotations ¬s0,1 e of pos. ￿a:￿b1 ,..,bn ￿ . . . . . . attach item1 :￿item￿ . . . y lt 2007 ￿ 6 1 ’WD/CR/PR’ attach Query denote semantics is principally dynamic iteration scopes. #a:￿b1 ,..,bn ￿/c iter pos item πiter,pos,item:item2 π￿a:￿b1 ,..,bn ￿ attach 1 1 true as the core language construct: any 2 1 true . . . ￿item2 :￿item1 ,item￿ outer:iter, sidered (Linq) . . . Enumerable.Range(2001,8).Select( to be iteratively evaluated in the agga:b/c inner, sort:pos attach . . . ✶ most enclosing for loop. The top level of ? ’WD/CR/PR’ : ’REC’ ) y => y < 2007 8 1 false iter1 =iter ssumed (Links) to be wrapped inside the scope π :iter, for (y <- [2001,2002,. . .,2008])iter11 :item @item:2007 Table 1: Excerpt of item gle-iteration loop for _ in (0) return e [if (y < 2007) then "WD/CR/PR" else "REC"] occur free in e (i.e., the choice of 0 is y @pos:1 @pos:1 (with agg ∈ {count 2007 iter pos item iter pos item (Ruby) ndamental idea behind(2001..2008).collect { 2001 loop lifting is to 1 1 πiter:inner, πiter:inner 1 1 2007 item 2 1 2007 |y|a y < 2007 ? ’WD/CR/PR’ : ’REC’ } ode that consumes and emits “fully un- 2 1 2002 . . . . . . . . . . . . . . . . . . esentation of e’s value. if y unrolling then "WD/CR/PR" else "REC" | #inner (Haskell) [ Here, < 2007 8 1 2008 gramming languag 8 1 2007 le that y <- [2001..2008] ] 2001 to 2008 iter pos item set-oriented algebr 1 1 2001 ble with schema iter|pos|item holds the if y < 2007 else ’REC’ (Python) [ ’WD/CR/PR’ 1 2 2002 . . . work for database- for y in range(2001,2008) ] ues that e assumes during its iterative . . . . 1 . . 8 2008 not stumble if pro 32 stances [13]. Figure 4: A sample of iterative constructs found in able has key ￿iter, pos￿ since e may yield ’s companion languages (paraphrases of Q3 ). code to evaluatecode. s in each distinct iteration—only if e’s Figure 5: Loop-lifted algebraic Algebraic Q3 W
  • 43. s type [α] (Haskell), Array (Ruby) or 0 Q3 iter pos item for y in properly reflect this ble<T > (Linq). To 2001 to 2008 loop(s0 ) loop(s1 ) πiter:outer, Operator 1 1 ’WD/CR/PR’ Seman return herently unordered relational database s1 iter iter pos:pos1 , item 1 2 ’WD/CR/PR’ . . . . . . if ( y lt 2007) we embed order in the data and use 1 ’REC’ 1 ￿pos1 :￿sort,pos￿πa:b. . . project ···else . σa 1 8 ’REC’ select r then ’WD/CR/PR’ elseon the bles with columns pos|item (shown ’REC’ iter pos item . ✶ 7 1 ’REC’ . iter=inner × Cartes present such sequences. Note that the 8 1 ’REC’ . s need not be dense and not even be of 8∪ ✶a=b , ￿a=b . equi-jo (a) Query Q3 . ordered domain will do. In specific cases (b) Associated loop tables. @item:’REC’ @item:’WD/CR/PR’ ∪, ∪, (disjoin may already reflected by the items xi @pos:1 @pos:1 δ elimina ···then f a sequence of encoded nodes resulting πiter πiter @a:c pos ’WD/···’ iter item attach on step evaluation)—column itemand its W3C Recommendation Figure 3: may σitem σitem1 #a 1 1 ’WD/CR/PR’ 2 1 ’WD/CR/PR’ attach track maturity level (Query Q3 ). Annotations ¬s0,1 e of pos. ￿a:￿b1 ,..,bn ￿ . . . . . . attach . . item1 :￿item￿ . y lt 2007 ￿ 6 1 ’WD/CR/PR’ attach Query denote semantics is principally dynamic iteration scopes. #a:￿b1 ,..,bn ￿/c iter pos item πiter,pos,item:item2 π￿a:￿b1 ,..,bn ￿ attach 1 1 true as the core language construct: any 2 1 true . . . ￿item2 :￿item1 ,item￿ outer:iter, sidered (Linq) . . . Enumerable.Range(2001,8).Select( to be iteratively evaluated in the agga:b/c inner, sort:pos attach . . . ✶ most enclosing for loop. The top level of ? ’WD/CR/PR’ : ’REC’ ) y => y < 2007 8 1 false iter1 =iter ssumed (Links) to be wrapped inside the scope π :iter, for (y <- [2001,2002,. . .,2008])iter11 :item @item:2007 Table 1: Excerpt of item gle-iteration loop for _ in (0) return e [if (y < 2007) then "WD/CR/PR" else "REC"] occur free in e (i.e., the choice of 0 is y @pos:1 @pos:1 (with agg ∈ {count 2007 iter pos item iter pos item (Ruby) ndamental idea behind(2001..2008).collect { 2001 loop lifting is to 1 1 πiter:inner, πiter:inner 1 1 2007 item 2 1 2007 |y|a y < 2007 ? ’WD/CR/PR’ : ’REC’ } ode that consumes and emits “fully un- 2 1 2002 . . . . . . . . . . . . . . . . . . esentation of e’s value. if y unrolling then "WD/CR/PR" else "REC" | #inner (Haskell) [ Here, < 2007 8 1 2008 gramming languag 8 1 2007 le that P. Boncz, T. Grust, M. van Keulen, S. 2001 to 2008 y <- [2001..2008] ] set-oriented algebr Manegold, J. Rittinger, J. Teubner. iter pos item 1 1 2001 MonetDB/XQuery: ’WD/CR/PR’ (Python) [ A Fast XQuery ble with schema iter|pos|item holds the if y < 2007 else ’REC’ 1 2 2002 work for database- . . . Processor Poweredy in range(2001,2008) ] for by a ues that e assumes during its iterative . . . . . . not stumble if pro Relational Engine, SIGMOD 2006. 1 8 2008 32 stances [13]. Figure 4: A sample of iterative constructs found in able has key ￿iter, pos￿ since e may yield ’s companion languages (paraphrases of Q3 ). code to evaluatecode. s in each distinct iteration—only if e’s Figure 5: Loop-lifted algebraic Algebraic Q3 W
  • 44. XQuery: Graph Queries & Construction From XPath to Hell 1 significantly harder than XPath composition & recursion 2 expensive operations yet significant steps toward 3 fast implementations 33
  • 45. XQuery: Graph Queries & Construction From XPath to Hell 1 significantly harder than XPath composition & recursion M. Benedikt and C. Koch. 2 expensive operations Interpreting Tree-to-Tree Queries. ICALP 2006. yet significant steps toward 3 fast implementations 33
  • 46. XQuery: Graph Queries & Construction From XPath to Hell 1 significantly harder than XPath composition & recursion M. Benedikt and C. Koch. 2 expensive operations Interpreting Tree-to-Tree Queries. ICALP 2006. P. Boncz, T. Grust, et al. yet significant steps toward 3 fast implementations MonetDB/XQuery: A Fast XQuery Processor Powered by a Relational Engine, SIGMOD 2006. 33
  • 47. XQuery: Graph Queries & Construction From XPath to Hell 1 significantly harder than XPath composition & recursion M. Benedikt and C. Koch. 2 expensive operations Interpreting Tree-to-Tree Queries. ICALP 2006. P. Boncz, T. Grust, et al. yet significant steps toward 3 fast implementations MonetDB/XQuery: A Fast XQuery Processor Powered by a Relational Engine, SIGMOD 2006. 33
  • 48. 34
  • 49. 1.2 RDF Graph Data for the Web 35
  • 50. Legend Other Smith Class Literal Resource Person last Mary type first Computer 11 Bag Journal Person number name Journal type _2 type type _1 first John 2005 isPartOf author year last smith2005 type Article Doe RDF title ...Semantic Web... unordered, unranked, unrooted finite graph, edge and node labeled, different node kinds 36
  • 51. ? what’s new about that? isn’t that just a fancy repr. of a ternary relation? 37
  • 52. RDF: Semantics for the Web? What’s new about RDF? data model of RDF very similar to relational but: “unique” identifiers in form of URIs that can be shared on the Web uniqueness only thanks to careful assignment practice (unlike GUIDs) human-readable (unlike GUIDs) but: existential information in form of blank nodes like named null values in SQL, also known as Codd tables but: implied information due to RDF/S semantics (and, thus, entailment) e.g., class hierarchy and typing, domain and range of properties, axiomatic triples notion of redundancy-free or lean graph/answer 38
  • 53. bib:hasPart.5 queries are rang (CONSTRUCT or S CONSTRUCT { ?j bib:hasPart ?a } (WHERE clause) o 2 WHERE { ?a rdf:type bib:Article AND ?a bib:isPartOf ?j AND ?j bib:name ‘Computer Journal’ } expressions (in c FILTER expressio The query illustrates SPARQLs fundamental query con- condition must struct: a pattern (s, p, o) for RDF triples (whose components first limitation i are usually thought of as subject, predicate, object). Any allowed in stand RDF triple is also a triple pattern, but triple patterns allow priori and rewritt variables for each component. Furthermore, SPARQL also (as FILTER expre allows literals in subject position, anticipating the same which, in turn, change also in RDF itself. We use the variant syntax for boolean value” i SPARQL discussed in [161] to ease the definition of syntax Finally, we a and semantics of the language. For instance, standard CONSTRUCT clause SPARQL, uses . instead of AND for triple conjunction. We variables occurri consider two forms of SPARQL queries, viz. SELECT queries literals, and all v that return list of variable bindings and CONSTRUCT queries only ever bound that return new RDF graphs. Triple patterns contained in a The first conditio CONSTRUCT clause (or “template”) are instantiated with the adding appropria variable bindings provided by the evaluation of the triple query body. pattern in the WHERE clause. We omit named graphs and Following [16 assume that all queries are on the single input graph. An queries based
  • 54. ￿ (s, p, o) ￿D Subst = {θ : dom(θ) = Vars((s, p, o)) ∧ t θ ∈ D} ￿ pattern1 AND pattern2 ￿D Subst = ￿ pattern1 ￿D ￿ ￿ pattern2 ￿D Subst Subst ￿ pattern1 UNION pattern2 ￿D Subst = ￿ pattern1 ￿D ∪ ￿ pattern2 ￿D Subst Subst ￿ pattern1 MINUS pattern2 ￿D Subst = ￿ pattern1 ￿D ￿ pattern2 ￿D Subst Subst ￿ pattern1 OPT pattern2 ￿D Subst = ￿ pattern1 ￿D ￿ ￿ pattern2 ￿D Subst Subst ￿ pattern FILTER condition ￿D Subst = {θ ∈ ￿ pattern ￿D : Vars(condition) ⊂ dom(θ) Subst ∧ ￿ condition ￿D (θ)} Bool ￿ condition1 ∧ condition2 ￿D (θ) Bool = ￿ condition1 ￿D (θ) ∧ ￿ condition2 ￿D (θ) Bool Bool ￿ condition1 ∨ condition2 ￿D (θ) Bool = ￿ condition1 ￿D (θ) ∨ ￿ condition2 ￿D (θ) Bool Bool ￿ ¬condition ￿D (θ) Bool = ¬ ￿ condition ￿D (θ) Bool ￿ BOUND(?v) ￿D (θ) Bool = vθ ￿ nil ￿ isLITERAL(?v) ￿D (θ) Bool = vθ ∈ L ￿ isIRI(?v) ￿D (θ) Bool = vθ ∈ I ￿ isBLANK(?v) ￿D (θ) Bool = vθ ∈ B ￿ ?v = literal ￿D (θ) Bool = vθ = literal ￿ ?u = ?v ￿D (θ) Bool = uθ = vθ ∧ uθ ￿ nil ￿ triple ￿D (θ) Graph = tripleθ if ∀v ∈ Vars(triple) : vθ ￿ nil, ￿ otherwise ￿ template1 AND template2 ￿D (θ) = ￿ template1 ￿D (θ) ∪ ￿ template2 ￿D (θ) Graph Graph Graph ￿ ￿ CONSTRUCT t WHERE p ￿D = ￿ std(t ) ￿D (θ) θ∈￿ P ￿D Subst Graph ￿ SELECT V WHERE p ￿D = πV (￿ P ￿D ) Subst TABLE V S EMANTICS FOR SPARQL
  • 55. RDF: Semantics for the Web? Simple RDF QL? SPARQL: Simple Protocol and RDF Query Language really simple? same expressiveness as full relational algebra, PSPACE-complete but: no composition, no order → simpler than full SQL or XQuery really an RDF query language? blank node construction only limited (no quantifier alternation) talks vaguely about extension to entailment regimes but no support in SPARQL as defined no support for lean answers (justified partially by high computational cost) 40
  • 56. RDF: Semantics for the Web? SPARQL: Simple RDF QL? Complexity: SPARQL with only AND, FILTER, and UNION: NP-complete i.e., no computationally interesting subset of full relational SPJU queries full SPARQL: PSPACE-complete again no computationally interesting subset of full first-order queries why? due to negation “hidden” in OPTIONAL reduction from 3SAT using isBound to encode negation lacks completeness w.r.t. RDF transformations: relational algebra: can express all PSPACE-transformations on relations SPARQL: fails at the same for RDF 41
  • 57. RDF: Semantics for the Web? SPARQL: Simple RDF QL? J. Perez, M. Arenas, and Complexity: C. Gutierrez. Semantics and Complexity of SPARQL. SPARQL with only AND, FILTER, and UNION: NP-complete ISWC 2006 i.e., no computationally interesting subset of full relational SPJU queries full SPARQL: PSPACE-complete again no computationally interesting subset of full first-order queries why? due to negation “hidden” in OPTIONAL reduction from 3SAT using isBound to encode negation lacks completeness w.r.t. RDF transformations: relational algebra: can express all PSPACE-transformations on relations SPARQL: fails at the same for RDF 41
  • 58. Select all persons and return one blank node connected to all these persons using “member” Mary John member Joe member member Not expressible in SPARQL SPARQL limitations of blank node construction Why not? no quantifier alternation
  • 59. Select all persons and return one blank node connected to all these persons using “member” Mary John member Joe member member T. Furche, C. Ley, B. Linse, B. Marnette, and O. F. Bry, Poppe. SPARQLog: SPARQL with Rules and Quantification. In: Semantic Web Information Management: A Model-based Perspective, 2009. Not expressible in SPARQL SPARQL limitations of blank node construction Why not? no quantifier alternation
  • 60. RDF: Semantics for the Web? SPARQL: Future doesn’t hit a sweet spot (like XPath) at least from a formal, database perspective contributions from database community very rare yet a significant improvement over most previous RDF QLs already forms basis for future QL research on RDF rule extensions for RDF “From SPARQL to Rules” (Polleres, WWW 2007), Networked Graphs (Schenk et al, WWW 2008) no or limited blank node support but: with rules, restricted quantification as in SPARQL (∃∀) as expressive as unrestricted quantifier alternation 43
  • 61. RDF: Semantics for the Web? SPARQL: Future doesn’t hit a sweet spot (like XPath) at least from a formal, database perspective contributions from database community very rare yet a significant improvement over most previous RDF QLs already forms basis for future QL research on RDF rule extensions for RDF “From SPARQL to Rules” (Polleres, WWW 2007), Networked Graphs (Schenk et al, WWW 2008) no or limited blank node support but: with rules, restricted quantification as in SPARQL (∃∀) as expressive as unrestricted quantifier alternation F. Bry, T. Furche, C. Ley, B. Linse, B. Marnette, and O. Poppe. SPARQLog: SPARQL with Rules and Quantification. In: Semantic Web Information Management: A Model-based Perspective, 2009. 43
  • 62. RDF: Semantics for the Web? SPARQL: Summary 1 syntax for first-order queries on RDF foundation for research but little 2 innovation in the language itself 3 lackluster support for RDF specifics 44
  • 63. RDF: Semantics for the Web? SPARQL: Summary 1 syntax for first-order queries on RDF J. Perez, M. Arenas, and foundation for research but little C. Gutierrez. Semantics and 2 innovation in the language itself Complexity of SPARQL. ISWC 2006 3 lackluster support for RDF specifics 44
  • 64. 45
  • 65. Part 2 Keyword Queries Classification & main issues 46
  • 66. 47
  • 67. Queries as Keywords Main Characteristics Queries are (mostly) unstructured bags of words Used in general purpose web search engines but also elsewhere (Amazon, Facebook,...) Implicit conjunctive semantics (with limitations) Often combined with IR techniques ranking, fuzzy matching Research focuses on application to semi-structured data general structured data (Web objects & tables, relations) 48
  • 68. Queries as Keywords Why Keyword Qs for Structured Data Success in other areas Users are already familiarized with the paradigm Allow casual users to query structured data without having deep knowledge of the query language the structure & schema of the data Enable querying of heterogeneous data 49
  • 69. Queries as Keywords Classification of Keyword QLs Data type XML RDF Implementation stand-alone systems translation to conventional query language keyword-enhanced query languages 50
  • 70. Queries as Keywords Classification of Keyword QLs Complexity of atomic queries keyword-only label-keyword keyword-enhanced Querying of elements Values Node labels Edge labels 51
  • 71. Queries as Keywords XML Keyword QLs Stand-alone Enhancement Keyword-only 9 0 Label-keyword 2 0 Keyword-enhanced -- 3 52
  • 72. Queries as Keywords XML Keyword QLs K. Weiand, T. Furche, and F. Bry. Quo vadis, web queries? In Web4Web, 2008. Stand-alone Enhancement Keyword-only 9 0 Label-keyword 2 0 Keyword-enhanced -- 3 52
  • 73. Queries as Keywords RDF Keyword QLs Stand-alone Translation Keyword-only 2 2 Label-keyword 0 1 Keyword-enhanced -- -- 53
  • 74. Queries as Keywords RDF Keyword QLs K. Weiand, T. Furche, and F. Bry. Quo vadis, web queries? In Web4Web, 2008. Stand-alone Translation Keyword-only 2 2 Label-keyword 0 1 Keyword-enhanced -- -- 53
  • 75. Queries as Keywords XML Keyword QLs More Popular Time first XML keyword query languages are as old as RDF Familiarity with XML Complexity of RDF Graph-shaped Labeled Edges Blank nodes 54
  • 76. Queries as Keywords XML Keyword QLs More Popular Time first XML keyword query languages are as old as RDF Familiarity with XML Complexity of RDF Graph-shaped Labeled Edges Blank nodes 54
  • 77. Queries as Keywords Focus Issues 1. Grouping keyword matches 2. Determining answer representations 3. Expressive power 4. Ranking 5. Limitations 55
  • 78. 2.1 Computing Query Answers Turning keyword matches into answers 56
  • 79. Computing Query Answers Problem Setting Query K yields match lists L1 ... L|K| Answer sets S1 … Sm are constructed from L may contain either only one match per match list (m = |K|) or several (m ≥ |K|) Data-centric vs. document-centric XML 57
  • 80. Computing Query Answers K ={Smith, web}, Problem Setting L1 ={11}, L 2 ={3,23}, S1 ={11,3}, S 2 ={11,23} (1) bib (2) (13) article article (3) (4) (5) (12) (14) (15) (16) (23) title year authors journal title year authors journal (6) (9) (17) (20) ...Semantic Web... 2005 Computer Journal ...XML... 2003 Web Journal author author author author (7) (8) (10) (11) (18) (19) (21) (22) first last first last first last first last John Doe Mary Smith Peter Jones Sue Robinson 58
  • 81. Computing Query Answers Problem Setting Entire XML document usually too big to serve as a good query answer Matched nodes alone usually not informative Meaningful results a smaller unit has to be found 59
  • 82. Computing Query Answers Approaches 1. Determine entities based on the schema only keyword matches within one entity or apply LCA at entity-level done manually or using schema partitioning with cardinality as criterion Lowest Entity Node, Minimal Information Unit, XSeek... 2. Determine entities by connecting keyword matches Answer entity: contains (at least) one match for each keyword 60
  • 83. Computing Query Answers Connecting Keyword Matches Classic concept: Lowest Common Ancestor (LCA) Use LCA to find the maximally specific concept that is common to the keyword matches LCA: Lowest node that is ancestor to all elements in an answer set 61
  • 84. Computing Query Answers Lowest Common Ancestor (LCA) LCA(S1={3,11}) = 2 (1) bib (2) (13) article article (3) (4) (5) (12) (14) (15) (16) (23) title year authors journal title year authors journal (6) (9) (17) (20) ...Semantic Web... 2005 Computer Journal ...XML... 2003 Web Journal author author author author (7) (8) (10) (11) (18) (19) (21) (22) first last first last first last first last John Doe Mary Smith Peter Jones Sue Robinson 62
  • 85. Computing Query Answers Lowest Common Ancestor (LCA) LCA(S2={11,23}) = 1 — False positive (1) bib (2) (13) article article (3) (4) (5) (12) (14) (15) (16) (23) title year authors journal title year authors journal (6) (9) (17) (20) ...Semantic Web... 2005 Computer Journal ...XML... 2003 Web Journal author author author author (7) (8) (10) (11) (18) (19) (21) (22) first last first last first last first last John Doe Mary Smith Peter Jones Sue Robinson 63
  • 86. 2.1.1 Improving LCA Approaches to improve LCA for computing answer entities 64
  • 87. Computing Query Answers Lowest Common Ancestor (LCA) Many approaches to improve over LCA SLCA, MLCA, Interconnection Semantics, VLCA, Amoeba Join... Filter LCAs to avoid false positives Result is subset of LCA result 65
  • 88. Improving LCA Overview of Approaches 1. Interconnection relationship 2. XKSearch: Smallest LCA 3. Meaningful LCA (MLCA) 4. Amoeba Join 5. (Valuable LCA (VLCA), Compact LCA, CVLCA) 6. (Relaxed Tightest Fragment (RTF)) 7. XRank: Exclusive LCA 66
  • 89. Improving LCA Interconnection Relationship Idea: Two different nodes with the same label correspond to different entities of the same type. Two nodes are interconnected if, on the shortest path between them, every node label occurs only once Interconnection in answer sets: star-related: a node in Si interconnected with all nodes in Si all-pairs related: all nodes in Si pair-wise interconnected 67
  • 90. Improving LCA Interconnection Relationship Idea: Two different nodes with the same label correspond to different entities of the same type. Two nodes are interconnected if, on the shortest path between them, every node label occurs only once Interconnection in answer sets: star-related: a node in Si interconnected with all nodes in Si all-pairs related: all nodes in Si pair-wise interconnected S. Cohen, J. Mamou, Y. Kanza, and Y. Sagiv. XSearch: A Semantic Search Engine for XML. VLDB 2003 67
  • 91. Improving LCA Interconnection Relationship (1) bib (2) (13) article article (3) (4) (5) (12) (14) (15) (16) (23) title year authors journal title year authors journal (6) (9) (17) (20) ...Semantic Web... 2005 Computer Journal ...XML... 2003 Web Journal author author author author (7) (8) (10) (11) (18) (19) (21) (22) first last first last first last first last John Doe Mary Smith Peter Jones Sue Robinson 68
  • 92. Improving LCA Interconnection Relationship (1) bib (2) (13) article article (3) (4) (5) (12) (14) (15) (16) (23) title year authors journal title year authors journal (6) (9) (17) (20) ...Semantic Web... 2005 Computer Journal ...XML... 2003 Web Journal author author author author (7) (8) (10) (11) (18) (19) (21) (22) first last first last first last first last John Doe Mary Smith Peter Jones Sue Robinson 69
  • 93. Improving LCA Interconnection Relationship For |K| = 2, star-related ≡ all-pairs related Consider S1={3, 8, 11} 70
  • 94. Improving LCA Interconnection Relationship 3 and 8 are interconnected (1) bib (2) (13) article article (3) (4) (5) (12) (14) (15) (16) (23) title year authors journal title year authors journal (6) (9) (17) (20) ...Semantic Web... 2005 Computer Journal ...XML... 2003 Web Journal author author author author (7) (8) (10) (11) (18) (19) (21) (22) first last first last first last first last John Doe Mary Smith Peter Jones Sue Robinson 71
  • 95. Improving LCA Interconnection Relationship ...so are 3 and 11 (1) bib (2) (13) article article (3) (4) (5) (12) (14) (15) (16) (23) title year authors journal title year authors journal (6) (9) (17) (20) ...Semantic Web... 2005 Computer Journal ...XML... 2003 Web Journal author author author author (7) (8) (10) (11) (18) (19) (21) (22) first last first last first last first last John Doe Mary Smith Peter Jones Sue Robinson 72
  • 96. Improving LCA Interconnection Relationship ...but 8 and 11 are not (1) bib (2) (13) article article (3) (4) (5) (12) (14) (15) (16) (23) title year authors journal title year authors journal (6) (9) (17) (20) ...Semantic Web... 2005 Computer Journal ...XML... 2003 Web Journal author author author author (7) (8) (10) (11) (18) (19) (21) (22) first last first last first last first last John Doe Mary Smith Peter Jones Sue Robinson 73
  • 97. Improving LCA Interconnection Relationship S1={3, 8, 11} is star-related but not all-pairs related False negative when using all-pairs relatedness Consider S1={3,7,9} 74
  • 98. Improving LCA Interconnection Relationship (1) bib (2) (11) article article (3) (4) (5) (10) (12) (13) (14) (19) name year authors journal title year authors journal (6) (8) (15) (17) ...Semantic Web... 2005 Computer Journal ...XML... 2003 Web Journal author author author author (7) (9) (16) (18) name name name name John Doe Mary Smith Peter Jones Sue Robinson 75
  • 99. Improving LCA Interconnection Relationship (1) bib (2) (11) article article (3) (4) (5) (10) (12) (13) (14) (19) name year authors journal title year authors journal (6) (8) (15) (17) ...Semantic Web... 2005 Computer Journal ...XML... 2003 Web Journal author author author author (7) (9) (16) (18) name name name name John Doe Mary Smith Peter Jones Sue Robinson 76
  • 100. Improving LCA Interconnection Relationship (1) bib (2) (11) article article (3) (4) (5) (10) (12) (13) (14) (19) name year authors journal title year authors journal (6) (8) (15) (17) ...Semantic Web... 2005 Computer Journal ...XML... 2003 Web Journal author author author author (7) (9) (16) (18) name name name name John Doe Mary Smith Peter Jones Sue Robinson 77
  • 101. Improving LCA Interconnection Relationship S1={3,7,9} is a false negative in both measures False positives for both when node labels are different but refer to similar concepts e.g. “article” vs. “book” vs. “text” False negatives when node labels are identical but refer to different concepts e.g. the name of a person vs. the name of a journal 78
  • 102. Improving LCA XKSearch: Smallest LCA (SLCA) Idea: Enhance LCA with a minimality constraint SLCA nodes are LCAs which do not have LCA nodes among their descendants Similar to XRank/ELCA but stricter 79
  • 103. Improving LCA XKSearch: Smallest LCA (SLCA) Idea: Enhance LCA with a minimality constraint SLCA nodes are LCAs which do not have LCA nodes among their descendants Similar to XRank/ELCA but stricter Y. Xu and Y. Papakonstantinou. Efficient Keyword Search for Smallest LCAs in XML Databases. SIGMOD 2005 79
  • 104. Improving LCA XKSearch: Smallest LCA (SLCA) (1) bib (2) (13) article article (3) (4) (5) (12) (14) (15) (16) (23) title year authors journal title year authors journal (6) (9) (17) (20) ...Semantic Web... 2005 Computer Journal ...XML... 2003 Web Journal author author author author (7) (8) (10) (11) (18) (19) (21) (22) first last first last first last first last John Doe Mary Smith Peter Jones Sue Robinson 80
  • 105. Improving LCA XKSearch: Smallest LCA (SLCA) No false positive (1) bib (2) (13) (1035) ... article article article (3) (4) (5) (12) (14) (15) (16) (23) (1038) ... ... title year authors journal title year authors journal authors (6) (9) (17) (20) (1039) ...Semantic Web... 2005 Computer Journal ...XML... 2003 Web Journal author author author author author (7) (8) (10) (11) (18) (19) (21) (22) (1040) (1041) first last first last first last first last first last John Doe Mary Smith Peter Jones Sue Robinson Tom Smith 81
  • 106. Improving LCA XKSearch: Smallest LCA (SLCA) But of course... (1) bib (2) (13) article article (3) (4) (5) (12) (14) (15) (16) (23) title year authors journal title year authors journal (6) (9) (17) (20) ...Semantic Web... 2005 Computer Journal ...XML... 2003 Web Journal author author author author (7) (8) (10) (11) (18) (19) (21) (22) first last first last first last first last John Doe Mary Smith Peter Jones Sue Robinson 82
  • 107. Improving LCA SLCA: False Negatives (1) article (2) (3) (4) (11) (12) title year authors journal references (5) (8) (13) ...Semantic Web... 2005 Computer Journal author author article (6) (7) (9) (10) (14) (15) first last first last title authors (16) John Doe Mary Smith Web Querying... author (17) (18) first last Mary Smith 83
  • 108. Improving LCA Meaningful LCA (MLCA) MLCA: LCA where for each pair of nodes, no descendant LCA of nodes with the same label exists Similar to SLCA but less strict since only applies when node label constraint is fulfilled Problems similar to SLCA and Interconnection Semantics 84
  • 109. Improving LCA Meaningful LCA (MLCA) MLCA: LCA where for each pair of nodes, no descendant LCA of nodes with the same label exists Similar to SLCA but less strict since only applies when node label constraint is fulfilled Problems similar to SLCA and Interconnection Semantics Y. Li, C. Yu, and H. V. Jagadish. Schema-Free XQuery. VLDB 2004 84
  • 110. Improving LCA Meaningful LCA (MLCA) S1={3,11},S2={23,11} (1) bib (2) (13) article article (3) (4) (5) (12) (14) (15) (16) (23) title year authors journal title year authors journal (6) (9) (17) (20) ...Semantic Web... 2005 Computer Journal ...XML... 2003 Web Journal author author author author (7) (8) (10) (11) (18) (19) (21) (22) first last first last first last first last John Doe Mary Smith Peter Jones Sue Robinson 85
  • 111. Improving LCA Meaningful LCA (MLCA) K={Smith, journal} (1) bib (2) (13) article article (3) (4) (5) (12) (14) (15) (16) (23) title year authors journal title year authors journal (6) (9) (17) (20) ...Semantic Web... 2005 Computer Journal ...XML... 2003 Web Journal author author author author (7) (8) (10) (11) (18) (19) (21) (22) first last first last first last first last John Doe Mary Smith Peter Jones Sue Robinson 86
  • 112. Improving LCA Amoeba Join Si is a valid answer set if LCA(Si) ∈ Si only allows matches where one matched node is an ancestor of (or identical to) all matched nodes K={Smith, web}, L1={11}, L2={3,23}, S1={11,3}, S2= {11,23} 87
  • 113. Improving LCA Amoeba Join Si is a valid answer set if LCA(Si) ∈ Si only allows matches where one matched node is an ancestor of (or identical to) all matched nodes K={Smith, web}, L1={11}, L2={3,23}, S1={11,3}, S2= {11,23} T. Saito and S. Morishita. Amoeba Join: Overcoming Structural Fluctuations in XML Data. WebDB 87
  • 114. Improving LCA Amoeba Join K={web, Smith} — No false positive (1) bib (2) (13) article article (3) (4) (5) (12) (14) (15) (16) (23) title year authors journal title year authors journal (6) (9) (17) (20) ...Semantic Web... 2005 Computer Journal ...XML... 2003 Web Journal author author author author (7) (8) (10) (11) (18) (19) (21) (22) first last first last first last first last John Doe Mary Smith Peter Jones Sue Robinson 88
  • 115. Improving LCA Amoeba Join K={web, Smith} — False negative (1) bib (2) (13) article article (3) (4) (5) (12) (14) (15) (16) (23) title year authors journal title year authors journal (6) (9) (17) (20) ...Semantic Web... 2005 Computer Journal ...XML... 2003 Web Journal author author author author (7) (8) (10) (11) (18) (19) (21) (22) first last first last first last first last John Doe Mary Smith Peter Jones Sue Robinson 89
  • 116. Improving LCA Amoeba Join K={web, Smith, article} (1) bib (2) (13) article article (3) (4) (5) (12) (14) (15) (16) (23) title year authors journal title year authors journal (6) (9) (17) (20) ...Semantic Web... 2005 Computer Journal ...XML... 2003 Web Journal author author author author (7) (8) (10) (11) (18) (19) (21) (22) first last first last first last first last John Doe Mary Smith Peter Jones Sue Robinson 90
  • 117. Improving LCA Amoeba Join K={web, Smith, article} — False positive (1) article (2) (3) (4) (11) (12) title year authors journal references (5) (8) (13) ...Semantic Web... 2005 Computer Journal author author article (6) (7) (9) (10) (14) (15) first last first last title authors (16) John Doe Mary Smith Web Querying... author (17) (18) first last Mary Smith 91
  • 118. Improving LCA Compact LCA (CLCA) CLCA: LCA node of an answer set Si where LCA(Si) dominates all nodes in Si Node vi dominates node vj if there is no vj ∈ Si where LCA (Si) is descendant of vi Simpler: CLCA is an LCA where no node in the associated answer set can be part of a more specific answer set Similar to SLCA 92
  • 119. Improving LCA Compact Valuable LCA (VLCA) CVLCA: CLCA node which is also a VLCA node Recall problems of SLCA and Interconnection Semantics 93
  • 120. Improving LCA Relaxed Tightest Fragment (RTF) Subtrees are complete with respect to matches while being as small as possible Similar to XRank: maximum match without contained descendant LCAs K={Smith, web}, L1={11}, L2={3,23}, S1={11,3,23}, S2= {11,3}, S3={11,23} and LCA(11,3)=2, LCA(11,14)=1, LCA(11,3,23)=1 94
  • 121. Improving LCA Relaxed Tightest Fragment (RTF) S1={11,3,23}, c1: ✖ (1) bib (2) (13) article article (3) (4) (5) (12) (14) (15) (16) (23) title year authors journal title year authors journal (6) (9) (17) (20) ...Semantic Web... 2005 Computer Journal ...XML... 2003 Web Journal author author author author (7) (8) (10) (11) (18) (19) (21) (22) first last first last first last first last John Doe Mary Smith Peter Jones Sue Robinson 95
  • 122. Improving LCA Relaxed Tightest Fragment (RTF) S3={11,23}, c1: ✔, c2: ✖ (1) bib (2) (13) article article (3) (4) (5) (12) (14) (15) (16) (23) title year authors journal title year authors journal (6) (9) (17) (20) ...Semantic Web... 2005 Computer Journal ...XML... 2003 Web Journal author author author author (7) (8) (10) (11) (18) (19) (21) (22) first last first last first last first last John Doe Mary Smith Peter Jones Sue Robinson 96
  • 123. Improving LCA Relaxed Tightest Fragment (RTF) S2={11,3}, c1: ✔, c2: ✔, c3: ✔ (1) bib (2) (13) article article (3) (4) (5) (12) (14) (15) (16) (23) title year authors journal title year authors journal (6) (9) (17) (20) ...Semantic Web... 2005 Computer Journal ...XML... 2003 Web Journal author author author author (7) (8) (10) (11) (18) (19) (21) (22) first last first last first last first last John Doe Mary Smith Peter Jones Sue Robinson 97
  • 124. Improving LCA XRank: Exclusive LCA Idea: Prefer more specific LCAs unless reason not to (more matches) Result node is LCA node which does not contain further LCAs or which is still LCA if LCA subtrees are ignored As before, K={Smith, web}, L1={11}, L2={3,23}, S1= {11,3}, S2={11,23} and LCA(11,3)=2, LCA(11,23)=1 98
  • 125. Improving LCA XRank: Exclusive LCA Idea: Prefer more specific LCAs unless reason not to (more matches) Result node is LCA node which does not contain further LCAs or which is still LCA if LCA subtrees are ignored As before, K={Smith, web}, L1={11}, L2={3,23}, S1= {11,3}, S2={11,23} and LCA(11,3)=2, LCA(11,23)=1 Lin Guo et al. XRANK: Ranked keyword search over XML documents. SIGMOD 2003 98
  • 126. Improving LCA XRank: Exclusive LCA Only (2) is a valid result node (1) bib (2) (13) article article (3) (4) (5) (12) (14) (15) (16) (23) title year authors journal title year authors journal (6) (9) (17) (20) ...Semantic Web... 2005 Computer Journal ...XML... 2003 Web Journal author author author author (7) (8) (10) (11) (18) (19) (21) (22) first last first last first last first last John Doe Mary Smith Peter Jones Sue Robinson 99
  • 127. Improving LCA XRank: Exclusive LCA But... (1) bib (2) (13) (1035) ... article article article (3) (4) (5) (12) (14) (15) (16) (23) (1038) ... ... title year authors journal title year authors journal authors (6) (9) (17) (20) (1039) ...Semantic Web... 2005 Computer Journal ...XML... 2003 Web Journal author author author author author (7) (8) (10) (11) (18) (19) (21) (22) (1040) (1041) first last first last first last first last first last John Doe Mary Smith Peter Jones Sue Robinson Tom Smith 100
  • 128. Improving LCA XRank: Exclusive LCA Objection: XRank targeted at document-centric XML Does this solve the problem? Consider K={XML,RDF} 101
  • 129. Improving LCA XRank: Exclusive LCA S1={13}, S2={38,80}... Is 10 a good return node? (1) article (2) (3) (10) article authors body (4) (7) (11) (37) (78) ...Semantic Web... ... ... author author section section section (5) (6) (8) (9) (12) (13) (38) (79) (80) ... first last first last title content title title content John Doe Mary Smith Introduction... ...XML...RDF ...XML... Conclusion ...RDF... 102
  • 130. Improving LCA Summary No heuristic with perfect precision and recall Data-driven solutions, not universally applicable Monotonicity and consistency are desirable but often violated In some cases, no heuristic produces suitable results 103
  • 131. Improving LCA Summary K={Smith,2003} (1) bib (2) (13) article article (3) (4) (5) (12) (14) (15) (16) (23) title year authors journal title year authors journal (6) (9) (17) (20) ...Semantic Web... 2005 Computer Journal ...XML... 2003 Web Journal author author author author (7) (8) (10) (11) (18) (19) (21) (22) first last first last first last first last John Doe Mary Smith Peter Jones Sue Robinson 104
  • 132. Improving LCA Summary K={Doe, Smith} (1) bib (2) (13) article article (3) (4) (5) (12) (14) (15) (16) (23) title year authors journal title year authors journal (6) (9) (17) (20) ...Semantic Web... 2005 Computer Journal ...XML... 2003 Web Journal author author author author (7) (8) (10) (11) (18) (19) (21) (22) first last first last first last first last John Doe Mary Smith Peter Jones Sue Robinson 105
  • 133. Queries as Keywords RDF Summarize RDF graph (Q2RDF, Q2Semantic) Generate queries or query results by connecting matches Dijkstra’s Algorithm (Q2RDF) Kruskal’s Minimum Spanning Tree Algorithm (SPARK) Templates based on types (SemSearch) Cost-based heuristics (Q2Semantic) 106
  • 134. 2.2 Determining Return Values From an answer node to an answer representation 107
  • 135. Determining Return Values Approach 1: LCA LCA (or similar) node, e.g. ELCA, MLCA (2) article 108
  • 136. Determining Return Values Approach 2: Matched Nodes Matched nodes (Interconnection Semantics) (3) (11) title last ...Semantic Web... Smith 109
  • 137. Determining Return Values Approach 3: LCA & Path Path from LCA to matched nodes e.g. BANKS, RTF (2) article (3) (5) title authors (9) ...Semantic Web... author (11) last Smith 110
  • 138. Determining Return Values Approach 4: Entire Subtree LCA or entity subtree e.g. SLCA, XRank, VLCA (2) article (3) (4) (5) (12) title year authors journal (6) (9) ...Semantic Web... 2005 Computer Journal author author (7) (8) (10) (11) first last first last John Doe Mary Smith 111
  • 139. Determining Return Values Comparison & Summary Neither approach is always satisfying subtree may be too big path or node may not provide enough information More controlled return value is desirable use query elements to determine a suitable return value automatically 112
  • 140. Determining Return Values Exemplar: XSeek Processing steps: 1. Match query on node labels and content 2. Group matches using SLCA 3. Extract return nodes from query: If a term in K matches a node label and no descendant content is matched, consider the term a return node (explicit) 4. When there are no return nodes, return SLCA entity subtree (implicit) 113
  • 141. Determining Return Values Exemplar: XSeek Terms in the query that are not return nodes are search predicates i.e. keywords to find Entities and their attributes inferred from the schema using cardinality information Final return value: explicit or implicit return nodes and entities’ attributes 114
  • 142. Determining Return Values Exemplar: XSeek (1) Entity node (1:n) Attribute node (1:1) Connection node bib (2) (13) article article (3) (4) (5) (12) (14) (15) (16) (23) title year authors journal title year authors journal (6) (9) (17) (20) ...Semantic Web... 2005 Computer Journal ...XML... 2003 Web Journal author author author author (7) (8) (10) (11) (18) (19) (21) (22) first last first last first last first last John Doe Mary Smith Peter Jones Sue Robinson 115
  • 143. Determining Return Values Exemplar: XSeek K={Smith, web} (2) (9) article author (3) (4) (12) (10) (11) title year journal first last ...Semantic Web... 2005 Computer Journal Mary Smith 116
  • 144. Determining Return Values Exemplar: XSeek K={Smith, title} (3) (11) title last ...Semantic Web... Smith 117
  • 145. Determining Return Values Exemplar: XSeek K={Smith, title} (3) (11) title last ...Semantic Web... Smith Z. Liu, J. Walker and Y.Chen. XSeek: a semantic XML search engine using keywords. VLDB 2007 117
  • 146. Queries as Keywords Expressiveness Queries: unordered lists with implicit conjunction + Conjunction, disjunction (Multiway, Abbaci et al.) + Inclusion, sibling, negation, precedence (Abbaci et al.) User-selected return value (XSeek, MIU) Numeric comparison operators (MIU) Optional terms (Interconnection) label:keyword terms (Interconnection, XSearch) Keyword-enhanced languages 118
  • 147. Queries as Keywords Expressiveness (1) bib (2) (13) article article (3) (4) (5) (12) (14) (15) (16) (23) title year authors journal title year authors journal (6) (9) (17) (20) ...Semantic Web... 2005 Computer Journal ...XML... 2003 Web Journal author author author author (7) (8) (10) (11) (18) (19) (21) (22) first last first last first last first last John Doe Mary Smith Peter Jones Sue Robinson 119
  • 148. Queries as Keywords Expressiveness Transform query into binary tree Construct set of matching nodes 23 and their ancestors for each term AND Bottom-up processing: web NOT 4 ... 23 Apply operator to sets of nodes 3 23 (i.e. intersection for conjunction) 2 semantic 1 3 The nodes remaining at the root are 13 2 1 valid answers (LCA-like) 120
  • 149. Queries as Keywords Expressiveness Transform query into binary tree Construct set of matching nodes 23 and their ancestors for each term AND Bottom-up processing: web NOT 4 ... 23 Apply operator to sets of nodes 3 23 (i.e. intersection for conjunction) 2 semantic 1 3 The nodes remaining at the root are 13 2 1 valid answers (LCA-like) F. Abbaci et al. Index and Search XML Documents by Combining Content. International Conference on Internet Computing 2006 120
  • 150. Queries as Keywords Ranking Size of answer subtree Distance between matched nodes and answer tree root node tf-idf/vector space model XRank: PageRank-like ranking 121
  • 151. Queries as Keywords XRank — Ranking factors Result specificity: vertical distance Keyword proximity: horizontal distance Hyperlink awareness: PageRank value, adapted for XML distinction between XML edges and IDREF links Bidirectional propagation between XML edges distinction between following forward and backward links 122
  • 152. Queries as Keywords XRank — Ranking factors Result specificity: vertical distance Keyword proximity: horizontal distance Hyperlink awareness: PageRank value, adapted for XML distinction between XML edges and IDREF links Bidirectional propagation between XML edges distinction between following forward and backward links Lin Guo et al. XRANK: Ranked keyword search over XML documents. SIGMOD 2003 122
  • 153. Queries as Keywords Limitations Mostly limited to tree data Determining semantic entities in structured data Assumption: No element outside of LCA subtree is relevant No universal solution, data-driven Relatively low expressiveness 123
  • 154. Queries as Keywords Limitations Query answers Little control over selection Result may be too verbose or not informative enough No construction or aggregation Exception: Keyword-enhanced languages No querying of data in mixed formats (RDF & XML) 124
  • 155. Part 3 KWQL Keyword-based query language for Semantic Wikis 125
  • 156. Semantic Wikis: The (Semantic) Web in the Small As on the Semantic Web we have pages and links between them and annotations content created by many different people But we also have central control, organization and administration a small (or at least manageable) number of pages Strong social factors and collaboration →Wikis as a testbed for the “real” web 126
  • 157. KWQL: Keyword-Based QL for Wikis Characteristics KWQL can access all elements the user interacts with Combined querying of text, annotation and metadata Querying of informal to formal annotations Combination of selection criteria from several data sources in one query Aggregation and construction Data construction Embedded queries Continuous queries 127
  • 158. KWQL: Keyword-Based QL for Wikis Characteristics KWQL can access all elements the user interacts with Combined querying of text, annotation and metadata Querying of informal to formal annotations Combination of selection criteria from several data sources in one query F. Bry and K. Weiand. Flavors of Aggregation and construction KWQL, a Keyword Query Language for a Semantic Wiki. Data construction SOFSEM, 2010. Embedded queries Continuous queries 127
  • 159. KWQL: Keyword-Based QL for Wikis Characteristics Varying complexity of queries Simple label-keyword queries Conjunction/disjunction/optional Structural queries Link traversal 128
  • 160. KWQL: Keyword-Based QL for Wikis Examples 1 Java 2 author:"Mary" 3 ci(text:Java OR (tag(name:XML) AND author:Mary)) ci(tag(name:Java) link(target:ci(title:Lucene) 4 tag(name:uses))) ci(title:Contents text:($A "-" ALL($T,","))) 5 @ ci(title:$T author:$A) 129
  • 161. KWQL: Keyword-Based QL for Wikis visKWQL KWQL’s visual counterpart Query by example paradigm Round-tripping between KWQL and visKWQL Visualization of textual queries visKWQL as a tool to learn KWQL 130
  • 162. KWQL: Keyword-Based QL for Wikis KWQL and visKWQL 131
  • 163. KWQL: Keyword-Based QL for Wikis KWQL and visKWQL 132
  • 164. KWQL: Keyword-Based QL for Wikis KWQL and visKWQL 133
  • 165. KWQL: Keyword-Based QL for Wikis KWQL and visKWQL 134
  • 166. Part 4 Conclusion 135
  • 167. Web Queries Conclusion Is it even possible to find a universal grouping mechanism? How easy to use can a query language be while still being powerful enough? How complicated can a query language be without becoming too hard to use for casual users? 136
  • 168. Web Queries Conclusion Slides and links at http://pms.ifi.lmu.de/wise 137