• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Keyword proximity search in xml trees   andrada astefanoaie - presentation
 

Keyword proximity search in xml trees andrada astefanoaie - presentation

on

  • 478 views

 

Statistics

Views

Total Views
478
Views on SlideShare
470
Embed Views
8

Actions

Likes
0
Downloads
0
Comments
0

2 Embeds 8

http://www.linkedin.com 7
https://www.linkedin.com 1

Accessibility

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Keyword proximity search in xml trees   andrada astefanoaie - presentation Keyword proximity search in xml trees andrada astefanoaie - presentation Presentation Transcript

    • Keyword Proximity Searchin XML Trees Andrada Astefanoaie XML and Database Systems SS 2010
    • Outline I. Introduction II. Framework III. Algorithms:Indexed XML DataKeyword Proximity Search IV. Processing Unindexed XML DATAin XML Trees V. Experimental Evaluation VI. Overview
    • Introduction - Framework - Algorithms:Indexed XML Data – Processing Unindexed XML Data - Experimental Evaluation - OverviewKeyword Search Keyword Proximity Search in XML TreesKeyword searchuser-friendly information discovery techniqueextensively studied for text documents.Keyword proximity searchwell-suited to XML documents
    • Introduction - Framework - Algorithms:Indexed XML Data - Processing Unindexed XML Data - Experimental Evaluation - OverviewNotation Keyword Proximity Search in XML Trees XML DOCUMENT directed tree with labeles - labled with λ(v), a tag - 4-tuple: id(v) start and end correspond to the first and the final times the node is v visited in a depth-first traversal of the XML tree, depth is the depth of the node from the root of the tree. - if v is a leaf, it has a string value val(v) that contains a list of keywords set of keywords k1,. . . , km.keyword query returns a compact representation of the set of trees that connect the nodes that contain the keywords
    • Introduction - Framework - Algorithms:Indexed XML Data - Processing Unindexed XML Data - Experimental Evaluation - OverviewNotation Keyword Proximity Search in XML Trees r c1 s1 s2 s3 p2 p5 p6 p1 p3 p4 a1 a2 a3 a4 a5 a6 a7 a8 a9 a10
    • Introduction - Framework - Algorithms:Indexed XML Data - Processing Unindexed XML Data - Experimental Evaluation - OverviewNotation Keyword Proximity Search in XML TreesDefinitionminimum connecting tree (MCT) of nodes v1,. . . ,vm of a tree → the minimum size subtree thatconnects v1, . . . ,vm.root of the tree → the lowest common ancestor (LCA) of the nodes v1, . . . ,vm.Examples: r rMCTs for the query MCTs for the query“Tom, Harry” c1 “Tom, Dick, Harry” c1 s1 s2 s3 s1 s2 s3 p1 p2 p4 p5 p6 p1 p2 p3 p4 p5 p6 p3 a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 a1 a2 a3 a7 a8 a4 a5 a6 a9 a10
    • Introduction - Framework - Algorithms:Indexed XML Data - Processing Unindexed XML Data - Experimental Evaluation - OverviewNotation Keyword Proximity Search in XML TreesDMCTv1, . . . , vm Є T.Distance MCT (DMCT) TD=d(TM) of the MCT TM of nodes v1, . . . , vm → the minimum node-labeledand edge-labeled tree such that: TD contains v1, . . . , vm TD contains the LCAs u1, . . . , uk of any pair of nodes (vi, vj) where vi , vj Є [v1, . . . , vm], i≠ j edge labeled with l between any two distinct nodes n, n’ Є {v1,...,vm, u1, . . . ,uk} if there is a path of length l from n’ to n in TM and the path does not contain any node n’’ Є { u1, . . . , um} other than n and n’.
    • Introduction - Framework - Algorithms:Indexed XML Data - Processing Unindexed XML Data - Experimental Evaluation - OverviewNotation Keyword Proximity Search in XML TreesGDMCTA Grouped DMCT of a tree T is a labeled tree where edges are labeled with numbers and nodesare labeled with lists of node ids from T.DMCT D Є GDMCT G if D and G are isomorphic. Assuming that f is the mapping of the nodes of Dto the nodes of G, which induces a corresponding mapping, also called f, of the edges of D tothe edges of G, the following must hold: nD is a node of D, nG is a node of G and f(nD)=nG, then the label of nG contains the id of nD. eD is an edge of D, eG is an edge of G and f(eD) = eG, then the label of eD and the label of eG are the same number.
    • Introduction - Framework - Algorithms:Indexed XML Data - Processing Unindexed XML Data - Experimental Evaluation - OverviewProblems Keyword Proximity Search in XML TreesProblem 1 : All GDMCTs ProblemQuery K Result“Tom, Harry” 5 3
    • Introduction - Framework - Algorithms:Indexed XML Data - Processing Unindexed XML Data - Experimental Evaluation - OverviewProblems Keyword Proximity Search in XML TreesProblem 2 : Lowest GDMCTs ProblemQuery K Result“Tom, Harry” 5 3
    • Introduction - Framework - Algorithms:Indexed XML Data - Processing Unindexed XML Data - Experimental Evaluation - OverviewAll GDMCTs: Keyword Proximity Search in XML TreesNested Loop AlgorithmThe nested loops algorithm (NL) for the case of indexed XML Examples of some entries in thedata operates over separate lists of nodes, L(k), one for each master index for our tree:query keyword, k, to identify the GDMCTs whose sizes are nomore than the user-provided threshold, K.Master index inverted index a hash table list L(k) each node n has path-id (the list of node ids along the path from the root of T to n)
    • Introduction - Framework - Algorithms:Indexed XML Data - Processing Unindexed XML Data - Experimental Evaluation - OverviewAll GDMCTs: Keyword Proximity Search in XML TreesNested Loop Algorithm checks all combinations of nodes from the keyword lists. for each combination computes an MCT (minimum connecting tree) merges the resulting MCT into the list of result GDMCTs, if its size is within the user-specified threshold.
    • Introduction - Framework - Algorithms:Indexed XML Data - Processing Unindexed XML Data - Experimental Evaluation - OverviewAll GDMCTs: Keyword Proximity Search in XML TreesNested Loop AlgorithmFor example:Query: “Tom, Harry” and K=3,NL examine the 12 node-pairs 12 MCTs determine 2 of them meet thethreshold(K) return 2 GDMCTs:
    • Introduction - Framework - Algorithms:Indexed XML Data - Processing Unindexed XML Data - Experimental Evaluation - OverviewAll GDMCTs: Keyword Proximity Search in XML TreesNested Loop AlgorithmInefficienty: NL checks all the combinations of nodes from the keyword lists The grouping of the results into GDMCTs is not lightly integrated with the algorithm and a lookup to the array R is required for each relevant MCT found.
    • Introduction - Framework - Algorithms:Indexed XML Data - Processing Unindexed XML Data - Experimental Evaluation - OverviewAll GDMCTs: Keyword Proximity Search in XML TreesStack-Based AlgorithmIndex Structure and Algorithm.The stack-based algorithm for computing GDMCTs on indexed XML data operates over lists ofnodes, two for each query keyword.Indexing by keyword master index contains 2 lists o L(k) of the nodes of T that contain k in T and o Ld(k) of the ancestors of nodes in L(k).
    • Introduction - Framework - Algorithms:Indexed XML Data - Processing Unindexed XML Data - Experimental Evaluation - OverviewAll GDMCTs: Keyword Proximity Search in XML TreesStack-Based AlgorithmIndex Structure and Algorithm.For example the entries for Tom, Dick and Harry are:
    • Introduction - Framework - Algorithms:Indexed XML Data - Processing Unindexed XML Data - Experimental Evaluation - OverviewAll GDMCTs: Keyword Proximity Search in XML TreesStack-Based AlgorithmIndex Structure and Algorithm.This is the high-level description of the SA. It describes how the selected list of nodes is traversed in a depth-first manner and the nodes are pushed and popped from the stack.
    • Introduction - Framework - Algorithms:Indexed XML Data - Processing Unindexed XML Data - Experimental Evaluation - OverviewAll GDMCTs: Keyword Proximity Search in XML TreesStack-Based AlgorithmIndex Structure and Algorithm. novel part of the SA algorithm processing and bookkeeping performed at each stack operation
    • Introduction - Framework - Algorithms:Indexed XML Data - Processing Unindexed XML Data - Experimental Evaluation - OverviewAll GDMCTs: Keyword Proximity Search in XML TreesStack-Based AlgorithmIndex Structure and Algorithm. Functions that are called from POP(S)
    • Introduction - Framework - Algorithms:Indexed XML Data - Processing Unindexed XML Data - Experimental Evaluation - OverviewAll GDMCTs: Keyword Proximity Search in XML TreesStack-Based AlgorithmIllustrative ExampleQuery: “Tom, Harry”K=3Master index lists:The intersection of the lists:
    • Introduction - Framework - Algorithms:Indexed XML Data - Processing Unindexed XML Data - Experimental Evaluation - OverviewAll GDMCTs: Keyword Proximity Search in XML TreesStack-Based AlgorithmIllustrative Example Master index lists: Intersection of the LaQuery: “Tom, Harry”K=3Some of the initial stack states of the execution of the Stack Algorithm:1. 2. 3.
    • Introduction - Framework - Algorithms:Indexed XML Data - Processing Unindexed XML Data - Experimental Evaluation - OverviewAll GDMCTs: Keyword Proximity Search in XML TreesStack-Based AlgorithmIllustrative Example Master index lists: Intersection of the LaQuery: “Tom, Harry”K=3Some of the initial stack states of the execution of the Stack Algorithm:4. 5. 6.
    • Introduction - Framework - Algorithms:Indexed XML Data - Processing Unindexed XML Data - Experimental Evaluation - OverviewAll GDMCTs: Keyword Proximity Search in XML TreesStack-Based AlgorithmIllustrative Example Master index lists: Intersection of the LaQuery: “Tom, Harry”K=3Some of the initial stack states of the execution of the Stack Algorithm:7. 8. 9. Entries from the lists continue being examined, new GDMCTs are created and pruned until all the answers are output. ...
    • Introduction - Framework - Algorithms:Indexed XML Data - Processing Unindexed XML Data - Experimental Evaluation - OverviewLowest GDMCTs: Keyword Proximity Search in XML TreesStack- Based AlgorithmThe key observation is that once we output the GDMCTs of a node u, none of the ancestors of uin the stack can be LCAs of returned GDMCTs; hence, we can remove all of them from the stack!Specifically, we can add the following lines after line 5:
    • Introduction - Framework - Algorithms:Indexed XML Data - Processing Unindexed XML Data - Experimental Evaluation - OverviewLCAs: Keyword Proximity Search in XML TreesStack- Based AlgorithmsThe Stack Algorithm can also be easily modified to solve the All LCAs Problem and the LowestLCAs Problem, where the user is not interested in the GDMCTs, but only in the LCA nodes. o First, Merge(.) could be simplified, no merging of GDMCTs would need to be done, and line 33 could be replaced by: o Second, we can output an LCA early when the first GDMCT (with all keywords) is computed for that node (in Procedure CreateNewGDMCTs(.)), instead of waiting until the node is popped from the stack.
    • Introduction - Framework - Algorithms:Indexed XML Data - Processing Unindexed XML Data - Experimental Evaluation - OverviewComplexity Keyword Proximity Search in XML TreesAnalysisTotal number of GDMCTsWorst case: the number of DMCTs and of GDMCTs = exponential on the number of keywords.Under reasonable assumptions, the worst-case number of GDMCTs is smaller than that ofDMCTsComplexity of Finding Isomorphic GDMCTsGiven this canonical representation prezented in this chapter, one can linearize the GDMCTs inan XML-like nested representation with start and end tags, obtained from the nodeannotations.Theorem 1. The time complexity of SA is O( L  K  (i 1 L(ki ) ) 2 ) m
    • Introduction - Framework - Algorithms:Indexed XML Data - Processing Unindexed XML Data - Experimental Evaluation - OverviewProcessing Keyword Proximity Search in XML TreesUnindexed XML DataBoth the NL Algorithm and the SA have adaptations to work without index lists by doing a singlepass over the data tree.The streaming version of the Stack Algorithm following changes to the StackAlgorithm SA(k1,..km, K):
    • Introduction - Framework - Algorithms:Indexed XML Data - Processing Unindexed XML Data - Experimental Evaluation - OverviewExperimental Keyword Proximity Search in XML TreesEvaluationParameters affecting the performance of the presented algorithms: 1) the value of K denoting the threshold, 2) the number m of keywords, 3) the size of the data set.Tests show that usually the algorithms based on the Stack Algorithm have better results than theNested Loops Algorithms both in the Indexed and Unindexed data.
    • Introduction - Framework - Algorithms:Indexed XML Data - Processing Unindexed XML Data - Experimental Evaluation - OverviewOverview Keyword Proximity Search in XML TreesThere were presented two main problems: 1) identifying and presenting in a compact manner all MCTs which explain how the keywords are connected 2) identifying only MCTs whose root is not an ancestor of the root of another MCT.There are presented solutions: 1) when the XML data has been preprocessed and relevant indices have been constructed - Nested Loop Algorithm - Stack Algorithm 2) when the XML data has not been preprocessed, i.e., the XML data can only beprocessed sequentially.Benefits of the algorithms are shown by the Experimental Evaluation
    • ResourceName Keyword Proximity Search in XML Trees Vangelis Hristidis, Nick Koudas, Yannis Papakonstantinou and Diverish SrivastavaAuthors IEEE Transactions on Knoledge and Data EngineeringPublication Vol 18, No 4, APRIL 2006
    • Keyword Proximity Searchin XML TreesThank you!