Distance Based Indexing

Loading...

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

0 comments

Post a comment

    Post a comment
    Embed Video
    Edit your comment Cancel

    Favorites, Groups & Events

    Distance Based Indexing - Presentation Transcript

    1. Distance-based Indexing for metric space & almost-metric space Donghui Zhang Northeastern University
    2. Problem Statement
      • Given a set S of objects and a metric distance function d (). The similarity search problem is defined as: for an arbitrary object q and a threshold  , find
      • { o | o  S  d ( o , q )<  }
      • Solution without index: for every o  S , compute d ( q,o ). Not efficient!
    3. Metric Function
      • d ( x,x )=0;
      • d ( x,y )>0, where x ≠ y ;
      • d ( x,y )= d ( y,x );
      • d ( x,y ) ≤ d ( x,z )+ d ( y,z ).
    4. Spatial-Index Approach
      • If every object can be mapped to a location in space (e.g. 2-D point), there are existing solutions.
      • R-tree, Quad-tree, X-tree, …
      • Idea: break space hierarchically into partitions and store objects that are close to each other in the same partition; at query time, prune whole partitions if possible.
    5. Spatial Indexes Do not Apply
      • In our problem, objects can be arbitrary and we only know the distance function.
      • E.g. objects can be pictures, dogs, and so on. How to map a dog as a multi-dim point? Not clear. But suppose we got the “magical” distance function.
    6. VP-tree
      • vantage point tree, by Peter N. Yianilos, “Data Structures and Algorithms for Nearest Neighbor Search in General metric Spaces”, Proc. ACM-SIAM Symposium on Discrete Algorithms, 1993.
      • Idea: build a binary search tree, where each node corresponds to an object; the root is randomly picked; the n/2 objects that are close to it are in the left subtree.
    7. An Example
      • S ={ o 1 ,…, o 10 }.
      • Randomly pick o 1 as root.
      • Compute the distance between o 1 and o i , sort in increasing order of distance:
      • build tree
      • recursively.
      401 300 111 102 96 34 18 6 5 o 4 o 5 o 8 o 2 o 10 o 9 o 6 o 7 o 3 o 1 o 3 , o 7 , o 6 , o 9 o 10 , o 2 , o 8 , o 5 , o 4 34 96
    8. Query Processing
      • Given object q , compute d ( q,root ). Intuitively, if it’s small, search the left tree; otherwise, search the right tree.
      • Let maxDL =max{ d ( root, o i )| o i  left tree },
      • (stored in the index)
      • Under what circumstance can we prune the left sub-tree?
    9. To Prune the Left Sub-Tree…
      • We need:  o i  left tree, d ( q,o i ) ≥  .
      • We know: d ( q,o i )+ d ( o i ,root ) ≥ d ( q,root ), or
      • d ( q,o i ) ≥ d ( q,root ) - d ( o i ,root ), which implies:
      • d ( q,o i ) ≥ d ( q,root ) – maxDL.
      • To guarantee (1), it’s sufficient to have:
      • d ( q,root ) – maxDL ≥  .
      • Summary: given q , compare with tree root. If d ( q,root ) is so large that (3) is true, we know (1) is true and we can prune the left sub-tree.
    10. To Prune the Right Sub-Tree…
      • Similarly, we define
      • minDR =min{ d ( root, o i )| o i  right tree }.
      • Given q , compare with tree root. If d ( q,root ) is so small that minDR - d ( q,root ) ≥  is true, we can prune the right sub-tree.
      • Note: these prunings are done at each level of the tree.
    11. Can we always prune?
      • No.
      • If d ( q,root ) – maxDL <  , cannot prune left;
      • If minDR - d ( q,root ) <  , cannot prune right;
      • Combine together:
      • If minDR -  < d ( q,root ) < maxDL +  , we have to examine both sub-trees.
    12. Almost Metric
      • Almost Metric was introduced in the paper “Distance Based Indexing for String Proximity Search”, ICDE’03.
      • It is similar to metric, with the difference that the condition d ( x,y ) ≤ d ( x,z )+ d ( y,z ) is changed to d ( x,y ) ≤ f * ( d ( x,z )+ d ( y,z ) ) for some constant f .
      • Can the VP-tree be used in an almost metric space?
    13. A thought on f
      • Must be: f ≥ 1. Why?
      • d ( x,y ) ≤ f * ( d ( x,z )+ d ( y,z ) )
      • d ( x,y ) ≤ f * ( d ( x,y ) )+ d ( y,y ) )
      • d ( x,y ) ≤ f * d ( x,y )
      • f ≥ 1
      let y=z since d(y,y) =0 since d(x,y) ≥0
    14. To Prune the Right Sub-Tree…
      • We need:  o i  right tree, d ( q,o i ) ≥  .
      • We know: d ( o i ,root ) ≤ f * ( d ( q,o i )+ d ( q,root )), or
      • d ( q,o i ) ≥ d ( o i ,root ) - d ( q,root ), which implies:
      • d ( q,o i ) ≥ minDR - d ( q,root ) .
      • To guarantee (1), it’s sufficient to have:
      • minDR - d ( q,root ) ≥  .
      • Summary: given q , compare with tree root. If d ( q,root ) is so small that (3) is true, we know (1) is true and we can prune the right sub-tree.

    + ketan533ketan533, 2 years ago

    custom

    145 views, 0 favs, 0 embeds more stats

    Data Mining

    More info about this document

    © All Rights Reserved

    Go to text version

    • Total Views 145
      • 145 on SlideShare
      • 0 from embeds
    • Comments 0
    • Favorites 0
    • Downloads 0
    Most viewed embeds

    more

    All embeds

    less

    Flagged as inappropriate Flag as inappropriate
    Flag as inappropriate

    Select your reason for flagging this presentation as inappropriate. If needed, use the feedback form to let us know more details.

    Cancel
    File a copyright complaint
    Having problems? Go to our helpdesk?

    Categories