Your SlideShare is downloading. ×
Distance Based Indexing
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Distance Based Indexing

301

Published on

Data Mining

Data Mining

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
301
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Distance-based Indexing for metric space & almost-metric space Donghui Zhang Northeastern University
  • 2. Problem Statement
    • Given a set S of objects and a metric distance function d (). The similarity search problem is defined as: for an arbitrary object q and a threshold  , find
    • { o | o  S  d ( o , q )<  }
    • Solution without index: for every o  S , compute d ( q,o ). Not efficient!
  • 3. Metric Function
    • d ( x,x )=0;
    • d ( x,y )>0, where x ≠ y ;
    • d ( x,y )= d ( y,x );
    • d ( x,y ) ≤ d ( x,z )+ d ( y,z ).
  • 4. Spatial-Index Approach
    • If every object can be mapped to a location in space (e.g. 2-D point), there are existing solutions.
    • R-tree, Quad-tree, X-tree, …
    • Idea: break space hierarchically into partitions and store objects that are close to each other in the same partition; at query time, prune whole partitions if possible.
  • 5. Spatial Indexes Do not Apply
    • In our problem, objects can be arbitrary and we only know the distance function.
    • E.g. objects can be pictures, dogs, and so on. How to map a dog as a multi-dim point? Not clear. But suppose we got the “magical” distance function.
  • 6. VP-tree
    • vantage point tree, by Peter N. Yianilos, “Data Structures and Algorithms for Nearest Neighbor Search in General metric Spaces”, Proc. ACM-SIAM Symposium on Discrete Algorithms, 1993.
    • Idea: build a binary search tree, where each node corresponds to an object; the root is randomly picked; the n/2 objects that are close to it are in the left subtree.
  • 7. An Example
    • S ={ o 1 ,…, o 10 }.
    • Randomly pick o 1 as root.
    • Compute the distance between o 1 and o i , sort in increasing order of distance:
    • build tree
    • recursively.
    401 300 111 102 96 34 18 6 5 o 4 o 5 o 8 o 2 o 10 o 9 o 6 o 7 o 3 o 1 o 3 , o 7 , o 6 , o 9 o 10 , o 2 , o 8 , o 5 , o 4 34 96
  • 8. Query Processing
    • Given object q , compute d ( q,root ). Intuitively, if it’s small, search the left tree; otherwise, search the right tree.
    • Let maxDL =max{ d ( root, o i )| o i  left tree },
    • (stored in the index)
    • Under what circumstance can we prune the left sub-tree?
  • 9. To Prune the Left Sub-Tree…
    • We need:  o i  left tree, d ( q,o i ) ≥  .
    • We know: d ( q,o i )+ d ( o i ,root ) ≥ d ( q,root ), or
    • d ( q,o i ) ≥ d ( q,root ) - d ( o i ,root ), which implies:
    • d ( q,o i ) ≥ d ( q,root ) – maxDL.
    • To guarantee (1), it’s sufficient to have:
    • d ( q,root ) – maxDL ≥  .
    • Summary: given q , compare with tree root. If d ( q,root ) is so large that (3) is true, we know (1) is true and we can prune the left sub-tree.
  • 10. To Prune the Right Sub-Tree…
    • Similarly, we define
    • minDR =min{ d ( root, o i )| o i  right tree }.
    • Given q , compare with tree root. If d ( q,root ) is so small that minDR - d ( q,root ) ≥  is true, we can prune the right sub-tree.
    • Note: these prunings are done at each level of the tree.
  • 11. Can we always prune?
    • No.
    • If d ( q,root ) – maxDL <  , cannot prune left;
    • If minDR - d ( q,root ) <  , cannot prune right;
    • Combine together:
    • If minDR -  < d ( q,root ) < maxDL +  , we have to examine both sub-trees.
  • 12. Almost Metric
    • Almost Metric was introduced in the paper “Distance Based Indexing for String Proximity Search”, ICDE’03.
    • It is similar to metric, with the difference that the condition d ( x,y ) ≤ d ( x,z )+ d ( y,z ) is changed to d ( x,y ) ≤ f * ( d ( x,z )+ d ( y,z ) ) for some constant f .
    • Can the VP-tree be used in an almost metric space?
  • 13. A thought on f
    • Must be: f ≥ 1. Why?
    • d ( x,y ) ≤ f * ( d ( x,z )+ d ( y,z ) )
    • d ( x,y ) ≤ f * ( d ( x,y ) )+ d ( y,y ) )
    • d ( x,y ) ≤ f * d ( x,y )
    • f ≥ 1
    let y=z since d(y,y) =0 since d(x,y) ≥0
  • 14. To Prune the Right Sub-Tree…
    • We need:  o i  right tree, d ( q,o i ) ≥  .
    • We know: d ( o i ,root ) ≤ f * ( d ( q,o i )+ d ( q,root )), or
    • d ( q,o i ) ≥ d ( o i ,root ) - d ( q,root ), which implies:
    • d ( q,o i ) ≥ minDR - d ( q,root ) .
    • To guarantee (1), it’s sufficient to have:
    • minDR - d ( q,root ) ≥  .
    • Summary: given q , compare with tree root. If d ( q,root ) is so small that (3) is true, we know (1) is true and we can prune the right sub-tree.

×