Upcoming SlideShare
×

# Distance Based Indexing

541 views

Published on

Data Mining

0 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

Views
Total views
541
On SlideShare
0
From Embeds
0
Number of Embeds
15
Actions
Shares
0
0
0
Likes
0
Embeds 0
No embeds

No notes for slide

### Distance Based Indexing

1. 1. Distance-based Indexing for metric space & almost-metric space Donghui Zhang Northeastern University
2. 2. Problem Statement <ul><li>Given a set S of objects and a metric distance function d (). The similarity search problem is defined as: for an arbitrary object q and a threshold  , find </li></ul><ul><li>{ o | o  S  d ( o , q )<  } </li></ul><ul><li>Solution without index: for every o  S , compute d ( q,o ). Not efficient! </li></ul>
3. 3. Metric Function <ul><li>d ( x,x )=0; </li></ul><ul><li>d ( x,y )>0, where x ≠ y ; </li></ul><ul><li>d ( x,y )= d ( y,x ); </li></ul><ul><li>d ( x,y ) ≤ d ( x,z )+ d ( y,z ). </li></ul>
4. 4. Spatial-Index Approach <ul><li>If every object can be mapped to a location in space (e.g. 2-D point), there are existing solutions. </li></ul><ul><li>R-tree, Quad-tree, X-tree, … </li></ul><ul><li>Idea: break space hierarchically into partitions and store objects that are close to each other in the same partition; at query time, prune whole partitions if possible. </li></ul>
5. 5. Spatial Indexes Do not Apply <ul><li>In our problem, objects can be arbitrary and we only know the distance function. </li></ul><ul><li>E.g. objects can be pictures, dogs, and so on. How to map a dog as a multi-dim point? Not clear. But suppose we got the “magical” distance function. </li></ul>
6. 6. VP-tree <ul><li>vantage point tree, by Peter N. Yianilos, “Data Structures and Algorithms for Nearest Neighbor Search in General metric Spaces”, Proc. ACM-SIAM Symposium on Discrete Algorithms, 1993. </li></ul><ul><li>Idea: build a binary search tree, where each node corresponds to an object; the root is randomly picked; the n/2 objects that are close to it are in the left subtree. </li></ul>
7. 7. An Example <ul><li>S ={ o 1 ,…, o 10 }. </li></ul><ul><li>Randomly pick o 1 as root. </li></ul><ul><li>Compute the distance between o 1 and o i , sort in increasing order of distance: </li></ul><ul><li>build tree </li></ul><ul><li>recursively. </li></ul>401 300 111 102 96 34 18 6 5 o 4 o 5 o 8 o 2 o 10 o 9 o 6 o 7 o 3 o 1 o 3 , o 7 , o 6 , o 9 o 10 , o 2 , o 8 , o 5 , o 4 34 96
8. 8. Query Processing <ul><li>Given object q , compute d ( q,root ). Intuitively, if it’s small, search the left tree; otherwise, search the right tree. </li></ul><ul><li>Let maxDL =max{ d ( root, o i )| o i  left tree }, </li></ul><ul><li>(stored in the index) </li></ul><ul><li>Under what circumstance can we prune the left sub-tree? </li></ul>
9. 9. To Prune the Left Sub-Tree… <ul><li>We need:  o i  left tree, d ( q,o i ) ≥  . </li></ul><ul><li>We know: d ( q,o i )+ d ( o i ,root ) ≥ d ( q,root ), or </li></ul><ul><li>d ( q,o i ) ≥ d ( q,root ) - d ( o i ,root ), which implies: </li></ul><ul><li>d ( q,o i ) ≥ d ( q,root ) – maxDL. </li></ul><ul><li>To guarantee (1), it’s sufficient to have: </li></ul><ul><li>d ( q,root ) – maxDL ≥  . </li></ul><ul><li>Summary: given q , compare with tree root. If d ( q,root ) is so large that (3) is true, we know (1) is true and we can prune the left sub-tree. </li></ul>
10. 10. To Prune the Right Sub-Tree… <ul><li>Similarly, we define </li></ul><ul><li>minDR =min{ d ( root, o i )| o i  right tree }. </li></ul><ul><li>Given q , compare with tree root. If d ( q,root ) is so small that minDR - d ( q,root ) ≥  is true, we can prune the right sub-tree. </li></ul><ul><li>Note: these prunings are done at each level of the tree. </li></ul>
11. 11. Can we always prune? <ul><li>No. </li></ul><ul><li>If d ( q,root ) – maxDL <  , cannot prune left; </li></ul><ul><li>If minDR - d ( q,root ) <  , cannot prune right; </li></ul><ul><li>Combine together: </li></ul><ul><li>If minDR -  < d ( q,root ) < maxDL +  , we have to examine both sub-trees. </li></ul>
12. 12. Almost Metric <ul><li>Almost Metric was introduced in the paper “Distance Based Indexing for String Proximity Search”, ICDE’03. </li></ul><ul><li>It is similar to metric, with the difference that the condition d ( x,y ) ≤ d ( x,z )+ d ( y,z ) is changed to d ( x,y ) ≤ f * ( d ( x,z )+ d ( y,z ) ) for some constant f . </li></ul><ul><li>Can the VP-tree be used in an almost metric space? </li></ul>
13. 13. A thought on f <ul><li>Must be: f ≥ 1. Why? </li></ul><ul><li>d ( x,y ) ≤ f * ( d ( x,z )+ d ( y,z ) ) </li></ul><ul><li>d ( x,y ) ≤ f * ( d ( x,y ) )+ d ( y,y ) ) </li></ul><ul><li>d ( x,y ) ≤ f * d ( x,y ) </li></ul><ul><li>f ≥ 1 </li></ul>let y=z since d(y,y) =0 since d(x,y) ≥0
14. 14. To Prune the Right Sub-Tree… <ul><li>We need:  o i  right tree, d ( q,o i ) ≥  . </li></ul><ul><li>We know: d ( o i ,root ) ≤ f * ( d ( q,o i )+ d ( q,root )), or </li></ul><ul><li>d ( q,o i ) ≥ d ( o i ,root ) - d ( q,root ), which implies: </li></ul><ul><li>d ( q,o i ) ≥ minDR - d ( q,root ) . </li></ul><ul><li>To guarantee (1), it’s sufficient to have: </li></ul><ul><li> minDR - d ( q,root ) ≥  . </li></ul><ul><li>Summary: given q , compare with tree root. If d ( q,root ) is so small that (3) is true, we know (1) is true and we can prune the right sub-tree. </li></ul>