Distance-based Indexing for metric space & almost-metric space Donghui Zhang Northeastern University
Problem Statement
Given a set S of objects and a metric distance function d (). The similarity search problem is defined as: for an arbitrary object q and a threshold , find
{ o | o S d ( o , q )< }
Solution without index: for every o S , compute d ( q,o ). Not efficient!
Metric Function
d ( x,x )=0;
d ( x,y )>0, where x ≠ y ;
d ( x,y )= d ( y,x );
d ( x,y ) ≤ d ( x,z )+ d ( y,z ).
Spatial-Index Approach
If every object can be mapped to a location in space (e.g. 2-D point), there are existing solutions.
R-tree, Quad-tree, X-tree, …
Idea: break space hierarchically into partitions and store objects that are close to each other in the same partition; at query time, prune whole partitions if possible.
Spatial Indexes Do not Apply
In our problem, objects can be arbitrary and we only know the distance function.
E.g. objects can be pictures, dogs, and so on. How to map a dog as a multi-dim point? Not clear. But suppose we got the “magical” distance function.
VP-tree
vantage point tree, by Peter N. Yianilos, “Data Structures and Algorithms for Nearest Neighbor Search in General metric Spaces”, Proc. ACM-SIAM Symposium on Discrete Algorithms, 1993.
Idea: build a binary search tree, where each node corresponds to an object; the root is randomly picked; the n/2 objects that are close to it are in the left subtree.
An Example
S ={ o 1 ,…, o 10 }.
Randomly pick o 1 as root.
Compute the distance between o 1 and o i , sort in increasing order of distance:
build tree
recursively.
401 300 111 102 96 34 18 6 5 o 4 o 5 o 8 o 2 o 10 o 9 o 6 o 7 o 3 o 1 o 3 , o 7 , o 6 , o 9 o 10 , o 2 , o 8 , o 5 , o 4 34 96
Query Processing
Given object q , compute d ( q,root ). Intuitively, if it’s small, search the left tree; otherwise, search the right tree.
Let maxDL =max{ d ( root, o i )| o i left tree },
(stored in the index)
Under what circumstance can we prune the left sub-tree?
To Prune the Left Sub-Tree…
We need: o i left tree, d ( q,o i ) ≥ .
We know: d ( q,o i )+ d ( o i ,root ) ≥ d ( q,root ), or
d ( q,o i ) ≥ d ( q,root ) - d ( o i ,root ), which implies:
d ( q,o i ) ≥ d ( q,root ) – maxDL.
To guarantee (1), it’s sufficient to have:
d ( q,root ) – maxDL ≥ .
Summary: given q , compare with tree root. If d ( q,root ) is so large that (3) is true, we know (1) is true and we can prune the left sub-tree.
To Prune the Right Sub-Tree…
Similarly, we define
minDR =min{ d ( root, o i )| o i right tree }.
Given q , compare with tree root. If d ( q,root ) is so small that minDR - d ( q,root ) ≥ is true, we can prune the right sub-tree.
Note: these prunings are done at each level of the tree.
Can we always prune?
No.
If d ( q,root ) – maxDL < , cannot prune left;
If minDR - d ( q,root ) < , cannot prune right;
Combine together:
If minDR - < d ( q,root ) < maxDL + , we have to examine both sub-trees.
Almost Metric
Almost Metric was introduced in the paper “Distance Based Indexing for String Proximity Search”, ICDE’03.
It is similar to metric, with the difference that the condition d ( x,y ) ≤ d ( x,z )+ d ( y,z ) is changed to d ( x,y ) ≤ f * ( d ( x,z )+ d ( y,z ) ) for some constant f .
Can the VP-tree be used in an almost metric space?
A thought on f
Must be: f ≥ 1. Why?
d ( x,y ) ≤ f * ( d ( x,z )+ d ( y,z ) )
d ( x,y ) ≤ f * ( d ( x,y ) )+ d ( y,y ) )
d ( x,y ) ≤ f * d ( x,y )
f ≥ 1
let y=z since d(y,y) =0 since d(x,y) ≥0
To Prune the Right Sub-Tree…
We need: o i right tree, d ( q,o i ) ≥ .
We know: d ( o i ,root ) ≤ f * ( d ( q,o i )+ d ( q,root )), or
d ( q,o i ) ≥ d ( o i ,root ) - d ( q,root ), which implies:
d ( q,o i ) ≥ minDR - d ( q,root ) .
To guarantee (1), it’s sufficient to have:
minDR - d ( q,root ) ≥ .
Summary: given q , compare with tree root. If d ( q,root ) is so small that (3) is true, we know (1) is true and we can prune the right sub-tree.
0 comments
Post a comment