Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
A Quantitative analysis and performance study for similarity search methods in high-dimensional spaces<br />Group 4<br />S...
Contents<br />Introduction<br />Observations<br />Analysis of NN-search<br />VA-file<br />Conclusion<br />2<br />Presenter...
The Similarity Search Paradigm<br />Presenter: SeokhwanEom<br />( Reference : What’s wrong with high-dimensional similarit...
The Similarity Search Paradigm <br />Presenter: SeokhwanEom<br />Locate closest point to query object, i.e. its nearest ne...
The conventional approach<br />Space-partitioning methods<br />- Gridfile       [Nievergelt:1984]<br />- K-D-B tree   [Rob...
Contribution<br />Assumptions : initially uniformly-distributed data within unit hypercube with independent dimensions<br ...
Presenter: SeokhwanEom<br />The Difficulties of High Dimensionality<br />Observation 1 (Number of partitions)<br />A simpl...
Presenter: SeokhwanEom<br />The Difficulties of High Dimensionality<br />Observation 2 (Data space is sparsely populated)<...
Presenter: SeokhwanEom<br />The Difficulties of High Dimensionality<br />Observation 3 (Spherical range queries)<br />   T...
Presenter: SeokhwanEom<br />The Difficulties of High Dimensionality<br />Observation 4 (Exponentially growing DB size)<br ...
Presenter: SeokhwanEom<br />The Difficulties of High Dimensionality<br />Observation 5 (Expected NN-distance)<br />The pro...
Presenter: SeokhwanEom<br />The Difficulties of High Dimensionality<br />Observation 5 (Expected NN-distance)<br />The NN-...
Analysis of NN-Search<br />Presenter: Jungyeol Lee<br />The complexity of any partitioning and clustering scheme converges...
General Cost Model<br />‘Cost’ of a query:<br />the number of blocks which must be accessed<br />Optimal NN search algorit...
General Cost Model <br />Let       be the number of blocks visited.<br />      = The number of blocks<br />		     which in...
<ul><li>Transform the spherical query into a point query
Probability that the  -th block must be visit</li></ul>General Cost Model <br />16<br />Presenter: Jungyeol Lee<br />
Space-Partitioning Methods <br />Dividing     regardless of clusters<br />If each dimension is split once, <br />   the to...
Space-Partitioning Methods <br />Presenter: Jungyeol Lee<br />Let      denote the maximum distance from        to any poin...
Space-Partitioning Methods <br />Fig. 7 Comparison of      with <br />19<br />Presenter: Jungyeol Lee<br />
Data-Partitioning Methods<br />Data-partitioning methods partition the data space hierarchically <br />In order to reduce ...
Rectangular MBRs<br />Index methods use hyper-cubes to bound the region of a block.<br />Splitting a node results in two n...
Rectangular MBRs<br />rectangular MBR <br />d’ sides with a length of 1/2<br />d - d’ sides with a length of 1.<br />the p...
Rectangular MBRs<br />the probability of accessing a block during a NN-search <br />different database sizes and different...
Spherical MBRs<br />Another group of index structures<br />MBRs in the form of hyper-spheres.<br />Each block of optimal s...
Spherical MBRs<br />The probability of accessing a block during the search.<br />MBRs in the form of hyper-spheres : <br /...
Spherical MBRs<br />another lower bound for this probability<br /> replace             by   <br />If     increases,       ...
Spherical MBRs<br />The probability of accessing a block<br />  during the search<br />average the above probability over ...
Spherical MBRs<br />Presenter: Rina You<br />percentage of blocks visited increases rapidly with the dimensionality<br />s...
General Partitioning and Clustering Schemes<br />Presenter: Rina You<br />No partitioning or clustering scheme can offer e...
General Partitioning and Clustering Schemes<br />Presenter: Rina You<br />Basic assumptions:<br />1. A cluster is a geomet...
General Partitioning and Clustering Schemes<br />Average probability of accessing a cluster during an NN-search<br />31<br...
General Partitioning and Clustering Schemes<br />Presenter: Rina You<br />Lower bound the average probability of accessing...
General Partitioning and Clustering Schemes<br />Presenter: Rina You<br />Lower bound the distance between  <br />   and  ...
General Partitioning and Clustering Schemes<br />Lower bound the average probability<br />  of accessing a line clusters<b...
General Partitioning and Clustering Schemes<br />Presenter: Rina You<br />Conclusion 1 (Performance)<br />For any clusteri...
General Partitioning and Clustering Schemes<br />Presenter: Rina You<br />Conclusion 3 (Degeneration)<br />All blocks are ...
The VA-file<br />Accelerates that unavoidable scan by using object approximations to compress the vector data.<br />Reduce...
The VA-file   Compressing vector data<br />Presenter: Kilho Lee<br /><ul><li> For each dimension i, a small number of bits...
 Let b be the sum of all bi’s,
 The data space is divided into 2b  </li></ul>38<br />
The VA-file  Filtering step<br />Presenter: Kilho Lee<br /><ul><li> When searching for the nearest neighbor, the entire ap...
 if a approx has lower bound exceeds δ, it will be filtered.</li></ul>( Reference : What’s wrong with high-dimensional sim...
The VA-file Filtering step<br />Presenter: Kilho Lee<br /><ul><li>  After the filtering step, less than 0.1% of vectors re...
The VA-file  Accessing the vector <br />Presenter: Kilho Lee<br /><ul><li> After the filtering step, a small set of candid...
Upcoming SlideShare
Loading in …5
×

A Quantitative analysis and performance study for similarity search methods in high-dimensional spaces

869 views

Published on

Published in: Technology
  • Be the first to comment

A Quantitative analysis and performance study for similarity search methods in high-dimensional spaces

  1. 1. A Quantitative analysis and performance study for similarity search methods in high-dimensional spaces<br />Group 4<br />SeokhwanEom,<br />Jungyeol Lee, <br />Rina You,<br />Kilho Lee,<br />
  2. 2. Contents<br />Introduction<br />Observations<br />Analysis of NN-search<br />VA-file<br />Conclusion<br />2<br />Presenter: SeokhwanEom<br />
  3. 3. The Similarity Search Paradigm<br />Presenter: SeokhwanEom<br />( Reference : What’s wrong with high-dimensional similarity search?, S. Blott, VLDB 2008 )<br />3<br />
  4. 4. The Similarity Search Paradigm <br />Presenter: SeokhwanEom<br />Locate closest point to query object, i.e. its nearest neighbor(NN)<br />( Reference : What’s wrong with high-dimensional similarity search?, S. Blott, VLDB 2008 )<br />4<br />
  5. 5. The conventional approach<br />Space-partitioning methods<br />- Gridfile [Nievergelt:1984]<br />- K-D-B tree [Robinson:1981]<br />- Quad tree [Finkel:1974]<br />Data-partitioning index trees<br />Unfortunately,<br />As the number of dimensions increases, their performance degrades.<br />- The dimensional curse<br />5<br />Presenter: SeokhwanEom<br />
  6. 6. Contribution<br />Assumptions : initially uniformly-distributed data within unit hypercube with independent dimensions<br />Establish lower bounds on the average performance of NN-search for space- and data-partitioning, and clustering structures.<br />Show formally that any partitioning scheme and clustering technique must degenerate to a sequential scan through all their blocks if the number of dimension is sufficiently large.<br />Present performance results which support their analysis, and demonstrate that the performance of VA-file offers the best performance in practice whenever the number of dimensions is larger than 6. <br />6<br />Presenter: SeokhwanEom<br />
  7. 7. Presenter: SeokhwanEom<br />The Difficulties of High Dimensionality<br />Observation 1 (Number of partitions)<br />A simple partitioning scheme :<br /> split the data space in each dimension into two halves.<br />This seems reasonable with low dimensions.<br />But with d = 100 there are 2100 ≒ 1030 partitions;<br /> even with 106 points, almost all of the partitions(1024) are empty.<br />7<br />
  8. 8. Presenter: SeokhwanEom<br />The Difficulties of High Dimensionality<br />Observation 2 (Data space is sparsely populated)<br /> Consider a hyper-cube range query with size s=0.95<br />At d=100, <br />Data space Ω=[0,1]d<br />Target region<br />s<br />s<br />8<br />
  9. 9. Presenter: SeokhwanEom<br />The Difficulties of High Dimensionality<br />Observation 3 (Spherical range queries)<br /> The probability that an arbitrary point R lies within the largest spherical query.<br />Figure: Largest range query entirely within the data space.<br />Table: Probability that a point lies within the largest range query inside Ω, and the expected database size<br />9<br />
  10. 10. Presenter: SeokhwanEom<br />The Difficulties of High Dimensionality<br />Observation 4 (Exponentially growing DB size)<br /> The size which a data set would have to have such that, on average, at least one point falls into the sphere (for even d):<br />Table: Probability that a point lies within the largest range query inside Ω, and the expected database size<br />10<br />
  11. 11. Presenter: SeokhwanEom<br />The Difficulties of High Dimensionality<br />Observation 5 (Expected NN-distance)<br />The probability that the NN-distance is at most r(i.e. the probability that NN to query point Q is contained in spd (Q,r)):<br />The expected NN-distance for a query point Q :<br />The expected NN-distance E[nndist] for any query point in the data space :<br />11<br />
  12. 12. Presenter: SeokhwanEom<br />The Difficulties of High Dimensionality<br />Observation 5 (Expected NN-distance)<br />The NN-distance grows steadily with d<br />Beyond trivially-small data sets D, NN-distances decrease only marginally as the size of D increases.<br />12<br />
  13. 13. Analysis of NN-Search<br />Presenter: Jungyeol Lee<br />The complexity of any partitioning and clustering scheme converges to with increasing dimensionality<br />General Cost Model<br />Space-Partitioning Methods<br />Data-Partitioning Methods<br />General Partitioning and Clustering Schemes<br />13<br />
  14. 14. General Cost Model<br />‘Cost’ of a query:<br />the number of blocks which must be accessed<br />Optimal NN search algorithm: <br />Blocks visited during the search<br /> = blocks whose MBR1) intersect the NN-sphere<br />Presenter: Jungyeol Lee<br />1) MBR: Minimum Bounding Regions<br />14<br />
  15. 15. General Cost Model <br />Let be the number of blocks visited.<br /> = The number of blocks<br /> which intersect the <br />Transform the spherical query into a point query<br />Minkowski sum, <br />Presenter: Jungyeol Lee<br />15<br />
  16. 16. <ul><li>Transform the spherical query into a point query
  17. 17. Probability that the -th block must be visit</li></ul>General Cost Model <br />16<br />Presenter: Jungyeol Lee<br />
  18. 18. Space-Partitioning Methods <br />Dividing regardless of clusters<br />If each dimension is split once, <br /> the total # of partitions: , the space overhead: <br />To reduce the space overhead, only dimensions are split such that, on average, points are assigned to a partition<br />17<br />Presenter: Jungyeol Lee<br />
  19. 19. Space-Partitioning Methods <br />Presenter: Jungyeol Lee<br />Let denote the maximum distance from to any point in the data space<br /> at some dimensionality<br />From that dimensionality, Minkowski sum covers the entire data space <br /> converges into 1 same as sequential scan<br />18<br />
  20. 20. Space-Partitioning Methods <br />Fig. 7 Comparison of with <br />19<br />Presenter: Jungyeol Lee<br />
  21. 21. Data-Partitioning Methods<br />Data-partitioning methods partition the data space hierarchically <br />In order to reduce the search cost from to <br />Impracticability of existing methods for NN- search in HDVSs. <br />A sequential scan out-performed these more sophisticated hierarchical methods.<br />20<br />Presenter: Rina You<br />
  22. 22. Rectangular MBRs<br />Index methods use hyper-cubes to bound the region of a block.<br />Splitting a node results in two new, equally-full partitions of the data space.<br />d’ dimensions are split at high dimensionality<br />21<br />Presenter: Rina You<br />
  23. 23. Rectangular MBRs<br />rectangular MBR <br />d’ sides with a length of 1/2<br />d - d’ sides with a length of 1.<br />the probability of visiting a block during<br /> NN-search <br />: the volume of that part of the extended box in the data space<br />22<br />Presenter: Rina You<br />
  24. 24. Rectangular MBRs<br />the probability of accessing a block during a NN-search <br />different database sizes and different values of d’<br />23<br />Presenter: Rina You<br />
  25. 25. Spherical MBRs<br />Another group of index structures<br />MBRs in the form of hyper-spheres.<br />Each block of optimal structure consists of <br />the center point C <br />m - 1 nearest neighbors <br />MBR can be described by <br />24<br />Presenter: Rina You<br />
  26. 26. Spherical MBRs<br />The probability of accessing a block during the search.<br />MBRs in the form of hyper-spheres : <br />use a Minkowski sum<br />The probability that block must be visited during a NN-search<br />25<br />Presenter: Rina You<br />
  27. 27. Spherical MBRs<br />another lower bound for this probability<br /> replace by <br />If increases, does not decrease.<br />26<br />Presenter: Rina You<br />
  28. 28. Spherical MBRs<br />The probability of accessing a block<br /> during the search<br />average the above probability over all center points :<br />27<br />Presenter: Rina You<br />
  29. 29. Spherical MBRs<br />Presenter: Rina You<br />percentage of blocks visited increases rapidly with the dimensionality<br />sequential scan will perform better in practice<br />28<br />
  30. 30. General Partitioning and Clustering Schemes<br />Presenter: Rina You<br />No partitioning or clustering scheme can offer efficient NN-search <br />if the number of dimensions becomes large.<br />The complexity of methods :<br />A large portion (up to 100%) of data blocks must be read <br />In order to determine the nearest neighbor.<br />29<br />
  31. 31. General Partitioning and Clustering Schemes<br />Presenter: Rina You<br />Basic assumptions:<br />1. A cluster is a geometrical form (MBR) that covers all cluster points<br />2. Each cluster contains at least two points<br />3. The MBR of a cluster is convex.<br />30<br />
  32. 32. General Partitioning and Clustering Schemes<br />Average probability of accessing a cluster during an NN-search<br />31<br />Presenter: Rina You<br />
  33. 33. General Partitioning and Clustering Schemes<br />Presenter: Rina You<br />Lower bound the average probability of accessing a line cluster.<br />Pick two arbitrary data points<br />each cluster contains at least two points <br /> is contained in<br /> is convex.<br />Lower bound the volume of the extended<br /> : <br />32<br />
  34. 34. General Partitioning and Clustering Schemes<br />Presenter: Rina You<br />Lower bound the distance between <br /> and :<br />With<br />Points in surface of nn-sphere of Ai have minimal minkowski sum for line(Ai, Bi)<br />Line(Ai, Pi) is the optimal line cluster for point A<br />If Pi is point in surface of nn-sphere of Ai.<br />33<br />
  35. 35. General Partitioning and Clustering Schemes<br />Lower bound the average probability<br /> of accessing a line clusters<br />Calculate the average volume of minkowski sums over all possible pairs A and P(A) in the data space<br />34<br />Presenter: Rina You<br />
  36. 36. General Partitioning and Clustering Schemes<br />Presenter: Rina You<br />Conclusion 1 (Performance)<br />For any clustering and partitioning method, a simple sequential scan performs better. <br /> if the number of dimensions exceeds some d.<br />Conclusion 2 (Complexity)<br />The complexity of any clustering and partitioning methods tends towards O(N) <br /> as dimensionality increases.<br />35<br />
  37. 37. General Partitioning and Clustering Schemes<br />Presenter: Rina You<br />Conclusion 3 (Degeneration)<br />All blocks are accessed <br /> if the number of dimensions exceeds some d<br />36<br />
  38. 38. The VA-file<br />Accelerates that unavoidable scan by using object approximations to compress the vector data.<br />Reduces the amount of data that must be read during similarity searches.<br />Compressing vector data<br />The filtering step<br />Accessing the data<br />37<br />Presenter: Kilho Lee<br />
  39. 39. The VA-file Compressing vector data<br />Presenter: Kilho Lee<br /><ul><li> For each dimension i, a small number of bits (bi) is assigned
  40. 40. Let b be the sum of all bi’s,
  41. 41. The data space is divided into 2b </li></ul>38<br />
  42. 42. The VA-file Filtering step<br />Presenter: Kilho Lee<br /><ul><li> When searching for the nearest neighbor, the entire approximation file </li></ul> is scanned and upper and lower bounds on the distance to the query<br /><ul><li> Let δ is the smallest upper bound found so far.
  43. 43. if a approx has lower bound exceeds δ, it will be filtered.</li></ul>( Reference : What’s wrong with high-dimensional similarity search?, S. Blott, VLDB 2008 )<br />39<br />
  44. 44. The VA-file Filtering step<br />Presenter: Kilho Lee<br /><ul><li> After the filtering step, less than 0.1% of vectors remaining.</li></ul>40<br />
  45. 45. The VA-file Accessing the vector <br />Presenter: Kilho Lee<br /><ul><li> After the filtering step, a small set of candidates remain.
  46. 46. candidates are sorted by lower bound
  47. 47. If a lower bound is encountered that exceeds the nearest distance seen </li></ul> so far, the VA-file method stops.<br />41<br />
  48. 48. The VA-file Accessing the vector<br />Presenter: Kilho Lee<br /><ul><li> less than 1% of vector blocks are visited.
  49. 49. In d = 50, bi = 6, N = 500,000 case, only 20 vectors are accessed. </li></ul>42<br />
  50. 50. Performance <br />Presenter: Kilho Lee<br /><ul><li>Figure depicts the percentage of blocks visited.</li></ul>43<br />
  51. 51. Conclusion<br />Presenter: Kilho Lee<br /><ul><li>conventional indexing methods are out-performed by a </li></ul> simple sequential scan at moderate dimensionality ( d = 10)<br /><ul><li>At moderate and high dimensionality ( d ≥ 6 ), the VA-file method</li></ul> can out-perform any other method.<br />44<br />
  52. 52. 45<br />

×