Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Successfully reported this slideshow.

869 views

Published on

Published in:
Technology

No Downloads

Total views

869

On SlideShare

0

From Embeds

0

Number of Embeds

1

Shares

0

Downloads

19

Comments

0

Likes

1

No embeds

No notes for slide

- 1. A Quantitative analysis and performance study for similarity search methods in high-dimensional spaces<br />Group 4<br />SeokhwanEom,<br />Jungyeol Lee, <br />Rina You,<br />Kilho Lee,<br />
- 2. Contents<br />Introduction<br />Observations<br />Analysis of NN-search<br />VA-file<br />Conclusion<br />2<br />Presenter: SeokhwanEom<br />
- 3. The Similarity Search Paradigm<br />Presenter: SeokhwanEom<br />( Reference : What’s wrong with high-dimensional similarity search?, S. Blott, VLDB 2008 )<br />3<br />
- 4. The Similarity Search Paradigm <br />Presenter: SeokhwanEom<br />Locate closest point to query object, i.e. its nearest neighbor(NN)<br />( Reference : What’s wrong with high-dimensional similarity search?, S. Blott, VLDB 2008 )<br />4<br />
- 5. The conventional approach<br />Space-partitioning methods<br />- Gridfile [Nievergelt:1984]<br />- K-D-B tree [Robinson:1981]<br />- Quad tree [Finkel:1974]<br />Data-partitioning index trees<br />Unfortunately,<br />As the number of dimensions increases, their performance degrades.<br />- The dimensional curse<br />5<br />Presenter: SeokhwanEom<br />
- 6. Contribution<br />Assumptions : initially uniformly-distributed data within unit hypercube with independent dimensions<br />Establish lower bounds on the average performance of NN-search for space- and data-partitioning, and clustering structures.<br />Show formally that any partitioning scheme and clustering technique must degenerate to a sequential scan through all their blocks if the number of dimension is sufficiently large.<br />Present performance results which support their analysis, and demonstrate that the performance of VA-file offers the best performance in practice whenever the number of dimensions is larger than 6. <br />6<br />Presenter: SeokhwanEom<br />
- 7. Presenter: SeokhwanEom<br />The Difficulties of High Dimensionality<br />Observation 1 (Number of partitions)<br />A simple partitioning scheme :<br /> split the data space in each dimension into two halves.<br />This seems reasonable with low dimensions.<br />But with d = 100 there are 2100 ≒ 1030 partitions;<br /> even with 106 points, almost all of the partitions(1024) are empty.<br />7<br />
- 8. Presenter: SeokhwanEom<br />The Difficulties of High Dimensionality<br />Observation 2 (Data space is sparsely populated)<br /> Consider a hyper-cube range query with size s=0.95<br />At d=100, <br />Data space Ω=[0,1]d<br />Target region<br />s<br />s<br />8<br />
- 9. Presenter: SeokhwanEom<br />The Difficulties of High Dimensionality<br />Observation 3 (Spherical range queries)<br /> The probability that an arbitrary point R lies within the largest spherical query.<br />Figure: Largest range query entirely within the data space.<br />Table: Probability that a point lies within the largest range query inside Ω, and the expected database size<br />9<br />
- 10. Presenter: SeokhwanEom<br />The Difficulties of High Dimensionality<br />Observation 4 (Exponentially growing DB size)<br /> The size which a data set would have to have such that, on average, at least one point falls into the sphere (for even d):<br />Table: Probability that a point lies within the largest range query inside Ω, and the expected database size<br />10<br />
- 11. Presenter: SeokhwanEom<br />The Difficulties of High Dimensionality<br />Observation 5 (Expected NN-distance)<br />The probability that the NN-distance is at most r(i.e. the probability that NN to query point Q is contained in spd (Q,r)):<br />The expected NN-distance for a query point Q :<br />The expected NN-distance E[nndist] for any query point in the data space :<br />11<br />
- 12. Presenter: SeokhwanEom<br />The Difficulties of High Dimensionality<br />Observation 5 (Expected NN-distance)<br />The NN-distance grows steadily with d<br />Beyond trivially-small data sets D, NN-distances decrease only marginally as the size of D increases.<br />12<br />
- 13. Analysis of NN-Search<br />Presenter: Jungyeol Lee<br />The complexity of any partitioning and clustering scheme converges to with increasing dimensionality<br />General Cost Model<br />Space-Partitioning Methods<br />Data-Partitioning Methods<br />General Partitioning and Clustering Schemes<br />13<br />
- 14. General Cost Model<br />‘Cost’ of a query:<br />the number of blocks which must be accessed<br />Optimal NN search algorithm: <br />Blocks visited during the search<br /> = blocks whose MBR1) intersect the NN-sphere<br />Presenter: Jungyeol Lee<br />1) MBR: Minimum Bounding Regions<br />14<br />
- 15. General Cost Model <br />Let be the number of blocks visited.<br /> = The number of blocks<br /> which intersect the <br />Transform the spherical query into a point query<br />Minkowski sum, <br />Presenter: Jungyeol Lee<br />15<br />
- 16. <ul><li>Transform the spherical query into a point query
- 17. Probability that the -th block must be visit</li></ul>General Cost Model <br />16<br />Presenter: Jungyeol Lee<br />
- 18. Space-Partitioning Methods <br />Dividing regardless of clusters<br />If each dimension is split once, <br /> the total # of partitions: , the space overhead: <br />To reduce the space overhead, only dimensions are split such that, on average, points are assigned to a partition<br />17<br />Presenter: Jungyeol Lee<br />
- 19. Space-Partitioning Methods <br />Presenter: Jungyeol Lee<br />Let denote the maximum distance from to any point in the data space<br /> at some dimensionality<br />From that dimensionality, Minkowski sum covers the entire data space <br /> converges into 1 same as sequential scan<br />18<br />
- 20. Space-Partitioning Methods <br />Fig. 7 Comparison of with <br />19<br />Presenter: Jungyeol Lee<br />
- 21. Data-Partitioning Methods<br />Data-partitioning methods partition the data space hierarchically <br />In order to reduce the search cost from to <br />Impracticability of existing methods for NN- search in HDVSs. <br />A sequential scan out-performed these more sophisticated hierarchical methods.<br />20<br />Presenter: Rina You<br />
- 22. Rectangular MBRs<br />Index methods use hyper-cubes to bound the region of a block.<br />Splitting a node results in two new, equally-full partitions of the data space.<br />d’ dimensions are split at high dimensionality<br />21<br />Presenter: Rina You<br />
- 23. Rectangular MBRs<br />rectangular MBR <br />d’ sides with a length of 1/2<br />d - d’ sides with a length of 1.<br />the probability of visiting a block during<br /> NN-search <br />: the volume of that part of the extended box in the data space<br />22<br />Presenter: Rina You<br />
- 24. Rectangular MBRs<br />the probability of accessing a block during a NN-search <br />different database sizes and different values of d’<br />23<br />Presenter: Rina You<br />
- 25. Spherical MBRs<br />Another group of index structures<br />MBRs in the form of hyper-spheres.<br />Each block of optimal structure consists of <br />the center point C <br />m - 1 nearest neighbors <br />MBR can be described by <br />24<br />Presenter: Rina You<br />
- 26. Spherical MBRs<br />The probability of accessing a block during the search.<br />MBRs in the form of hyper-spheres : <br />use a Minkowski sum<br />The probability that block must be visited during a NN-search<br />25<br />Presenter: Rina You<br />
- 27. Spherical MBRs<br />another lower bound for this probability<br /> replace by <br />If increases, does not decrease.<br />26<br />Presenter: Rina You<br />
- 28. Spherical MBRs<br />The probability of accessing a block<br /> during the search<br />average the above probability over all center points :<br />27<br />Presenter: Rina You<br />
- 29. Spherical MBRs<br />Presenter: Rina You<br />percentage of blocks visited increases rapidly with the dimensionality<br />sequential scan will perform better in practice<br />28<br />
- 30. General Partitioning and Clustering Schemes<br />Presenter: Rina You<br />No partitioning or clustering scheme can offer efficient NN-search <br />if the number of dimensions becomes large.<br />The complexity of methods :<br />A large portion (up to 100%) of data blocks must be read <br />In order to determine the nearest neighbor.<br />29<br />
- 31. General Partitioning and Clustering Schemes<br />Presenter: Rina You<br />Basic assumptions:<br />1. A cluster is a geometrical form (MBR) that covers all cluster points<br />2. Each cluster contains at least two points<br />3. The MBR of a cluster is convex.<br />30<br />
- 32. General Partitioning and Clustering Schemes<br />Average probability of accessing a cluster during an NN-search<br />31<br />Presenter: Rina You<br />
- 33. General Partitioning and Clustering Schemes<br />Presenter: Rina You<br />Lower bound the average probability of accessing a line cluster.<br />Pick two arbitrary data points<br />each cluster contains at least two points <br /> is contained in<br /> is convex.<br />Lower bound the volume of the extended<br /> : <br />32<br />
- 34. General Partitioning and Clustering Schemes<br />Presenter: Rina You<br />Lower bound the distance between <br /> and :<br />With<br />Points in surface of nn-sphere of Ai have minimal minkowski sum for line(Ai, Bi)<br />Line(Ai, Pi) is the optimal line cluster for point A<br />If Pi is point in surface of nn-sphere of Ai.<br />33<br />
- 35. General Partitioning and Clustering Schemes<br />Lower bound the average probability<br /> of accessing a line clusters<br />Calculate the average volume of minkowski sums over all possible pairs A and P(A) in the data space<br />34<br />Presenter: Rina You<br />
- 36. General Partitioning and Clustering Schemes<br />Presenter: Rina You<br />Conclusion 1 (Performance)<br />For any clustering and partitioning method, a simple sequential scan performs better. <br /> if the number of dimensions exceeds some d.<br />Conclusion 2 (Complexity)<br />The complexity of any clustering and partitioning methods tends towards O(N) <br /> as dimensionality increases.<br />35<br />
- 37. General Partitioning and Clustering Schemes<br />Presenter: Rina You<br />Conclusion 3 (Degeneration)<br />All blocks are accessed <br /> if the number of dimensions exceeds some d<br />36<br />
- 38. The VA-file<br />Accelerates that unavoidable scan by using object approximations to compress the vector data.<br />Reduces the amount of data that must be read during similarity searches.<br />Compressing vector data<br />The filtering step<br />Accessing the data<br />37<br />Presenter: Kilho Lee<br />
- 39. The VA-file Compressing vector data<br />Presenter: Kilho Lee<br /><ul><li> For each dimension i, a small number of bits (bi) is assigned
- 40. Let b be the sum of all bi’s,
- 41. The data space is divided into 2b </li></ul>38<br />
- 42. The VA-file Filtering step<br />Presenter: Kilho Lee<br /><ul><li> When searching for the nearest neighbor, the entire approximation file </li></ul> is scanned and upper and lower bounds on the distance to the query<br /><ul><li> Let δ is the smallest upper bound found so far.
- 43. if a approx has lower bound exceeds δ, it will be filtered.</li></ul>( Reference : What’s wrong with high-dimensional similarity search?, S. Blott, VLDB 2008 )<br />39<br />
- 44. The VA-file Filtering step<br />Presenter: Kilho Lee<br /><ul><li> After the filtering step, less than 0.1% of vectors remaining.</li></ul>40<br />
- 45. The VA-file Accessing the vector <br />Presenter: Kilho Lee<br /><ul><li> After the filtering step, a small set of candidates remain.
- 46. candidates are sorted by lower bound
- 47. If a lower bound is encountered that exceeds the nearest distance seen </li></ul> so far, the VA-file method stops.<br />41<br />
- 48. The VA-file Accessing the vector<br />Presenter: Kilho Lee<br /><ul><li> less than 1% of vector blocks are visited.
- 49. In d = 50, bi = 6, N = 500,000 case, only 20 vectors are accessed. </li></ul>42<br />
- 50. Performance <br />Presenter: Kilho Lee<br /><ul><li>Figure depicts the percentage of blocks visited.</li></ul>43<br />
- 51. Conclusion<br />Presenter: Kilho Lee<br /><ul><li>conventional indexing methods are out-performed by a </li></ul> simple sequential scan at moderate dimensionality ( d = 10)<br /><ul><li>At moderate and high dimensionality ( d ≥ 6 ), the VA-file method</li></ul> can out-perform any other method.<br />44<br />
- 52. 45<br />

No public clipboards found for this slide

×
### Save the most important slides with Clipping

Clipping is a handy way to collect and organize the most important slides from a presentation. You can keep your great finds in clipboards organized around topics.

Be the first to comment