1.
A Quantitative analysis and performance study for similarity search methods in high-dimensional spaces Group 4 SeokhwanEom, Jungyeol Lee, Rina You, Kilho Lee,
3.
The Similarity Search Paradigm Presenter: SeokhwanEom ( Reference : What’s wrong with high-dimensional similarity search?, S. Blott, VLDB 2008 ) 3
4.
The Similarity Search Paradigm Presenter: SeokhwanEom Locate closest point to query object, i.e. its nearest neighbor(NN) ( Reference : What’s wrong with high-dimensional similarity search?, S. Blott, VLDB 2008 ) 4
5.
The conventional approach Space-partitioning methods - Gridfile [Nievergelt:1984] - K-D-B tree [Robinson:1981] - Quad tree [Finkel:1974] Data-partitioning index trees Unfortunately, As the number of dimensions increases, their performance degrades. - The dimensional curse 5 Presenter: SeokhwanEom
6.
Contribution Assumptions : initially uniformly-distributed data within unit hypercube with independent dimensions Establish lower bounds on the average performance of NN-search for space- and data-partitioning, and clustering structures. Show formally that any partitioning scheme and clustering technique must degenerate to a sequential scan through all their blocks if the number of dimension is sufficiently large. Present performance results which support their analysis, and demonstrate that the performance of VA-file offers the best performance in practice whenever the number of dimensions is larger than 6. 6 Presenter: SeokhwanEom
7.
Presenter: SeokhwanEom The Difficulties of High Dimensionality Observation 1 (Number of partitions) A simple partitioning scheme : split the data space in each dimension into two halves. This seems reasonable with low dimensions. But with d = 100 there are 2100 ≒ 1030 partitions; even with 106 points, almost all of the partitions(1024) are empty. 7
8.
Presenter: SeokhwanEom The Difficulties of High Dimensionality Observation 2 (Data space is sparsely populated) Consider a hyper-cube range query with size s=0.95 At d=100, Data space Ω=[0,1]d Target region s s 8
9.
Presenter: SeokhwanEom The Difficulties of High Dimensionality Observation 3 (Spherical range queries) The probability that an arbitrary point R lies within the largest spherical query. Figure: Largest range query entirely within the data space. Table: Probability that a point lies within the largest range query inside Ω, and the expected database size 9
10.
Presenter: SeokhwanEom The Difficulties of High Dimensionality Observation 4 (Exponentially growing DB size) The size which a data set would have to have such that, on average, at least one point falls into the sphere (for even d): Table: Probability that a point lies within the largest range query inside Ω, and the expected database size 10
11.
Presenter: SeokhwanEom The Difficulties of High Dimensionality Observation 5 (Expected NN-distance) The probability that the NN-distance is at most r(i.e. the probability that NN to query point Q is contained in spd (Q,r)): The expected NN-distance for a query point Q : The expected NN-distance E[nndist] for any query point in the data space : 11
12.
Presenter: SeokhwanEom The Difficulties of High Dimensionality Observation 5 (Expected NN-distance) The NN-distance grows steadily with d Beyond trivially-small data sets D, NN-distances decrease only marginally as the size of D increases. 12
13.
Analysis of NN-Search Presenter: Jungyeol Lee The complexity of any partitioning and clustering scheme converges to with increasing dimensionality General Cost Model Space-Partitioning Methods Data-Partitioning Methods General Partitioning and Clustering Schemes 13
14.
General Cost Model ‘Cost’ of a query: the number of blocks which must be accessed Optimal NN search algorithm: Blocks visited during the search = blocks whose MBR1) intersect the NN-sphere Presenter: Jungyeol Lee 1) MBR: Minimum Bounding Regions 14
15.
General Cost Model Let be the number of blocks visited. = The number of blocks which intersect the Transform the spherical query into a point query Minkowski sum, Presenter: Jungyeol Lee 15
18.
Space-Partitioning Methods Dividing regardless of clusters If each dimension is split once, the total # of partitions: , the space overhead: To reduce the space overhead, only dimensions are split such that, on average, points are assigned to a partition 17 Presenter: Jungyeol Lee
19.
Space-Partitioning Methods Presenter: Jungyeol Lee Let denote the maximum distance from to any point in the data space at some dimensionality From that dimensionality, Minkowski sum covers the entire data space converges into 1 same as sequential scan 18
20.
Space-Partitioning Methods Fig. 7 Comparison of with 19 Presenter: Jungyeol Lee
21.
Data-Partitioning Methods Data-partitioning methods partition the data space hierarchically In order to reduce the search cost from to Impracticability of existing methods for NN- search in HDVSs. A sequential scan out-performed these more sophisticated hierarchical methods. 20 Presenter: Rina You
22.
Rectangular MBRs Index methods use hyper-cubes to bound the region of a block. Splitting a node results in two new, equally-full partitions of the data space. d’ dimensions are split at high dimensionality 21 Presenter: Rina You
23.
Rectangular MBRs rectangular MBR d’ sides with a length of 1/2 d - d’ sides with a length of 1. the probability of visiting a block during NN-search : the volume of that part of the extended box in the data space 22 Presenter: Rina You
24.
Rectangular MBRs the probability of accessing a block during a NN-search different database sizes and different values of d’ 23 Presenter: Rina You
25.
Spherical MBRs Another group of index structures MBRs in the form of hyper-spheres. Each block of optimal structure consists of the center point C m - 1 nearest neighbors MBR can be described by 24 Presenter: Rina You
26.
Spherical MBRs The probability of accessing a block during the search. MBRs in the form of hyper-spheres : use a Minkowski sum The probability that block must be visited during a NN-search 25 Presenter: Rina You
27.
Spherical MBRs another lower bound for this probability replace by If increases, does not decrease. 26 Presenter: Rina You
28.
Spherical MBRs The probability of accessing a block during the search average the above probability over all center points : 27 Presenter: Rina You
29.
Spherical MBRs Presenter: Rina You percentage of blocks visited increases rapidly with the dimensionality sequential scan will perform better in practice 28
30.
General Partitioning and Clustering Schemes Presenter: Rina You No partitioning or clustering scheme can offer efficient NN-search if the number of dimensions becomes large. The complexity of methods : A large portion (up to 100%) of data blocks must be read In order to determine the nearest neighbor. 29
31.
General Partitioning and Clustering Schemes Presenter: Rina You Basic assumptions: 1. A cluster is a geometrical form (MBR) that covers all cluster points 2. Each cluster contains at least two points 3. The MBR of a cluster is convex. 30
32.
General Partitioning and Clustering Schemes Average probability of accessing a cluster during an NN-search 31 Presenter: Rina You
33.
General Partitioning and Clustering Schemes Presenter: Rina You Lower bound the average probability of accessing a line cluster. Pick two arbitrary data points each cluster contains at least two points is contained in is convex. Lower bound the volume of the extended : 32
34.
General Partitioning and Clustering Schemes Presenter: Rina You Lower bound the distance between and : With Points in surface of nn-sphere of Ai have minimal minkowski sum for line(Ai, Bi) Line(Ai, Pi) is the optimal line cluster for point A If Pi is point in surface of nn-sphere of Ai. 33
35.
General Partitioning and Clustering Schemes Lower bound the average probability of accessing a line clusters Calculate the average volume of minkowski sums over all possible pairs A and P(A) in the data space 34 Presenter: Rina You
36.
General Partitioning and Clustering Schemes Presenter: Rina You Conclusion 1 (Performance) For any clustering and partitioning method, a simple sequential scan performs better. if the number of dimensions exceeds some d. Conclusion 2 (Complexity) The complexity of any clustering and partitioning methods tends towards O(N) as dimensionality increases. 35
37.
General Partitioning and Clustering Schemes Presenter: Rina You Conclusion 3 (Degeneration) All blocks are accessed if the number of dimensions exceeds some d 36
38.
The VA-file Accelerates that unavoidable scan by using object approximations to compress the vector data. Reduces the amount of data that must be read during similarity searches. Compressing vector data The filtering step Accessing the data 37 Presenter: Kilho Lee
39.
The VA-file Compressing vector data Presenter: Kilho Lee
For each dimension i, a small number of bits (bi) is assigned
Be the first to comment