Your SlideShare is downloading. ×
A Quantitative analysis and performance study for similarity search methods in high-dimensional spaces
A Quantitative analysis and performance study for similarity search methods in high-dimensional spaces
A Quantitative analysis and performance study for similarity search methods in high-dimensional spaces
A Quantitative analysis and performance study for similarity search methods in high-dimensional spaces
A Quantitative analysis and performance study for similarity search methods in high-dimensional spaces
A Quantitative analysis and performance study for similarity search methods in high-dimensional spaces
A Quantitative analysis and performance study for similarity search methods in high-dimensional spaces
A Quantitative analysis and performance study for similarity search methods in high-dimensional spaces
A Quantitative analysis and performance study for similarity search methods in high-dimensional spaces
A Quantitative analysis and performance study for similarity search methods in high-dimensional spaces
A Quantitative analysis and performance study for similarity search methods in high-dimensional spaces
A Quantitative analysis and performance study for similarity search methods in high-dimensional spaces
A Quantitative analysis and performance study for similarity search methods in high-dimensional spaces
A Quantitative analysis and performance study for similarity search methods in high-dimensional spaces
A Quantitative analysis and performance study for similarity search methods in high-dimensional spaces
A Quantitative analysis and performance study for similarity search methods in high-dimensional spaces
A Quantitative analysis and performance study for similarity search methods in high-dimensional spaces
A Quantitative analysis and performance study for similarity search methods in high-dimensional spaces
A Quantitative analysis and performance study for similarity search methods in high-dimensional spaces
A Quantitative analysis and performance study for similarity search methods in high-dimensional spaces
A Quantitative analysis and performance study for similarity search methods in high-dimensional spaces
A Quantitative analysis and performance study for similarity search methods in high-dimensional spaces
A Quantitative analysis and performance study for similarity search methods in high-dimensional spaces
A Quantitative analysis and performance study for similarity search methods in high-dimensional spaces
A Quantitative analysis and performance study for similarity search methods in high-dimensional spaces
A Quantitative analysis and performance study for similarity search methods in high-dimensional spaces
A Quantitative analysis and performance study for similarity search methods in high-dimensional spaces
A Quantitative analysis and performance study for similarity search methods in high-dimensional spaces
A Quantitative analysis and performance study for similarity search methods in high-dimensional spaces
A Quantitative analysis and performance study for similarity search methods in high-dimensional spaces
A Quantitative analysis and performance study for similarity search methods in high-dimensional spaces
A Quantitative analysis and performance study for similarity search methods in high-dimensional spaces
A Quantitative analysis and performance study for similarity search methods in high-dimensional spaces
A Quantitative analysis and performance study for similarity search methods in high-dimensional spaces
A Quantitative analysis and performance study for similarity search methods in high-dimensional spaces
A Quantitative analysis and performance study for similarity search methods in high-dimensional spaces
A Quantitative analysis and performance study for similarity search methods in high-dimensional spaces
A Quantitative analysis and performance study for similarity search methods in high-dimensional spaces
A Quantitative analysis and performance study for similarity search methods in high-dimensional spaces
A Quantitative analysis and performance study for similarity search methods in high-dimensional spaces
A Quantitative analysis and performance study for similarity search methods in high-dimensional spaces
A Quantitative analysis and performance study for similarity search methods in high-dimensional spaces
A Quantitative analysis and performance study for similarity search methods in high-dimensional spaces
A Quantitative analysis and performance study for similarity search methods in high-dimensional spaces
A Quantitative analysis and performance study for similarity search methods in high-dimensional spaces
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

A Quantitative analysis and performance study for similarity search methods in high-dimensional spaces

573

Published on

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
573
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
17
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. A Quantitative analysis and performance study for similarity search methods in high-dimensional spaces
    Group 4
    SeokhwanEom,
    Jungyeol Lee,
    Rina You,
    Kilho Lee,
  • 2. Contents
    Introduction
    Observations
    Analysis of NN-search
    VA-file
    Conclusion
    2
    Presenter: SeokhwanEom
  • 3. The Similarity Search Paradigm
    Presenter: SeokhwanEom
    ( Reference : What’s wrong with high-dimensional similarity search?, S. Blott, VLDB 2008 )
    3
  • 4. The Similarity Search Paradigm
    Presenter: SeokhwanEom
    Locate closest point to query object, i.e. its nearest neighbor(NN)
    ( Reference : What’s wrong with high-dimensional similarity search?, S. Blott, VLDB 2008 )
    4
  • 5. The conventional approach
    Space-partitioning methods
    - Gridfile [Nievergelt:1984]
    - K-D-B tree [Robinson:1981]
    - Quad tree [Finkel:1974]
    Data-partitioning index trees
    Unfortunately,
    As the number of dimensions increases, their performance degrades.
    - The dimensional curse
    5
    Presenter: SeokhwanEom
  • 6. Contribution
    Assumptions : initially uniformly-distributed data within unit hypercube with independent dimensions
    Establish lower bounds on the average performance of NN-search for space- and data-partitioning, and clustering structures.
    Show formally that any partitioning scheme and clustering technique must degenerate to a sequential scan through all their blocks if the number of dimension is sufficiently large.
    Present performance results which support their analysis, and demonstrate that the performance of VA-file offers the best performance in practice whenever the number of dimensions is larger than 6.
    6
    Presenter: SeokhwanEom
  • 7. Presenter: SeokhwanEom
    The Difficulties of High Dimensionality
    Observation 1 (Number of partitions)
    A simple partitioning scheme :
    split the data space in each dimension into two halves.
    This seems reasonable with low dimensions.
    But with d = 100 there are 2100 ≒ 1030 partitions;
    even with 106 points, almost all of the partitions(1024) are empty.
    7
  • 8. Presenter: SeokhwanEom
    The Difficulties of High Dimensionality
    Observation 2 (Data space is sparsely populated)
    Consider a hyper-cube range query with size s=0.95
    At d=100,
    Data space Ω=[0,1]d
    Target region
    s
    s
    8
  • 9. Presenter: SeokhwanEom
    The Difficulties of High Dimensionality
    Observation 3 (Spherical range queries)
    The probability that an arbitrary point R lies within the largest spherical query.
    Figure: Largest range query entirely within the data space.
    Table: Probability that a point lies within the largest range query inside Ω, and the expected database size
    9
  • 10. Presenter: SeokhwanEom
    The Difficulties of High Dimensionality
    Observation 4 (Exponentially growing DB size)
    The size which a data set would have to have such that, on average, at least one point falls into the sphere (for even d):
    Table: Probability that a point lies within the largest range query inside Ω, and the expected database size
    10
  • 11. Presenter: SeokhwanEom
    The Difficulties of High Dimensionality
    Observation 5 (Expected NN-distance)
    The probability that the NN-distance is at most r(i.e. the probability that NN to query point Q is contained in spd (Q,r)):
    The expected NN-distance for a query point Q :
    The expected NN-distance E[nndist] for any query point in the data space :
    11
  • 12. Presenter: SeokhwanEom
    The Difficulties of High Dimensionality
    Observation 5 (Expected NN-distance)
    The NN-distance grows steadily with d
    Beyond trivially-small data sets D, NN-distances decrease only marginally as the size of D increases.
    12
  • 13. Analysis of NN-Search
    Presenter: Jungyeol Lee
    The complexity of any partitioning and clustering scheme converges to with increasing dimensionality
    General Cost Model
    Space-Partitioning Methods
    Data-Partitioning Methods
    General Partitioning and Clustering Schemes
    13
  • 14. General Cost Model
    ‘Cost’ of a query:
    the number of blocks which must be accessed
    Optimal NN search algorithm:
    Blocks visited during the search
    = blocks whose MBR1) intersect the NN-sphere
    Presenter: Jungyeol Lee
    1) MBR: Minimum Bounding Regions
    14
  • 15. General Cost Model
    Let be the number of blocks visited.
    = The number of blocks
    which intersect the
    Transform the spherical query into a point query
    Minkowski sum,
    Presenter: Jungyeol Lee
    15
  • 16.
    • Transform the spherical query into a point query
    • 17. Probability that the -th block must be visit
    General Cost Model
    16
    Presenter: Jungyeol Lee
  • 18. Space-Partitioning Methods
    Dividing regardless of clusters
    If each dimension is split once,
    the total # of partitions: , the space overhead:
    To reduce the space overhead, only dimensions are split such that, on average, points are assigned to a partition
    17
    Presenter: Jungyeol Lee
  • 19. Space-Partitioning Methods
    Presenter: Jungyeol Lee
    Let denote the maximum distance from to any point in the data space
    at some dimensionality
    From that dimensionality, Minkowski sum covers the entire data space
    converges into 1 same as sequential scan
    18
  • 20. Space-Partitioning Methods
    Fig. 7 Comparison of with
    19
    Presenter: Jungyeol Lee
  • 21. Data-Partitioning Methods
    Data-partitioning methods partition the data space hierarchically
    In order to reduce the search cost from to
    Impracticability of existing methods for NN- search in HDVSs.
    A sequential scan out-performed these more sophisticated hierarchical methods.
    20
    Presenter: Rina You
  • 22. Rectangular MBRs
    Index methods use hyper-cubes to bound the region of a block.
    Splitting a node results in two new, equally-full partitions of the data space.
    d’ dimensions are split at high dimensionality
    21
    Presenter: Rina You
  • 23. Rectangular MBRs
    rectangular MBR
    d’ sides with a length of 1/2
    d - d’ sides with a length of 1.
    the probability of visiting a block during
    NN-search
    : the volume of that part of the extended box in the data space
    22
    Presenter: Rina You
  • 24. Rectangular MBRs
    the probability of accessing a block during a NN-search
    different database sizes and different values of d’
    23
    Presenter: Rina You
  • 25. Spherical MBRs
    Another group of index structures
    MBRs in the form of hyper-spheres.
    Each block of optimal structure consists of
    the center point C
    m - 1 nearest neighbors
    MBR can be described by
    24
    Presenter: Rina You
  • 26. Spherical MBRs
    The probability of accessing a block during the search.
    MBRs in the form of hyper-spheres :
    use a Minkowski sum
    The probability that block must be visited during a NN-search
    25
    Presenter: Rina You
  • 27. Spherical MBRs
    another lower bound for this probability
    replace by
    If increases, does not decrease.
    26
    Presenter: Rina You
  • 28. Spherical MBRs
    The probability of accessing a block
    during the search
    average the above probability over all center points :
    27
    Presenter: Rina You
  • 29. Spherical MBRs
    Presenter: Rina You
    percentage of blocks visited increases rapidly with the dimensionality
    sequential scan will perform better in practice
    28
  • 30. General Partitioning and Clustering Schemes
    Presenter: Rina You
    No partitioning or clustering scheme can offer efficient NN-search
    if the number of dimensions becomes large.
    The complexity of methods :
    A large portion (up to 100%) of data blocks must be read
    In order to determine the nearest neighbor.
    29
  • 31. General Partitioning and Clustering Schemes
    Presenter: Rina You
    Basic assumptions:
    1. A cluster is a geometrical form (MBR) that covers all cluster points
    2. Each cluster contains at least two points
    3. The MBR of a cluster is convex.
    30
  • 32. General Partitioning and Clustering Schemes
    Average probability of accessing a cluster during an NN-search
    31
    Presenter: Rina You
  • 33. General Partitioning and Clustering Schemes
    Presenter: Rina You
    Lower bound the average probability of accessing a line cluster.
    Pick two arbitrary data points
    each cluster contains at least two points
    is contained in
    is convex.
    Lower bound the volume of the extended
    :
    32
  • 34. General Partitioning and Clustering Schemes
    Presenter: Rina You
    Lower bound the distance between
    and :
    With
    Points in surface of nn-sphere of Ai have minimal minkowski sum for line(Ai, Bi)
    Line(Ai, Pi) is the optimal line cluster for point A
    If Pi is point in surface of nn-sphere of Ai.
    33
  • 35. General Partitioning and Clustering Schemes
    Lower bound the average probability
    of accessing a line clusters
    Calculate the average volume of minkowski sums over all possible pairs A and P(A) in the data space
    34
    Presenter: Rina You
  • 36. General Partitioning and Clustering Schemes
    Presenter: Rina You
    Conclusion 1 (Performance)
    For any clustering and partitioning method, a simple sequential scan performs better.
    if the number of dimensions exceeds some d.
    Conclusion 2 (Complexity)
    The complexity of any clustering and partitioning methods tends towards O(N)
    as dimensionality increases.
    35
  • 37. General Partitioning and Clustering Schemes
    Presenter: Rina You
    Conclusion 3 (Degeneration)
    All blocks are accessed
    if the number of dimensions exceeds some d
    36
  • 38. The VA-file
    Accelerates that unavoidable scan by using object approximations to compress the vector data.
    Reduces the amount of data that must be read during similarity searches.
    Compressing vector data
    The filtering step
    Accessing the data
    37
    Presenter: Kilho Lee
  • 39. The VA-file Compressing vector data
    Presenter: Kilho Lee
    • For each dimension i, a small number of bits (bi) is assigned
    • 40. Let b be the sum of all bi’s,
    • 41. The data space is divided into 2b
    38
  • 42. The VA-file Filtering step
    Presenter: Kilho Lee
    • When searching for the nearest neighbor, the entire approximation file
    is scanned and upper and lower bounds on the distance to the query
    • Let δ is the smallest upper bound found so far.
    • 43. if a approx has lower bound exceeds δ, it will be filtered.
    ( Reference : What’s wrong with high-dimensional similarity search?, S. Blott, VLDB 2008 )
    39
  • 44. The VA-file Filtering step
    Presenter: Kilho Lee
    • After the filtering step, less than 0.1% of vectors remaining.
    40
  • 45. The VA-file Accessing the vector
    Presenter: Kilho Lee
    • After the filtering step, a small set of candidates remain.
    • 46. candidates are sorted by lower bound
    • 47. If a lower bound is encountered that exceeds the nearest distance seen
    so far, the VA-file method stops.
    41
  • 48. The VA-file Accessing the vector
    Presenter: Kilho Lee
    • less than 1% of vector blocks are visited.
    • 49. In d = 50, bi = 6, N = 500,000 case, only 20 vectors are accessed.
    42
  • 50. Performance
    Presenter: Kilho Lee
    • Figure depicts the percentage of blocks visited.
    43
  • 51. Conclusion
    Presenter: Kilho Lee
    • conventional indexing methods are out-performed by a
    simple sequential scan at moderate dimensionality ( d = 10)
    • At moderate and high dimensionality ( d ≥ 6 ), the VA-file method
    can out-perform any other method.
    44
  • 52. 45

×