Combinatorial Algorithms
   for Nearest Neighbors, Near-Duplicates
           and Small-World Design


                   ...
Similarity Search: an Example
 Input: Set of objects
 Task: Preprocess it
Similarity Search: an Example
 Input: Set of objects
 Task: Preprocess it




              Query: New object
            ...
Similarity Search: an Example
    Input: Set of objects
    Task: Preprocess it




                                      ...
Similarity Search


Search space: object domain U, similarity
function σ
Input: database S = {p1 , . . . , pn } ⊆ U
Query:...
Nearest Neighbors in Theory
              Orchard’s Algorithm k-d-B tree
     Sphere Rectangle Tree

 Geometric near-neigh...
Revision: Basic Assumptions

In theory:
    Triangle inequality
    Doubling dimension is o(log n)




Yury Lifshits, Shen...
Revision: Basic Assumptions

In theory:
    Triangle inequality
    Doubling dimension is o(log n)


Typical web dataset h...
Revision: Basic Assumptions

In theory:
    Triangle inequality
    Doubling dimension is o(log n)


Typical web dataset h...
Contribution
Navin Goyal, YL, Hinrich Schütze, WSDM 2008:
         Combinatorial framework: new approach to data
         ...
Contribution
Navin Goyal, YL, Hinrich Schütze, WSDM 2008:
         Combinatorial framework: new approach to data
         ...
Outline


       Combinatorial Framework
 1




       New Algorithms
 2




       Combinatorial Nets
 3




       Direc...
1
Combinatorial Framework




Yury Lifshits, Shengyu Zhang   ()Combinatorial Nearest Neighbors   8 / 29
Comparison Oracle


         Dataset p1 , . . . , pn

         Objects and distance (or similarity)
         function are ...
Disorder Inequality
Sort all objects by their similarity to p:
                rankp (r)

    p                           ...
Disorder Inequality
Sort all objects by their similarity to p:
                rankp (r)

    p                           ...
Disorder Inequality
Sort all objects by their similarity to p:
                rankp (r)

    p                           ...
Combinatorial Framework
                                              =
                    Comparison oracle
            ...
Combinatorial Framework: FAQ



         Disorder of a metric space? Disorder of
         Rk ?
         In what cases diso...
Disorder vs. Others


         If expansion rate is c, disorder constant is
         at most c2

         Doubling dimensi...
Combinatorial Framework: Pro & Contra



Advantages:
         Does not require triangle inequality for distances
         ...
Combinatorial Framework: Pro & Contra



Advantages:
         Does not require triangle inequality for distances
         ...
2
New Algorithms




Yury Lifshits, Shengyu Zhang   ()Combinatorial Nearest Neighbors   15 / 29
Nearest Neighbor Search

Assume S ∪ {q} has disorder constant D

Theorem
There is a deterministic and exact algorithm
for ...
Near-Duplicates
Assume, comparison oracle can also tell us
whether σ(x, y) > T for some similarity
threshold T

Theorem
Al...
Visibility Graph

Theorem
For any dataset S with disorder D there exists
a visibility graph:
         poly(D)n log2 n cons...
q
p1
p2




          q
p1
p2




          q
p1
p2


          p3


               q
p1
p2


          p3


               q
p1
p2


                                                                         p3

                                        ...
3
Combinatorial Nets




Yury Lifshits, Shengyu Zhang   ()Combinatorial Nearest Neighbors   20 / 29
Combinatorial Ball

                         B(x, r) = {y : rankx (y) < r}

In other words, it is a subset of dataset S: t...
Combinatorial Net
A subset R ⊆ S is called a combinatorial
r-net iff the following two properties holds:
Covering: ∀y ∈ S,...
Combinatorial Net
A subset R ⊆ S is called a combinatorial
r-net iff the following two properties holds:
Covering: ∀y ∈ S,...
Basic Data Structure
Combinatorial nets:
                                                                   n
 For every 0...
Basic Data Structure
Combinatorial nets:
                                                                   n
 For every 0...
Fast Net Construction




Theorem
Combinatorial nets can be constructed in
O(D7 n log2 n) time

Yury Lifshits, Shengyu Zha...
Up’n’Down Trick

Assume your have 2r-net for the dataset

To compute an r-ball around some object p:
         Take a cente...
4
Directions for Further Research




Yury Lifshits, Shengyu Zhang   ()Combinatorial Nearest Neighbors   26 / 29
Future of Combinatorial Framework
         Other problems in combinatorial framework:
                  Low-distortion emb...
Summary
         Combinatorial framework:
         comparison oracle + disorder inequality
         New algorithms:
      ...
Summary
         Combinatorial framework:
         comparison oracle + disorder inequality
         New algorithms:
      ...
Links
http://yury.name

http://simsearch.yury.name
Tutorial, bibliography, people, links, open problems


       Yury Lifs...
Upcoming SlideShare
Loading in …5
×

Cobinatorial Algorithms for Nearest Neighbors, Near-Duplicates and Small World Design - Yury Lifshits - SODA 2009

1,450 views

Published on

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,450
On SlideShare
0
From Embeds
0
Number of Embeds
8
Actions
Shares
0
Downloads
32
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Cobinatorial Algorithms for Nearest Neighbors, Near-Duplicates and Small World Design - Yury Lifshits - SODA 2009

  1. 1. Combinatorial Algorithms for Nearest Neighbors, Near-Duplicates and Small-World Design Yury Lifshits Yahoo! Research Shengyu Zhang The Chinese University of Hong Kong SODA 2009 Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 1 / 29
  2. 2. Similarity Search: an Example Input: Set of objects Task: Preprocess it
  3. 3. Similarity Search: an Example Input: Set of objects Task: Preprocess it Query: New object Task: Find the most similar one in the dataset
  4. 4. Similarity Search: an Example Input: Set of objects Task: Preprocess it Most similar Query: New object Task: Find the most similar one in the dataset Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 2 / 29
  5. 5. Similarity Search Search space: object domain U, similarity function σ Input: database S = {p1 , . . . , pn } ⊆ U Query: q ∈ U Task: find argmax σ(pi , q) Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 3 / 29
  6. 6. Nearest Neighbors in Theory Orchard’s Algorithm k-d-B tree Sphere Rectangle Tree Geometric near-neighbor access tree Excluded middle vantage point forest mvp-tree Fixed-height Vantage-point fixed-queries tree AESA tree LAESA R∗ -tree Burkhard-Keller tree BBD tree Navigating Nets Voronoi tree Balanced aspect ratio M-tree tree vps -tree Metric tree Locality-Sensitive Hashing SS-tree R-tree Spatial approximation tree mb-tree Cover Multi-vantage point tree Bisector tree tree Generalized hyperplane tree Hybrid tree Slim tree k-d tree X-tree Spill Tree Fixed queries tree Balltree Quadtree Octree Post-office tree Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 4 / 29
  7. 7. Revision: Basic Assumptions In theory: Triangle inequality Doubling dimension is o(log n) Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 5 / 29
  8. 8. Revision: Basic Assumptions In theory: Triangle inequality Doubling dimension is o(log n) Typical web dataset has separation effect 1/ 2 ≤ d(pi , pj ) ≤ 1 For almost all i, j : Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 5 / 29
  9. 9. Revision: Basic Assumptions In theory: Triangle inequality Doubling dimension is o(log n) Typical web dataset has separation effect 1/ 2 ≤ d(pi , pj ) ≤ 1 For almost all i, j : Classic methods fail: Branch and bound algorithms visit every object Doubling dimension is at least log n/ 2 Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 5 / 29
  10. 10. Contribution Navin Goyal, YL, Hinrich Schütze, WSDM 2008: Combinatorial framework: new approach to data mining problems that does not require triangle inequality Nearest neighbor algorithm Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 6 / 29
  11. 11. Contribution Navin Goyal, YL, Hinrich Schütze, WSDM 2008: Combinatorial framework: new approach to data mining problems that does not require triangle inequality Nearest neighbor algorithm This work: Better nearest neighbor search Detecting near-duplicates Navigability design for small worlds Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 6 / 29
  12. 12. Outline Combinatorial Framework 1 New Algorithms 2 Combinatorial Nets 3 Directions for Further Research 4 Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 7 / 29
  13. 13. 1 Combinatorial Framework Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 8 / 29
  14. 14. Comparison Oracle Dataset p1 , . . . , pn Objects and distance (or similarity) function are NOT given Instead, there is a comparison oracle answering queries of the form: Who is closer to A: B or C? Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 9 / 29
  15. 15. Disorder Inequality Sort all objects by their similarity to p: rankp (r) p s r rankp (s) Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 10 / 29
  16. 16. Disorder Inequality Sort all objects by their similarity to p: rankp (r) p s r rankp (s) Then by similarity to r: rankr (s) s r Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 10 / 29
  17. 17. Disorder Inequality Sort all objects by their similarity to p: rankp (r) p s r rankp (s) Then by similarity to r: rankr (s) s r Dataset has disorder D if ∀p, r, s : rankr (s) ≤ D(rankp (r) + rankp (s)) Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 10 / 29
  18. 18. Combinatorial Framework = Comparison oracle Who is closer to A: B or C? + Disorder inequality rankr (s) ≤ D(rankp(r) + rankp(s)) Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 11 / 29
  19. 19. Combinatorial Framework: FAQ Disorder of a metric space? Disorder of Rk ? In what cases disorder is relatively small? Experimental values of D for some practical datasets? Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 12 / 29
  20. 20. Disorder vs. Others If expansion rate is c, disorder constant is at most c2 Doubling dimension and disorder dimension are incomparable Disorder inequality implies combinatorial form of “doubling effect” Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 13 / 29
  21. 21. Combinatorial Framework: Pro & Contra Advantages: Does not require triangle inequality for distances Applicable to any data model and any similarity function Require only comparative training information Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 14 / 29
  22. 22. Combinatorial Framework: Pro & Contra Advantages: Does not require triangle inequality for distances Applicable to any data model and any similarity function Require only comparative training information Limitation: worst-case form of disorder inequality Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 14 / 29
  23. 23. 2 New Algorithms Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 15 / 29
  24. 24. Nearest Neighbor Search Assume S ∪ {q} has disorder constant D Theorem There is a deterministic and exact algorithm for nearest neighbor search: Preprocessing: O(D7 n log2 n) Search: O(D4 log n) Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 16 / 29
  25. 25. Near-Duplicates Assume, comparison oracle can also tell us whether σ(x, y) > T for some similarity threshold T Theorem All pairs with over-T similarity can be found deterministically in time poly(D)(n log2 n + |Output|) Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 17 / 29
  26. 26. Visibility Graph Theorem For any dataset S with disorder D there exists a visibility graph: poly(D)n log2 n construction time O(D4 log n) out-degrees Naïve greedy routing deterministically reaches exact nearest neighbor of the given target q in at most log n steps Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 18 / 29
  27. 27. q p1
  28. 28. p2 q p1
  29. 29. p2 q p1
  30. 30. p2 p3 q p1
  31. 31. p2 p3 q p1
  32. 32. p2 p3 p4 q p1 Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 19 / 29
  33. 33. 3 Combinatorial Nets Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 20 / 29
  34. 34. Combinatorial Ball B(x, r) = {y : rankx (y) < r} In other words, it is a subset of dataset S: the object x itself and r − 1 its nearest neighbors B(x, 10) x Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 21 / 29
  35. 35. Combinatorial Net A subset R ⊆ S is called a combinatorial r-net iff the following two properties holds: Covering: ∀y ∈ S, ∃x ∈ R, s.t. rankx (y) < r. Separation: ∀xi , xj ∈ R, rankxi (xj ) ≥ r OR rankxj (xi ) ≥ r Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 22 / 29
  36. 36. Combinatorial Net A subset R ⊆ S is called a combinatorial r-net iff the following two properties holds: Covering: ∀y ∈ S, ∃x ∈ R, s.t. rankx (y) < r. Separation: ∀xi , xj ∈ R, rankxi (xj ) ≥ r OR rankxj (xi ) ≥ r How to construct a combinatorial net? What upper bound on its size can we guarantee? Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 22 / 29
  37. 37. Basic Data Structure Combinatorial nets: n For every 0 ≤ i ≤ log n, construct a -net 2i Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 23 / 29
  38. 38. Basic Data Structure Combinatorial nets: n For every 0 ≤ i ≤ log n, construct a -net 2i Pointers, pointers, pointers: Direct & inverted indices: links between centers and members of their balls Cousin links: for every center keep pointers to close centers on the same level Navigation links: for every center keep pointers to close centers on the next level Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 23 / 29
  39. 39. Fast Net Construction Theorem Combinatorial nets can be constructed in O(D7 n log2 n) time Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 24 / 29
  40. 40. Up’n’Down Trick Assume your have 2r-net for the dataset To compute an r-ball around some object p: Take a center p of 2r ball that is covering p 1 Take all centers of 2r-balls nearby p 2 For all of them write down all members of theirs 3 2r-balls Sort all written objects with respect to p and keep r 4 most similar ones. Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 25 / 29
  41. 41. 4 Directions for Further Research Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 26 / 29
  42. 42. Future of Combinatorial Framework Other problems in combinatorial framework: Low-distortion embeddings Closest pairs Community discovery Linear arrangement Distance labelling Dimensionality reduction What if disorder inequality has exceptions? Insertions, deletions, changing metric Experiments & implementation Unification challenge: disorder + doubling = ? Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 27 / 29
  43. 43. Summary Combinatorial framework: comparison oracle + disorder inequality New algorithms: Nearest neighbor search Deterministic detection of near-duplicates Navigability design Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 28 / 29
  44. 44. Summary Combinatorial framework: comparison oracle + disorder inequality New algorithms: Nearest neighbor search Deterministic detection of near-duplicates Navigability design Thanks for your attention! Questions? Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 28 / 29
  45. 45. Links http://yury.name http://simsearch.yury.name Tutorial, bibliography, people, links, open problems Yury Lifshits and Shengyu Zhang Combinatorial Algorithms for Nearest Neighbors, Near-Duplicates and Small-World Design http://yury.name/papers/lifshits2008similarity.pdf Navin Goyal, Yury Lifshits, Hinrich Schütze Disorder Inequality: A Combinatorial Approach to Nearest Neighbor Search http://yury.name/papers/goyal2008disorder.pdf Benjamin Hoffmann, Yury Lifshits, Dirk Novotka Maximal Intersection Queries in Randomized Graph Models http://yury.name/papers/hoffmann2007maximal.pdf Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 29 / 29

×