SlideShare a Scribd company logo
1 of 45
Download to read offline
Combinatorial Algorithms
   for Nearest Neighbors, Near-Duplicates
           and Small-World Design


                                 Yury Lifshits
                                Yahoo! Research

                            Shengyu Zhang
                  The Chinese University of Hong Kong



                                   SODA 2009

Yury Lifshits, Shengyu Zhang   ()Combinatorial Nearest Neighbors   1 / 29
Similarity Search: an Example
 Input: Set of objects
 Task: Preprocess it
Similarity Search: an Example
 Input: Set of objects
 Task: Preprocess it




              Query: New object
              Task: Find the most
              similar one in the dataset
Similarity Search: an Example
    Input: Set of objects
    Task: Preprocess it




                                                    Most            similar


                         Query: New object
                         Task: Find the most
                         similar one in the dataset
Yury Lifshits, Shengyu Zhang    ()Combinatorial Nearest Neighbors             2 / 29
Similarity Search


Search space: object domain U, similarity
function σ
Input: database S = {p1 , . . . , pn } ⊆ U
Query: q ∈ U
Task: find argmax σ(pi , q)




Yury Lifshits, Shengyu Zhang   ()Combinatorial Nearest Neighbors   3 / 29
Nearest Neighbors in Theory
              Orchard’s Algorithm k-d-B tree
     Sphere Rectangle Tree

 Geometric near-neighbor access tree Excluded
   middle vantage point forest                              mvp-tree Fixed-height
                                                       Vantage-point
     fixed-queries tree AESA
     tree  LAESA R∗ -tree Burkhard-Keller tree BBD tree
    Navigating Nets Voronoi tree Balanced aspect ratio
                                                              M-tree
                      tree                      vps -tree
                                 Metric tree

         Locality-Sensitive Hashing                                        SS-tree
             R-tree Spatial approximation tree
                                                                       mb-tree Cover
    Multi-vantage point tree                        Bisector tree

tree                           Generalized hyperplane tree
            Hybrid tree                                                         Slim tree

                                                               k-d tree
                                                X-tree
  Spill Tree Fixed queries tree                                                 Balltree

                   Quadtree Octree                            Post-office tree
Yury Lifshits, Shengyu Zhang       ()Combinatorial Nearest Neighbors                   4 / 29
Revision: Basic Assumptions

In theory:
    Triangle inequality
    Doubling dimension is o(log n)




Yury Lifshits, Shengyu Zhang   ()Combinatorial Nearest Neighbors   5 / 29
Revision: Basic Assumptions

In theory:
    Triangle inequality
    Doubling dimension is o(log n)


Typical web dataset has separation effect

                                                   1/ 2 ≤ d(pi , pj ) ≤ 1
            For almost all i, j :




Yury Lifshits, Shengyu Zhang   ()Combinatorial Nearest Neighbors            5 / 29
Revision: Basic Assumptions

In theory:
    Triangle inequality
    Doubling dimension is o(log n)


Typical web dataset has separation effect

                                                   1/ 2 ≤ d(pi , pj ) ≤ 1
            For almost all i, j :

Classic methods fail:
    Branch and bound algorithms visit every object
    Doubling dimension is at least log n/ 2


Yury Lifshits, Shengyu Zhang   ()Combinatorial Nearest Neighbors            5 / 29
Contribution
Navin Goyal, YL, Hinrich Schütze, WSDM 2008:
         Combinatorial framework: new approach to data
         mining problems that does not require triangle
         inequality
         Nearest neighbor algorithm




Yury Lifshits, Shengyu Zhang   ()Combinatorial Nearest Neighbors   6 / 29
Contribution
Navin Goyal, YL, Hinrich Schütze, WSDM 2008:
         Combinatorial framework: new approach to data
         mining problems that does not require triangle
         inequality
         Nearest neighbor algorithm

This work:
         Better nearest neighbor search
         Detecting near-duplicates
         Navigability design for small worlds


Yury Lifshits, Shengyu Zhang   ()Combinatorial Nearest Neighbors   6 / 29
Outline


       Combinatorial Framework
 1




       New Algorithms
 2




       Combinatorial Nets
 3




       Directions for Further Research
 4




Yury Lifshits, Shengyu Zhang   ()Combinatorial Nearest Neighbors   7 / 29
1
Combinatorial Framework




Yury Lifshits, Shengyu Zhang   ()Combinatorial Nearest Neighbors   8 / 29
Comparison Oracle


         Dataset p1 , . . . , pn

         Objects and distance (or similarity)
         function are NOT given

         Instead, there is a comparison oracle
         answering queries of the form:

                    Who is closer to A: B or C?



Yury Lifshits, Shengyu Zhang   ()Combinatorial Nearest Neighbors   9 / 29
Disorder Inequality
Sort all objects by their similarity to p:
                rankp (r)

    p                                                              s
                               r

                                      rankp (s)




Yury Lifshits, Shengyu Zhang   ()Combinatorial Nearest Neighbors       10 / 29
Disorder Inequality
Sort all objects by their similarity to p:
                rankp (r)

    p                                                                 s
                                  r

                                           rankp (s)


                                                  Then by similarity to r:
                               rankr (s)

                                                                  s
    r




Yury Lifshits, Shengyu Zhang      ()Combinatorial Nearest Neighbors       10 / 29
Disorder Inequality
Sort all objects by their similarity to p:
                rankp (r)

    p                                                                 s
                                  r

                                           rankp (s)


                                                  Then by similarity to r:
                               rankr (s)

                                                                  s
    r



Dataset has disorder D if
 ∀p, r, s : rankr (s) ≤ D(rankp (r) + rankp (s))
Yury Lifshits, Shengyu Zhang      ()Combinatorial Nearest Neighbors       10 / 29
Combinatorial Framework
                                              =
                    Comparison oracle
                  Who is closer to A: B or C?
                                              +
                  Disorder inequality
           rankr (s) ≤ D(rankp(r) + rankp(s))

Yury Lifshits, Shengyu Zhang   ()Combinatorial Nearest Neighbors   11 / 29
Combinatorial Framework: FAQ



         Disorder of a metric space? Disorder of
         Rk ?
         In what cases disorder is relatively small?

         Experimental values of D for some
         practical datasets?




Yury Lifshits, Shengyu Zhang   ()Combinatorial Nearest Neighbors   12 / 29
Disorder vs. Others


         If expansion rate is c, disorder constant is
         at most c2

         Doubling dimension and disorder
         dimension are incomparable

         Disorder inequality implies combinatorial
         form of “doubling effect”



Yury Lifshits, Shengyu Zhang   ()Combinatorial Nearest Neighbors   13 / 29
Combinatorial Framework: Pro & Contra



Advantages:
         Does not require triangle inequality for distances
         Applicable to any data model and any similarity
         function
         Require only comparative training information




Yury Lifshits, Shengyu Zhang   ()Combinatorial Nearest Neighbors   14 / 29
Combinatorial Framework: Pro & Contra



Advantages:
         Does not require triangle inequality for distances
         Applicable to any data model and any similarity
         function
         Require only comparative training information


Limitation: worst-case form of disorder inequality




Yury Lifshits, Shengyu Zhang   ()Combinatorial Nearest Neighbors   14 / 29
2
New Algorithms




Yury Lifshits, Shengyu Zhang   ()Combinatorial Nearest Neighbors   15 / 29
Nearest Neighbor Search

Assume S ∪ {q} has disorder constant D

Theorem
There is a deterministic and exact algorithm
for nearest neighbor search:
         Preprocessing: O(D7 n log2 n)
         Search: O(D4 log n)



Yury Lifshits, Shengyu Zhang   ()Combinatorial Nearest Neighbors   16 / 29
Near-Duplicates
Assume, comparison oracle can also tell us
whether σ(x, y) > T for some similarity
threshold T

Theorem
All pairs with over-T similarity can be found
deterministically in time

                        poly(D)(n log2 n + |Output|)




Yury Lifshits, Shengyu Zhang   ()Combinatorial Nearest Neighbors   17 / 29
Visibility Graph

Theorem
For any dataset S with disorder D there exists
a visibility graph:
         poly(D)n log2 n construction time
         O(D4 log n) out-degrees
         Naïve greedy routing
         deterministically reaches
         exact nearest neighbor of the given target q
         in at most log n steps



Yury Lifshits, Shengyu Zhang   ()Combinatorial Nearest Neighbors   18 / 29
q
p1
p2




          q
p1
p2




          q
p1
p2


          p3


               q
p1
p2


          p3


               q
p1
p2


                                                                         p3

                                                                              p4
                                                                              q
                               p1


Yury Lifshits, Shengyu Zhang    ()Combinatorial Nearest Neighbors                  19 / 29
3
Combinatorial Nets




Yury Lifshits, Shengyu Zhang   ()Combinatorial Nearest Neighbors   20 / 29
Combinatorial Ball

                         B(x, r) = {y : rankx (y) < r}

In other words, it is a subset of dataset S: the
object x itself and r − 1 its nearest neighbors




                                                             B(x, 10)
                                x



Yury Lifshits, Shengyu Zhang   ()Combinatorial Nearest Neighbors        21 / 29
Combinatorial Net
A subset R ⊆ S is called a combinatorial
r-net iff the following two properties holds:
Covering: ∀y ∈ S, ∃x ∈ R, s.t. rankx (y) < r.
Separation: ∀xi , xj ∈ R, rankxi (xj ) ≥ r OR rankxj (xi ) ≥ r




Yury Lifshits, Shengyu Zhang   ()Combinatorial Nearest Neighbors   22 / 29
Combinatorial Net
A subset R ⊆ S is called a combinatorial
r-net iff the following two properties holds:
Covering: ∀y ∈ S, ∃x ∈ R, s.t. rankx (y) < r.
Separation: ∀xi , xj ∈ R, rankxi (xj ) ≥ r OR rankxj (xi ) ≥ r




          How to construct a combinatorial net?
     What upper bound on its size can we guarantee?
Yury Lifshits, Shengyu Zhang   ()Combinatorial Nearest Neighbors   22 / 29
Basic Data Structure
Combinatorial nets:
                                                                   n
 For every 0 ≤ i ≤ log n, construct a                                 -net
                                                                   2i




Yury Lifshits, Shengyu Zhang   ()Combinatorial Nearest Neighbors             23 / 29
Basic Data Structure
Combinatorial nets:
                                                                   n
 For every 0 ≤ i ≤ log n, construct a                                 -net
                                                                   2i

Pointers, pointers, pointers:
         Direct & inverted indices: links between centers
         and members of their balls
         Cousin links: for every center keep pointers to
         close centers on the same level
         Navigation links: for every center keep pointers to
         close centers on the next level


Yury Lifshits, Shengyu Zhang   ()Combinatorial Nearest Neighbors             23 / 29
Fast Net Construction




Theorem
Combinatorial nets can be constructed in
O(D7 n log2 n) time

Yury Lifshits, Shengyu Zhang   ()Combinatorial Nearest Neighbors   24 / 29
Up’n’Down Trick

Assume your have 2r-net for the dataset

To compute an r-ball around some object p:
         Take a center p of 2r ball that is covering p
    1



         Take all centers of 2r-balls nearby p
    2



         For all of them write down all members of theirs
    3

         2r-balls
         Sort all written objects with respect to p and keep r
    4

         most similar ones.



Yury Lifshits, Shengyu Zhang   ()Combinatorial Nearest Neighbors   25 / 29
4
Directions for Further Research




Yury Lifshits, Shengyu Zhang   ()Combinatorial Nearest Neighbors   26 / 29
Future of Combinatorial Framework
         Other problems in combinatorial framework:
                  Low-distortion embeddings
                  Closest pairs
                  Community discovery
                  Linear arrangement
                  Distance labelling
                  Dimensionality reduction
         What if disorder inequality has exceptions?
         Insertions, deletions, changing metric
         Experiments & implementation
         Unification challenge: disorder + doubling = ?
Yury Lifshits, Shengyu Zhang   ()Combinatorial Nearest Neighbors   27 / 29
Summary
         Combinatorial framework:
         comparison oracle + disorder inequality
         New algorithms:
         Nearest neighbor search
         Deterministic detection of near-duplicates
         Navigability design




Yury Lifshits, Shengyu Zhang   ()Combinatorial Nearest Neighbors   28 / 29
Summary
         Combinatorial framework:
         comparison oracle + disorder inequality
         New algorithms:
         Nearest neighbor search
         Deterministic detection of near-duplicates
         Navigability design



          Thanks for your attention!
                 Questions?
Yury Lifshits, Shengyu Zhang   ()Combinatorial Nearest Neighbors   28 / 29
Links
http://yury.name

http://simsearch.yury.name
Tutorial, bibliography, people, links, open problems


       Yury Lifshits and Shengyu Zhang
       Combinatorial Algorithms for Nearest Neighbors, Near-Duplicates and
       Small-World Design
       http://yury.name/papers/lifshits2008similarity.pdf

       Navin Goyal, Yury Lifshits, Hinrich Schütze
       Disorder Inequality: A Combinatorial Approach to Nearest Neighbor Search
       http://yury.name/papers/goyal2008disorder.pdf

       Benjamin Hoffmann, Yury Lifshits, Dirk Novotka
       Maximal Intersection Queries in Randomized Graph Models
       http://yury.name/papers/hoffmann2007maximal.pdf

Yury Lifshits, Shengyu Zhang   ()Combinatorial Nearest Neighbors                  29 / 29

More Related Content

Similar to Cobinatorial Algorithms for Nearest Neighbors, Near-Duplicates and Small World Design - Yury Lifshits - SODA 2009

Prolog Cpt114 - Week 2
Prolog Cpt114 - Week 2Prolog Cpt114 - Week 2
Prolog Cpt114 - Week 2
a_akhavan
 
Introduction to Probabilistic Latent Semantic Analysis
Introduction to Probabilistic Latent Semantic AnalysisIntroduction to Probabilistic Latent Semantic Analysis
Introduction to Probabilistic Latent Semantic Analysis
NYC Predictive Analytics
 
CVPR2010: higher order models in computer vision: Part 1, 2
CVPR2010: higher order models in computer vision: Part 1, 2CVPR2010: higher order models in computer vision: Part 1, 2
CVPR2010: higher order models in computer vision: Part 1, 2
zukun
 

Similar to Cobinatorial Algorithms for Nearest Neighbors, Near-Duplicates and Small World Design - Yury Lifshits - SODA 2009 (20)

Simulation Informatics; Analyzing Large Scientific Datasets
Simulation Informatics; Analyzing Large Scientific DatasetsSimulation Informatics; Analyzing Large Scientific Datasets
Simulation Informatics; Analyzing Large Scientific Datasets
 
Drugs and Electrons
Drugs and ElectronsDrugs and Electrons
Drugs and Electrons
 
Nelly Litvak – Asymptotic behaviour of ranking algorithms in directed random ...
Nelly Litvak – Asymptotic behaviour of ranking algorithms in directed random ...Nelly Litvak – Asymptotic behaviour of ranking algorithms in directed random ...
Nelly Litvak – Asymptotic behaviour of ranking algorithms in directed random ...
 
Class 11: Deeper List Procedures
Class 11: Deeper List ProceduresClass 11: Deeper List Procedures
Class 11: Deeper List Procedures
 
Pertemuan 5_Relation Matriks_01 (17)
Pertemuan 5_Relation Matriks_01 (17)Pertemuan 5_Relation Matriks_01 (17)
Pertemuan 5_Relation Matriks_01 (17)
 
Computing Local and Global Centrality
Computing Local and Global CentralityComputing Local and Global Centrality
Computing Local and Global Centrality
 
A Primer on Entity Resolution
A Primer on Entity ResolutionA Primer on Entity Resolution
A Primer on Entity Resolution
 
Prolog Cpt114 - Week 2
Prolog Cpt114 - Week 2Prolog Cpt114 - Week 2
Prolog Cpt114 - Week 2
 
Fast matrix primitives for ranking, link-prediction and more
Fast matrix primitives for ranking, link-prediction and moreFast matrix primitives for ranking, link-prediction and more
Fast matrix primitives for ranking, link-prediction and more
 
Introduction to Probabilistic Latent Semantic Analysis
Introduction to Probabilistic Latent Semantic AnalysisIntroduction to Probabilistic Latent Semantic Analysis
Introduction to Probabilistic Latent Semantic Analysis
 
CVPR2010: higher order models in computer vision: Part 1, 2
CVPR2010: higher order models in computer vision: Part 1, 2CVPR2010: higher order models in computer vision: Part 1, 2
CVPR2010: higher order models in computer vision: Part 1, 2
 
Slides4
Slides4Slides4
Slides4
 
Anti-differentiating Approximation Algorithms: PageRank and MinCut
Anti-differentiating Approximation Algorithms: PageRank and MinCutAnti-differentiating Approximation Algorithms: PageRank and MinCut
Anti-differentiating Approximation Algorithms: PageRank and MinCut
 
Spectral graph theory
Spectral graph theorySpectral graph theory
Spectral graph theory
 
[ICDE 2012] On Top-k Structural Similarity Search
[ICDE 2012] On Top-k Structural Similarity Search[ICDE 2012] On Top-k Structural Similarity Search
[ICDE 2012] On Top-k Structural Similarity Search
 
Similarity Search in High Dimensions via Hashing
Similarity Search in High Dimensions via HashingSimilarity Search in High Dimensions via Hashing
Similarity Search in High Dimensions via Hashing
 
MUMS Opening Workshop - Emulators for models and Complexity Reduction - Akil ...
MUMS Opening Workshop - Emulators for models and Complexity Reduction - Akil ...MUMS Opening Workshop - Emulators for models and Complexity Reduction - Akil ...
MUMS Opening Workshop - Emulators for models and Complexity Reduction - Akil ...
 
pres_coconat
pres_coconatpres_coconat
pres_coconat
 
Chapter 10 link prediction
Chapter 10 link predictionChapter 10 link prediction
Chapter 10 link prediction
 
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks
 

More from Yury Lifshits

Екатерина Карпова: Опыт организации Ночи Музеев
Екатерина Карпова: Опыт организации Ночи МузеевЕкатерина Карпова: Опыт организации Ночи Музеев
Екатерина Карпова: Опыт организации Ночи Музеев
Yury Lifshits
 
Evolution of Two Sided Markets - Yury Lifshits - WSDM 2010
Evolution of  Two Sided Markets - Yury Lifshits - WSDM 2010Evolution of  Two Sided Markets - Yury Lifshits - WSDM 2010
Evolution of Two Sided Markets - Yury Lifshits - WSDM 2010
Yury Lifshits
 
Business-Consumer Networks: Concept and Challenges
Business-Consumer Networks: Concept and ChallengesBusiness-Consumer Networks: Concept and Challenges
Business-Consumer Networks: Concept and Challenges
Yury Lifshits
 

More from Yury Lifshits (17)

Osh — Curiosity Learning on Mobile
Osh — Curiosity Learning on MobileOsh — Curiosity Learning on Mobile
Osh — Curiosity Learning on Mobile
 
Maester — The First Platform for Lead Generation on Mobile
Maester — The First Platform for Lead Generation on MobileMaester — The First Platform for Lead Generation on Mobile
Maester — The First Platform for Lead Generation on Mobile
 
Earlydays: План построения успешной компании
Earlydays: План построения успешной компанииEarlydays: План построения успешной компании
Earlydays: План построения успешной компании
 
Екатерина Карпова: Опыт организации Ночи Музеев
Екатерина Карпова: Опыт организации Ночи МузеевЕкатерина Карпова: Опыт организации Ночи Музеев
Екатерина Карпова: Опыт организации Ночи Музеев
 
Evolution of Two Sided Markets - Yury Lifshits - WSDM 2010
Evolution of  Two Sided Markets - Yury Lifshits - WSDM 2010Evolution of  Two Sided Markets - Yury Lifshits - WSDM 2010
Evolution of Two Sided Markets - Yury Lifshits - WSDM 2010
 
New Advertising
New AdvertisingNew Advertising
New Advertising
 
Interfaces of Attention: What if people will outsource management of their at...
Interfaces of Attention: What if people will outsource management of their at...Interfaces of Attention: What if people will outsource management of their at...
Interfaces of Attention: What if people will outsource management of their at...
 
Data Cloud - Yury Lifshits - Yahoo! Research
Data Cloud - Yury Lifshits - Yahoo! ResearchData Cloud - Yury Lifshits - Yahoo! Research
Data Cloud - Yury Lifshits - Yahoo! Research
 
Future of Search | Yury Lifshits, Yahoo! Research
Future of Search | Yury Lifshits, Yahoo! ResearchFuture of Search | Yury Lifshits, Yahoo! Research
Future of Search | Yury Lifshits, Yahoo! Research
 
"Economics of Web Design" by Yury Lifshits
"Economics of Web Design" by Yury Lifshits"Economics of Web Design" by Yury Lifshits
"Economics of Web Design" by Yury Lifshits
 
Social Design (Stanford version)
Social Design (Stanford version)Social Design (Stanford version)
Social Design (Stanford version)
 
Social Design
Social DesignSocial Design
Social Design
 
The Architecture of the Web
The Architecture of the WebThe Architecture of the Web
The Architecture of the Web
 
Reputation Systems I
Reputation Systems IReputation Systems I
Reputation Systems I
 
Reputation Systems II
Reputation Systems IIReputation Systems II
Reputation Systems II
 
Business-Consumer Networks: Concept and Challenges
Business-Consumer Networks: Concept and ChallengesBusiness-Consumer Networks: Concept and Challenges
Business-Consumer Networks: Concept and Challenges
 
Business-Consumer Networks. Project Proposal by Yury Lifshits
Business-Consumer Networks. Project Proposal by Yury LifshitsBusiness-Consumer Networks. Project Proposal by Yury Lifshits
Business-Consumer Networks. Project Proposal by Yury Lifshits
 

Recently uploaded

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 

Recently uploaded (20)

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 

Cobinatorial Algorithms for Nearest Neighbors, Near-Duplicates and Small World Design - Yury Lifshits - SODA 2009

  • 1. Combinatorial Algorithms for Nearest Neighbors, Near-Duplicates and Small-World Design Yury Lifshits Yahoo! Research Shengyu Zhang The Chinese University of Hong Kong SODA 2009 Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 1 / 29
  • 2. Similarity Search: an Example Input: Set of objects Task: Preprocess it
  • 3. Similarity Search: an Example Input: Set of objects Task: Preprocess it Query: New object Task: Find the most similar one in the dataset
  • 4. Similarity Search: an Example Input: Set of objects Task: Preprocess it Most similar Query: New object Task: Find the most similar one in the dataset Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 2 / 29
  • 5. Similarity Search Search space: object domain U, similarity function σ Input: database S = {p1 , . . . , pn } ⊆ U Query: q ∈ U Task: find argmax σ(pi , q) Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 3 / 29
  • 6. Nearest Neighbors in Theory Orchard’s Algorithm k-d-B tree Sphere Rectangle Tree Geometric near-neighbor access tree Excluded middle vantage point forest mvp-tree Fixed-height Vantage-point fixed-queries tree AESA tree LAESA R∗ -tree Burkhard-Keller tree BBD tree Navigating Nets Voronoi tree Balanced aspect ratio M-tree tree vps -tree Metric tree Locality-Sensitive Hashing SS-tree R-tree Spatial approximation tree mb-tree Cover Multi-vantage point tree Bisector tree tree Generalized hyperplane tree Hybrid tree Slim tree k-d tree X-tree Spill Tree Fixed queries tree Balltree Quadtree Octree Post-office tree Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 4 / 29
  • 7. Revision: Basic Assumptions In theory: Triangle inequality Doubling dimension is o(log n) Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 5 / 29
  • 8. Revision: Basic Assumptions In theory: Triangle inequality Doubling dimension is o(log n) Typical web dataset has separation effect 1/ 2 ≤ d(pi , pj ) ≤ 1 For almost all i, j : Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 5 / 29
  • 9. Revision: Basic Assumptions In theory: Triangle inequality Doubling dimension is o(log n) Typical web dataset has separation effect 1/ 2 ≤ d(pi , pj ) ≤ 1 For almost all i, j : Classic methods fail: Branch and bound algorithms visit every object Doubling dimension is at least log n/ 2 Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 5 / 29
  • 10. Contribution Navin Goyal, YL, Hinrich Schütze, WSDM 2008: Combinatorial framework: new approach to data mining problems that does not require triangle inequality Nearest neighbor algorithm Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 6 / 29
  • 11. Contribution Navin Goyal, YL, Hinrich Schütze, WSDM 2008: Combinatorial framework: new approach to data mining problems that does not require triangle inequality Nearest neighbor algorithm This work: Better nearest neighbor search Detecting near-duplicates Navigability design for small worlds Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 6 / 29
  • 12. Outline Combinatorial Framework 1 New Algorithms 2 Combinatorial Nets 3 Directions for Further Research 4 Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 7 / 29
  • 13. 1 Combinatorial Framework Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 8 / 29
  • 14. Comparison Oracle Dataset p1 , . . . , pn Objects and distance (or similarity) function are NOT given Instead, there is a comparison oracle answering queries of the form: Who is closer to A: B or C? Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 9 / 29
  • 15. Disorder Inequality Sort all objects by their similarity to p: rankp (r) p s r rankp (s) Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 10 / 29
  • 16. Disorder Inequality Sort all objects by their similarity to p: rankp (r) p s r rankp (s) Then by similarity to r: rankr (s) s r Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 10 / 29
  • 17. Disorder Inequality Sort all objects by their similarity to p: rankp (r) p s r rankp (s) Then by similarity to r: rankr (s) s r Dataset has disorder D if ∀p, r, s : rankr (s) ≤ D(rankp (r) + rankp (s)) Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 10 / 29
  • 18. Combinatorial Framework = Comparison oracle Who is closer to A: B or C? + Disorder inequality rankr (s) ≤ D(rankp(r) + rankp(s)) Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 11 / 29
  • 19. Combinatorial Framework: FAQ Disorder of a metric space? Disorder of Rk ? In what cases disorder is relatively small? Experimental values of D for some practical datasets? Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 12 / 29
  • 20. Disorder vs. Others If expansion rate is c, disorder constant is at most c2 Doubling dimension and disorder dimension are incomparable Disorder inequality implies combinatorial form of “doubling effect” Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 13 / 29
  • 21. Combinatorial Framework: Pro & Contra Advantages: Does not require triangle inequality for distances Applicable to any data model and any similarity function Require only comparative training information Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 14 / 29
  • 22. Combinatorial Framework: Pro & Contra Advantages: Does not require triangle inequality for distances Applicable to any data model and any similarity function Require only comparative training information Limitation: worst-case form of disorder inequality Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 14 / 29
  • 23. 2 New Algorithms Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 15 / 29
  • 24. Nearest Neighbor Search Assume S ∪ {q} has disorder constant D Theorem There is a deterministic and exact algorithm for nearest neighbor search: Preprocessing: O(D7 n log2 n) Search: O(D4 log n) Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 16 / 29
  • 25. Near-Duplicates Assume, comparison oracle can also tell us whether σ(x, y) > T for some similarity threshold T Theorem All pairs with over-T similarity can be found deterministically in time poly(D)(n log2 n + |Output|) Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 17 / 29
  • 26. Visibility Graph Theorem For any dataset S with disorder D there exists a visibility graph: poly(D)n log2 n construction time O(D4 log n) out-degrees Naïve greedy routing deterministically reaches exact nearest neighbor of the given target q in at most log n steps Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 18 / 29
  • 27. q p1
  • 28. p2 q p1
  • 29. p2 q p1
  • 30. p2 p3 q p1
  • 31. p2 p3 q p1
  • 32. p2 p3 p4 q p1 Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 19 / 29
  • 33. 3 Combinatorial Nets Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 20 / 29
  • 34. Combinatorial Ball B(x, r) = {y : rankx (y) < r} In other words, it is a subset of dataset S: the object x itself and r − 1 its nearest neighbors B(x, 10) x Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 21 / 29
  • 35. Combinatorial Net A subset R ⊆ S is called a combinatorial r-net iff the following two properties holds: Covering: ∀y ∈ S, ∃x ∈ R, s.t. rankx (y) < r. Separation: ∀xi , xj ∈ R, rankxi (xj ) ≥ r OR rankxj (xi ) ≥ r Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 22 / 29
  • 36. Combinatorial Net A subset R ⊆ S is called a combinatorial r-net iff the following two properties holds: Covering: ∀y ∈ S, ∃x ∈ R, s.t. rankx (y) < r. Separation: ∀xi , xj ∈ R, rankxi (xj ) ≥ r OR rankxj (xi ) ≥ r How to construct a combinatorial net? What upper bound on its size can we guarantee? Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 22 / 29
  • 37. Basic Data Structure Combinatorial nets: n For every 0 ≤ i ≤ log n, construct a -net 2i Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 23 / 29
  • 38. Basic Data Structure Combinatorial nets: n For every 0 ≤ i ≤ log n, construct a -net 2i Pointers, pointers, pointers: Direct & inverted indices: links between centers and members of their balls Cousin links: for every center keep pointers to close centers on the same level Navigation links: for every center keep pointers to close centers on the next level Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 23 / 29
  • 39. Fast Net Construction Theorem Combinatorial nets can be constructed in O(D7 n log2 n) time Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 24 / 29
  • 40. Up’n’Down Trick Assume your have 2r-net for the dataset To compute an r-ball around some object p: Take a center p of 2r ball that is covering p 1 Take all centers of 2r-balls nearby p 2 For all of them write down all members of theirs 3 2r-balls Sort all written objects with respect to p and keep r 4 most similar ones. Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 25 / 29
  • 41. 4 Directions for Further Research Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 26 / 29
  • 42. Future of Combinatorial Framework Other problems in combinatorial framework: Low-distortion embeddings Closest pairs Community discovery Linear arrangement Distance labelling Dimensionality reduction What if disorder inequality has exceptions? Insertions, deletions, changing metric Experiments & implementation Unification challenge: disorder + doubling = ? Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 27 / 29
  • 43. Summary Combinatorial framework: comparison oracle + disorder inequality New algorithms: Nearest neighbor search Deterministic detection of near-duplicates Navigability design Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 28 / 29
  • 44. Summary Combinatorial framework: comparison oracle + disorder inequality New algorithms: Nearest neighbor search Deterministic detection of near-duplicates Navigability design Thanks for your attention! Questions? Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 28 / 29
  • 45. Links http://yury.name http://simsearch.yury.name Tutorial, bibliography, people, links, open problems Yury Lifshits and Shengyu Zhang Combinatorial Algorithms for Nearest Neighbors, Near-Duplicates and Small-World Design http://yury.name/papers/lifshits2008similarity.pdf Navin Goyal, Yury Lifshits, Hinrich Schütze Disorder Inequality: A Combinatorial Approach to Nearest Neighbor Search http://yury.name/papers/goyal2008disorder.pdf Benjamin Hoffmann, Yury Lifshits, Dirk Novotka Maximal Intersection Queries in Randomized Graph Models http://yury.name/papers/hoffmann2007maximal.pdf Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 29 / 29