Sketching, Sampling, and other Sublinear Algorithms 1 (Lecture by Alex Andoni)


Published on

We will learn about modern algorithmic techniques for handling large datasets, often by using imprecise but concise representations of the data such as a sketch or a sample of the data. The lectures will cluster around three themes

Nearest Neighbor Search (similarity search): the general problem is, given a set of objects (e.g., images), to construct a data structure so that later, given a query object, one can efficiently find the most similar object from the database.
Streaming framework: we are required to solve a certain problem on a large collection of items that one streams through once (i.e., algorithm's memory footprint is much smaller than the dataset itself). For example, how can a router with 1Mb memory estimate the number of different IPs it sees in a multi-gigabytes long real-time traffic?
Parallel framework: we look at problems where neither the data or the output fits on a machine. For example, given a set of 2D points, how can we compute the minimum spanning tree over a cluster of machines.

The focus will be on techniques such as sketching, dimensionality reduction, sampling, hashing, and others.

Published in: Education, Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Draw L hash tables
  • Sketching, Sampling, and other Sublinear Algorithms 1 (Lecture by Alex Andoni)

    1. 1. Sketching, Sampling and other Sublinear Algorithms: Nearest Neighbor Search Alex Andoni (MSR SVC)
    2. 2. Nearest Neighbor Search (NNS) 
    3. 3. Motivation  Generic setup:  Points model objects (e.g. images)  Distance models (dis)similarity measure  Application areas:  machine learning: k-NN rule  speech/image/video/music recognition, vector quantization, bioinformatics, etc…  Distance can be:  Hamming, Euclidean, edit distance, Earth-mover distance, etc…  Primitive for other problems:  find the similar pairs in a set D, clustering… 000000 011100 010100 000100 010100 011111 000000 001100 000100 000100 110100 111111
    4. 4. Lecture Plan 1. Locality-Sensitive Hashing 2. LSH as a Sketch 3. Towards Embeddings
    5. 5. 2D case 
    6. 6. High-dimensional case  Algorithm Query time Space Full indexing No indexing – linear scan
    7. 7. Approximate NNS  q r p cr
    8. 8. Heuristic for Exact NNS  q r p cr
    9. 9. Approximation Algorithms for NNS  A vast literature:  milder dependence on dimension [Arya-Mount’93], [Clarkson’94],[Arya-Mount-Netanyahu-Silverman- We’98], [Kleinberg’97],[Har-Peled’02],…[Aiger-Kaplan-Sharir’13],  little to no dependence on dimension [Indyk-Motwani’98],[Kushilevitz-Ostrovsky-Rabani’98],[Indyk’98, ‘01], [Gionis-Indyk-Motwani’99], [Charikar’02], [Datar-Immorlica- Indyk-Mirrokni’04],[Chakrabarti-Regev’04], [Panigrahy’06], [Ailon- Chazelle’06], [A-Indyk’06],… [A-Indyk-Nguyen-Razenshteyn’??]
    10. 10. Locality-Sensitive Hashing  q p 1 [Indyk-Motwani’98] q “not-so-small”
    11. 11. Locality sensitive hash functions 11 
    12. 12. Formal description 12 
    13. 13. Analysis of LSH Scheme 13 
    14. 14. Analysis: Correctness 14 
    15. 15. Analysis: Runtime 15 
    16. 16. LSH in the wild 16  safety not guaranteed fewer false positives fewer tables
    17. 17. LSH Zoo 17  To be or not to be To sketch or not to sketch …21102… be to or not sketch …01122… be to or not sketch …11101… …01111… {be,not,or,to} {not,or,to, sketch} 1 1 not not be to