Sketching, Sampling, and other Sublinear Algorithms 1 (Lecture by Alex Andoni)

1,333 views

Published on

We will learn about modern algorithmic techniques for handling large datasets, often by using imprecise but concise representations of the data such as a sketch or a sample of the data. The lectures will cluster around three themes

Nearest Neighbor Search (similarity search): the general problem is, given a set of objects (e.g., images), to construct a data structure so that later, given a query object, one can efficiently find the most similar object from the database.
Streaming framework: we are required to solve a certain problem on a large collection of items that one streams through once (i.e., algorithm's memory footprint is much smaller than the dataset itself). For example, how can a router with 1Mb memory estimate the number of different IPs it sees in a multi-gigabytes long real-time traffic?
Parallel framework: we look at problems where neither the data or the output fits on a machine. For example, given a set of 2D points, how can we compute the minimum spanning tree over a cluster of machines.

The focus will be on techniques such as sketching, dimensionality reduction, sampling, hashing, and others.

Published in: Education, Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,333
On SlideShare
0
From Embeds
0
Number of Embeds
700
Actions
Shares
0
Downloads
38
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • Draw L hash tables
  • Sketching, Sampling, and other Sublinear Algorithms 1 (Lecture by Alex Andoni)

    1. 1. Sketching, Sampling and other Sublinear Algorithms: Nearest Neighbor Search Alex Andoni (MSR SVC)
    2. 2. Nearest Neighbor Search (NNS) 
    3. 3. Motivation  Generic setup:  Points model objects (e.g. images)  Distance models (dis)similarity measure  Application areas:  machine learning: k-NN rule  speech/image/video/music recognition, vector quantization, bioinformatics, etc…  Distance can be:  Hamming, Euclidean, edit distance, Earth-mover distance, etc…  Primitive for other problems:  find the similar pairs in a set D, clustering… 000000 011100 010100 000100 010100 011111 000000 001100 000100 000100 110100 111111
    4. 4. Lecture Plan 1. Locality-Sensitive Hashing 2. LSH as a Sketch 3. Towards Embeddings
    5. 5. 2D case 
    6. 6. High-dimensional case  Algorithm Query time Space Full indexing No indexing – linear scan
    7. 7. Approximate NNS  q r p cr
    8. 8. Heuristic for Exact NNS  q r p cr
    9. 9. Approximation Algorithms for NNS  A vast literature:  milder dependence on dimension [Arya-Mount’93], [Clarkson’94],[Arya-Mount-Netanyahu-Silverman- We’98], [Kleinberg’97],[Har-Peled’02],…[Aiger-Kaplan-Sharir’13],  little to no dependence on dimension [Indyk-Motwani’98],[Kushilevitz-Ostrovsky-Rabani’98],[Indyk’98, ‘01], [Gionis-Indyk-Motwani’99], [Charikar’02], [Datar-Immorlica- Indyk-Mirrokni’04],[Chakrabarti-Regev’04], [Panigrahy’06], [Ailon- Chazelle’06], [A-Indyk’06],… [A-Indyk-Nguyen-Razenshteyn’??]
    10. 10. Locality-Sensitive Hashing  q p 1 [Indyk-Motwani’98] q “not-so-small”
    11. 11. Locality sensitive hash functions 11 
    12. 12. Formal description 12 
    13. 13. Analysis of LSH Scheme 13 
    14. 14. Analysis: Correctness 14 
    15. 15. Analysis: Runtime 15 
    16. 16. LSH in the wild 16  safety not guaranteed fewer false positives fewer tables
    17. 17. LSH Zoo 17  To be or not to be To sketch or not to sketch …21102… be to or not sketch …01122… be to or not sketch …11101… …01111… {be,not,or,to} {not,or,to, sketch} 1 1 not not be to

    ×