Sketching, Sampling, and other Sublinear Algorithms 2 (Lecture by Alex Andoni)


Published on

We will learn about modern algorithmic techniques for handling large datasets, often by using imprecise but concise representations of the data such as a sketch or a sample of the data. The lectures will cluster around three themes

Nearest Neighbor Search (similarity search): the general problem is, given a set of objects (e.g., images), to construct a data structure so that later, given a query object, one can efficiently find the most similar object from the database.
Streaming framework: we are required to solve a certain problem on a large collection of items that one streams through once (i.e., algorithm's memory footprint is much smaller than the dataset itself). For example, how can a router with 1Mb memory estimate the number of different IPs it sees in a multi-gigabytes long real-time traffic?
Parallel framework: we look at problems where neither the data or the output fits on a machine. For example, given a set of 2D points, how can we compute the minimum spanning tree over a cluster of machines.

The focus will be on techniques such as sketching, dimensionality reduction, sampling, hashing, and others.

Published in: Education, Technology, Business
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Sketching, Sampling, and other Sublinear Algorithms 2 (Lecture by Alex Andoni)

  1. 1. Sketching, Sampling and other Sublinear Algorithms: Euclidean space: dimension reduction and NNS Alex Andoni (MSR SVC)
  2. 2. A Sketching Problem 2  010110 010101 similar? To be or not to be To sketch or not to sketch be to similar?
  3. 3. Sketch from LSH 3  1 [Broder’97]: for Jaccard coefficient
  4. 4. General Theory: embeddings  Euclidean distance (ℓ2) Hamming distance Edit distance between two strings Earth-Mover (transportation) Distance Compute distance between two points Diameter/Close-pair of set S Clustering, MST, etc Nearest Neighbor Search f Reduce problem <P under hard metric> to <P under simpler metric>
  5. 5. Embeddings: landscape 
  6. 6. Dimension Reduction 
  7. 7. Main intuition 
  8. 8. 1D embedding 
  9. 9. 1D embedding  2 2
  10. 10. Full Dimension Reduction 
  11. 11. Concentration 
  12. 12. Dimension Reduction: wrap-up 
  13. 13. NNS for Euclidean space 13  [Datar-Immorlica-Indyk-Mirrokni’04]
  14. 14.  Regular grid → grid of balls  p can hit empty space, so take more such grids until p is in a ball  Need (too) many grids of balls  Start by projecting in dimension t  Analysis gives  Choice of reduced dimension t?  Tradeoff between  # hash tables, n , and  Time to hash, tO(t)  Total query time: dn1/c2+o(1) Near-Optimal LSH 2D p p Rt [A-Indyk’06]
  15. 15. Open question:  [Prob. needle of length 1 is not cut] [Prob needle of length c is not cut] ≥ c2
  16. 16. Time-Space Trade-offs [AI’06] [KOR’98, IM’98, Pan’06] [Ind’01, Pan’06] Space Time Comment Reference [DIIM’04, AI’06] [IM’98] query time space medium medium lowhigh highlow one hash table lookup! no(1/ε2) ω(1) memory lookups [AIP’06] n1+o(1/c2) ω(1) memory lookups [PTW’08, PTW’10]
  17. 17. NNS beyond LSH 17 
  18. 18. Finale 