Sketching, Sampling, and other Sublinear Algorithms 2 (Lecture by Alex Andoni)

1,303 views

Published on

We will learn about modern algorithmic techniques for handling large datasets, often by using imprecise but concise representations of the data such as a sketch or a sample of the data. The lectures will cluster around three themes

Nearest Neighbor Search (similarity search): the general problem is, given a set of objects (e.g., images), to construct a data structure so that later, given a query object, one can efficiently find the most similar object from the database.
Streaming framework: we are required to solve a certain problem on a large collection of items that one streams through once (i.e., algorithm's memory footprint is much smaller than the dataset itself). For example, how can a router with 1Mb memory estimate the number of different IPs it sees in a multi-gigabytes long real-time traffic?
Parallel framework: we look at problems where neither the data or the output fits on a machine. For example, given a set of 2D points, how can we compute the minimum spanning tree over a cluster of machines.

The focus will be on techniques such as sketching, dimensionality reduction, sampling, hashing, and others.

Published in: Education, Technology, Business
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,303
On SlideShare
0
From Embeds
0
Number of Embeds
699
Actions
Shares
0
Downloads
22
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Sketching, Sampling, and other Sublinear Algorithms 2 (Lecture by Alex Andoni)

  1. 1. Sketching, Sampling and other Sublinear Algorithms: Euclidean space: dimension reduction and NNS Alex Andoni (MSR SVC)
  2. 2. A Sketching Problem 2  010110 010101 similar? To be or not to be To sketch or not to sketch be to similar?
  3. 3. Sketch from LSH 3  1 [Broder’97]: for Jaccard coefficient
  4. 4. General Theory: embeddings  Euclidean distance (ℓ2) Hamming distance Edit distance between two strings Earth-Mover (transportation) Distance Compute distance between two points Diameter/Close-pair of set S Clustering, MST, etc Nearest Neighbor Search f Reduce problem <P under hard metric> to <P under simpler metric>
  5. 5. Embeddings: landscape 
  6. 6. Dimension Reduction 
  7. 7. Main intuition 
  8. 8. 1D embedding 
  9. 9. 1D embedding  2 2
  10. 10. Full Dimension Reduction 
  11. 11. Concentration 
  12. 12. Dimension Reduction: wrap-up 
  13. 13. NNS for Euclidean space 13  [Datar-Immorlica-Indyk-Mirrokni’04]
  14. 14.  Regular grid → grid of balls  p can hit empty space, so take more such grids until p is in a ball  Need (too) many grids of balls  Start by projecting in dimension t  Analysis gives  Choice of reduced dimension t?  Tradeoff between  # hash tables, n , and  Time to hash, tO(t)  Total query time: dn1/c2+o(1) Near-Optimal LSH 2D p p Rt [A-Indyk’06]
  15. 15. Open question:  [Prob. needle of length 1 is not cut] [Prob needle of length c is not cut] ≥ c2
  16. 16. Time-Space Trade-offs [AI’06] [KOR’98, IM’98, Pan’06] [Ind’01, Pan’06] Space Time Comment Reference [DIIM’04, AI’06] [IM’98] query time space medium medium lowhigh highlow one hash table lookup! no(1/ε2) ω(1) memory lookups [AIP’06] n1+o(1/c2) ω(1) memory lookups [PTW’08, PTW’10]
  17. 17. NNS beyond LSH 17 
  18. 18. Finale 

×