Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Successfully reported this slideshow.

Like this presentation? Why not share!

- Sketching, Sampling, and other Subl... by Anton Konushin 851 views
- Lec 2-2 by Atner Yegorov 330 views
- Parallel Algorithms for Geometric G... by Grigory Yaroslavtsev 841 views
- ECCV2010: feature learning for imag... by zukun 1899 views
- Sketching, Sampling, and other Subl... by Anton Konushin 1358 views
- A Friendly Guide To Sparse Coding by Shao-Chuan Wang 11797 views

1,264 views

Published on

Published in:
Technology

No Downloads

Total views

1,264

On SlideShare

0

From Embeds

0

Number of Embeds

401

Shares

0

Downloads

26

Comments

0

Likes

2

No embeds

No notes for slide

- 1. Sketching, Sampling and other Sublinear Algorithms: Streaming Alex Andoni (MSR SVC)
- 2. A scenario IP Frequency 131.107.65.14 3 18.9.22.69 2 80.97.56.20 2 131.107.65.14 131.107.65.14 18.9.22.69 18.9.22.69 80.97.56.20 80.97.56.20 IP Frequency 131.107.65.14 3 18.9.22.69 2 80.97.56.20 2 128.112.128.81 9 127.0.0.1 8 257.2.5.7 0 7.8.20.13 1 Challenge: compute something on the table, using small space. 131.107.65.14Example of “something”: • # distinct IPs • max frequency • other statistics…
- 3. Sublinear: a panacea? Sub-linear space algorithm for solving Travelling Salesperson Problem? Sorry, perhaps a different lecture Hard to solve sublinearly even very simple problems: Ex: what is the count of distinct IPs seen Will settle for: Approximate algorithms: 1+ approximation true answer ≤ output ≤ (1+ ) * (true answer) Randomized: above holds with probability 95% Quick and dirty way to get a sense of the data IP Frequency 131.107.65.14 3 18.9.22.69 2 80.97.56.20 2 128.112.128.81 9 127.0.0.1 8 257.2.5.7 0 8.3.20.12 1
- 4. Streaming data Data through a router Data stored on a hard drive, or streamed remotely More efficient to do a linear scan on a hard drive Working memory is the (smaller) main memory 2 2
- 5. Application areas Data can come from: Network logs, sensor data Real time data Search queries, served ads Databases (query planning) …
- 6. Problem 1: # distinct elements 2 5 7 5 5 i Frequency 2 1 5 3 7 1
- 7. Distinct Elements: idea 1 Algorithm DISTINCT: Initialize: minHash=1 hash function h into [0,1] Process(int i): if (h(i) < minHash) minHash = h(index); Output: 1/minHash-1 275 [Flajolet-Martin‟85, Alon-Matias-Szegedy‟96]
- 8. Distinct Elements: idea 2 ZEROS(x) x=0.0000001100101 Algorithm DISTINCT: Initialize: minHash=1 hash function h into [0,1] Process(int i): if (h(i) < minHash) minHash = h(index); Output: 1/minHash-1 Algorithm DISTINCT: Initialize: minHash2=0 hash function h into [0,1] Process(int i): if (h(i) < 1/2^minHash2) minHash2 = ZEROS(h(index)); Output: 2^minHash2
- 9. Problem 2: max count Problem: compute the maximum frequency of an element in the stream Bad news: Hard to distinguish whether an element repeated (max = 1 vs 2) Good news: Can find “heavy hitters” elements with frequency > total frequency / s using space proportional to s IP Frequency 2 1 5 3 7 1 2 5 7 5 5 heavy hitters
- 10. Heavy Hitters: CountMin 1 1 1 2 5 7 5 5 1 1 2 1 1 1 2 1 2 1 1 1 2 2 1 3 1 2 1 3 2 1 4 1 3 1 Algorithm CountMin: Initialize(r, L): array Sketch[L][w] L hash functions h[L], into {0,…w-1} Process(int i): for(j=0; j<L; j++) Sketch[j][ h[j](i) ] += 1; Output: foreach i in PossibleIP { freq[i] = int.MaxValue; for(j=0; j<L; j++) freq[i] = min(freq[i], Sketch[j][h[j](i)]); } // freq[] is the frequency estimate 11 [Charikar-Chen-FarachColton‟04, Cormode-Muthukrishnan‟05]
- 11. Heavy Hitters: analysis 5 3 2 1 4 1 3 1 3 Algorithm CountMin: Initialize(r, L): array Sketch[L][w] L hash functions h[L], into {0,…w-1} Process(int i): for(j=0; j<L; j++) Sketch[j][ h[j](i) ] += 1; Output: foreach i in PossibleIP { freq[i] = int.MaxValue; for(j=0; j<L; j++) freq[i] = min(freq[i], Sketch[j][h[j](i)]); } // freq[] is the frequency estimate
- 12. Problem 3: Moments IP 2 1 5 3 7 2 1 9 4 1 81 16
- 13.
- 14. Scenario 2: distributed traffic Two sketches should be sufficient to compute something on the difference or sum IP Frequency 131.107.65.1 4 1 18.9.22.69 1 35.8.10.140 1 IP Frequenc y 131.107.65.14 1 18.9.22.69 2 131.107.65.14 18.9.22.69 18.9.22.69 35.8.10.140
- 15. Common primitive: estimate sum a1 a2 a3 a4 a1 a3
- 16. Precision Sampling Framework a1 a2 a3 a4 u1 u2 u3 u4 1 2 3 4
- 17. Formalization Sum Estimator Adversary
- 18. Precision Sampling Lemma Goal: estimate ∑ai from { i} satisfying |ai- i|<ui. Precision Sampling Lemma: can get, with 90% success: O(1) additive error and 1.5 multiplicative error: S – O(1) < < 1.5*S + O(1) with average cost equal to O(log n) Example: distinguish Σai=3 vs Σai=0 Consider two extreme cases: if three ai=1: enough to have crude approx for all (ui=0.1) if all ai=3/n: only few with good approx ui=1/n, and the rest with ui=1 ε 1+ε S – ε < (1+ ε)S + ε O(ε-3 log n) [A-Krauthgamer-Onak‟11]
- 19. Precision Sampling Algorithm Precision Sampling Lemma: can get, with 90% success: O(1) additive error and 1.5 multiplicative error: S – O(1) < < 1.5*S + O(1) with average cost equal to O(log n) Algorithm: Choose each ui [0,1] i.i.d. Estimator: S = count number of i„s s.t. i / ui > 6 (up to a normalization constant) Proof of correctness: we use only i which are 1.5-approximation to ai E[S] ≈ ∑ Pr[ai / ui > 6] = ∑ ai/6. E[1/ui] = O(log n) w.h.p. function of [ i /ui - 4/ε]+ and ui‟s concrete distrib. = minimum of O(ε-3) u.r.v. O(ε-3 log n) ε 1+ε S – ε < (1+ ε)S + ε
- 20. x1 x2 x3 x4 x5 x6 y1+ y3 y4 y2+ y5+ y6 x= H=
- 21. Streaming++ LOTS of work in the area: Surveys Muthukrishnan: http://algo.research.googlepages.com/eight.ps McGregor: http://people.cs.umass.edu/~mcgregor/papers/08- graphmining.pdf Chakrabarti: http://www.cs.dartmouth.edu/~ac/Teach/CS49- Fall11/Notes/lecnotes.pdf Open problems: http://sublinear.info Examples: Moments, sampling Median estimation, longest increasing sequence Graph algorithms E.g., dynamic graph connectivity [AGG‟12, KKM‟13,…] Numerical algorithms (e.g., regression, SVD approximation) Fastest (sparse) regression […CW‟13,MM‟13,KN‟13,LMP‟13] related to Compressed Sensing

No public clipboards found for this slide

Be the first to comment