Sketching, Sampling and other Sublinear
Algorithms:
Streaming
Alex Andoni
(MSR SVC)
A scenario
IP Frequency
131.107.65.14 3
18.9.22.69 2
80.97.56.20 2
131.107.65.14
131.107.65.14
18.9.22.69
18.9.22.69
80.97...
Sublinear: a panacea?
 Sub-linear space algorithm for solving Travelling
Salesperson Problem?
 Sorry, perhaps a differen...
Streaming data
 Data through a router
 Data stored on a hard drive, or streamed remotely
 More efficient to do a linear...
Application areas
 Data can come from:
 Network logs, sensor data
 Real time data
 Search queries, served ads
 Databa...
Problem 1: # distinct elements

2 5 7 5 5
i Frequency
2 1
5 3
7 1
Distinct Elements: idea 1

Algorithm DISTINCT:
Initialize:
minHash=1
hash function h into [0,1]
Process(int i):
if (h(i) ...
Distinct Elements: idea 2

ZEROS(x)
x=0.0000001100101
Algorithm DISTINCT:
Initialize:
minHash=1
hash function h into [0,1...
Problem 2: max count
 Problem: compute the maximum frequency of an
element in the stream
 Bad news:
 Hard to distinguis...
Heavy Hitters: CountMin
1
1
1
2 5 7 5 5
1 1
2
1 1
1 2
1 2
1 1 1
2 2
1 3
1 2 1
3 2
1 4
1 3 1
Algorithm CountMin:
Initialize...
Heavy Hitters: analysis

5
3 2
1 4
1 3 1
3
Algorithm CountMin:
Initialize(r, L):
array Sketch[L][w]
L hash functions h[L]...
Problem 3: Moments

IP
2 1
5 3
7 2
1
9
4
1
81
16

Scenario 2: distributed traffic

Two sketches should be sufficient to compute
something on the difference or sum
IP Frequ...
Common primitive: estimate sum

a1 a2 a3 a4
a1
a3
Precision Sampling Framework

a1 a2 a3 a4
u1 u2 u3 u4
1 2 3 4
Formalization
Sum Estimator Adversary

Precision Sampling Lemma
 Goal: estimate ∑ai from { i} satisfying |ai- i|<ui.
 Precision Sampling Lemma: can get, with 9...
Precision Sampling Algorithm
 Precision Sampling Lemma: can get, with 90%
success:
 O(1) additive error and 1.5 multipli...

x1 x2 x3 x4 x5 x6
y1+
y3
y4 y2+
y5+
y6
x=
H=
Streaming++
 LOTS of work in the area:
 Surveys
 Muthukrishnan: http://algo.research.googlepages.com/eight.ps
 McGrego...
Sketching, Sampling, and other Sublinear Algorithms 3 (Lecture by Alex Andoni)
Upcoming SlideShare
Loading in …5
×

Sketching, Sampling, and other Sublinear Algorithms 3 (Lecture by Alex Andoni)

930
-1

Published on

Streaming framework: we are required to solve a certain problem on a large collection of items that one streams through once (i.e., algorithm's memory footprint is much smaller than the dataset itself). For example, how can a router with 1Mb memory estimate the number of different IPs it sees in a multi-gigabytes long real-time traffic?

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
930
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
10
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide
  • pictures
  • Sketching, Sampling, and other Sublinear Algorithms 3 (Lecture by Alex Andoni)

    1. 1. Sketching, Sampling and other Sublinear Algorithms: Streaming Alex Andoni (MSR SVC)
    2. 2. A scenario IP Frequency 131.107.65.14 3 18.9.22.69 2 80.97.56.20 2 131.107.65.14 131.107.65.14 18.9.22.69 18.9.22.69 80.97.56.20 80.97.56.20 IP Frequency 131.107.65.14 3 18.9.22.69 2 80.97.56.20 2 128.112.128.81 9 127.0.0.1 8 257.2.5.7 0 7.8.20.13 1 Challenge: compute something on the table, using small space. 131.107.65.14Example of “something”: • # distinct IPs • max frequency • other statistics…
    3. 3. Sublinear: a panacea?  Sub-linear space algorithm for solving Travelling Salesperson Problem?  Sorry, perhaps a different lecture  Hard to solve sublinearly even very simple problems:  Ex: what is the count of distinct IPs seen  Will settle for:  Approximate algorithms: 1+ approximation true answer ≤ output ≤ (1+ ) * (true answer)  Randomized: above holds with probability 95%  Quick and dirty way to get a sense of the data IP Frequency 131.107.65.14 3 18.9.22.69 2 80.97.56.20 2 128.112.128.81 9 127.0.0.1 8 257.2.5.7 0 8.3.20.12 1
    4. 4. Streaming data  Data through a router  Data stored on a hard drive, or streamed remotely  More efficient to do a linear scan on a hard drive  Working memory is the (smaller) main memory 2 2
    5. 5. Application areas  Data can come from:  Network logs, sensor data  Real time data  Search queries, served ads  Databases (query planning)  …
    6. 6. Problem 1: # distinct elements  2 5 7 5 5 i Frequency 2 1 5 3 7 1
    7. 7. Distinct Elements: idea 1  Algorithm DISTINCT: Initialize: minHash=1 hash function h into [0,1] Process(int i): if (h(i) < minHash) minHash = h(index); Output: 1/minHash-1 275 [Flajolet-Martin‟85, Alon-Matias-Szegedy‟96]
    8. 8. Distinct Elements: idea 2  ZEROS(x) x=0.0000001100101 Algorithm DISTINCT: Initialize: minHash=1 hash function h into [0,1] Process(int i): if (h(i) < minHash) minHash = h(index); Output: 1/minHash-1 Algorithm DISTINCT: Initialize: minHash2=0 hash function h into [0,1] Process(int i): if (h(i) < 1/2^minHash2) minHash2 = ZEROS(h(index)); Output: 2^minHash2
    9. 9. Problem 2: max count  Problem: compute the maximum frequency of an element in the stream  Bad news:  Hard to distinguish whether an element repeated (max = 1 vs 2)  Good news:  Can find “heavy hitters”  elements with frequency > total frequency / s  using space proportional to s IP Frequency 2 1 5 3 7 1 2 5 7 5 5 heavy hitters
    10. 10. Heavy Hitters: CountMin 1 1 1 2 5 7 5 5 1 1 2 1 1 1 2 1 2 1 1 1 2 2 1 3 1 2 1 3 2 1 4 1 3 1 Algorithm CountMin: Initialize(r, L): array Sketch[L][w] L hash functions h[L], into {0,…w-1} Process(int i): for(j=0; j<L; j++) Sketch[j][ h[j](i) ] += 1; Output: foreach i in PossibleIP { freq[i] = int.MaxValue; for(j=0; j<L; j++) freq[i] = min(freq[i], Sketch[j][h[j](i)]); } // freq[] is the frequency estimate 11 [Charikar-Chen-FarachColton‟04, Cormode-Muthukrishnan‟05]
    11. 11. Heavy Hitters: analysis  5 3 2 1 4 1 3 1 3 Algorithm CountMin: Initialize(r, L): array Sketch[L][w] L hash functions h[L], into {0,…w-1} Process(int i): for(j=0; j<L; j++) Sketch[j][ h[j](i) ] += 1; Output: foreach i in PossibleIP { freq[i] = int.MaxValue; for(j=0; j<L; j++) freq[i] = min(freq[i], Sketch[j][h[j](i)]); } // freq[] is the frequency estimate
    12. 12. Problem 3: Moments  IP 2 1 5 3 7 2 1 9 4 1 81 16
    13. 13.
    14. 14. Scenario 2: distributed traffic  Two sketches should be sufficient to compute something on the difference or sum IP Frequency 131.107.65.1 4 1 18.9.22.69 1 35.8.10.140 1 IP Frequenc y 131.107.65.14 1 18.9.22.69 2 131.107.65.14 18.9.22.69 18.9.22.69 35.8.10.140
    15. 15. Common primitive: estimate sum  a1 a2 a3 a4 a1 a3
    16. 16. Precision Sampling Framework  a1 a2 a3 a4 u1 u2 u3 u4 1 2 3 4
    17. 17. Formalization Sum Estimator Adversary 
    18. 18. Precision Sampling Lemma  Goal: estimate ∑ai from { i} satisfying |ai- i|<ui.  Precision Sampling Lemma: can get, with 90% success:  O(1) additive error and 1.5 multiplicative error: S – O(1) < < 1.5*S + O(1)  with average cost equal to O(log n)  Example: distinguish Σai=3 vs Σai=0  Consider two extreme cases:  if three ai=1: enough to have crude approx for all (ui=0.1) if all ai=3/n: only few with good approx ui=1/n, and the rest with ui=1 ε 1+ε S – ε < (1+ ε)S + ε O(ε-3 log n) [A-Krauthgamer-Onak‟11]
    19. 19. Precision Sampling Algorithm  Precision Sampling Lemma: can get, with 90% success:  O(1) additive error and 1.5 multiplicative error: S – O(1) < < 1.5*S + O(1)  with average cost equal to O(log n)  Algorithm:  Choose each ui [0,1] i.i.d.  Estimator: S = count number of i„s s.t. i / ui > 6 (up to a normalization constant)  Proof of correctness:  we use only i which are 1.5-approximation to ai  E[S] ≈ ∑ Pr[ai / ui > 6] = ∑ ai/6.  E[1/ui] = O(log n) w.h.p. function of [ i /ui - 4/ε]+ and ui‟s concrete distrib. = minimum of O(ε-3) u.r.v. O(ε-3 log n) ε 1+ε S – ε < (1+ ε)S + ε
    20. 20.  x1 x2 x3 x4 x5 x6 y1+ y3 y4 y2+ y5+ y6 x= H=
    21. 21. Streaming++  LOTS of work in the area:  Surveys  Muthukrishnan: http://algo.research.googlepages.com/eight.ps  McGregor: http://people.cs.umass.edu/~mcgregor/papers/08- graphmining.pdf  Chakrabarti: http://www.cs.dartmouth.edu/~ac/Teach/CS49- Fall11/Notes/lecnotes.pdf  Open problems: http://sublinear.info  Examples:  Moments, sampling  Median estimation, longest increasing sequence  Graph algorithms  E.g., dynamic graph connectivity [AGG‟12, KKM‟13,…]  Numerical algorithms (e.g., regression, SVD approximation)  Fastest (sparse) regression […CW‟13,MM‟13,KN‟13,LMP‟13]  related to Compressed Sensing
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×