Upcoming SlideShare
Loading in...5
×

Like this presentation? Why not share!

Sketching, Sampling, and other Sublinear Algorithms 3 (Lecture by Alex Andoni)

on Aug 07, 2013

• 920 views

Streaming framework: we are required to solve a certain problem on a large collection of items that one streams through once (i.e., algorithm's memory footprint is much smaller than the dataset ...

Streaming framework: we are required to solve a certain problem on a large collection of items that one streams through once (i.e., algorithm's memory footprint is much smaller than the dataset itself). For example, how can a router with 1Mb memory estimate the number of different IPs it sees in a multi-gigabytes long real-time traffic?

Views

Total Views
920
Views on SlideShare
547
Embed Views
373

Likes
2
Downloads
4
Comments
0

2 Embeds373

 http://almada2013.ru 372 http://cloud2014.cs.msu.ru 1

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

• Comment goes here.
Are you sure you want to
Your message goes here
Edit your comment
• pictures

Sketching, Sampling, and other Sublinear Algorithms 3 (Lecture by Alex Andoni)Presentation Transcript

• Sketching, Sampling and other Sublinear Algorithms: Streaming Alex Andoni (MSR SVC)
• A scenario IP Frequency 131.107.65.14 3 18.9.22.69 2 80.97.56.20 2 131.107.65.14 131.107.65.14 18.9.22.69 18.9.22.69 80.97.56.20 80.97.56.20 IP Frequency 131.107.65.14 3 18.9.22.69 2 80.97.56.20 2 128.112.128.81 9 127.0.0.1 8 257.2.5.7 0 7.8.20.13 1 Challenge: compute something on the table, using small space. 131.107.65.14Example of “something”: • # distinct IPs • max frequency • other statistics…
• Sublinear: a panacea?  Sub-linear space algorithm for solving Travelling Salesperson Problem?  Sorry, perhaps a different lecture  Hard to solve sublinearly even very simple problems:  Ex: what is the count of distinct IPs seen  Will settle for:  Approximate algorithms: 1+ approximation true answer ≤ output ≤ (1+ ) * (true answer)  Randomized: above holds with probability 95%  Quick and dirty way to get a sense of the data IP Frequency 131.107.65.14 3 18.9.22.69 2 80.97.56.20 2 128.112.128.81 9 127.0.0.1 8 257.2.5.7 0 8.3.20.12 1
• Streaming data  Data through a router  Data stored on a hard drive, or streamed remotely  More efficient to do a linear scan on a hard drive  Working memory is the (smaller) main memory 2 2
• Application areas  Data can come from:  Network logs, sensor data  Real time data  Search queries, served ads  Databases (query planning)  …
• Problem 1: # distinct elements  2 5 7 5 5 i Frequency 2 1 5 3 7 1
• Distinct Elements: idea 1  Algorithm DISTINCT: Initialize: minHash=1 hash function h into [0,1] Process(int i): if (h(i) < minHash) minHash = h(index); Output: 1/minHash-1 275 [Flajolet-Martin‟85, Alon-Matias-Szegedy‟96]
• Distinct Elements: idea 2  ZEROS(x) x=0.0000001100101 Algorithm DISTINCT: Initialize: minHash=1 hash function h into [0,1] Process(int i): if (h(i) < minHash) minHash = h(index); Output: 1/minHash-1 Algorithm DISTINCT: Initialize: minHash2=0 hash function h into [0,1] Process(int i): if (h(i) < 1/2^minHash2) minHash2 = ZEROS(h(index)); Output: 2^minHash2
• Problem 2: max count  Problem: compute the maximum frequency of an element in the stream  Bad news:  Hard to distinguish whether an element repeated (max = 1 vs 2)  Good news:  Can find “heavy hitters”  elements with frequency > total frequency / s  using space proportional to s IP Frequency 2 1 5 3 7 1 2 5 7 5 5 heavy hitters
• Heavy Hitters: CountMin 1 1 1 2 5 7 5 5 1 1 2 1 1 1 2 1 2 1 1 1 2 2 1 3 1 2 1 3 2 1 4 1 3 1 Algorithm CountMin: Initialize(r, L): array Sketch[L][w] L hash functions h[L], into {0,…w-1} Process(int i): for(j=0; j<L; j++) Sketch[j][ h[j](i) ] += 1; Output: foreach i in PossibleIP { freq[i] = int.MaxValue; for(j=0; j<L; j++) freq[i] = min(freq[i], Sketch[j][h[j](i)]); } // freq[] is the frequency estimate 11 [Charikar-Chen-FarachColton‟04, Cormode-Muthukrishnan‟05]
• Heavy Hitters: analysis  5 3 2 1 4 1 3 1 3 Algorithm CountMin: Initialize(r, L): array Sketch[L][w] L hash functions h[L], into {0,…w-1} Process(int i): for(j=0; j<L; j++) Sketch[j][ h[j](i) ] += 1; Output: foreach i in PossibleIP { freq[i] = int.MaxValue; for(j=0; j<L; j++) freq[i] = min(freq[i], Sketch[j][h[j](i)]); } // freq[] is the frequency estimate
• Problem 3: Moments  IP 2 1 5 3 7 2 1 9 4 1 81 16
• Scenario 2: distributed traffic  Two sketches should be sufficient to compute something on the difference or sum IP Frequency 131.107.65.1 4 1 18.9.22.69 1 35.8.10.140 1 IP Frequenc y 131.107.65.14 1 18.9.22.69 2 131.107.65.14 18.9.22.69 18.9.22.69 35.8.10.140
• Common primitive: estimate sum  a1 a2 a3 a4 a1 a3
• Precision Sampling Framework  a1 a2 a3 a4 u1 u2 u3 u4 1 2 3 4
• Formalization Sum Estimator Adversary 
• Precision Sampling Lemma  Goal: estimate ∑ai from { i} satisfying |ai- i|<ui.  Precision Sampling Lemma: can get, with 90% success:  O(1) additive error and 1.5 multiplicative error: S – O(1) < < 1.5*S + O(1)  with average cost equal to O(log n)  Example: distinguish Σai=3 vs Σai=0  Consider two extreme cases:  if three ai=1: enough to have crude approx for all (ui=0.1) if all ai=3/n: only few with good approx ui=1/n, and the rest with ui=1 ε 1+ε S – ε < (1+ ε)S + ε O(ε-3 log n) [A-Krauthgamer-Onak‟11]
• Precision Sampling Algorithm  Precision Sampling Lemma: can get, with 90% success:  O(1) additive error and 1.5 multiplicative error: S – O(1) < < 1.5*S + O(1)  with average cost equal to O(log n)  Algorithm:  Choose each ui [0,1] i.i.d.  Estimator: S = count number of i„s s.t. i / ui > 6 (up to a normalization constant)  Proof of correctness:  we use only i which are 1.5-approximation to ai  E[S] ≈ ∑ Pr[ai / ui > 6] = ∑ ai/6.  E[1/ui] = O(log n) w.h.p. function of [ i /ui - 4/ε]+ and ui‟s concrete distrib. = minimum of O(ε-3) u.r.v. O(ε-3 log n) ε 1+ε S – ε < (1+ ε)S + ε
•  x1 x2 x3 x4 x5 x6 y1+ y3 y4 y2+ y5+ y6 x= H=
• Streaming++  LOTS of work in the area:  Surveys  Muthukrishnan: http://algo.research.googlepages.com/eight.ps  McGregor: http://people.cs.umass.edu/~mcgregor/papers/08- graphmining.pdf  Chakrabarti: http://www.cs.dartmouth.edu/~ac/Teach/CS49- Fall11/Notes/lecnotes.pdf  Open problems: http://sublinear.info  Examples:  Moments, sampling  Median estimation, longest increasing sequence  Graph algorithms  E.g., dynamic graph connectivity [AGG‟12, KKM‟13,…]  Numerical algorithms (e.g., regression, SVD approximation)  Fastest (sparse) regression […CW‟13,MM‟13,KN‟13,LMP‟13]  related to Compressed Sensing