Lecture 4
Barna Saha
AT&T-Labs Research
September 19, 2013
Outline
Heavy Hitter Continued
Frequency Moment Estimation
Dimensionality Reduction
Heavy Hitter
Heavy Hitter Problem: For 0 < < φ < 1 find a set of
elements S including all i such that fi > φm and there is ...
Heavy Hitter
Turnstile model:
Maintain dyadic intervals over binary search tree and maintain
log n count-min sketch with u...
l2 frequency estimation
|fi − ˆfi | ≤ ± f 2
1 + f 2
2 + ....f 2
n [Count-sketch]
F2 = f 2
1 + f 2
2 + ....f 2
n
How do we ...
AMS-F2 Estimation
H = {h : [n] → {+1, −1}} four-wise independent hash
functions
Maintain Zj = Zj + ahj (i) on arrival of (...
Analysis
Zj = n
i=1 fi hj (i)
E Zj = 0, E Z2
j = F2.
Var Z2
j = E Z4
j − (E Zj )2 ≤ 4F2
2 .
E Y = F2. Var Y = 1
t2
t
j=1 V...
Boosting by Median
Keep Y1, Y2, ...Ys, s = O(log 1δ)
Return A = median(Y1, Y2, .., Ys)
By Chernoff bound Pr |A − F2| > F2 <...
Linear Sketch
Algorithm maintains a linear sketch [Z1, Z2, ...., Zt]x = Rx
where R is a t × n random matrix with entries {...
Dimensionality Reduction
Streaming algorithm operating in the sketch model can be
viewed as dimensionality reduction techn...
Dimensionality Reduction
F(Y1, Y2, ..Yt) = median(Y1, Y2, .., Yt)
Disadvantage: F is not a norm–performing any nontrivial
...
Interlude to Normal Distribution
Normal distribution N(0, 1):
Range (−∞, ∞)
Density f (x) = e−x2
/
√
2π
Mean=0, Variance=1...
A Different Linear Sketch
Instead of ±1 let ri be a i.i.d. random variable from N(0, 1).
Consider Z = i ri xi
E Z2 = E ( i ...
Johnson Lindenstrauss Lemma
Lemma
For any 0 < epsilon < 1 and any integer m, let t be a positive
integer such that
t >
4 l...
Upcoming SlideShare
Loading in...5
×

Lec 4-slides

156

Published on

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
156
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
2
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Lec 4-slides

  1. 1. Lecture 4 Barna Saha AT&T-Labs Research September 19, 2013
  2. 2. Outline Heavy Hitter Continued Frequency Moment Estimation Dimensionality Reduction
  3. 3. Heavy Hitter Heavy Hitter Problem: For 0 < < φ < 1 find a set of elements S including all i such that fi > φm and there is no element in S with frequency ≤ (φ − )m. Count-Min sketch guarantees: fi ≤ ˆfi ≤ fi + m with probability ≥ 1 − δ in space e log 1 (φ− )δ . Insert only: Maintain a min-heap of size k = 1 φ− , when an item arrives estimate frequency and if above φm include it in the heap. If heap size more than k, discard the minimum frequency element in the heap.
  4. 4. Heavy Hitter Turnstile model: Maintain dyadic intervals over binary search tree and maintain log n count-min sketch with using space e log 2 log n δ(φ− ) one for each level. At every level at most 1 φ heavy hitters. Estimate frequency of children of the heavy hitter nodes until leaf-level is reached. Return all the leaves with estimated frequency above φm. Analysis At most 2 φ− nodes at every level is examined. Each true frequency > (φ − )m with probability at least 1 − δ(φ− ) 2 log n . By union bound all true frequencies are above (φ − )m with probability at least 1 − δ.
  5. 5. l2 frequency estimation |fi − ˆfi | ≤ ± f 2 1 + f 2 2 + ....f 2 n [Count-sketch] F2 = f 2 1 + f 2 2 + ....f 2 n How do we estimate F2 in small space ?
  6. 6. AMS-F2 Estimation H = {h : [n] → {+1, −1}} four-wise independent hash functions Maintain Zj = Zj + ahj (i) on arrival of (i, a) for j = 1, ..., t = c 2 Return Y = 1 t t j=1 Z2 j
  7. 7. Analysis Zj = n i=1 fi hj (i) E Zj = 0, E Z2 j = F2. Var Z2 j = E Z4 j − (E Zj )2 ≤ 4F2 2 . E Y = F2. Var Y = 1 t2 t j=1 Var(Z2 j ) = 4 2 c F2 2 By Chebyshev Inequality Pr |Y − E Y | > F2 ≤ 4 c
  8. 8. Boosting by Median Keep Y1, Y2, ...Ys, s = O(log 1δ) Return A = median(Y1, Y2, .., Ys) By Chernoff bound Pr |A − F2| > F2 < δ
  9. 9. Linear Sketch Algorithm maintains a linear sketch [Z1, Z2, ...., Zt]x = Rx where R is a t × n random matrix with entries {+1, −1}. Use Y = ||Rx||2 2 to estimate t||x|2 2. t = O( 1 2 ). Streaming algorithm operating in the sketch model can be viewed as dimensionality reduction technique.
  10. 10. Dimensionality Reduction Streaming algorithm operating in the sketch model can be viewed as dimensionality reduction technique. stream S: point in n dimensional space, want to compute l2(S) sketch operator can be viewed as an approximate embedding of ln 2 to sketch space C such that 1. Each point in C can be described using only small number (say m) of numbers so C ⊂ Rm and 2. value of l2(S) is approximately equal to F(C(S)). F(Y1, Y2, ..Yt) = median(Y1, Y2, .., Yt)
  11. 11. Dimensionality Reduction F(Y1, Y2, ..Yt) = median(Y1, Y2, .., Yt) Disadvantage: F is not a norm–performing any nontrivial operations in the sketch space (e.g. clustering, similarity search, regression etc.) becomes difficult. Can we embed from ln 2 to lm 2 , m << n approximately preserving the distance ? Johnson-Lindenstrauss Lemma
  12. 12. Interlude to Normal Distribution Normal distribution N(0, 1): Range (−∞, ∞) Density f (x) = e−x2 / √ 2π Mean=0, Variance=1 Basic facts If X and Y are independent random variables with normal distribution then so is X + Y If X and Y are independent with mean 0 then E [X + Y ]2 = E X2 + E Y 2 E cX = cE X , Var cX = c2Var X
  13. 13. A Different Linear Sketch Instead of ±1 let ri be a i.i.d. random variable from N(0, 1). Consider Z = i ri xi E Z2 = E ( i ri xi )2 = i E r2 i x2 i = i Var ri x2 i = i x2 i = ||x||2 2. As before we maintain Z = [Z1, Z2, ..., Zt] and define Y = ||Z||2 2 E Y = t||x||2 2 We show that there exists constant C > 0 s.t. for small enough > 0 Pr |Y − t||x||2 2| > t||x||2 2 ≤ e−C 2t (JL lemma) set t = O( 1 2 log 1 δ )
  14. 14. Johnson Lindenstrauss Lemma Lemma For any 0 < epsilon < 1 and any integer m, let t be a positive integer such that t > 4 ln m 2/2 + 3/3 Then for any set V of m points in Rn, there is a map f : Rn → Rt such that for all u and v ∈ V , (1 − )||u − v||2 2 ≤ ||f (u) − f (v)||2 2 ≤ (1 + )||u − v||2 2. Furthermore this map can be found in randomized polynomial time.
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×