Upcoming SlideShare
×

# Lec 4-slides

156

Published on

0 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

Views
Total Views
156
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
2
0
Likes
0
Embeds 0
No embeds

No notes for slide

### Lec 4-slides

1. 1. Lecture 4 Barna Saha AT&T-Labs Research September 19, 2013
2. 2. Outline Heavy Hitter Continued Frequency Moment Estimation Dimensionality Reduction
3. 3. Heavy Hitter Heavy Hitter Problem: For 0 < < φ < 1 ﬁnd a set of elements S including all i such that fi > φm and there is no element in S with frequency ≤ (φ − )m. Count-Min sketch guarantees: fi ≤ ˆfi ≤ fi + m with probability ≥ 1 − δ in space e log 1 (φ− )δ . Insert only: Maintain a min-heap of size k = 1 φ− , when an item arrives estimate frequency and if above φm include it in the heap. If heap size more than k, discard the minimum frequency element in the heap.
4. 4. Heavy Hitter Turnstile model: Maintain dyadic intervals over binary search tree and maintain log n count-min sketch with using space e log 2 log n δ(φ− ) one for each level. At every level at most 1 φ heavy hitters. Estimate frequency of children of the heavy hitter nodes until leaf-level is reached. Return all the leaves with estimated frequency above φm. Analysis At most 2 φ− nodes at every level is examined. Each true frequency > (φ − )m with probability at least 1 − δ(φ− ) 2 log n . By union bound all true frequencies are above (φ − )m with probability at least 1 − δ.
5. 5. l2 frequency estimation |fi − ˆfi | ≤ ± f 2 1 + f 2 2 + ....f 2 n [Count-sketch] F2 = f 2 1 + f 2 2 + ....f 2 n How do we estimate F2 in small space ?
6. 6. AMS-F2 Estimation H = {h : [n] → {+1, −1}} four-wise independent hash functions Maintain Zj = Zj + ahj (i) on arrival of (i, a) for j = 1, ..., t = c 2 Return Y = 1 t t j=1 Z2 j
7. 7. Analysis Zj = n i=1 fi hj (i) E Zj = 0, E Z2 j = F2. Var Z2 j = E Z4 j − (E Zj )2 ≤ 4F2 2 . E Y = F2. Var Y = 1 t2 t j=1 Var(Z2 j ) = 4 2 c F2 2 By Chebyshev Inequality Pr |Y − E Y | > F2 ≤ 4 c
8. 8. Boosting by Median Keep Y1, Y2, ...Ys, s = O(log 1δ) Return A = median(Y1, Y2, .., Ys) By Chernoﬀ bound Pr |A − F2| > F2 < δ
9. 9. Linear Sketch Algorithm maintains a linear sketch [Z1, Z2, ...., Zt]x = Rx where R is a t × n random matrix with entries {+1, −1}. Use Y = ||Rx||2 2 to estimate t||x|2 2. t = O( 1 2 ). Streaming algorithm operating in the sketch model can be viewed as dimensionality reduction technique.
10. 10. Dimensionality Reduction Streaming algorithm operating in the sketch model can be viewed as dimensionality reduction technique. stream S: point in n dimensional space, want to compute l2(S) sketch operator can be viewed as an approximate embedding of ln 2 to sketch space C such that 1. Each point in C can be described using only small number (say m) of numbers so C ⊂ Rm and 2. value of l2(S) is approximately equal to F(C(S)). F(Y1, Y2, ..Yt) = median(Y1, Y2, .., Yt)
11. 11. Dimensionality Reduction F(Y1, Y2, ..Yt) = median(Y1, Y2, .., Yt) Disadvantage: F is not a norm–performing any nontrivial operations in the sketch space (e.g. clustering, similarity search, regression etc.) becomes diﬃcult. Can we embed from ln 2 to lm 2 , m << n approximately preserving the distance ? Johnson-Lindenstrauss Lemma
12. 12. Interlude to Normal Distribution Normal distribution N(0, 1): Range (−∞, ∞) Density f (x) = e−x2 / √ 2π Mean=0, Variance=1 Basic facts If X and Y are independent random variables with normal distribution then so is X + Y If X and Y are independent with mean 0 then E [X + Y ]2 = E X2 + E Y 2 E cX = cE X , Var cX = c2Var X
13. 13. A Diﬀerent Linear Sketch Instead of ±1 let ri be a i.i.d. random variable from N(0, 1). Consider Z = i ri xi E Z2 = E ( i ri xi )2 = i E r2 i x2 i = i Var ri x2 i = i x2 i = ||x||2 2. As before we maintain Z = [Z1, Z2, ..., Zt] and deﬁne Y = ||Z||2 2 E Y = t||x||2 2 We show that there exists constant C > 0 s.t. for small enough > 0 Pr |Y − t||x||2 2| > t||x||2 2 ≤ e−C 2t (JL lemma) set t = O( 1 2 log 1 δ )
14. 14. Johnson Lindenstrauss Lemma Lemma For any 0 < epsilon < 1 and any integer m, let t be a positive integer such that t > 4 ln m 2/2 + 3/3 Then for any set V of m points in Rn, there is a map f : Rn → Rt such that for all u and v ∈ V , (1 − )||u − v||2 2 ≤ ||f (u) − f (v)||2 2 ≤ (1 + )||u − v||2 2. Furthermore this map can be found in randomized polynomial time.
1. #### A particular slide catching your eye?

Clipping is a handy way to collect important slides you want to go back to later.