Bloom filter
Upcoming SlideShare
Loading in...5
×
 

Bloom filter

on

  • 1,233 views

 

Statistics

Views

Total Views
1,233
Views on SlideShare
1,233
Embed Views
0

Actions

Likes
0
Downloads
26
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Bloom filter Bloom filter Presentation Transcript

    • Bloom Filter YHD Search Sharing 2013-04-23
    • Outline ● Interview questions ● Bloom Filter – Data structure – Probability of false positives – Set properties ● Application – Cache sharing : squid – Speed up data access : Hbase – ID Mapping : zoie ● Materials
    • Interview questions ● Crawler – Billions web pages – How to keep track crawled urls ● Straggler Detection – You are manning the security desk of a large building – Everyone checks in or checks out with their id – At the end of day, identify the few stragglers left in the building
    • Data structure ● Data structure – Init : a bit array of m bits, all set to 0 – Add an element ● K hash function to get K array positions ● Set the bits at all these positions to 1 ● Query an element (test whether it's in the set) – K hash function to get K array positions – If any position are 0, not in the set – If all are 1, probabilistic in the set
    • Probability of false positives ● 1 hash function 0 0 0 0 0 0 0 1 0 00 0 0 1 0 0 0 0 0 0 0 0 0 0 p( A[i]=1∣hash[ x1 ,…, xn])=1−(1− 1 m ) n p(hash[ x1]=i)= 1 m p( A[i]=0∣hash[x1])=1− 1 m p( A[i]=0∣hash[x1 ,…,xn])=(1− 1 m ) n p( A[i]=1)=1−(1− 1 m ) n ≃1−e −n/m lim x→ ∞ (1− 1 x ) −x =e
    • Probability of false positives ● 1 hash function 0 0 0 0 0 0 0 1 0 00 0 0 1 0 0 0 0 0 0 0 0 0 0 p( A[ H ( y)]=1∣y∉S)= (number of 1) m 2/23 p( A[i]=1)=1−e −n/m Given E(number of 1)=m⋅(1−e −n/m ) p( A[ H ( y)]=1∣y∉S)=(1−e −n/m )
    • Probability of false positives ● K hash function : repeat for k times 0 0 0 0 0 0 0 1 0 00 0 0 1 0 0 0 0 0 0 0 0 0 0 p( A[ H ( y)]=1∣y∉S)= (number of 1) m p( A[i]=1)=1−e −n/m Given E(number of 1)=m⋅(1−e −n/m ) p( A[ H ( y)]=1∣y∉S)=(1−e −n/m ) p( A[ H ( y)]=1∣y∉S)=( number of 1 m ) k p( A[i]=1)=1−(1− 1 m ) kn ≃1−e −kn/m E(number of 1)=m⋅(1−e −kn/m ) p( A[ H ( y)]=1∣y∉S)=(1−e −kn/m ) k
    • Probability of false positives ● Minimal Probability of false positives p( A[ H ( y)]=1∣y∉S)=(1−e −kn/m ) k f =(1−e −kn/m ) k f =e k∗ln (1−e −kn/m ) Minimal f, then minimal g g=k∗ln (1−e −kn/m ) p=e −kn/m Given g=− m n ln( p)∗ln(1−p) Minimal(f )=( 1 2 ) k p= 1 2 e −kn/m = 1 2 is the probability than any specific bit is still 0 half-full Bloom filter array
    • Probability of false positives ● examples k=ln2 m n m/n k k=1 k=2 k=3 k=4 k=5 2 1.39 0.393 0.400 3 2.08 0.283 0.237 0.253 4 2.77 0.221 0.155 0.147 0.160 5 3.46 0.181 0.109 0.092 0.092 0.101 6 4.16 0.154 0.0804 0.0609 0.0561 0.0578 7 4.85 0.133 0.0618 0.0423 0.0359 0.0347 8 5.55 0.118 0.0489 0.0306 0.024 0.0217
    • Set properties ● Union (bitwise OR) – same as the Bloom filter created from scratch using the union of the two sets. ● Intersection (AND operations) – the false positive probability in the resulting Bloom filter is at most the false-positive probability in one of the constituent Bloom filters, but may be larger than the false positive probability in the Bloom filter created from scratch using the intersection of the two sets
    • Squid : Cache Digests
    • Squid : Cache Digests
    • Squid : Cache Digests
    • Squid : Cache Digests
    • Squid : Cache Digests ● False positive: – Proxy A thinks Proxy B has URL U cached. A asks for cached U, B responds back with “no”, A goes to actual website.
    • HBase :architecture
    • Hbase :HFile format ● (Not including Bloom Filter)
    • HBase : Query optimization ● Bloom Filter – As meta store of HFile – used to determine if a given key is in that store file ● Characteristics – Know n total KV count (N), but actual count can often be much lower – HFile.insert (and hence, BloomFilter.add) commands are done in lexicographically increasing order 4000 10000 5000 9001
    • Application : Zoie ● Long[] uidArray – Add element – Query element int h = (int) ((uid >>> 32) ^ uid) * MIXER; long bits = _filter[h & _mask]; bits |= ((1L << (h >>> 26))); bits |= ((1L << ((h >> 20) & 0x3F))); _filter[h & _mask] = bits; final int h = (int) ((uid >>> 32) ^ uid) * MIXER; final int p = h & _mask; // check the filter final long bits = _filter[p]; if ((bits & (1L << (h >>> 26))) == 0 || (bits & (1L << ((h >> 20) & 0x3F))) == 0) return -1;
    • Materials ● http://www.cs.jhu.edu/~fabian/courses/CS600.6 24/slides/bloomslides.pdf ● Google tech talk, bloom filtering ● http://www.slideshare.net/jaxlondon2012/intro -to-hbase-lars-george ● https://issues.apache.org/jira/secure/attachme nt/12444007/Bloom_Filters_in_HBase.pdf