Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Like this presentation? Why not share!

- Hash Techniques in Cryptography by Basudev Saha 5596 views
- Secure hashing algorithm by Karteek Paruchuri 9223 views
- A Comparative Analysis between SHA ... by Er Piyush Gupta 758 views
- Cryptographic Hashing Functions by Yusuf Uzun 305 views
- Hash Functions: lecture series by A... by GuardTimeEstonia 2946 views
- Hash Function & Analysis by Pawandeep Kaur 1321 views

3,562 views

Published on

Presentation on Hash Functions, Bloom Filters, and Hash-Oriented Storage

Published in:
Technology

No Downloads

Total views

3,562

On SlideShare

0

From Embeds

0

Number of Embeds

15

Shares

0

Downloads

60

Comments

0

Likes

10

No embeds

No notes for slide

- 1. Hash Functions FTW* Fast Hashing, Bloom Filters & Hash-Oriented Storage Sunny Gleason * For the win (see urbandictionary FTW[1]); this expression has nothing to do with hash functions
- 2. What’s in this Presentation • Hash Function Survey • Hash Performance • Bloom Filters • HashFile : Hash Storage
- 3. Hash Functions int getIntHash(byte[] data); // 32-bit long getLongHash(byte[] data) // 64-bit int v1 = hash(“foo”); int v2 = hash(“goo”); int hash(byte[] value) { // a simple hash int h = 0; for (byte b: value) { h = (h<<5) ^ (h>>27) ^ b; } return h % PRIME; }
- 4. Hash Functions • Goal : v1 has many bit differences from v2 • Desirable Properties: • Uniform Distribution - no collisions • Very Fast Computation
- 5. Hash Applications Goal: O(1) access • Hash Table • Hash Set • Bloom Filter
- 6. Popular Hash Functions • FNV Hash • DJB Hash • Jenkins Hash • Murmur2 • New (Promising?): CrapWow • Awesome & Slow: SHA-1, MD5 etc.
- 7. Evaluating Hash Functions • Hash Function “Zoo” • Quality of: CRC32 DJB Jenkins FNV Murmur2 SHA1 • Performance: !"#$%&'()*(+",-'%./%0'/%1',23$% (MM ops/s) '#" '!" &#" &!" %#" *+,-.,/" %!" 012312%" $#" 456$" $!" #" !" %#(" ('" )"
- 8. A Strawman “Set” • N keys, K bytes per key • Allocate array of size K * N bytes • Utilize array storage as: • a heap or tree: O(lg N) insert/delete/ remove • a hash: O(1) insert/delete/remove • What if we don’t have room for K*N bytes?
- 9. Bloom Filter • Key Point: give up on storing all the keys • Store r bits per key instead of K bytes • Allocate bit vector of size: M = r * N, where N is expected number of entries • Use multiple hash functions of key to determine which bits to set • Premise: if hash functions are well- distributed, few collisions, high accuracy
- 10. Bloom Filter
- 11. Tuning Bloom Filters Let r = M bits / N keys (r: num bits/key) Let k = 0.7 * r (k: num hashes to use) Let p = 0.6185 ** r (p: probability of false positives) Working backwards, we can use desired false positive rate p to tune the data structure space consumption: r = 8, p = 2.1e-2 r = 16, p = 4.5e-4 r = 24, p = 9.8e-6 r = 32, p = 2.1e-7 r = 40, p = 4.5e-9 r = 48, p = 9.6e-11
- 12. Bloom Filter Performance 100MM entries, 8bits/key : 833k ops/s 100MM entries, 32bits/key : 256k ops/s 1BN entries, 8bits/key : 714k ops/s 1BN entries, 32bits/key : 185k ops/s Hypothesis : difference between 100MM and 1BN is due to locality of memory access in smaller bit vector
- 13. Hash-Oriented Storage • HashFile : 64-bit clone of djb’s constant db “CDB” • Plain ol’ Key/Value storage: add(byte[] k, byte[] v), byte[] lookup(byte[] k) • Constant aka “Immutable” Data Store create(), add(k, v) ... , build() ... before lookup(k) • Use properties of hash table to achieve O(1) disk seeks per lookup
- 14. HashFile Structure • Header (ﬁxed width): table pointers, contains offests of hash tables and count of elements per table • Body (variable width): contains concatenation of all keys and values (with data lengths) • Footer (ﬁxed width): hash “tables” containing long hash values of keys alongside long offsets into body
- 15. HashFile Diagram HEADER BODY FOOTER p1s3p2s4p3s2p4s1 k1v1k2v2k3v3k4v4k5v5k6v6k7v7 hk7o7hk3o3hk4o4hk1o1 • Create: initialize empty header, start appending keys/values while recording offsets and hash values of keys • Build: take list of hash values and offsets and turn them into hash tables, backﬁll header with values • Lookup: compute hash(key), compute offset into table (hash modulo size of table), use table to ﬁnd offset into body, return the value from body
- 16. HashFile Performance • Spec: ≤ 2 disk seeks per lookup • Number of seeks independent of number of entries • X25E SSD: 1BN 8-byte keys, values (41GB): 650μs lookup w/ cold cache, up to 700x faster as ﬁlesystem cache warms, 0.9μs when in-memory • With 100MM entries (4GB), cold cache is ~600μs (from locality), 0.6μs warm
- 17. Conclusions • Be aware of different Hash Functions and their collision / performance tradeoffs • Bloom Filters are extremely useful for fast, large-scale set membership • HashFile provides excellent performance in cases where a static K/V store sufﬁces
- 18. Future Work • Implement cWow hash in Java • Extend HashFile with conﬁgurable hash, pointer, and key/value lengths to conserve space (reduce 24 bytes-per-KV overhead) • Implement a read-write (non-constant) version of HashFile • Bloom Filter that spills to SSD
- 19. Thank You! ...Any questions? :)
- 20. References • GitHub Project: g414-hash (hash function, bloom ﬁlter, HashFile implementations) • Wikipedia: Hash Function, Bloom Filter • Non-Cryptographic Hash Function Zoo • DJB CDB, sg-cdb (java implementation)

No public clipboards found for this slide

Be the first to comment