Hash Functions                                                                                        FTW*
Fast Hashing, B...
What’s in this Presentation

• Hash Function Survey
• Hash Performance
• Bloom Filters
• HashFile : Hash Storage
Hash Functions
int getIntHash(byte[] data); // 32-bit
long getLongHash(byte[] data) // 64-bit

int v1 = hash(“foo”); int v...
Hash Functions

• Goal : v1 has many bit differences from v2
• Desirable Properties:
 • Uniform Distribution - no collisio...
Hash Applications
Goal: O(1) access
   • Hash Table
   • Hash Set
   • Bloom Filter
Popular Hash Functions
• FNV Hash
• DJB Hash
• Jenkins Hash
• Murmur2
• New (Promising?): CrapWow
• Awesome & Slow: SHA-1,...
Evaluating Hash Functions
• Hash Function “Zoo”
• Quality of: CRC32 DJB    Jenkins FNV
  Murmur2 SHA1
• Performance:      ...
A Strawman “Set”
• N keys, K bytes per key
• Allocate array of size K * N bytes
• Utilize array storage as:
 • a heap or t...
Bloom Filter
• Key Point: give up on storing all the keys
• Store r bits per key instead of K bytes
• Allocate bit vector ...
Bloom Filter
Tuning Bloom Filters
Let r = M bits / N keys (r: num bits/key)
Let k = 0.7 * r      (k: num hashes to use)
Let p = 0.6185 ...
Bloom Filter Performance
  100MM entries, 8bits/key :    833k ops/s
  100MM entries, 32bits/key :   256k ops/s
  1BN entri...
Hash-Oriented Storage
•   HashFile : 64-bit clone of djb’s constant db
    “CDB”

•   Plain ol’ Key/Value storage:
     ad...
HashFile Structure
• Header (fixed width): table pointers,
  contains offests of hash tables and count of
  elements per ta...
HashFile Diagram
    HEADER                    BODY                      FOOTER
p1s3p2s4p3s2p4s1   k1v1k2v2k3v3k4v4k5v5k6v...
HashFile Performance
• Spec: ≤ 2 disk seeks per lookup
• Number of seeks independent of number
  of entries
• X25E SSD: 1B...
Conclusions

• Be aware of different Hash Functions and
  their collision / performance tradeoffs
• Bloom Filters are extr...
Future Work
• Implement cWow hash in Java
• Extend HashFile with configurable hash,
  pointer, and key/value lengths to con...
Thank You!
...Any questions? :)
References
• GitHub Project: g414-hash (hash
  function, bloom filter, HashFile
  implementations)
• Wikipedia: Hash Functi...
Upcoming SlideShare
Loading in...5
×

Hash Functions FTW

2,959

Published on

Presentation on Hash Functions, Bloom Filters, and Hash-Oriented Storage

Published in: Technology
0 Comments
9 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,959
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
57
Comments
0
Likes
9
Embeds 0
No embeds

No notes for slide

Hash Functions FTW

  1. 1. Hash Functions FTW* Fast Hashing, Bloom Filters & Hash-Oriented Storage Sunny Gleason * For the win (see urbandictionary FTW[1]); this expression has nothing to do with hash functions
  2. 2. What’s in this Presentation • Hash Function Survey • Hash Performance • Bloom Filters • HashFile : Hash Storage
  3. 3. Hash Functions int getIntHash(byte[] data); // 32-bit long getLongHash(byte[] data) // 64-bit int v1 = hash(“foo”); int v2 = hash(“goo”); int hash(byte[] value) { // a simple hash int h = 0; for (byte b: value) { h = (h<<5) ^ (h>>27) ^ b; } return h % PRIME; }
  4. 4. Hash Functions • Goal : v1 has many bit differences from v2 • Desirable Properties: • Uniform Distribution - no collisions • Very Fast Computation
  5. 5. Hash Applications Goal: O(1) access • Hash Table • Hash Set • Bloom Filter
  6. 6. Popular Hash Functions • FNV Hash • DJB Hash • Jenkins Hash • Murmur2 • New (Promising?): CrapWow • Awesome & Slow: SHA-1, MD5 etc.
  7. 7. Evaluating Hash Functions • Hash Function “Zoo” • Quality of: CRC32 DJB Jenkins FNV Murmur2 SHA1 • Performance: !"#$%&'()*(+",-'%./%0'/%1',23$% (MM ops/s) '#" '!" &#" &!" %#" *+,-.,/" %!" 012312%" $#" 456$" $!" #" !" %#(" ('" )"
  8. 8. A Strawman “Set” • N keys, K bytes per key • Allocate array of size K * N bytes • Utilize array storage as: • a heap or tree: O(lg N) insert/delete/ remove • a hash: O(1) insert/delete/remove • What if we don’t have room for K*N bytes?
  9. 9. Bloom Filter • Key Point: give up on storing all the keys • Store r bits per key instead of K bytes • Allocate bit vector of size: M = r * N, where N is expected number of entries • Use multiple hash functions of key to determine which bits to set • Premise: if hash functions are well- distributed, few collisions, high accuracy
  10. 10. Bloom Filter
  11. 11. Tuning Bloom Filters Let r = M bits / N keys (r: num bits/key) Let k = 0.7 * r (k: num hashes to use) Let p = 0.6185 ** r (p: probability of false positives) Working backwards, we can use desired false positive rate p to tune the data structure space consumption: r = 8, p = 2.1e-2 r = 16, p = 4.5e-4 r = 24, p = 9.8e-6 r = 32, p = 2.1e-7 r = 40, p = 4.5e-9 r = 48, p = 9.6e-11
  12. 12. Bloom Filter Performance 100MM entries, 8bits/key : 833k ops/s 100MM entries, 32bits/key : 256k ops/s 1BN entries, 8bits/key : 714k ops/s 1BN entries, 32bits/key : 185k ops/s Hypothesis : difference between 100MM and 1BN is due to locality of memory access in smaller bit vector
  13. 13. Hash-Oriented Storage • HashFile : 64-bit clone of djb’s constant db “CDB” • Plain ol’ Key/Value storage: add(byte[] k, byte[] v), byte[] lookup(byte[] k) • Constant aka “Immutable” Data Store create(), add(k, v) ... , build() ... before lookup(k) • Use properties of hash table to achieve O(1) disk seeks per lookup
  14. 14. HashFile Structure • Header (fixed width): table pointers, contains offests of hash tables and count of elements per table • Body (variable width): contains concatenation of all keys and values (with data lengths) • Footer (fixed width): hash “tables” containing long hash values of keys alongside long offsets into body
  15. 15. HashFile Diagram HEADER BODY FOOTER p1s3p2s4p3s2p4s1 k1v1k2v2k3v3k4v4k5v5k6v6k7v7 hk7o7hk3o3hk4o4hk1o1 • Create: initialize empty header, start appending keys/values while recording offsets and hash values of keys • Build: take list of hash values and offsets and turn them into hash tables, backfill header with values • Lookup: compute hash(key), compute offset into table (hash modulo size of table), use table to find offset into body, return the value from body
  16. 16. HashFile Performance • Spec: ≤ 2 disk seeks per lookup • Number of seeks independent of number of entries • X25E SSD: 1BN 8-byte keys, values (41GB): 650μs lookup w/ cold cache, up to 700x faster as filesystem cache warms, 0.9μs when in-memory • With 100MM entries (4GB), cold cache is ~600μs (from locality), 0.6μs warm
  17. 17. Conclusions • Be aware of different Hash Functions and their collision / performance tradeoffs • Bloom Filters are extremely useful for fast, large-scale set membership • HashFile provides excellent performance in cases where a static K/V store suffices
  18. 18. Future Work • Implement cWow hash in Java • Extend HashFile with configurable hash, pointer, and key/value lengths to conserve space (reduce 24 bytes-per-KV overhead) • Implement a read-write (non-constant) version of HashFile • Bloom Filter that spills to SSD
  19. 19. Thank You! ...Any questions? :)
  20. 20. References • GitHub Project: g414-hash (hash function, bloom filter, HashFile implementations) • Wikipedia: Hash Function, Bloom Filter • Non-Cryptographic Hash Function Zoo • DJB CDB, sg-cdb (java implementation)
  1. Gostou de algum slide específico?

    Recortar slides é uma maneira fácil de colecionar informações para acessar mais tarde.

×