Hash Functions                                                                                        FTW*
Fast Hashing, Bloom Filters & Hash-Oriented Storage



                                                            Sunny Gleason


* For   the win (see urbandictionary FTW[1]); this expression has nothing to do with hash functions
What’s in this Presentation

• Hash Function Survey
• Hash Performance
• Bloom Filters
• HashFile : Hash Storage
Hash Functions
int getIntHash(byte[] data); // 32-bit
long getLongHash(byte[] data) // 64-bit

int v1 = hash(“foo”); int v2 = hash(“goo”);

int hash(byte[] value) { // a simple hash
    int h = 0;
    for (byte b: value) { h = (h<<5) ^ (h>>27) ^ b; }
    return h % PRIME;
}
Hash Functions

• Goal : v1 has many bit differences from v2
• Desirable Properties:
 • Uniform Distribution - no collisions
 • Very Fast Computation
Hash Applications
Goal: O(1) access
   • Hash Table
   • Hash Set
   • Bloom Filter
Popular Hash Functions
• FNV Hash
• DJB Hash
• Jenkins Hash
• Murmur2
• New (Promising?): CrapWow
• Awesome & Slow: SHA-1, MD5 etc.
Evaluating Hash Functions
• Hash Function “Zoo”
• Quality of: CRC32 DJB    Jenkins FNV
  Murmur2 SHA1
• Performance:            !"#$%&'()*(+",-'%./%0'/%1',23$%
  (MM ops/s)       '#"
                   '!"
                   &#"
                   &!"
                   %#"                                      *+,-.,/"
                   %!"                                      012312%"
                   $#"                                      456$"
                   $!"
                    #"
                    !"
                           %#("      ('"        )"
A Strawman “Set”
• N keys, K bytes per key
• Allocate array of size K * N bytes
• Utilize array storage as:
 • a heap or tree: O(lg N) insert/delete/
    remove
  • a hash: O(1) insert/delete/remove
• What if we don’t have room for K*N
  bytes?
Bloom Filter
• Key Point: give up on storing all the keys
• Store r bits per key instead of K bytes
• Allocate bit vector of size: M = r * N,
  where N is expected number of entries
• Use multiple hash functions of key to
  determine which bits to set
• Premise: if hash functions are well-
  distributed, few collisions, high accuracy
Bloom Filter
Tuning Bloom Filters
Let r = M bits / N keys (r: num bits/key)
Let k = 0.7 * r      (k: num hashes to use)
Let p = 0.6185 ** r (p: probability of false positives)

Working backwards, we can use desired false
positive rate p to tune the data structure space
consumption:

r = 8, p = 2.1e-2      r = 16, p = 4.5e-4
r = 24, p = 9.8e-6     r = 32, p = 2.1e-7
r = 40, p = 4.5e-9     r = 48, p = 9.6e-11
Bloom Filter Performance
  100MM entries, 8bits/key :    833k ops/s
  100MM entries, 32bits/key :   256k ops/s
  1BN entries, 8bits/key :      714k ops/s
  1BN entries, 32bits/key :     185k ops/s

  Hypothesis : difference between 100MM and
  1BN is due to locality of memory access in
  smaller bit vector
Hash-Oriented Storage
•   HashFile : 64-bit clone of djb’s constant db
    “CDB”

•   Plain ol’ Key/Value storage:
     add(byte[] k, byte[] v), byte[] lookup(byte[] k)

•   Constant aka “Immutable” Data Store
     create(), add(k, v) ... , build() ... before lookup(k)

•   Use properties of hash table to achieve
    O(1) disk seeks per lookup
HashFile Structure
• Header (fixed width): table pointers,
  contains offests of hash tables and count of
  elements per table
• Body (variable width): contains
  concatenation of all keys and values (with
  data lengths)
• Footer (fixed width): hash “tables”
  containing long hash values of keys
  alongside long offsets into body
HashFile Diagram
    HEADER                    BODY                      FOOTER
p1s3p2s4p3s2p4s1   k1v1k2v2k3v3k4v4k5v5k6v6k7v7   hk7o7hk3o3hk4o4hk1o1




       •   Create: initialize empty header, start appending
           keys/values while recording offsets and hash values
           of keys

       •   Build: take list of hash values and offsets and turn
           them into hash tables, backfill header with values

       •   Lookup: compute hash(key), compute offset into
           table (hash modulo size of table), use table to find
           offset into body, return the value from body
HashFile Performance
• Spec: ≤ 2 disk seeks per lookup
• Number of seeks independent of number
  of entries
• X25E SSD: 1BN 8-byte keys, values (41GB):
  650μs lookup w/ cold cache, up to 700x
  faster as filesystem cache warms, 0.9μs
  when in-memory
• With 100MM entries (4GB), cold cache is
  ~600μs (from locality), 0.6μs warm
Conclusions

• Be aware of different Hash Functions and
  their collision / performance tradeoffs
• Bloom Filters are extremely useful for fast,
  large-scale set membership
• HashFile provides excellent performance in
  cases where a static K/V store suffices
Future Work
• Implement cWow hash in Java
• Extend HashFile with configurable hash,
  pointer, and key/value lengths to conserve
  space (reduce 24 bytes-per-KV overhead)
• Implement a read-write (non-constant)
  version of HashFile
• Bloom Filter that spills to SSD
Thank You!
...Any questions? :)
References
• GitHub Project: g414-hash (hash
  function, bloom filter, HashFile
  implementations)
• Wikipedia: Hash Function, Bloom Filter
• Non-Cryptographic Hash Function Zoo
• DJB CDB, sg-cdb (java implementation)

Hash Functions FTW

  • 1.
    Hash Functions FTW* Fast Hashing, Bloom Filters & Hash-Oriented Storage Sunny Gleason * For the win (see urbandictionary FTW[1]); this expression has nothing to do with hash functions
  • 2.
    What’s in thisPresentation • Hash Function Survey • Hash Performance • Bloom Filters • HashFile : Hash Storage
  • 3.
    Hash Functions int getIntHash(byte[]data); // 32-bit long getLongHash(byte[] data) // 64-bit int v1 = hash(“foo”); int v2 = hash(“goo”); int hash(byte[] value) { // a simple hash int h = 0; for (byte b: value) { h = (h<<5) ^ (h>>27) ^ b; } return h % PRIME; }
  • 4.
    Hash Functions • Goal: v1 has many bit differences from v2 • Desirable Properties: • Uniform Distribution - no collisions • Very Fast Computation
  • 5.
    Hash Applications Goal: O(1)access • Hash Table • Hash Set • Bloom Filter
  • 6.
    Popular Hash Functions •FNV Hash • DJB Hash • Jenkins Hash • Murmur2 • New (Promising?): CrapWow • Awesome & Slow: SHA-1, MD5 etc.
  • 7.
    Evaluating Hash Functions •Hash Function “Zoo” • Quality of: CRC32 DJB Jenkins FNV Murmur2 SHA1 • Performance: !"#$%&'()*(+",-'%./%0'/%1',23$% (MM ops/s) '#" '!" &#" &!" %#" *+,-.,/" %!" 012312%" $#" 456$" $!" #" !" %#(" ('" )"
  • 8.
    A Strawman “Set” •N keys, K bytes per key • Allocate array of size K * N bytes • Utilize array storage as: • a heap or tree: O(lg N) insert/delete/ remove • a hash: O(1) insert/delete/remove • What if we don’t have room for K*N bytes?
  • 9.
    Bloom Filter • KeyPoint: give up on storing all the keys • Store r bits per key instead of K bytes • Allocate bit vector of size: M = r * N, where N is expected number of entries • Use multiple hash functions of key to determine which bits to set • Premise: if hash functions are well- distributed, few collisions, high accuracy
  • 10.
  • 11.
    Tuning Bloom Filters Letr = M bits / N keys (r: num bits/key) Let k = 0.7 * r (k: num hashes to use) Let p = 0.6185 ** r (p: probability of false positives) Working backwards, we can use desired false positive rate p to tune the data structure space consumption: r = 8, p = 2.1e-2 r = 16, p = 4.5e-4 r = 24, p = 9.8e-6 r = 32, p = 2.1e-7 r = 40, p = 4.5e-9 r = 48, p = 9.6e-11
  • 12.
    Bloom Filter Performance 100MM entries, 8bits/key : 833k ops/s 100MM entries, 32bits/key : 256k ops/s 1BN entries, 8bits/key : 714k ops/s 1BN entries, 32bits/key : 185k ops/s Hypothesis : difference between 100MM and 1BN is due to locality of memory access in smaller bit vector
  • 13.
    Hash-Oriented Storage • HashFile : 64-bit clone of djb’s constant db “CDB” • Plain ol’ Key/Value storage: add(byte[] k, byte[] v), byte[] lookup(byte[] k) • Constant aka “Immutable” Data Store create(), add(k, v) ... , build() ... before lookup(k) • Use properties of hash table to achieve O(1) disk seeks per lookup
  • 14.
    HashFile Structure • Header(fixed width): table pointers, contains offests of hash tables and count of elements per table • Body (variable width): contains concatenation of all keys and values (with data lengths) • Footer (fixed width): hash “tables” containing long hash values of keys alongside long offsets into body
  • 15.
    HashFile Diagram HEADER BODY FOOTER p1s3p2s4p3s2p4s1 k1v1k2v2k3v3k4v4k5v5k6v6k7v7 hk7o7hk3o3hk4o4hk1o1 • Create: initialize empty header, start appending keys/values while recording offsets and hash values of keys • Build: take list of hash values and offsets and turn them into hash tables, backfill header with values • Lookup: compute hash(key), compute offset into table (hash modulo size of table), use table to find offset into body, return the value from body
  • 16.
    HashFile Performance • Spec:≤ 2 disk seeks per lookup • Number of seeks independent of number of entries • X25E SSD: 1BN 8-byte keys, values (41GB): 650μs lookup w/ cold cache, up to 700x faster as filesystem cache warms, 0.9μs when in-memory • With 100MM entries (4GB), cold cache is ~600μs (from locality), 0.6μs warm
  • 17.
    Conclusions • Be awareof different Hash Functions and their collision / performance tradeoffs • Bloom Filters are extremely useful for fast, large-scale set membership • HashFile provides excellent performance in cases where a static K/V store suffices
  • 18.
    Future Work • ImplementcWow hash in Java • Extend HashFile with configurable hash, pointer, and key/value lengths to conserve space (reduce 24 bytes-per-KV overhead) • Implement a read-write (non-constant) version of HashFile • Bloom Filter that spills to SSD
  • 19.
  • 20.
    References • GitHub Project:g414-hash (hash function, bloom filter, HashFile implementations) • Wikipedia: Hash Function, Bloom Filter • Non-Cryptographic Hash Function Zoo • DJB CDB, sg-cdb (java implementation)