Hash Functions FTW

Hash Functions FTW*
Fast Hashing, Bloom Filters & Hash-Oriented Storage

Sunny Gleason

* For the win (see urbandictionary FTW[1]); this expression has nothing to do with hash functions

What’s in this Presentation

• Hash Function Survey
• Hash Performance
• Bloom Filters
• HashFile : Hash Storage

Hash Functions
int getIntHash(byte[] data); // 32-bit
long getLongHash(byte[] data) // 64-bit

int v1 = hash(“foo”); int v2 = hash(“goo”);

int hash(byte[] value) { // a simple hash
int h = 0;
for (byte b: value) { h = (h<<5) ^ (h>>27) ^ b; }
return h % PRIME;
}

Hash Functions

• Goal : v1 has many bit differences from v2
• Desirable Properties:
• Uniform Distribution - no collisions
• Very Fast Computation

Hash Applications
Goal: O(1) access
• Hash Table
• Hash Set
• Bloom Filter

Popular Hash Functions
• FNV Hash
• DJB Hash
• Jenkins Hash
• Murmur2
• New (Promising?): CrapWow
• Awesome & Slow: SHA-1, MD5 etc.

Evaluating Hash Functions
• Hash Function “Zoo”
• Quality of: CRC32 DJB Jenkins FNV
Murmur2 SHA1
• Performance: !"#$%&'()*(+",-'%./%0'/%1',23$%
(MM ops/s) '#"
'!"
&#"
&!"
%#" *+,-.,/"
%!" 012312%"
$#" 456$"
$!"
#"
!"
%#(" ('" )"

A Strawman “Set”
• N keys, K bytes per key
• Allocate array of size K * N bytes
• Utilize array storage as:
• a heap or tree: O(lg N) insert/delete/
remove
• a hash: O(1) insert/delete/remove
• What if we don’t have room for K*N
bytes?

Bloom Filter
• Key Point: give up on storing all the keys
• Store r bits per key instead of K bytes
• Allocate bit vector of size: M = r * N,
where N is expected number of entries
• Use multiple hash functions of key to
determine which bits to set
• Premise: if hash functions are well-
distributed, few collisions, high accuracy

Tuning Bloom Filters
Let r = M bits / N keys (r: num bits/key)
Let k = 0.7 * r (k: num hashes to use)
Let p = 0.6185 ** r (p: probability of false positives)

Working backwards, we can use desired false
positive rate p to tune the data structure space
consumption:

r = 8, p = 2.1e-2 r = 16, p = 4.5e-4
r = 24, p = 9.8e-6 r = 32, p = 2.1e-7
r = 40, p = 4.5e-9 r = 48, p = 9.6e-11

Bloom Filter Performance
100MM entries, 8bits/key : 833k ops/s
100MM entries, 32bits/key : 256k ops/s
1BN entries, 8bits/key : 714k ops/s
1BN entries, 32bits/key : 185k ops/s

Hypothesis : difference between 100MM and
1BN is due to locality of memory access in
smaller bit vector

Hash-Oriented Storage
• HashFile : 64-bit clone of djb’s constant db
“CDB”

• Plain ol’ Key/Value storage:
add(byte[] k, byte[] v), byte[] lookup(byte[] k)

• Constant aka “Immutable” Data Store
create(), add(k, v) ... , build() ... before lookup(k)

• Use properties of hash table to achieve
O(1) disk seeks per lookup

HashFile Structure
• Header (ﬁxed width): table pointers,
contains offests of hash tables and count of
elements per table
• Body (variable width): contains
concatenation of all keys and values (with
data lengths)
• Footer (ﬁxed width): hash “tables”
containing long hash values of keys
alongside long offsets into body

HashFile Diagram
HEADER BODY FOOTER
p1s3p2s4p3s2p4s1 k1v1k2v2k3v3k4v4k5v5k6v6k7v7 hk7o7hk3o3hk4o4hk1o1

• Create: initialize empty header, start appending
keys/values while recording offsets and hash values
of keys

• Build: take list of hash values and offsets and turn
them into hash tables, backﬁll header with values

• Lookup: compute hash(key), compute offset into
table (hash modulo size of table), use table to ﬁnd
offset into body, return the value from body

HashFile Performance
• Spec: ≤ 2 disk seeks per lookup
• Number of seeks independent of number
of entries
• X25E SSD: 1BN 8-byte keys, values (41GB):
650μs lookup w/ cold cache, up to 700x
faster as ﬁlesystem cache warms, 0.9μs
when in-memory
• With 100MM entries (4GB), cold cache is
~600μs (from locality), 0.6μs warm

Conclusions

• Be aware of different Hash Functions and
their collision / performance tradeoffs
• Bloom Filters are extremely useful for fast,
large-scale set membership
• HashFile provides excellent performance in
cases where a static K/V store sufﬁces

Future Work
• Implement cWow hash in Java
• Extend HashFile with conﬁgurable hash,
pointer, and key/value lengths to conserve
space (reduce 24 bytes-per-KV overhead)
• Implement a read-write (non-constant)
version of HashFile
• Bloom Filter that spills to SSD

Thank You!
...Any questions? :)

References
• GitHub Project: g414-hash (hash
function, bloom ﬁlter, HashFile
implementations)
• Wikipedia: Hash Function, Bloom Filter
• Non-Cryptographic Hash Function Zoo
• DJB CDB, sg-cdb (java implementation)

Hash Functions FTW

More Related Content

What's hot

Viewers also liked

Similar to Hash Functions FTW

Recently uploaded

Hash Functions FTW