Faster persistent data structures through hashing and Patricia trees

Faster persistent data structures through hashing

Johan Tibell
johan.tibell@gmail.com

2011-02-15

Motivating problem: Twitter data analys

“I’m computing a communication graph from Twitter data and
then scan it daily to allocate social capital to nodes behaving in a
good karmic manner. The graph is culled from 100 million tweets
and has about 3 million nodes.”

We need a data structure that is
fast when used with string keys, and
doesn’t use too much memory.

Persistent maps in Haskell

Data.Map is the most commonly used map type.
It’s implemented using size balanced trees.
Keys can be of any type, as long as values of the type can be
ordered.

Real world performance of Data.Map

Good in theory: no more than O(log n) comparisons.
Not great in practice: up to O(log n) comparisons!
Many common types are expensive to compare e.g String,
ByteString, and Text.
Given a string of length k, we need O(k ∗ log n) comparisons
to look up an entry.

Hash tables

Hash tables perform well with string keys: O(k) amortized
lookup time for strings of length k.
However, we want persistent maps, not mutable hash tables.

Milan Straka’s idea: Patricia trees as sparse arrays

We can use hashing without using hash tables!
A Patricia tree implements a persistent, sparse array.
Patricia trees are as twice fast as size balanced trees, but only
work with Int keys.
Use hashing to derive an Int from an arbitrary key.

Implementation tricks

Patricia trees implement a sparse, persistent array of size 232
(or 264 ).
Hashing using this many buckets makes collisions rare: for 224
entries we expect about 32,000 single collisions.
Linked lists are a perfectly adequate collision resolution
strategy.

First attempt at an implementation

−− D e f i n e d i n t h e c o n t a i n e r s p a c k a g e .
data IntMap a
= Nil
| Tip {−# UNPACK #−} ! Key a
| Bin {−# UNPACK #−} ! P r e f i x
{−# UNPACK #−} ! Mask
! ( IntMap a ) ! ( IntMap a )

type P r e f i x = I n t
type Mask = Int
type Key = Int

newtype HashMap k v = HashMap ( IntMap [ ( k , v ) ] )

A more memory eﬃcient implementation

data HashMap k v
= Nil
| Tip {−# UNPACK #−} ! Hash
{−# UNPACK #−} ! ( FL . F u l l L i s t k v )
| Bin {−# UNPACK #−} ! Prefix
{−# UNPACK #−} ! Mask
! ( HashMap k v ) ! ( HashMap k v )

type P r e f i x = I n t
type Mask = Int
type Hash = Int

data F u l l L i s t k v = FL ! k ! v ! ( L i s t k v )
data L i s t k v = N i l | Cons ! k ! v ! ( L i s t k v )

Reducing the memory footprint

List k v uses 2 fewer words per key/value pair than [( k, v )] .
FullList can be unpacked into the Tip constructor as it’s a
product type, saving 2 more words.
Always unpack word sized types, like Int, unless you really
need them to be lazy.

Benchmarks

Keys: 212 random 8-byte ByteStrings

Map HashMap
insert 1.00 0.43
lookup 1.00 0.28

delete performs like insert.

Can we do better?

We still need to perform O(min(W , log n)) Int comparisons,
where W is the number of bits in a word.
The memory overhead per key/value pair is still quite high.

Borrowing from our neighbours

Clojure uses a hash-array mapped trie (HAMT) data structure
to implement persistent maps.
Described in the paper “Ideal Hash Trees” by Bagwell (2001).
Originally a mutable data structure implemented in C++.
Clojure’s persistent version was created by Rich Hickey.

Hash-array mapped tries in Clojure

Shallow tree with high branching factor.
Each node, except the leaf nodes, contains an array of up to
32 elements.
5 bits of the hash are used to index the array at each level.
A clever trick, using bit population count, is used to represent
spare arrays.

The Haskell deﬁnition of a HAMT

data HashMap k v
= Empty
| BitmapIndexed
{−# UNPACK #−} ! Bitmap
{−# UNPACK #−} ! ( Array ( HashMap k v ) )
| L e a f {−# UNPACK #−} ! ( L e a f k v )
| F u l l {−# UNPACK #−} ! ( Array ( HashMap k v ) )
| C o l l i s i o n {−# UNPACK #−} ! Hash
{−# UNPACK #−} ! ( Array ( L e a f k v ) )

type Bitmap = Word

data Array a = Array ! ( Array# a )
{−# UNPACK #−} ! I n t

Making it fast

The initial implementation by Edward Z. Yang: correct but
didn’t perform well.
Improved performance by
replacing use of Data.Vector by a specialized array type,
paying careful attention to strictness, and
using GHC’s new INLINABLE pragma.

Benchmarks


Map HashMap HashMap (HAMT)
insert 1.00 0.43 1.21
lookup 1.00 0.28 0.21

Where is all the time spent?

Most time in insert is spent copying small arrays.
Array copying is implemented using indexArray# and
writeArray#, which results in poor performance.
When cloning an array, we are force to ﬁrst ﬁll the new array
with dummy elements, followed by copying over the elements
from the old array.

A better array copy

Daniel Peebles have implemented a set of new primops for
copying array in GHC.
The ﬁrst implementation showed a 20% performance
improvement for insert.
Copying arrays is still slow so there might be room for big
improvements still.

Other possible performance improvements

Even faster array copying using SSE instructions, inline
memory allocation, and CMM inlining.
Use dedicated bit population count instruction on
architectures where it’s available.
Clojure uses a clever trick to unpack keys and values directly
into the arrays; keys are stored at even positions and values at
odd positions.
GHC 7.2 will use 1 less word for Array.

Optimize common cases

In many cases maps are created in one go from a sequence of
key/value pairs.
We can optimize for this case by repeatedly mutating the
HAMT and freeze it when we’re done.


fromList/pure 1.00
fromList/mutating 0.50

Abstracting over collection types

We will soon have two map types worth using (one ordered
and one unordered).
We want to write function that work with both types, without
a O(n) conversion cost.
Use type families to abstract over diﬀerent concrete
implementation.

Summary

Hashing allows us to create more eﬃcient data structures.
There are new interesting data structures out there that have,
or could have, an eﬃcient persistent implementation.

Faster persistent data structures through hashing and Patricia trees

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (20)

Similar to Faster persistent data structures through hashing and Patricia trees

Similar to Faster persistent data structures through hashing and Patricia trees (20)

Recently uploaded

Recently uploaded (20)

Faster persistent data structures through hashing and Patricia trees