Faster persistent data structures through hashing

                    Johan Tibell
               johan.tibell@gmail.com


                    2011-02-15
Motivating problem: Twitter data analys



   “I’m computing a communication graph from Twitter data and
   then scan it daily to allocate social capital to nodes behaving in a
   good karmic manner. The graph is culled from 100 million tweets
   and has about 3 million nodes.”

   We need a data structure that is
       fast when used with string keys, and
       doesn’t use too much memory.
Persistent maps in Haskell




       Data.Map is the most commonly used map type.
       It’s implemented using size balanced trees.
       Keys can be of any type, as long as values of the type can be
       ordered.
Real world performance of Data.Map




      Good in theory: no more than O(log n) comparisons.
      Not great in practice: up to O(log n) comparisons!
      Many common types are expensive to compare e.g String,
      ByteString, and Text.
      Given a string of length k, we need O(k ∗ log n) comparisons
      to look up an entry.
Hash tables




      Hash tables perform well with string keys: O(k) amortized
      lookup time for strings of length k.
      However, we want persistent maps, not mutable hash tables.
Milan Straka’s idea: Patricia trees as sparse arrays




       We can use hashing without using hash tables!
       A Patricia tree implements a persistent, sparse array.
       Patricia trees are as twice fast as size balanced trees, but only
       work with Int keys.
       Use hashing to derive an Int from an arbitrary key.
Implementation tricks




      Patricia trees implement a sparse, persistent array of size 232
      (or 264 ).
      Hashing using this many buckets makes collisions rare: for 224
      entries we expect about 32,000 single collisions.
      Linked lists are a perfectly adequate collision resolution
      strategy.
First attempt at an implementation

  −− D e f i n e d i n t h e c o n t a i n e r s p a c k a g e .
  data IntMap a
      = Nil
      | Tip {−# UNPACK #−} ! Key a
      | Bin {−# UNPACK #−} ! P r e f i x
                  {−# UNPACK #−} ! Mask
                  ! ( IntMap a ) ! ( IntMap a )

   type P r e f i x = I n t
   type Mask        = Int
   type Key         = Int

   newtype HashMap k v = HashMap ( IntMap [ ( k , v ) ] )
A more memory efficient implementation

   data HashMap k v
       = Nil
       | Tip {−# UNPACK #−}        ! Hash
             {−# UNPACK #−}        ! ( FL . F u l l L i s t k v )
       | Bin {−# UNPACK #−}        ! Prefix
             {−# UNPACK #−}        ! Mask
             ! ( HashMap k v )     ! ( HashMap k v )

   type P r e f i x = I n t
   type Mask        = Int
   type Hash        = Int

   data F u l l L i s t k v = FL ! k ! v ! ( L i s t k v )
   data L i s t k v = N i l | Cons ! k ! v ! ( L i s t k v )
Reducing the memory footprint




      List k v uses 2 fewer words per key/value pair than [( k, v )] .
       FullList can be unpacked into the Tip constructor as it’s a
      product type, saving 2 more words.
      Always unpack word sized types, like Int, unless you really
      need them to be lazy.
Benchmarks




  Keys: 212 random 8-byte ByteStrings

             Map    HashMap
    insert   1.00     0.43
   lookup    1.00     0.28

  delete performs like insert.
Can we do better?




      We still need to perform O(min(W , log n)) Int comparisons,
      where W is the number of bits in a word.
      The memory overhead per key/value pair is still quite high.
Borrowing from our neighbours




      Clojure uses a hash-array mapped trie (HAMT) data structure
      to implement persistent maps.
      Described in the paper “Ideal Hash Trees” by Bagwell (2001).
      Originally a mutable data structure implemented in C++.
      Clojure’s persistent version was created by Rich Hickey.
Hash-array mapped tries in Clojure




      Shallow tree with high branching factor.
      Each node, except the leaf nodes, contains an array of up to
      32 elements.
      5 bits of the hash are used to index the array at each level.
      A clever trick, using bit population count, is used to represent
      spare arrays.
The Haskell definition of a HAMT

   data HashMap k v
       = Empty
       | BitmapIndexed
                {−# UNPACK #−} ! Bitmap
                {−# UNPACK #−} ! ( Array ( HashMap k v ) )
       | L e a f {−# UNPACK #−} ! ( L e a f k v )
       | F u l l {−# UNPACK #−} ! ( Array ( HashMap k v ) )
       | C o l l i s i o n {−# UNPACK #−} ! Hash
                           {−# UNPACK #−} ! ( Array ( L e a f k v ) )

   type Bitmap = Word

   data Array a = Array ! ( Array# a )
                        {−# UNPACK #−} ! I n t
Making it fast




       The initial implementation by Edward Z. Yang: correct but
       didn’t perform well.
       Improved performance by
           replacing use of Data.Vector by a specialized array type,
           paying careful attention to strictness, and
           using GHC’s new INLINABLE pragma.
Benchmarks




  Keys: 212 random 8-byte ByteStrings

             Map    HashMap    HashMap (HAMT)
    insert   1.00     0.43          1.21
   lookup    1.00     0.28          0.21
Where is all the time spent?




       Most time in insert is spent copying small arrays.
       Array copying is implemented using indexArray# and
       writeArray#, which results in poor performance.
       When cloning an array, we are force to first fill the new array
       with dummy elements, followed by copying over the elements
       from the old array.
A better array copy




      Daniel Peebles have implemented a set of new primops for
      copying array in GHC.
      The first implementation showed a 20% performance
      improvement for insert.
      Copying arrays is still slow so there might be room for big
      improvements still.
Other possible performance improvements



      Even faster array copying using SSE instructions, inline
      memory allocation, and CMM inlining.
      Use dedicated bit population count instruction on
      architectures where it’s available.
      Clojure uses a clever trick to unpack keys and values directly
      into the arrays; keys are stored at even positions and values at
      odd positions.
      GHC 7.2 will use 1 less word for Array.
Optimize common cases



      In many cases maps are created in one go from a sequence of
      key/value pairs.
      We can optimize for this case by repeatedly mutating the
      HAMT and freeze it when we’re done.

  Keys: 212 random 8-byte ByteStrings

      fromList/pure     1.00
   fromList/mutating    0.50
Abstracting over collection types




       We will soon have two map types worth using (one ordered
       and one unordered).
       We want to write function that work with both types, without
       a O(n) conversion cost.
       Use type families to abstract over different concrete
       implementation.
Summary




     Hashing allows us to create more efficient data structures.
     There are new interesting data structures out there that have,
     or could have, an efficient persistent implementation.

Faster persistent data structures through hashing

  • 1.
    Faster persistent datastructures through hashing Johan Tibell johan.tibell@gmail.com 2011-02-15
  • 2.
    Motivating problem: Twitterdata analys “I’m computing a communication graph from Twitter data and then scan it daily to allocate social capital to nodes behaving in a good karmic manner. The graph is culled from 100 million tweets and has about 3 million nodes.” We need a data structure that is fast when used with string keys, and doesn’t use too much memory.
  • 3.
    Persistent maps inHaskell Data.Map is the most commonly used map type. It’s implemented using size balanced trees. Keys can be of any type, as long as values of the type can be ordered.
  • 4.
    Real world performanceof Data.Map Good in theory: no more than O(log n) comparisons. Not great in practice: up to O(log n) comparisons! Many common types are expensive to compare e.g String, ByteString, and Text. Given a string of length k, we need O(k ∗ log n) comparisons to look up an entry.
  • 5.
    Hash tables Hash tables perform well with string keys: O(k) amortized lookup time for strings of length k. However, we want persistent maps, not mutable hash tables.
  • 6.
    Milan Straka’s idea:Patricia trees as sparse arrays We can use hashing without using hash tables! A Patricia tree implements a persistent, sparse array. Patricia trees are as twice fast as size balanced trees, but only work with Int keys. Use hashing to derive an Int from an arbitrary key.
  • 7.
    Implementation tricks Patricia trees implement a sparse, persistent array of size 232 (or 264 ). Hashing using this many buckets makes collisions rare: for 224 entries we expect about 32,000 single collisions. Linked lists are a perfectly adequate collision resolution strategy.
  • 8.
    First attempt atan implementation −− D e f i n e d i n t h e c o n t a i n e r s p a c k a g e . data IntMap a = Nil | Tip {−# UNPACK #−} ! Key a | Bin {−# UNPACK #−} ! P r e f i x {−# UNPACK #−} ! Mask ! ( IntMap a ) ! ( IntMap a ) type P r e f i x = I n t type Mask = Int type Key = Int newtype HashMap k v = HashMap ( IntMap [ ( k , v ) ] )
  • 9.
    A more memoryefficient implementation data HashMap k v = Nil | Tip {−# UNPACK #−} ! Hash {−# UNPACK #−} ! ( FL . F u l l L i s t k v ) | Bin {−# UNPACK #−} ! Prefix {−# UNPACK #−} ! Mask ! ( HashMap k v ) ! ( HashMap k v ) type P r e f i x = I n t type Mask = Int type Hash = Int data F u l l L i s t k v = FL ! k ! v ! ( L i s t k v ) data L i s t k v = N i l | Cons ! k ! v ! ( L i s t k v )
  • 10.
    Reducing the memoryfootprint List k v uses 2 fewer words per key/value pair than [( k, v )] . FullList can be unpacked into the Tip constructor as it’s a product type, saving 2 more words. Always unpack word sized types, like Int, unless you really need them to be lazy.
  • 11.
    Benchmarks Keys:212 random 8-byte ByteStrings Map HashMap insert 1.00 0.43 lookup 1.00 0.28 delete performs like insert.
  • 12.
    Can we dobetter? We still need to perform O(min(W , log n)) Int comparisons, where W is the number of bits in a word. The memory overhead per key/value pair is still quite high.
  • 13.
    Borrowing from ourneighbours Clojure uses a hash-array mapped trie (HAMT) data structure to implement persistent maps. Described in the paper “Ideal Hash Trees” by Bagwell (2001). Originally a mutable data structure implemented in C++. Clojure’s persistent version was created by Rich Hickey.
  • 14.
    Hash-array mapped triesin Clojure Shallow tree with high branching factor. Each node, except the leaf nodes, contains an array of up to 32 elements. 5 bits of the hash are used to index the array at each level. A clever trick, using bit population count, is used to represent spare arrays.
  • 15.
    The Haskell definitionof a HAMT data HashMap k v = Empty | BitmapIndexed {−# UNPACK #−} ! Bitmap {−# UNPACK #−} ! ( Array ( HashMap k v ) ) | L e a f {−# UNPACK #−} ! ( L e a f k v ) | F u l l {−# UNPACK #−} ! ( Array ( HashMap k v ) ) | C o l l i s i o n {−# UNPACK #−} ! Hash {−# UNPACK #−} ! ( Array ( L e a f k v ) ) type Bitmap = Word data Array a = Array ! ( Array# a ) {−# UNPACK #−} ! I n t
  • 16.
    Making it fast The initial implementation by Edward Z. Yang: correct but didn’t perform well. Improved performance by replacing use of Data.Vector by a specialized array type, paying careful attention to strictness, and using GHC’s new INLINABLE pragma.
  • 17.
    Benchmarks Keys:212 random 8-byte ByteStrings Map HashMap HashMap (HAMT) insert 1.00 0.43 1.21 lookup 1.00 0.28 0.21
  • 18.
    Where is allthe time spent? Most time in insert is spent copying small arrays. Array copying is implemented using indexArray# and writeArray#, which results in poor performance. When cloning an array, we are force to first fill the new array with dummy elements, followed by copying over the elements from the old array.
  • 19.
    A better arraycopy Daniel Peebles have implemented a set of new primops for copying array in GHC. The first implementation showed a 20% performance improvement for insert. Copying arrays is still slow so there might be room for big improvements still.
  • 20.
    Other possible performanceimprovements Even faster array copying using SSE instructions, inline memory allocation, and CMM inlining. Use dedicated bit population count instruction on architectures where it’s available. Clojure uses a clever trick to unpack keys and values directly into the arrays; keys are stored at even positions and values at odd positions. GHC 7.2 will use 1 less word for Array.
  • 21.
    Optimize common cases In many cases maps are created in one go from a sequence of key/value pairs. We can optimize for this case by repeatedly mutating the HAMT and freeze it when we’re done. Keys: 212 random 8-byte ByteStrings fromList/pure 1.00 fromList/mutating 0.50
  • 22.
    Abstracting over collectiontypes We will soon have two map types worth using (one ordered and one unordered). We want to write function that work with both types, without a O(n) conversion cost. Use type families to abstract over different concrete implementation.
  • 23.
    Summary Hashing allows us to create more efficient data structures. There are new interesting data structures out there that have, or could have, an efficient persistent implementation.