Persistent Data Structures by @aradzie

5,189 views

Published on

Published in: Technology, Education
2 Comments
7 Likes
Statistics
Notes
No Downloads
Views
Total views
5,189
On SlideShare
0
From Embeds
0
Number of Embeds
797
Actions
Shares
0
Downloads
77
Comments
2
Likes
7
Embeds 0
No embeds

No notes for slide

Persistent Data Structures by @aradzie

  1. 1. Persistent Data Structures Living in a world where nothing changes but everything evolves - or -A complete idiots guide to immutability
  2. 2. Java Haskell vs● Warm, soft and cute ● Strange, unfamiliar alien● Imperative ● Purely functional● Object oriented ● Everything is different● Just like good old ● Shocking news! Its not Basic, but with classes like Basic!
  3. 3. Haskell does not have variables!Imagine a dialect of Java where everything is final by default class LinkedList { class Node { final Node next, prev; final Object value; } final Node head, tail; void add(final Object v) { for (final Node n = head; n != null; n = n.next) { ... } } } All fields, parameters and variables are automatically immutable, the final is implied everywhere, and there is no way to get rid of it
  4. 4. Haskell does not have variables!Imagine a dialect of Java where everything is final by default class LinkedList { class Node { final Node next, prev; final Object value; } It does for me! final Node head, tail; void add(final doesnt make But it Object v) { sense! for (final Node n = head; n != null; n = n.next) { ... } It wont work! } } All fields, parameters and variables are automatically immutable, the final is implied everywhere
  5. 5. What is a variable?var·y/ˈve(ə)rē/vary, varied, varying ● — verb (used with object)Definition: to change or alter, as in form, appearance,character, or substance ● — verb (used without object)Definition: to undergo change in appearance, form, substance,character, etc ● — synonyms:modify, mutate
  6. 6. "Variables" in Haskell ● Must be assigned once declared YES: int a = 1; NO: int a; ● Cannot be reassigned YES: final int a = 1; NO: a = 2;These are mathematical variables, not imperative ones!
  7. 7. When everything is immutableThere is no notion of time: ● Functions take old values, produce new values, nothing is changed in-place ● It does not matter when a function was called, it only matters what arguments it was called withThere is no notion of identity: ● Everything is a value, complex data structures are values too ● There is no way to tell if a == b, only if a.equals(b) ● In other words, values are never identical to each other, but may be equal
  8. 8. I want my linked list!Basic terminology: ● Ephemeral data structure — everything that is not persistent. Most Java data structures (lists, sets, etc.) are ephemeral. ● Persistent data structure — immutable data structure with history. No in-place modifications. Operations on it create new versions. Older versions are always available. That. Is. Simple. ● The persistence property has nothing to do with persistent storage, like disks! This is a completely different story.
  9. 9. I want my linked list! ● In imperative languages, like Java, most data structures are ephemeral by defaultDesigning persistent data structures is somewhat awkward andnot always efficient ● In purely functional languages, like Haskell, all data structures are automatically persistent!There is just no other way to make data structures
  10. 10. History of updates Making update to a persistent DS instancealways creates a new instance that contains this update. The current version is left unmodified.
  11. 11. Why should I bother? Is it fun? Hell yeah! But is it practical? Lets see!
  12. 12. The free lunch is over!"The biggest sea change in software development since the OO revolution is knocking at the door, and its name is Concurrency." — Herb Sutter A commodity hardware (my laptop)The need for writing correct multi-threaded code is constantly increasing
  13. 13. Concurrent data structures are hard!Want a concurrent ephemeral linked list?Here are some implementation strategies: ● Coarse-grained synchronization ● Fine-grained synchronization ● Optimistic synchronization ● Lazy synchronizationAll lock-based — no composition, deadlocks, etc ● Non-blocking synchronization in different flavorsAnd you need the size of a list you are in trouble!
  14. 14. Concurrent data structures are hard!● Making mutable concurrent data structures requires inter- thread coordination within these structures● Locks and atomic references all over the place● Decades of research by academia with many attempts● Sophisticated algorithms that are hard to reason about, test and prove● Several different ways to solve the same problems, each with its own cons and pros
  15. 15. Concurrent data structures are hard!● Making mutable concurrent data structures requires inter- Yes, but are persistent data thread coordination within these structures structures actually simpler?● Locks and atomic references all over the place● Decades of research by academia with many attempts● Sophisticated algorithms that are hard to test and prove● Several different ways to solve the same problems, each with its own cons and pros
  16. 16. Just give up mutability!● Persistent data structures are easy to reason about in concurrent environment● The behavior does not depend on how many threads are trying to "modify" it at once● Therefore persistent data structures are very easy to test and debug
  17. 17. The whole picture ● Persistent data structures alone are not sufficientThey are an essential part of the picture, but not the wholeanswer to concurrency ● Inter-thread coordination is neededThreads still need to know what each other thread is doing toagree on a common outcome ● But it can be added "outside"Which gives us complete separation of concerns
  18. 18. The whole pictureSolving concurrency challenge in a modern language: ● Scala Way — Persistent data structures with message passing ● Clojure Way — Persistent data structures with software transactional memory ● Will likely be mixed in the future
  19. 19. Last few words on concurrency● Persistent data structures are slower than ephemeral ones in sequential use● But not that much slower!● We can forgive it, since they give you more functionality, and ephemeral data structures are simply less capable● And in multiprocessor era, it is better to make things scalable rather than fast
  20. 20. Efficient persistent data structuresWe want persistent data structures to be space and timeefficient: ● Structural sharingWe want to reuse as many fragments of the previous versionas possible ● Path copyingWe want to copy as few pieces as possible ● Maybe, just maybe lazy evaluation (where available)We dont want nasty pathological cases
  21. 21. A case study● Lets make some persistent data structures in Java● All these structures consist of Why are you classes with only final fields looking at me?!● With good amortized asymptotic complexity in most cases
  22. 22. Our planLets start with some trivial examples ● Stack ● Queue ● TreeThe proceed with more advanced structures ● Hash Table ● Finger Tree
  23. 23. Trivial Example — Persistent Stackclass Stack<T> { final T v; (a) final Stack<T> next; (b) Its just a singly linked Stack() { list of nodes v = null; next = null; size = 0; } Stack(T v, Stack<T> next) { this.v = v; this.next = next; } ... Source Code 1/2
  24. 24. Trivial Example — Persistent Stackclass Stack<T> { ... Stack<T> push(T v) { return new Stack<T>(v, this); (a) } T peek() { if (next == null) throw new NoSuchElementException(); return v; (b) } Stack<T> pop() { if (next == null) throw new NoSuchElementException(); return next; (c) } Source Code 2/2
  25. 25. Trivial Example — Persistent Stack Structural sharing in persistent stack
  26. 26. Trivial Example — Persistent Stack Looks familiar? The versions tree!
  27. 27. Trivial Example — Persistent Stack Also known as Spaghetti stack or Cactus stack
  28. 28. Persistent QueueIts just two stacks combined: When front stack is empty, reverse back stack and ● Back stack to enqueue items use it as front stack ● Front stack to dequeue items
  29. 29. Persistent Queueclass Queue<T> { // back stack - push elements here final Stack<T> b; (a) // front stack - pop elements from here final Stack<T> f; (b) Queue() { b = f = new Stack<T>(); } Queue(Stack<T> b, Stack<T> f) { this.b = b; this.f = f; } boolean isEmpty() { return f.isEmpty(); (c) } ... Source Code 1/3
  30. 30. Persistent Queueclass Queue<T> { ... static <T> Queue<T> check(Stack<T> b, Stack<T> f) { if (f.isEmpty()) return new Queue<T>(f, b.reverse()); (a) else return new Queue<T>(b, f); (b) } Queue<T> push(T v) { return check(b.push(v), f); } Queue<T> pop() { if (isEmpty()) { throw new NoSuchElementException(); } return check(b, f.pop()); } Source Code 2/3
  31. 31. Persistent Queueclass Queue<T> { ... T peek() { if (isEmpty()) { throw new NoSuchElementException(); } return f.peek(); }class Stack<T> { ... Stack<T> reverse() { if (isEmpty() || next.isEmpty()) return this; Stack<T> r = new Stack<T>(); for (Stack<T> s = this; !s.isEmpty(); s = s.pop()) { r = r.push(s.peek()); } return r; } Source Code 3/3
  32. 32. Persistent QueueStructural sharing in persistent queue
  33. 33. Persistent QueueBeware pathological cases! ● What is forward stack is empty, but back stack is full? ● And we are going to pop from the same queue N times ● Then we get N back back stack reversions! ● Lazy evaluation to the rescue — use lazy streams instead of strict stacks
  34. 34. Persistent Queue But there is a better way to design queue!Monoidally Annotated 2-3 Finger Tree is a versatile datastructure that can be used to build efficient lists, deques,priority queues, interval trees, ropes, etc.It is more complex, we will take a look at it later.
  35. 35. Persistent Tree● It is trivial to convert any ephemeral tree to a persistent one by means of path copying● It works for binary trees, 2-3 trees, B-trees, etc● The shape of tree is not affected, only mutating algorithms● In a balanced binary tree at most log N nodes need to be copied — quite efficient● The secret to all persistent data structures is that they all are trees! (Yes, lists and hash tables are trees too)
  36. 36. Persistent Tree
  37. 37. Simple Persistent Binary Treeclass SimpleBinaryTree { static class Node { final K key; (a) final V value; (b) final Node l, r; (c) Node(K key, V value, Node l, Node r) { this.key = key; this.value = value; this.l = l; this.r = r; } } ... Source Code 1/2
  38. 38. Simple Persistent Binary Treeclass SimpleBinaryTree { ... static Node insert(Node n, K key, V value) { if (n == null) { return new Node(key, value, null, null); (a) } int cmp = key.compareTo(n.key); (b) if (cmp < 0) { return new Node(n.key, n.value, (c) insert(n.l, key, value), n.r); } if (cmp > 0) { return new Node(n.key, n.value, (d) n.l, insert(n.r, key, value)); } return new Node(key, value, n.l, n.r); (e) } Source Code 2/2
  39. 39. Persistent TreeMultiple definitions of persistence: ● Immutable data structure with history ● Committed to a persistent storageAppend only databases and file systems: ● CouchDB uses append only B-Tree ● RethinkDB makes append only variant of MySQL ● ZFS, BTRFS implement copy-on-write transactions and snapshotsNothing is new under the moon!
  40. 40. Persistent Mapinterface Map<K, V> { // get value for a key, or null if not found V get(K key); // make key/value association Map<K, V> put(K key, V value); // remove key/value association Map<K, V> remove(K key);} Remember, no in-place updates Mutations create new instances
  41. 41. Persistent MapImplementation Strategy ● Persistent red-black tree for ordered keys Time complexity — O(log n) ● Persistent hash table for hashable keys Time complexity — O(1)
  42. 42. Persistent Hash TableBut how do we implement it?Copying the whole table would be too expensive!
  43. 43. Persistent Hash TableHeres the idea: partition hash table into smallerpieces, organized them as a persistent treeNice idea, but how do we navigate in such a tree?
  44. 44. Prefix Tree/TrieSearch is guided by individual letters of a string keyHash code is just a string of digits!
  45. 45. Persistent Hash Table in Prefix TreeRepresent 32 bit hash codes as strings of 5 bit symbol:hashCode = CAFEBABE16level 6 5 4 3 2 1 0bits 11 00101 01111 11101 01110 10101 11110symbol 3 5 15 29 14 21 30
  46. 46. Persistent Hash Table hashCode = ... xxxxx xxxxx xxxxx xxxxxEach item is either a key/value pair or a subtree
  47. 47. Persistent Hash Tableclass PersistentHashMap { abstract class Item<K, V> {} class Node<K, V> extends Item<K, V> { final Item<K, V> children = new Item<K, V>[32]; (a) } class Entry<K, V> extends Item<K, V> { final int hashCode; (b) final K key; (c) final V value; (d) final Entry<K, V> next; (e) } Source Code 1/2
  48. 48. Persistent Hash Tableclass PersistentHashMap { V get(K key) { return root.find(key.hashCode(), key, 0); (a) } class Node<K, V> extends Item<K, V> { V find(int hashCode, K key, int level) { int index = (hashCode >>> (level * 5)) & 31; (b) Item<K, V> item = children[index]; (c) if (item instanceof Node) { (d) return ((Node<K, V>) item) (e) .find(hashCode, key, level + 1); } if (item instanceof Entry) { (f) return ((Entry<K, V>) item) (g) .find(hashCode, key); } return null; } Source Code 2/2
  49. 49. Persistent Hash TableDo not waste space! class PersistentHashMap { class Node<K, V> { final Item<K, V> children = new Item<K, V>[32]; (a) } ● Most of the children would be null on deeper levels ● The number of arrays grows exponentially as we go deeper ● Need to find a way to compact tree ● Simply get rid of nulls in arrays!
  50. 50. Persistent Hash Table class Node<K, V> { final int mask; (a) final Item<K, V> children = new Item<K, V>[bitCount(mask)]; (b) }● Mask is a 32-bit integer whose bits set to 1 only for those array elements that are not null● Array stores only non-null elements. Its size is the number of 1 bits in the mask. Array size varies from 2 to 32 elements.● Overhead for null array element is just one bit. Quite good!
  51. 51. Persistent Hash Table● To test that array has element at index i, simply test if ith bit in the mask is 1: if ((mask & (1 << i)) != 0) { ...● To get offset to ith element in the array, count number of 1 bits lower than i in the mask: int offset = bitCount(mask & ((1 << i) - 1)); if (children[offset] instanceof ...
  52. 52. Persistent Listinterface Seq<T> { T head(); // get first element Seq<T> tail(); // get list without first element Seq<T> cons(T v); // append element to head Seq<T> snoc(T v); // append element to tail Seq<T> concat(Seq<T> that); // join two lists int size(); // get number of elements T get(int index); // get Nth element Seq<T> set(int index, T v); // set Nth element} Remember, no in-place updates Mutations create new instances
  53. 53. Persistent List● There are quite a few ways to implement persistent lists● But we will not be studying them● Instead, we will turn our attention to finger trees● Soon, it will be clear why
  54. 54. Finger Trees● An incredibly elegant, simple and efficient data structure● Oh so very versatile, functional programmers Swiss Army knife● Basic data structure for building random acces sequences, deques, priority queues, ropes, interval trees, etc.● Lets define it in stages
  55. 55. Persistent leafy 2-3 treesLets begin with a simple data structure — leafy 2-3 tree ● Every intermediate node has either two childrent or three children ● All values are stored in leafs ● Perfectly balanced — all leafs are at the same level
  56. 56. Persistent leafy 2-3 trees
  57. 57. Persistent leafy 2-3 trees Leafs contain interesting values, but what is stored in nodes?
  58. 58. Annotated leafy 2-3 trees● There must be a way to find interesting values in a tree● We need to guide search from the root of a tree to its leafs● Lets add special annotations to nodes● Use these annotations to find values
  59. 59. Size annotated leafy 2-3 trees● Each intermediate node is annotated with the size of a subtree rooted at this node● Makes it trivial to find any leaf by its index● Starting from root, test if index is in the range of its left (middle) or right subtree, and repeat recursively for that subtree, until a leaf is found
  60. 60. Size annotated leafy 2-3 trees Looks like random access list
  61. 61. Priority annotated leafy 2-3 trees● Each intermediate node is annotated with the highest priority of an element in its subtree● Makes it trivial to find value with the highest priority● Starting from root, find subtree with the highest priority descent recursively into it, until a leaf is found
  62. 62. Priority annotated leafy 2-3 trees Looks like priority queue
  63. 63. Monoids● One interface to unify size, priority (and more!) annotations on trees● A set of values with a "zero" element 0 and a binary associative operation ⊕● Monoid laws: 0⊕a = a a⊕0 = a a⊕(b⊕c) = (a⊕b)⊕c
  64. 64. Monoid examples● Strings with empty string and concatenation "" + "a" = "a", "a" + "" = "a" "a" + ("b" + "c") = ("a" + "b") + "c"● Integers with zero and addition 0 + 1 = 1, 1 + 0 = 1 1 + (2 + 3) = (1 + 2) + 3● Integers with one and multiplication 1 * 2 = 2, 2 * 1 = 1 2 * (3 * 4) = (2 * 3) * 4● And many, more of them! (Monoids are everywhere)
  65. 65. Monoid interfaceinterface Monoid<T extends Monoid<T>> { T unit(); T combine(T that);}class String implements Monoid<String> { ... String unit() { return ""; (a) } String combine(String that) { return this + that; (b) }}
  66. 66. Size monoidclass Size implements Monoid<Size> { final int size; (a) Size(int size) { this.size = size; } Size unit() { return new Size(0); (b) } Size combine(Size that) { return new Size(this.size + that.size); (c) }}
  67. 67. Priority monoidclass Priority implements Monoid<Priority> { final int priority; (a) Priority(int priority) { this.priority = priority; } Priority unit() { return new Priority(MAX_INTEGER); (b) } Priority combine(Priority that) { return new Priority( Math.min(this.priority, that.priority)); (c) }}
  68. 68. But where do we get monoids from?● Monoids have nice property of composability● We can get more monoids by combining existing ones● But where do we get initial monoids to begin with?● We need a way to measure values!● Those measures must be monoids, obviously interface Measured<M extends Monoid> { M measure(); }
  69. 69. Lets make a sketch of annotated tree/** <V> is the type of values <M> is the type of monoidal measures of values */class Tree<M extends Monoid, V extends Measured<M>> implements Measured<M> { (a) abstract class Leaf<M, V> extends Tree<M, V> { final V value; (b) override abstract M measure(); (c) } class Node<M, V> extends Tree<M, V> { final Tree<M, V> left, right; (d) final M m; (e) Node(Tree<M, V> l, Tree<M, V> r) { left = l; right = r; m = l.measure().combine(r.measure()); (f) } override final M measure() { Pseudocode! return m; (g) }
  70. 70. Lets make a sketch of annotated tree...class Leaf<V> extends Tree<Size, V> { final V value; override final Size measure() { return new Size(1); (a) }}...class Leaf<V> extends Tree<Priority, V> { final V value; override final Priority measure() { return new Priority(value.priority()); (b) }} Pseudocode!
  71. 71. But that is not finger tree yet!
  72. 72. Finger Tree... is a just an annotated tree of annotated 2-3 trees!
  73. 73. Finger TreeDigits, 2-3 trees, fingers and nested levels
  74. 74. Finger TreeA little bit of Haskell would not hurt:data Node v a = Node2 v a a | Node3 v a a adata Digit v a = One v a | Two v a a | Three v a a a | Four v a a a adata FingerTree v a = Empty | Single a | Deep v (Digit a) (a) (FingerTree v (Node v a)) (b) (Digit a) (c)
  75. 75. Finger Treeclass FingerTree<M extends Monoid<M>, T extends Measured<M>> implements Measured<M> { class Empty<M extends Monoid<M>, T extends Measured<M>> extends FingerTree<M, T> {} class Single<M extends Monoid<M>, T extends Measured<M>> extends FingerTree<M, T> { final T v; (a) final M m; (b) class Deep<M extends Monoid<M>, T extends Measured<M>> extends FingerTree<M, T> { final Digit<M, T> prefix; (c) final FingerTree<M, Node<M, T>> middle; (d) final Digit<M, T> suffix; (e) final M m; (f) Source Code 1/3
  76. 76. Finger Treeclass Digit<M extends Monoid<M>, T extends Measured<M>> implements Measured<M> { final M m; (a) class One<M extends Monoid<M>, T extends Measured<M>> extends Digit<M, T> { final T a; (b) class Two<M extends Monoid<M>, T extends Measured<M>> extends Digit<M, T> { final T a, b; (c) class Three<M extends Monoid<M>, T extends Measured<M>> extends Digit<M, T> { final T a, b, c; (d) class Four<M extends Monoid<M>, T extends Measured<M>> extends Digit<M, T> { final T a, b, c, d; (e) Source Code 2/3
  77. 77. Finger Treeclass Node<M extends Monoid<M>, T extends Measured<M>> implements Measured<M> { final M m; (a) class Node2<M extends Monoid<M>, T extends Measured<M>> extends Node<M, T> { final T a, b; (b) class Node3<M extends Monoid<M>, T extends Measured<M>> extends Node<M, T> { final T a, b, c; (c) Source Code 3/3
  78. 78. Finger Tree InterfaceBasic operations: ● cons, snoc — append/prepend element ● concat — join two trees ● split — find prefix, element and suffix using predicateBeyond the scope of this presentation, sorry
  79. 79. Finger Tree PerformanceAmortized bounds: Finger Tree 2-3 Tree List ● cons, snoc O(1) O(log n) O(1)/O(n) ● head, last O(1) O(log n) O(1)/O(n) ● concat O(log min(ℓ1, ℓ2)) O(log n) O(n) ● split O(log min(n, ℓ-n)) O(log n) O(n) ● index O(log min(n, ℓ-n) O(log n) O(n)
  80. 80. Thanks!Questions?

×