Advertisement

A deep dive into Clojure's data structures - EuroClojure 2015

Owner/Partner at Rebel Guru
Jun. 25, 2015
Advertisement

More Related Content

Slideshows for you(20)

Advertisement

A deep dive into Clojure's data structures - EuroClojure 2015

  1. What Lies Beneath Mohit Thatte EUROCLOJURE 2015 Barcelona A Deep Dive into Clojure’s data structures @mohitthatte @pastafari
  2. A DAY IN THE LIFE Image: User:Joonspoon Wikimedia Commons
  3. Programs that use Maps Map API Map Implementation Primitives (JVM, et al) TOWERS OF ABSTRACTION
  4. “Any sufficiently advanced data structure is indistinguishable from magic” - Me With apologies to Arthur Clarke
  5. IMMUTABILITY IS GOOD
  6. PERFORMANCE IS NECESSARY
  7. By U.S. Navy photo [Public domain], via Wikimedia Commons IMMUTABILITY PERF
  8. Image: Maj. Gen. William Anders, Apollo 8
  9. “… functional programming’s stricture against destructive updates (assignments) is a staggering handicap, tantamount to confiscating a master chef’s knives.” - Chris Okasaki
  10. ABSTRACT DATA TYPE enqueue add an element to the end head first element tail remaining elements QUEUE INTERFACE INVARIANTS NAME
  11. THE CHALLENGE Correct Performant Immutable X
  12. CHALLENGE ACCEPTED
  13. Structural Sharing KEY IDEAS Structural Bootstrapping Hybrid Structures
  14. STRUCTURAL SHARING :a :b :c :d :e (assoc v 2 :zz) :a :b :zz
  15. STRUCTURAL SHARING :c :a :d :f :m (assoc v 4 :zz) :e:b :d :f :zz
  16. Image: Alan Levine
  17. STRUCTURAL DECOMPOSITION Image: Alan Chia (Lego Color Bricks)
  18. HYBRID STRUCTURES
  19. LETS DIVE IN!
  20. ‘(1 2 3) Lists: Code manipulation [1 2 3] Vectors: All things sequential {:a 1 :b 2} Maps: Structured Data #{a e i o u} Sets: Ermm, Sets CLOJURE DATA STRUCTURES
  21. MAPS
  22. GET GET value for given key ASSOC ADD key,value to map DISSOC REMOVE key,value from map MERGE MERGE two maps together THE MAP INTERFACE
  23. WHAT MAKES A GOOD MAP? Constant time operations independent of number of keys Efficient space utilization even with mutation Objects as keys, Objects as values
  24. IDEAS
  25. ARRAYS IDEA #1
  26. :a 1 :b 2 :c 3 KEY VALUE PAIRS
  27. NOT A GREAT MAP! Time complexity O(n) Space efficiency NO Objects as keys YES
  28. HOW DO WE DO BETTER?
  29. Image: www.pooktre.com TREES TO THE RESCUE
  30. Ramon Llull, Catalunya c. 1250 Arbol de ciencia
  31. IDEA #2 BINARY SEARCH TREE
  32. 13 a 8 f 17 1 11q b 6 z 15 s r n25 t22 u27
  33. 13 a 17 m r 25 u27
  34. NOT A GREAT MAP! Time complexity worst case O(n) Space efficiency POSSIBLY Objects as keys YES
  35. How do we keep our trees in ‘balance’?
  36. IDEA #3 BALANCED BINARY SEARCH TREES
  37. RED BLACK TREES ALWAYS BALANCED, 100 % MONEY BACK GUARANTEE Guibas, Sedgwick 1978
  38. RED BLACK TREES Root is black Every path from root to an empty node contains the same number of black nodes Every node is colored red or black No red node can have a red child
  39. RED BLACK TREES Okasaki ‘96
  40. A PRETTY GOOD MAP! Time complexity O(log2N) Space efficiency YES Objects as keys YES
  41. Clojure’s sorted-maps are Red Black Trees
  42. CONSTRAINTS KEYS MUST BE COMPARABLE KEYS ARE COMPARED AT EVERY NODE, THIS CAN BE EXPENSIVE
  43. IDEA #4 TRIE - SEARCH BY DIGIT
  44. tap LEVEL 0 LEVEL 1 LEVEL 2
  45. next(node, symbol) FINITE STATE MACHINE Symbols #{a..z} Nodes, Edges
  46. TRIE IMPLEMENTATIONS
  47. Associate each symbol with an offset, e.g a=0,b=1,… LOOKUP TABLES next = lookup(node, offset)
  48. Fast and space efficient trie searches, Bagwell 2000 ADD
  49. NOT A GREAT MAP! Time complexity O(logmN) Space efficiency NO Objects as keys NO
  50. How do we avoid null nodes?
  51. IDEA #4 BST + TRIE = TST Bentley, Sedgwick 1998
  52. Fast and space efficient trie searches, Bagwell 2000 ADD
  53. A DECENT MAP Time complexity ~O(log2N) Space efficiency YES Objects as keys NO
  54. No null nodes, but can we do better than log2N?
  55. CHALLENGE ACCEPTED
  56. Fast and space efficient trie searches, Bagwell 2000 Array Mapped Trie IDEA #5
  57. Use bitmaps to determine presence or absence of symbol
  58. Lets say we have 16 symbols, 0…15
  59. 0 1 0 0 0 1 0 0 1 1 1 0 0 0 0 0 USING BITMAPS 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 Does the symbol with offset 6 exist? mask = 1 << offset bitmap & mask 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 bitwise AND with a mask
  60. There’s an array alongside that only contains entries for the 1’s. NOT pre-allocated.
  61. What offset in the dynamic array should I look at?
  62. Image: Martin Fisch, flickr.com USE THE 1’S AS TALLY MARKS
  63. 0 1 0 0 0 1 0 0 1 1 1 0 0 0 0 0 0 1 2 3 4 MapEntry MapEntry SubTrie Pointer MapEntry MapEntry
  64. 0 1 0 0 0 1 0 0 1 1 1 0 0 0 0 0 USING BITMAPS 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 Where in the array is the entry for ‘6’? Integer.bitCount(bitmap & mask) 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 Count tally marks to the ‘right’ of offset mask = (1 << 6 ) - 1 How do I create a mask to do that? 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
  65. What happens if I insert a new map entry?
  66. 0 1 1 0 0 1 0 0 1 1 1 0 0 0 0 0 0 1 2 3 4 MapEntry MapEntry MapEntry MapEntry MapEntry
  67. 0 1 1 0 0 1 0 0 1 1 1 0 0 0 0 0 0 1 2 3 4 5 Map Entry Map Entry SubTrie Pointer Map Entry Map Entry Map Entry
  68. A DECENT MAP Time complexity O(logmN) Space efficiency YES Objects as keys NO
  69. How do we support arbitrary Objects as keys?
  70. Ideal hash trees, Bagwell 2001 Hashing + AMT IDEA #6
  71. Ideal hash trees, Bagwell 2001 Use a good hash function to generate an integer key. STEP 1 0010 1101 1011 1110 1100 1111 1111 1001 hasheq
  72. STEP 2 72021 35 Divide the 32 bit integer into ‘symbols’ 5 bits at a time. 00101 001111010010101 000110100101 11 Use the ‘symbols’ to walk down an AMT
  73. t bits per symbol give 2t symbols
  74. Why 5 bits?
  75. BIT JUGGLING! Compute ‘symbols’ by shifting and masking 00111000110010110100101010100101 00 00000 00000 00000 00000 00000 11111 (hash >>> shift) & 0x01f How to calculate nth digit? Shift by 5*n and mask with 0x1f
  76. BEST COMMENT EVER. A persistent rendition of Phil Bagwell's Hash Array Mapped Trie Hickey R., Grand C., Emerick C., Miller A., Fingerhut A. Uses path copying for persistence HashCollision leaves vs. extended hashing Node polymorphism vs. conditionals No sub-tree pools or root-resizing Any errors are my own PersistentHashMap.java:19
  77. NODE POLYMORPHISM ArrayNode - 32 wide pointers to sub-tries BitmapIndexedNode - bitmap + dynamic array HashCollisionNode - array for things that collide
  78. EXAMPLE (let [h (zipmap (range 1e6) (range 1e6))] (get h 123456))
  79. 10111 111001100101001 00010 28259 223 0101100000 110 shift = 0 ArrayNode ArrayNode shift = 5 ArrayNode shift = 10 BitmapIndexedNode shift = 15 … and then follow the AMT down
  80. A GOOD MAP Time complexity O(log32N) Space efficiency YES Objects as keys YES
  81. Key compared only once Bit juggling for great performance! HAMT ~6 hops to a leaf node
  82. NEED ROOT RESIZING NOT AMENABLE TO STRUCTURAL SHARING REGULAR HASH TABLE?
  83. UPDATES? Search for the key, clone leaf nodes and path to root
  84. VECTORS
  85. ArrayNode’s all the way. Break ‘index’ into digits and walk down levels. INTUITION (let [arr (vec (range 1e6))] (nth arr 123456))
  86. 030 182400 shift = 15 ArrayNode ArrayNode shift = 10 ArrayNode shift = 5 ArrayNode shift = 0 00011 000001001011000000000000000000 123456
  87. THE TAIL OPTIMIZATION PersistentVector count shift root tail
  88. RIGHT TOOL FOR THE JOB By Schnobby (Own work) [CC BY-SA 3.0], via Wikimedia Commons
  89. HashMaps do not merge efficiently
  90. data.int-map MAP CATENATION Okasaki & Gill’s “Fast Mergeable int maps” Zach Tellman
  91. Vectors do not concat efficiently Vectors do not subvec efficiently
  92. VECTOR CATENATION Based on Bagwell and Rompf, “RRB-Trees: Efficient Immutable Vectors” logarithmic catenation and slicing Michal Marczyk core.rrb-vector TODO: benchmarks
  93. CTRIES Michál Marczyk Tomorrow at 0850
  94. 1959 Birandais, Fredkin Trie 1960 Windley,Booth, Colin,Hibbard Binary Search Trees 1962 Adelson-Velsky, Landis AVL Trees 1978 Guibas, Sedgwick Red Black Trees 1985 Sleator, Tarjan Splay Trees 1996 Okasaki Purely Functional Data Structures 1998 Sedgwick Ternary Search Trees 2000 Phil Bagwell AMT 2001 Phil Bagwell HAMT 2007 Rich Hickey Clojure!
  95. Reading List Ideal Hash Trees, Bagwell 2001 Fast and efficient trie searches, Bagwell 2000 Fast Mergeable Integer Maps, Okasaki & Gill, 1998 The worlds fastest scrabble program, Appel & Jacobson, 1988 File searching using variable length keys, Birandais, 1959 Purely Functional Data Structures, Okasaki 1996
  96. Polymatheia: Jean Niklas L’Orange
  97. QUESTIONS? Ask Michal or Zach or Jean Niklas :)
  98. THANK YOU
Advertisement