Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
What Lies Beneath
Mohit Thatte
EUROCLOJURE 2015
Barcelona
A Deep Dive into
Clojure’s data structures
@mohitthatte
@pastafa...
A DAY IN THE LIFE
Image: User:Joonspoon Wikimedia Commons
Programs that use Maps
Map API
Map Implementation
Primitives (JVM, et al)
TOWERS OF ABSTRACTION
“Any sufficiently advanced data structure
is indistinguishable from magic”
- Me
With apologies to Arthur Clarke
IMMUTABILITY
IS GOOD
PERFORMANCE IS
NECESSARY
By U.S. Navy photo [Public domain], via Wikimedia Commons
IMMUTABILITY
PERF
Image: Maj. Gen. William Anders, Apollo 8
“… functional programming’s stricture
against destructive updates (assignments)
is a staggering handicap, tantamount to
co...
ABSTRACT DATA TYPE
enqueue add an element to the end
head first element
tail remaining elements
QUEUE
INTERFACE INVARIANTS...
THE CHALLENGE
Correct
Performant
Immutable
X
CHALLENGE ACCEPTED
Structural Sharing
KEY IDEAS
Structural Bootstrapping
Hybrid Structures
STRUCTURAL SHARING
:a :b :c :d :e
(assoc v 2 :zz)
:a :b :zz
STRUCTURAL SHARING
:c
:a
:d
:f
:m
(assoc v 4 :zz)
:e:b
:d
:f
:zz
Image: Alan Levine
STRUCTURAL
DECOMPOSITION
Image: Alan Chia (Lego Color Bricks)
HYBRID STRUCTURES
LETS DIVE IN!
‘(1 2 3) Lists: Code manipulation
[1 2 3] Vectors: All things sequential
{:a 1 :b 2} Maps: Structured Data
#{a e i o u} Se...
MAPS
GET GET value for given key
ASSOC ADD key,value to map
DISSOC REMOVE key,value from map
MERGE MERGE two maps together
THE ...
WHAT MAKES A GOOD MAP?
Constant time operations
independent of number of keys
Efficient space utilization even with mutatio...
IDEAS
ARRAYS
IDEA #1
:a 1 :b 2 :c 3
KEY VALUE PAIRS
NOT A GREAT MAP!
Time complexity O(n)
Space efficiency NO
Objects as keys YES
HOW DO WE DO
BETTER?
Image: www.pooktre.com
TREES TO THE RESCUE
Ramon Llull,
Catalunya c. 1250
Arbol de ciencia
IDEA #2
BINARY SEARCH TREE
13 a
8 f 17
1 11q b
6 z
15 s
r
n25
t22 u27
13 a
17
m
r
25
u27
NOT A GREAT MAP!
Time complexity worst case O(n)
Space efficiency POSSIBLY
Objects as keys YES
How do we keep our
trees in ‘balance’?
IDEA #3
BALANCED
BINARY SEARCH TREES
RED BLACK TREES
ALWAYS BALANCED,
100 % MONEY BACK GUARANTEE
Guibas, Sedgwick 1978
RED BLACK TREES
Root is black
Every path from root to an empty node
contains the same number of black nodes
Every node is ...
RED BLACK TREES
Okasaki ‘96
A PRETTY GOOD MAP!
Time complexity O(log2N)
Space efficiency YES
Objects as keys YES
Clojure’s
sorted-maps are
Red Black Trees
CONSTRAINTS
KEYS MUST BE COMPARABLE
KEYS ARE COMPARED AT EVERY
NODE, THIS CAN BE EXPENSIVE
IDEA #4
TRIE - SEARCH BY DIGIT
tap
LEVEL 0
LEVEL 1
LEVEL 2
next(node, symbol)
FINITE STATE MACHINE
Symbols #{a..z}
Nodes, Edges
TRIE IMPLEMENTATIONS
Associate each symbol with
an offset, e.g a=0,b=1,…
LOOKUP TABLES
next = lookup(node, offset)
Fast and space efficient trie searches, Bagwell 2000
ADD
NOT A GREAT MAP!
Time complexity O(logmN)
Space efficiency NO
Objects as keys NO
How do we avoid null
nodes?
IDEA #4
BST + TRIE = TST
Bentley, Sedgwick 1998
Fast and space efficient trie searches, Bagwell 2000
ADD
A DECENT MAP
Time complexity ~O(log2N)
Space efficiency YES
Objects as keys NO
No null nodes,
but can we do better
than log2N?
CHALLENGE ACCEPTED
Fast and space efficient trie searches, Bagwell 2000
Array Mapped Trie
IDEA #5
Use bitmaps to determine
presence or absence
of symbol
Lets say we have 16 symbols,
0…15
0 1 0 0 0 1 0 0 1 1 1 0 0 0 0 0
USING BITMAPS
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
Does the symbol with offset 6 exist?
ma...
There’s an array alongside
that only contains entries
for the 1’s.
NOT pre-allocated.
What offset in the dynamic
array should I look at?
Image: Martin Fisch, flickr.com
USE THE 1’S AS TALLY MARKS
0 1 0 0 0 1 0 0 1 1 1 0 0 0 0 0
0 1 2 3 4
MapEntry MapEntry
SubTrie
Pointer
MapEntry MapEntry
0 1 0 0 0 1 0 0 1 1 1 0 0 0 0 0
USING BITMAPS
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
Where in the array is the entry for ‘6...
What happens if I insert a new
map entry?
0 1 1 0 0 1 0 0 1 1 1 0 0 0 0 0
0 1 2 3 4
MapEntry MapEntry MapEntry MapEntry MapEntry
0 1 1 0 0 1 0 0 1 1 1 0 0 0 0 0
0 1 2 3 4 5
Map
Entry
Map
Entry
SubTrie
Pointer
Map
Entry
Map
Entry
Map
Entry
A DECENT MAP
Time complexity O(logmN)
Space efficiency YES
Objects as keys NO
How do we support
arbitrary
Objects as keys?
Ideal hash trees, Bagwell 2001
Hashing + AMT
IDEA #6
Ideal hash trees, Bagwell 2001
Use a good hash function
to generate an integer key.
STEP 1
0010 1101 1011 1110 1100 1111 1...
STEP 2
72021 35
Divide the 32 bit integer into ‘symbols’
5 bits at a time.
00101 001111010010101 000110100101
11
Use the ‘...
t bits per symbol
give
2t symbols
Why 5 bits?
BIT JUGGLING!
Compute ‘symbols’ by shifting and masking
00111000110010110100101010100101
00 00000 00000 00000 00000 00000 ...
BEST COMMENT EVER.
A persistent rendition of Phil Bagwell's
Hash Array Mapped Trie
Hickey R., Grand C., Emerick C., Miller...
NODE POLYMORPHISM
ArrayNode - 32 wide pointers to sub-tries
BitmapIndexedNode - bitmap + dynamic array
HashCollisionNode -...
EXAMPLE
(let [h (zipmap (range 1e6)
(range 1e6))]
(get h 123456))
10111 111001100101001 00010
28259 223
0101100000
110
shift = 0
ArrayNode
ArrayNode
shift = 5
ArrayNode
shift = 10
BitmapIn...
A GOOD MAP
Time complexity O(log32N)
Space efficiency YES
Objects as keys YES
Key compared only once
Bit juggling for great performance!
HAMT
~6 hops to a leaf node
NEED ROOT RESIZING
NOT AMENABLE TO
STRUCTURAL SHARING
REGULAR HASH TABLE?
UPDATES?
Search for the key,
clone leaf nodes and path to root
VECTORS
ArrayNode’s all the way.
Break ‘index’ into digits and walk down levels.
INTUITION
(let [arr (vec (range 1e6))]
(nth arr 1...
030 182400
shift = 15
ArrayNode
ArrayNode
shift = 10
ArrayNode
shift = 5
ArrayNode
shift = 0
00011 00000100101100000000000...
THE TAIL OPTIMIZATION
PersistentVector
count shift root tail
RIGHT TOOL
FOR THE JOB
By Schnobby (Own work) [CC BY-SA 3.0], via Wikimedia Commons
HashMaps do not
merge efficiently
data.int-map
MAP CATENATION
Okasaki & Gill’s “Fast Mergeable int maps”
Zach Tellman
Vectors do not
concat efficiently
Vectors do not
subvec efficiently
VECTOR CATENATION
Based on Bagwell and Rompf,
“RRB-Trees: Efficient Immutable Vectors”
logarithmic catenation and slicing
M...
CTRIES
Michál Marczyk
Tomorrow at 0850
1959 Birandais, Fredkin Trie
1960 Windley,Booth, Colin,Hibbard Binary Search Trees
1962 Adelson-Velsky, Landis AVL Trees
1...
Reading List
Ideal Hash Trees, Bagwell 2001
Fast and efficient trie searches, Bagwell 2000
Fast Mergeable Integer Maps, Oka...
Polymatheia: Jean Niklas L’Orange
QUESTIONS?
Ask Michal or Zach or Jean Niklas :)
THANK YOU
Upcoming SlideShare
Loading in …5
×

A deep dive into Clojure's data structures - EuroClojure 2015

3,467 views

Published on

Immutable, persistent data structures are at the heart of Clojure's philosophy. It is instructive to see how these are implemented, to appreciate the trade-offs between persistence and performance. Lets explore the key ideas that led to effective, practical implementations of these data structures. There will be animations that should help clarify key concepts!

Published in: Software
  • Be the first to comment

A deep dive into Clojure's data structures - EuroClojure 2015

  1. 1. What Lies Beneath Mohit Thatte EUROCLOJURE 2015 Barcelona A Deep Dive into Clojure’s data structures @mohitthatte @pastafari
  2. 2. A DAY IN THE LIFE Image: User:Joonspoon Wikimedia Commons
  3. 3. Programs that use Maps Map API Map Implementation Primitives (JVM, et al) TOWERS OF ABSTRACTION
  4. 4. “Any sufficiently advanced data structure is indistinguishable from magic” - Me With apologies to Arthur Clarke
  5. 5. IMMUTABILITY IS GOOD
  6. 6. PERFORMANCE IS NECESSARY
  7. 7. By U.S. Navy photo [Public domain], via Wikimedia Commons IMMUTABILITY PERF
  8. 8. Image: Maj. Gen. William Anders, Apollo 8
  9. 9. “… functional programming’s stricture against destructive updates (assignments) is a staggering handicap, tantamount to confiscating a master chef’s knives.” - Chris Okasaki
  10. 10. ABSTRACT DATA TYPE enqueue add an element to the end head first element tail remaining elements QUEUE INTERFACE INVARIANTS NAME
  11. 11. THE CHALLENGE Correct Performant Immutable X
  12. 12. CHALLENGE ACCEPTED
  13. 13. Structural Sharing KEY IDEAS Structural Bootstrapping Hybrid Structures
  14. 14. STRUCTURAL SHARING :a :b :c :d :e (assoc v 2 :zz) :a :b :zz
  15. 15. STRUCTURAL SHARING :c :a :d :f :m (assoc v 4 :zz) :e:b :d :f :zz
  16. 16. Image: Alan Levine
  17. 17. STRUCTURAL DECOMPOSITION Image: Alan Chia (Lego Color Bricks)
  18. 18. HYBRID STRUCTURES
  19. 19. LETS DIVE IN!
  20. 20. ‘(1 2 3) Lists: Code manipulation [1 2 3] Vectors: All things sequential {:a 1 :b 2} Maps: Structured Data #{a e i o u} Sets: Ermm, Sets CLOJURE DATA STRUCTURES
  21. 21. MAPS
  22. 22. GET GET value for given key ASSOC ADD key,value to map DISSOC REMOVE key,value from map MERGE MERGE two maps together THE MAP INTERFACE
  23. 23. WHAT MAKES A GOOD MAP? Constant time operations independent of number of keys Efficient space utilization even with mutation Objects as keys, Objects as values
  24. 24. IDEAS
  25. 25. ARRAYS IDEA #1
  26. 26. :a 1 :b 2 :c 3 KEY VALUE PAIRS
  27. 27. NOT A GREAT MAP! Time complexity O(n) Space efficiency NO Objects as keys YES
  28. 28. HOW DO WE DO BETTER?
  29. 29. Image: www.pooktre.com TREES TO THE RESCUE
  30. 30. Ramon Llull, Catalunya c. 1250 Arbol de ciencia
  31. 31. IDEA #2 BINARY SEARCH TREE
  32. 32. 13 a 8 f 17 1 11q b 6 z 15 s r n25 t22 u27
  33. 33. 13 a 17 m r 25 u27
  34. 34. NOT A GREAT MAP! Time complexity worst case O(n) Space efficiency POSSIBLY Objects as keys YES
  35. 35. How do we keep our trees in ‘balance’?
  36. 36. IDEA #3 BALANCED BINARY SEARCH TREES
  37. 37. RED BLACK TREES ALWAYS BALANCED, 100 % MONEY BACK GUARANTEE Guibas, Sedgwick 1978
  38. 38. RED BLACK TREES Root is black Every path from root to an empty node contains the same number of black nodes Every node is colored red or black No red node can have a red child
  39. 39. RED BLACK TREES Okasaki ‘96
  40. 40. A PRETTY GOOD MAP! Time complexity O(log2N) Space efficiency YES Objects as keys YES
  41. 41. Clojure’s sorted-maps are Red Black Trees
  42. 42. CONSTRAINTS KEYS MUST BE COMPARABLE KEYS ARE COMPARED AT EVERY NODE, THIS CAN BE EXPENSIVE
  43. 43. IDEA #4 TRIE - SEARCH BY DIGIT
  44. 44. tap LEVEL 0 LEVEL 1 LEVEL 2
  45. 45. next(node, symbol) FINITE STATE MACHINE Symbols #{a..z} Nodes, Edges
  46. 46. TRIE IMPLEMENTATIONS
  47. 47. Associate each symbol with an offset, e.g a=0,b=1,… LOOKUP TABLES next = lookup(node, offset)
  48. 48. Fast and space efficient trie searches, Bagwell 2000 ADD
  49. 49. NOT A GREAT MAP! Time complexity O(logmN) Space efficiency NO Objects as keys NO
  50. 50. How do we avoid null nodes?
  51. 51. IDEA #4 BST + TRIE = TST Bentley, Sedgwick 1998
  52. 52. Fast and space efficient trie searches, Bagwell 2000 ADD
  53. 53. A DECENT MAP Time complexity ~O(log2N) Space efficiency YES Objects as keys NO
  54. 54. No null nodes, but can we do better than log2N?
  55. 55. CHALLENGE ACCEPTED
  56. 56. Fast and space efficient trie searches, Bagwell 2000 Array Mapped Trie IDEA #5
  57. 57. Use bitmaps to determine presence or absence of symbol
  58. 58. Lets say we have 16 symbols, 0…15
  59. 59. 0 1 0 0 0 1 0 0 1 1 1 0 0 0 0 0 USING BITMAPS 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 Does the symbol with offset 6 exist? mask = 1 << offset bitmap & mask 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 bitwise AND with a mask
  60. 60. There’s an array alongside that only contains entries for the 1’s. NOT pre-allocated.
  61. 61. What offset in the dynamic array should I look at?
  62. 62. Image: Martin Fisch, flickr.com USE THE 1’S AS TALLY MARKS
  63. 63. 0 1 0 0 0 1 0 0 1 1 1 0 0 0 0 0 0 1 2 3 4 MapEntry MapEntry SubTrie Pointer MapEntry MapEntry
  64. 64. 0 1 0 0 0 1 0 0 1 1 1 0 0 0 0 0 USING BITMAPS 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 Where in the array is the entry for ‘6’? Integer.bitCount(bitmap & mask) 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 Count tally marks to the ‘right’ of offset mask = (1 << 6 ) - 1 How do I create a mask to do that? 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
  65. 65. What happens if I insert a new map entry?
  66. 66. 0 1 1 0 0 1 0 0 1 1 1 0 0 0 0 0 0 1 2 3 4 MapEntry MapEntry MapEntry MapEntry MapEntry
  67. 67. 0 1 1 0 0 1 0 0 1 1 1 0 0 0 0 0 0 1 2 3 4 5 Map Entry Map Entry SubTrie Pointer Map Entry Map Entry Map Entry
  68. 68. A DECENT MAP Time complexity O(logmN) Space efficiency YES Objects as keys NO
  69. 69. How do we support arbitrary Objects as keys?
  70. 70. Ideal hash trees, Bagwell 2001 Hashing + AMT IDEA #6
  71. 71. Ideal hash trees, Bagwell 2001 Use a good hash function to generate an integer key. STEP 1 0010 1101 1011 1110 1100 1111 1111 1001 hasheq
  72. 72. STEP 2 72021 35 Divide the 32 bit integer into ‘symbols’ 5 bits at a time. 00101 001111010010101 000110100101 11 Use the ‘symbols’ to walk down an AMT
  73. 73. t bits per symbol give 2t symbols
  74. 74. Why 5 bits?
  75. 75. BIT JUGGLING! Compute ‘symbols’ by shifting and masking 00111000110010110100101010100101 00 00000 00000 00000 00000 00000 11111 (hash >>> shift) & 0x01f How to calculate nth digit? Shift by 5*n and mask with 0x1f
  76. 76. BEST COMMENT EVER. A persistent rendition of Phil Bagwell's Hash Array Mapped Trie Hickey R., Grand C., Emerick C., Miller A., Fingerhut A. Uses path copying for persistence HashCollision leaves vs. extended hashing Node polymorphism vs. conditionals No sub-tree pools or root-resizing Any errors are my own PersistentHashMap.java:19
  77. 77. NODE POLYMORPHISM ArrayNode - 32 wide pointers to sub-tries BitmapIndexedNode - bitmap + dynamic array HashCollisionNode - array for things that collide
  78. 78. EXAMPLE (let [h (zipmap (range 1e6) (range 1e6))] (get h 123456))
  79. 79. 10111 111001100101001 00010 28259 223 0101100000 110 shift = 0 ArrayNode ArrayNode shift = 5 ArrayNode shift = 10 BitmapIndexedNode shift = 15 … and then follow the AMT down
  80. 80. A GOOD MAP Time complexity O(log32N) Space efficiency YES Objects as keys YES
  81. 81. Key compared only once Bit juggling for great performance! HAMT ~6 hops to a leaf node
  82. 82. NEED ROOT RESIZING NOT AMENABLE TO STRUCTURAL SHARING REGULAR HASH TABLE?
  83. 83. UPDATES? Search for the key, clone leaf nodes and path to root
  84. 84. VECTORS
  85. 85. ArrayNode’s all the way. Break ‘index’ into digits and walk down levels. INTUITION (let [arr (vec (range 1e6))] (nth arr 123456))
  86. 86. 030 182400 shift = 15 ArrayNode ArrayNode shift = 10 ArrayNode shift = 5 ArrayNode shift = 0 00011 000001001011000000000000000000 123456
  87. 87. THE TAIL OPTIMIZATION PersistentVector count shift root tail
  88. 88. RIGHT TOOL FOR THE JOB By Schnobby (Own work) [CC BY-SA 3.0], via Wikimedia Commons
  89. 89. HashMaps do not merge efficiently
  90. 90. data.int-map MAP CATENATION Okasaki & Gill’s “Fast Mergeable int maps” Zach Tellman
  91. 91. Vectors do not concat efficiently Vectors do not subvec efficiently
  92. 92. VECTOR CATENATION Based on Bagwell and Rompf, “RRB-Trees: Efficient Immutable Vectors” logarithmic catenation and slicing Michal Marczyk core.rrb-vector TODO: benchmarks
  93. 93. CTRIES Michál Marczyk Tomorrow at 0850
  94. 94. 1959 Birandais, Fredkin Trie 1960 Windley,Booth, Colin,Hibbard Binary Search Trees 1962 Adelson-Velsky, Landis AVL Trees 1978 Guibas, Sedgwick Red Black Trees 1985 Sleator, Tarjan Splay Trees 1996 Okasaki Purely Functional Data Structures 1998 Sedgwick Ternary Search Trees 2000 Phil Bagwell AMT 2001 Phil Bagwell HAMT 2007 Rich Hickey Clojure!
  95. 95. Reading List Ideal Hash Trees, Bagwell 2001 Fast and efficient trie searches, Bagwell 2000 Fast Mergeable Integer Maps, Okasaki & Gill, 1998 The worlds fastest scrabble program, Appel & Jacobson, 1988 File searching using variable length keys, Birandais, 1959 Purely Functional Data Structures, Okasaki 1996
  96. 96. Polymatheia: Jean Niklas L’Orange
  97. 97. QUESTIONS? Ask Michal or Zach or Jean Niklas :)
  98. 98. THANK YOU

×