Exotic Functional Data
Structures:
Hitchhiker Trees
David Greenberg
9/17/16
Strange Loop
Who am I?
Functional
Data Structures
What are they, anyway?
Functional Data Structures
Immutable
7 + 1 = 8
But 7 is still 7
Functional Data Structures
x = [1, 2, 3]
y = x
y += [4]
if x == y:
print("I'm a sad panda”)
How to fix this?
x = [1, 2, 3]
y = x[:]
y += [4]
if x != y:
print("I'm a happy panda")
A List of Fruit
Mutation in an
Immutable World
What are
pointers?
(besides hard)
Pointers!
Pointers and Sharing
Doing Better with Pointers
Linked List
Editing the Linked List
Worse Case Performance
Philosophy of Identity
Q: When isn’t an apple an apple?
A: When an apple points to an orange points to a banana
isn’t an apple points to an orange points to a mango.
Trees
Binary Search Trees
Lookups are log2(n)
1 = 20
2 = 21
4 = 22
Elements
per
Level
Big O Analysis
We Care About the Dominating Factor
Performance
Analysis/Algebra
We have L levels
Lookups cost L
Only the last level matters
There are 2L-1 elements
Thus: n = 2L-1
log2(n) = L
Functional updates
Path Copying
Path Copying
Updates still log2(n)
Properties of Trees
Balanced
How do we maintain this?
How to order the values
Sort them
Trie
Changing Our
Cost Model
Where did the 2 come from in log2(n)?
More children
Fat nodes with ~B children
Going Wide
B Trees are Optimal for
Reads
Lower Bound of logB(n) for sorted lookups
Controlling the base of the logarithm is awesome
log2(1000) = 9.96
log5(1000) = 4.29
log100(1000) = 1.5
Going wide gives big constant speedups for free
Under our I/O cost model
B Tree Bookkeeping
Not as simple as a Binary Search Tree
Separate Node Types
Index & Data Nodes
B+ Tree
Reduce B to fit more levels on screen
Introducing
Fractal Trees
Fractal Trees
We can insert
faster
logb(n) is only for sorted lookups
Appending to a Log
Constant time to append
Already know the next index where we need to
insert
A B C D E
Fractal Trees
Fractal Insertion
Inserting 0
Walking Through Insertions
Inserting -1
Walking Through Insertions
Inserting 28
Walking Through Insertions
Inserting 29
Walking Through Insertions
Inserting -2
Walking Through Insertions
Inserting 11.5
Walking Through Insertions
Inserting 100
What about
Reads?
Looking up 20
Find the Path
Project Pending
Operations
Broken for Scans
Only Project Values Within
Range
Hitchhiker vs Fractal
Path Copying or Not!
Fractal Trees update in-place
Path Copying or Not!
Hitchhiker Trees use path-copying
Flush Control
Total I/O I/O per Flush Avg I/O per
Insert
B+ Tree 21 3 3
Fractal Tree 12 1 to 4 1.7
Hitchhiker Tree 5 5 0.7
Real Branching Factors
B+ Trees have fan out of 1000-2000
Hitchhiker Trees have fan out of 100-200
But Hitchhiker Tree buffers hold 900-1000
elements!
I want to try it!
On Github
Datacrypt is Pluggable
Backend Storage
I/O Management
Serialization
Sorting Algorithm
Works with Redis
Called the Outboard API
Outboard
Looks like a hash map
Data stored off-heap in Redis
Functional data structures mean free snapshots
After a VM restart, just reconnect to Redis
Lifetime of in-memory data doesn’t need to be
tied to lifetime of runtime memory
What’ll we build next?
Q&A
Thanks to:
Andy Chambers for JDBC Backend &
GC Improvements
Casey Marshall for S3 Backend
(Prefix) Tries
(Hash) Array Mapped Tries
We add the fat node trick from B trees
We hash keys first for even distribution
No need to store full hash: prefix is enough

Hitchhiker Trees - Strangeloop 2016

Editor's Notes

  • #3  Author, engineer, now consultant working on Mesos & dist sys Book signing at lunch today!
  • #6 Unfortunately, we’ll be a sad panda
  • #7 By copying the list, we get to be a happy panda
  • #9  explain the color scheme
  • #11 Introduce concept
  • #13 Remember this example? Let’s improve it
  • #15  something about segmented lists & their tradeoffs
  • #18  We’re going to talk about Binary Search Trees, B trees, B+ trees, Fractal trees, and Hitchhiker trees
  • #19  Note: sorted Binary = 2 children per node
  • #25  CLRS book for algorithm examples Tries are out of scope for this talk, but they’re how Scala, Clojure, and Elixir implement maps ^^^Cool hashing tricks, if we have time at the end
  • #27  B stands for “branching factor”
  • #48  Even a fractal tree needs functional data structure for projection hypothetical
  • #49  If we scan, we get out of order & duplicated values
  • #50  Moral: be careful about what gets projected where
  • #54  We just wrote out 7 inserts in 5 iops
  • #63 We can encode data by prefixes on the trie