4. 4
Latch-free
Key Features
Latch-free approach ensures a thread never yields in the face of conflict.
Thus increase level of concurrency with multi-core CPUs.
State changes are made using atomic CaS instructions.
5. 5
Mapping Table
Key Features
In order to implement latch-free design, Bw-tree uses CaS instruction to modify data.
Mapping Table is helping tool that maps logical page to physical page with inter-links in
Bw-Tree. Thus making atomic updates of several references to a tree node possible.
Δ Remove
PID Ptr
L
P
R
S
Node L Node R
Node P
Node S
6. 6
Delta Updates
Key Features
Good multi-core processor performance depends on high CPU cache hit ratios.
Instead of updating memory in place(which results in cache invalidation), Bw-Tree
uses delta updates to prepend changes.
K1 K3 K5 K8
V1 V3 V5 V8
Δdelete
[K5, V5]
Δinsert
[K4, V4]
}Delta Chain
9. 9
Mapping Table
Implementation
Every node have two inbound pointers from parent & left sibling.
Meaning it is hard to update both pointers atomically.
Bw-Tree use mapping table to translate PID into physical pointer.
In this way, no matter how many inbounds node encounters, modification
can be made atomically with one CaS instruction.
logical pointer: parent-child, left-right sibling pointer.
physical pointer: Mapping table-node, pointers within logical
node(pointer for merge SMO).
11. 11
Logical Node
Base Node
inner base node: sorted (key, nodeID) array
leaf base node : sorted (key, value) array
Implementation
Metadata
low-key
high-key
right sibling
size
depth
offset
LSN
The smallest key stored at the logical node
The smallest key of a logical node’s right sibling
The ID of the logical node’s right sibling
The number of items in the logical node
The number of records in the logical node’s Delta Chain
The location of the inserted or deleted item in the base node
Used to enforce WAL protocol
K1 K3 K5 K8
V1 V3 V5 V8
Δdelete
[K5, V5]
Δinsert
[K4, V4]
Delta Chain
metadata
metadata Base Node
metadata }
Inbound pointer
12. 12
Logical Node
Base Node Delta Chain
inner base node: sorted (key, nodeID) array
leaf base node : sorted (key, value) array
singly linked list that holds history of modification
All inbound pointer points to the head of Delta Chain
Implementation
Metadata
low-key
high-key
right sibling
size
depth
offset
LSN
The smallest key stored at the logical node
The smallest key of a logical node’s right sibling
The ID of the logical node’s right sibling
The number of items in the logical node
The number of records in the logical node’s Delta Chain
The location of the inserted or deleted item in the base node
Used to enforce WAL protocol
K1 K3 K5 K8
V1 V3 V5 V8
Δdelete
[K5, V5]
Δinsert
[K4, V4]
Delta Chain
metadata
metadata Base Node
metadata }
Inbound pointer
13. 13
Delta Updates
All page state changes are done by creating a delta record and prepending it
to an existing page state.
Implementation
K1 K3 K5 K8
V1 V3 V5 V8
Δdelete
[K5, V5]
Δinsert
[K4, V4]
} Delta Chain
metadata
metadata Base Node
metadata
14. 14
Delta Updates
Implementation
Only one delta update can be succeed using CaS instruction.
If it fails, thread will retry
All page state changes are done by creating a delta record and prepending it
to an existing page state.
K1 K3 K5 K8
V1 V3 V5 V8
Δdelete
[K5, V5]
Δinsert
[K4, V4]
Delta Chain
metadata
metadata Base Node
metadata}
16. 16
Delta Updates
Implementation
K1 K3 K5 K8
V1 V3 V5 V8
Δdelete
[K5, V5]
Δinsert
[K4, V4]
metadata
metadata
metadata
Δinsert
[K2, V2]
metadata
Δdelete
[K8, V8]
metadata
t1
t2
Mapping Table ptr.
1. t1 & t2 make delta record and prepend it to the head
17. 17
Delta Updates
Implementation
K1 K3 K5 K8
V1 V3 V5 V8
Δdelete
[K5, V5]
Δinsert
[K4, V4]
metadata
metadata
metadata
Δinsert
[K2, V2]
metadata
Δdelete
[K8, V8]
metadata
t1
t2
try CaS & failed
try CaS & succeed
1. t1 & t2 make delta record and prepend it to the head
2. Both trying CaS instruction on mapping table.
Only t1 succeed.
18. 18
Delta Updates
Implementation
K1 K3 K5 K8
V1 V3 V5 V8
Δdelete
[K5, V5]
Δinsert
[K4, V4]
metadata
metadata
metadata
Δinsert
[K2, V2]
metadata
Δdelete
[K8, V8]
metadata
t1
t2
Abort & Retry
1. t1 & t2 make delta record and prepend it to the head
2. Both trying CaS instruction on mapping table.
Only t1 succeed.
3. t2 abort delta record & retry whole process
20. 20
Page Search
Implementation
1. Thread travels delta chain to find the search key
2. If key is present in delta chain and delta record is
I.insert or update
search succeeds and returns the record
II. delete
search fails
metadata
K1 K3 K5 K8
V1 V3 V5 V8
Δdelete [K5, V5]
Δinsert [K4, V4]
Δupdate [K8, V'8] metadata
metadata
metadata
21. 21
Page Search
Implementation
1. Thread travels delta chain to find the search key
2. If key is present in delta chain and delta record is
I.insert or update
search succeeds and returns the record
II. delete
search fails
3. If not, thread performs binary search on base nodemetadata
K1 K3 K5 K8
V1 V3 V5 V8
Δdelete [K5, V5]
Δinsert [K4, V4]
Δupdate [K8, V'8] metadata
metadata
metadata
23. 23
Consolidation
Implementation
K1 K3 K5 K8
V1 V3 V5 V8
Δdelete
[K5, V5]
Δinsert
[K4, V4]
①
metadata
K1 K3 K5 K8
V1 V3 V5 V8
New base node
PID Ptr
delta chain length↑ == overhead of traversing tree↑
1.The thread creates new base node.
24. 24
Consolidation
Implementation
K1 K3 K5 K8
V1 V3 V5 V8
Δdelete
[K5, V5]
Δinsert
[K4, V4]
② delete② insert
metadata
K1 K3 K4 K5 K8
V1 V3 V4 V5 V8
New base node
PID Ptr
delta chain length↑ == overhead of traversing tree↑
1.The thread creates new base node.
2.Applies the delta chain to new base node.
25. 25
Consolidation
Implementation
1.The thread creates new base node.
2.Applies the delta chain to new base node.
3.Change mapping table's physical pointer to new node
and reclaim old logical node when it is safe to do(using
epoch-mechanism)
③ CaS
K1 K3 K5 K8
V1 V3 V5 V8
Δdelete
[K5, V5]
Δinsert
[K4, V4]
metadata
K1 K3 K4 K8
V1 V3 V4 V8
New base node
PID Ptr
delta chain length↑ == overhead of traversing tree↑
26. 26
Range Search
Implementation
Page P
Page R
Page Q
K1 K3 K5 K8
V1 V3 V5 V8
K17 K19 K20 K22
V17 V19 V20 V22
K10 K13 K15 K16
V10 V13 V15 V16
Min key
cursor
A range scan is specified by a key range.
A cursor is set for providing how far search is progressed.
27. 27
Range Search
Implementation
Page P
Page R
Page Q
K1 K3 K5 K8
V1 V3 V5 V8
K17 K19 K20 K22
V17 V19 V20 V22
K10 K13 K15 K16
V10 V13 V15 V16
Min key
cursor
A range scan is specified by a key range.
A cursor is set for providing how far search is progressed.
1.Construct a vector of records could be part of the scan.
28. 28
Range Search
Implementation
Page P
Page R
Page Q
K1 K3 K5 K8
V1 V3 V5 V8
K17 K19 K20 K22
V17 V19 V20 V22
K10 K13 K15 K16
V10 V13 V15 V16
Min key
cursor
1.Construct a vector of records could be part of the scan.
2."next record" operation is atomic but entire scan is not.
A range scan is specified by a key range.
A cursor is set for providing how far search is progressed.
29. 29
Range Search
Implementation
Page P
Page R
Page Q
K1 K3 K5 K8
V1 V3 V5 V8
K17 K19 K20 K22
V17 V19 V20 V22
K10 K13 K15 K16
V10 V13 V15 V16
Min key
cursor
If update has been occurred, reconstruct record vector.
So check is needed for update has affected the
subrange in our record vector.
30. 30
Garbage Collection
Implementation
1.A latch-free environment, so readers can be active even being updated.
2.Not to deallocate the old page while another threads still accessing.
3.By using "epoch" mechanism, protect objects being deallocated.
It has two ways for Garbage Collection
32. 32
Garbage Collection
Centralized scheme
Implementation
1.This is the final state of one epoch. No enrolled threads
in epoch, ∴ nodes are ready to be reclaimed.
2.New epoch object is appended at every fixed intervals.
Epoch E1
Count = 0
Epoch E2
Count = 0 NULL
②
33. 33
Garbage Collection
Centralized scheme
Implementation
1.This is the final state of one epoch. No enrolled threads
in epoch, ∴ nodes are ready to be reclaimed.
2.New epoch object is appended at every fixed intervals.
3.Node must enrolls itself to the current epoch object
before access to the tree.
Epoch E1
Count = 0
Epoch E2
Count = 2 NULL③
34. ④
34
Garbage Collection
Centralized scheme
Implementation
Epoch E1
Count = 0
Epoch E2
Count = 2
CaS
NULL
1.This is the final state of one epoch. No enrolled threads
in epoch, ∴ nodes are ready to be reclaimed.
2.New epoch object is appended at every fixed intervals.
3.Node must enrolls itself to the current epoch object
before access to the tree.
4.t2 add new node to garbage list. But nodes are not
reclaimed until all threads exit.
38. 38
Garbage Collection
Decentralized scheme
Implementation
Each worker thread maintains a private epoch(elocal) and
linked list of objects that are marked for deletion(edelete)
Global Epoch
eglobal = 103
Local thread t1
elocal = 102
Local thread t2
elocal = 103
edelete = 98
edelete = 100
edelete = 103
39. Local thread t2
elocal = 100
39
Garbage Collection
Decentralized scheme
Implementation
1.Thread copies eglobal to its elocal at the beginning of new
operation.
Global Epoch
eglobal = 102
Local thread t1
elocal = 102
edelete = 99
①
40. Local thread t2
elocal = 100
40
Garbage Collection
Decentralized scheme
Implementation
1.Thread copies eglobal to its elocal at the beginning of new
operation.
2.When garbage is created, edelete is tagged with latest eglobal
Global Epoch
eglobal = 102
Local thread t1
elocal = 102
edelete = 99
edelete = 102
②
41. Local thread t2
elocal = 100
41
Garbage Collection
Decentralized scheme
Implementation
1.Thread copies eglobal to its elocal at the beginning of new
operation.
2.When garbage is created, edelete is tagged with latest eglobal
3.At the end of operation, thread copies eglobal to its elocal again.
Global Epoch
eglobal = 103
Local thread t1
elocal = 103
edelete = 99
edelete = 102
Periodically increment
③
42. 42
Garbage Collection
Decentralized scheme
Implementation
1.Thread copies eglobal to its elocal at the beginning of new
operation.
2.When garbage is created, edelete is tagged with latest eglobal
3.At the end of operation, thread copies eglobal to its elocal again.
4.Initiate Garbage Collection
Global Epoch
eglobal = 103
Local thread t1
elocal = 103
edelete = 99
edelete = 102
④ reclaim
Thread retrieves the elocal from all other threads and reclaim
any edelete that is less than the minimum elocal
Local thread t2
elocal = 100
minimum elocal = 100
44. ②
44
Node Split
SMO
PID Ptr
L
P
R
S
Node L Node R
Node P
Node S
① Accessor thread notices node size ≥ threshold
② Creates new base Node S with consolidated upper half of Node L
①
Not CaS
Node S is still invisible to other threads and Node L still has all the records
45. ③
45
Node Split
SMO
③ add Δsplit to original node L
Δsplit has separator key Ks & logical pointer to new node
Δsplit also informs other threads that SMO is ongoing.
PID Ptr
L
P
R
S
Node L Node R
Node P
Node S
Δ SplitCaS
46. ③
46
Node Split
SMO
③ add Δsplit to original node L
Δsplit has separator key Ks & logical pointer to new node
Δsplit also informs other threads that SMO is ongoing.
PID Ptr
L
P
R
S
Node L Node R
Node P
Node S
Δ SplitCaS
Half Split
47. 47
Node Split
SMO
④ add Δseparator to parent node P
④
PID Ptr
L
P
R
S
Node L Node R
Node P
Node S
Δ Split
CaS
Δ separator
Δseparator has separator key KL, KS & logical
pointer to new node
48. 48
Node Split
SMO
What if parent node is merged to other node?
PID Ptr
L
P
R
S
Node L Node R
Node P
Node S
Δ Split
Δ separator
49. 49
Node Split
SMO
PID Ptr
L
P
R
S
Node L Node R
Node P
Node S
Δ Split
Δ separator
By epoch mechanism we are guaranteed to see Δremove of parent node
∴ we can find appropriate left sibling of parent to insert Δseparator
What if parent node is merged to other node?
50. 50
Node Merge
SMO
Δ Remove
PID Ptr
L
P
R
S
Node L Node R
Node P
Node S
①
① Accessor thread notices node size ≤ threshold
thread is only allowed to merge with left sibling
51. 51
Node Merge
SMO
② add Δremove to node R
This stops all further use of node R.
A thread encountering a Δremove needs to read or update
the contents previously contained in R by going to the left sibling
Δ Remove
PID Ptr
L
P
R
S
Node L Node R
Node P
Node S
②
52. ③
52
Node Merge
SMO
③ add Δmerge to left sibling node L
Δmerge physically points to node R.
After ∆merge, L and R are considered as part of the same logical node.
∆merge contains a merge key (copied from R’s low-key)
CaS
Δ Remove
PID Ptr
L
P
R
S
Node L Node R
Node P
Node S
Δ Merge
Logical Node L
53. ⑤
53
Node Merge
SMO
④ add ΔSeparator to parent node P
This record indicates R is deleted & L's new key range
⑤ delete Δremove & PID for node R
only after epoch mechanism says it is safe to do so
CaS
Δ Remove
PID Ptr
L
P
R
S
Node L Node R
Node P
Node S
Δ Merge
Δ Separator
④
low-key from L
high-key from R
55. 55
Possible Optimization
Optimization
•retry CaS not from root but from parent.
•don't abandon failed consolidation.
add inserted delta on created consolidation node and retry.
•consolidate periodically.
•Change CaS to TaS
56. 56
References
References
•David B. Lomet, Sudipta Sengupta, and Justin J. Levandoski. 2013.
The Bw-Tree: A B-tree for New Hardware Platforms.
•Ziqi Wang, Andrew Pavlo, Hyeontaek Lim, Viktor Leis,
Huanchen Zhang, Michael Kaminsky, and David G. Anderson. 2018.
Building a Bw-Tree Takes More Than Just Buzz Words