SlideShare a Scribd company logo
The Bw-Tree
JunKyu Kang
DaeIn Lee
2
Key Features Implementation SMO Optimization
Table of Contents
3
Key Features Implementation SMO Optimization
Table of Contents
4
Latch-free
Key Features
Latch-free approach ensures a thread never yields in the face of conflict.
Thus increase level of concurrency with multi-core CPUs.
State changes are made using atomic CaS instructions.
5
Mapping Table
Key Features
In order to implement latch-free design, Bw-tree uses CaS instruction to modify data.
Mapping Table is helping tool that maps logical page to physical page with inter-links in
Bw-Tree. Thus making atomic updates of several references to a tree node possible.
Δ Remove
PID Ptr
L
P
R
S
Node L Node R
Node P
Node S
6
Delta Updates
Key Features
Good multi-core processor performance depends on high CPU cache hit ratios.

Instead of updating memory in place(which results in cache invalidation), Bw-Tree
uses delta updates to prepend changes.
K1 K3 K5 K8
V1 V3 V5 V8
Δdelete
[K5, V5]
Δinsert
[K4, V4]
}Delta Chain
7
High Performance
Key Features
K1 K3 K5 K8
V1 V3 V5 V8
Δdelete
[K5, V5]
Δinsert
[K4, V4]
}Delta Chain
+
= High Performance
8
Table of Contents
Key Features Implementation SMO Optimization
9
Mapping Table
Implementation
Every node have two inbound pointers from parent & left sibling.
Meaning it is hard to update both pointers atomically.
Bw-Tree use mapping table to translate PID into physical pointer.
In this way, no matter how many inbounds node encounters, modification
can be made atomically with one CaS instruction.
logical pointer: parent-child, left-right sibling pointer.
physical pointer: Mapping table-node, pointers within logical
node(pointer for merge SMO).
10
Logical Node
Base Node
inner base node: sorted (key, nodeID) array
leaf base node : sorted (key, value) array
Implementation
K1 K3 K5 K8
V1 V3 V5 V8
Δdelete
[K5, V5]
Δinsert
[K4, V4]
Delta Chain
metadata
metadata Base Node
metadata }
Inbound pointer
11
Logical Node
Base Node
inner base node: sorted (key, nodeID) array
leaf base node : sorted (key, value) array
Implementation
Metadata
low-key
high-key
right sibling
size
depth
offset
LSN
The smallest key stored at the logical node
The smallest key of a logical node’s right sibling
The ID of the logical node’s right sibling
The number of items in the logical node
The number of records in the logical node’s Delta Chain
The location of the inserted or deleted item in the base node
Used to enforce WAL protocol
K1 K3 K5 K8
V1 V3 V5 V8
Δdelete
[K5, V5]
Δinsert
[K4, V4]
Delta Chain
metadata
metadata Base Node
metadata }
Inbound pointer
12
Logical Node
Base Node Delta Chain
inner base node: sorted (key, nodeID) array
leaf base node : sorted (key, value) array
singly linked list that holds history of modification
All inbound pointer points to the head of Delta Chain
Implementation
Metadata
low-key
high-key
right sibling
size
depth
offset
LSN
The smallest key stored at the logical node
The smallest key of a logical node’s right sibling
The ID of the logical node’s right sibling
The number of items in the logical node
The number of records in the logical node’s Delta Chain
The location of the inserted or deleted item in the base node
Used to enforce WAL protocol
K1 K3 K5 K8
V1 V3 V5 V8
Δdelete
[K5, V5]
Δinsert
[K4, V4]
Delta Chain
metadata
metadata Base Node
metadata }
Inbound pointer
13
Delta Updates
All page state changes are done by creating a delta record and prepending it
to an existing page state.
Implementation
K1 K3 K5 K8
V1 V3 V5 V8
Δdelete
[K5, V5]
Δinsert
[K4, V4]
} Delta Chain
metadata
metadata Base Node
metadata
14
Delta Updates
Implementation
Only one delta update can be succeed using CaS instruction.
If it fails, thread will retry
All page state changes are done by creating a delta record and prepending it
to an existing page state.
K1 K3 K5 K8
V1 V3 V5 V8
Δdelete
[K5, V5]
Δinsert
[K4, V4]
Delta Chain
metadata
metadata Base Node
metadata}
15
Delta Updates
Implementation
K1 K3 K5 K8
V1 V3 V5 V8
Δdelete
[K5, V5]
Δinsert
[K4, V4]
metadata
metadata
metadata
Mapping Table ptr.
16
Delta Updates
Implementation
K1 K3 K5 K8
V1 V3 V5 V8
Δdelete
[K5, V5]
Δinsert
[K4, V4]
metadata
metadata
metadata
Δinsert
[K2, V2]
metadata
Δdelete
[K8, V8]
metadata
t1
t2
Mapping Table ptr.
1. t1 & t2 make delta record and prepend it to the head
17
Delta Updates
Implementation
K1 K3 K5 K8
V1 V3 V5 V8
Δdelete
[K5, V5]
Δinsert
[K4, V4]
metadata
metadata
metadata
Δinsert
[K2, V2]
metadata
Δdelete
[K8, V8]
metadata
t1
t2
try CaS & failed
try CaS & succeed
1. t1 & t2 make delta record and prepend it to the head
2. Both trying CaS instruction on mapping table.

Only t1 succeed.
18
Delta Updates
Implementation
K1 K3 K5 K8
V1 V3 V5 V8
Δdelete
[K5, V5]
Δinsert
[K4, V4]
metadata
metadata
metadata
Δinsert
[K2, V2]
metadata
Δdelete
[K8, V8]
metadata
t1
t2
Abort & Retry
1. t1 & t2 make delta record and prepend it to the head
2. Both trying CaS instruction on mapping table.

Only t1 succeed.

3. t2 abort delta record & retry whole process
19
Page Search
Implementation
metadata
K1 K3 K5 K8
V1 V3 V5 V8
Δdelete [K5, V5]
Δinsert [K4, V4]
Δupdate [K8, V'8] metadata
metadata
metadata
1. Thread travels delta chain to find the search key
20
Page Search
Implementation
1. Thread travels delta chain to find the search key
2. If key is present in delta chain and delta record is
I.insert or update

search succeeds and returns the record
II. delete

search fails
metadata
K1 K3 K5 K8
V1 V3 V5 V8
Δdelete [K5, V5]
Δinsert [K4, V4]
Δupdate [K8, V'8] metadata
metadata
metadata
21
Page Search
Implementation
1. Thread travels delta chain to find the search key
2. If key is present in delta chain and delta record is
I.insert or update

search succeeds and returns the record
II. delete

search fails
3. If not, thread performs binary search on base nodemetadata
K1 K3 K5 K8
V1 V3 V5 V8
Δdelete [K5, V5]
Δinsert [K4, V4]
Δupdate [K8, V'8] metadata
metadata
metadata
22
Consolidation
Implementation
delta chain length↑ == overhead of traversing tree↑
K1 K3 K5 K8
V1 V3 V5 V8
Δdelete
[K5, V5]
Δinsert
[K4, V4]
PID Ptr
23
Consolidation
Implementation
K1 K3 K5 K8
V1 V3 V5 V8
Δdelete
[K5, V5]
Δinsert
[K4, V4]
①
metadata
K1 K3 K5 K8
V1 V3 V5 V8
New base node
PID Ptr
delta chain length↑ == overhead of traversing tree↑
1.The thread creates new base node.
24
Consolidation
Implementation
K1 K3 K5 K8
V1 V3 V5 V8
Δdelete
[K5, V5]
Δinsert
[K4, V4]
② delete② insert
metadata
K1 K3 K4 K5 K8
V1 V3 V4 V5 V8
New base node
PID Ptr
delta chain length↑ == overhead of traversing tree↑
1.The thread creates new base node.
2.Applies the delta chain to new base node.
25
Consolidation
Implementation
1.The thread creates new base node.
2.Applies the delta chain to new base node.
3.Change mapping table's physical pointer to new node
and reclaim old logical node when it is safe to do(using
epoch-mechanism)
③ CaS
K1 K3 K5 K8
V1 V3 V5 V8
Δdelete
[K5, V5]
Δinsert
[K4, V4]
metadata
K1 K3 K4 K8
V1 V3 V4 V8
New base node
PID Ptr
delta chain length↑ == overhead of traversing tree↑
26
Range Search
Implementation
Page P
Page R
Page Q
K1 K3 K5 K8
V1 V3 V5 V8
K17 K19 K20 K22
V17 V19 V20 V22
K10 K13 K15 K16
V10 V13 V15 V16
Min key
cursor
A range scan is specified by a key range. 

A cursor is set for providing how far search is progressed.
27
Range Search
Implementation
Page P
Page R
Page Q
K1 K3 K5 K8
V1 V3 V5 V8
K17 K19 K20 K22
V17 V19 V20 V22
K10 K13 K15 K16
V10 V13 V15 V16
Min key
cursor
A range scan is specified by a key range. 

A cursor is set for providing how far search is progressed.
1.Construct a vector of records could be part of the scan.
28
Range Search
Implementation
Page P
Page R
Page Q
K1 K3 K5 K8
V1 V3 V5 V8
K17 K19 K20 K22
V17 V19 V20 V22
K10 K13 K15 K16
V10 V13 V15 V16
Min key
cursor
1.Construct a vector of records could be part of the scan.
2."next record" operation is atomic but entire scan is not.
A range scan is specified by a key range. 

A cursor is set for providing how far search is progressed.
29
Range Search
Implementation
Page P
Page R
Page Q
K1 K3 K5 K8
V1 V3 V5 V8
K17 K19 K20 K22
V17 V19 V20 V22
K10 K13 K15 K16
V10 V13 V15 V16
Min key
cursor
If update has been occurred, reconstruct record vector.
So check is needed for update has affected the
subrange in our record vector.
30
Garbage Collection
Implementation
1.A latch-free environment, so readers can be active even being updated.

2.Not to deallocate the old page while another threads still accessing.

3.By using "epoch" mechanism, protect objects being deallocated.

It has two ways for Garbage Collection
31
Garbage Collection
Centralized scheme
Implementation
Epoch E1
Count = 0 NULL
Garbage nodes 1.This is the final state of one epoch. No enrolled threads
in epoch, ∴ garbage nodes are ready to be reclaimed.
①
32
Garbage Collection
Centralized scheme
Implementation
1.This is the final state of one epoch. No enrolled threads
in epoch, ∴ nodes are ready to be reclaimed.
2.New epoch object is appended at every fixed intervals.
Epoch E1
Count = 0
Epoch E2
Count = 0 NULL
②
33
Garbage Collection
Centralized scheme
Implementation
1.This is the final state of one epoch. No enrolled threads
in epoch, ∴ nodes are ready to be reclaimed.
2.New epoch object is appended at every fixed intervals.

3.Node must enrolls itself to the current epoch object
before access to the tree.
Epoch E1
Count = 0
Epoch E2
Count = 2 NULL③
④
34
Garbage Collection
Centralized scheme
Implementation
Epoch E1
Count = 0
Epoch E2
Count = 2
CaS
NULL
1.This is the final state of one epoch. No enrolled threads
in epoch, ∴ nodes are ready to be reclaimed.
2.New epoch object is appended at every fixed intervals.

3.Node must enrolls itself to the current epoch object
before access to the tree.

4.t2 add new node to garbage list. But nodes are not
reclaimed until all threads exit.
35
Garbage Collection
Centralized scheme
Implementation
Epoch E1
Count = 0
Epoch E2
Count = 2
CaS
NULL
Changing global var. "Count" becomes bottleneck
because of cache coherence traffic.
36
Garbage Collection
Centralized scheme
Implementation
Epoch E1
Count = 0
Epoch E2
Count = 2
CaS
NULL
Changing global var. "Count" becomes bottleneck
because of cache coherence traffic.
How to avoid using global memory?
37
Garbage Collection
Decentralized scheme
Implementation
Each worker thread maintains a private epoch(elocal)
Global Epoch
eglobal = 103
Local thread t1
elocal = 102
Local thread t2
elocal = 103
38
Garbage Collection
Decentralized scheme
Implementation
Each worker thread maintains a private epoch(elocal) and
linked list of objects that are marked for deletion(edelete)
Global Epoch
eglobal = 103
Local thread t1
elocal = 102
Local thread t2
elocal = 103
edelete = 98
edelete = 100
edelete = 103
Local thread t2
elocal = 100
39
Garbage Collection
Decentralized scheme
Implementation
1.Thread copies eglobal to its elocal at the beginning of new
operation.
Global Epoch
eglobal = 102
Local thread t1
elocal = 102
edelete = 99
①
Local thread t2
elocal = 100
40
Garbage Collection
Decentralized scheme
Implementation
1.Thread copies eglobal to its elocal at the beginning of new
operation.

2.When garbage is created, edelete is tagged with latest eglobal
Global Epoch
eglobal = 102
Local thread t1
elocal = 102
edelete = 99
edelete = 102
②
Local thread t2
elocal = 100
41
Garbage Collection
Decentralized scheme
Implementation
1.Thread copies eglobal to its elocal at the beginning of new
operation.

2.When garbage is created, edelete is tagged with latest eglobal

3.At the end of operation, thread copies eglobal to its elocal again.
Global Epoch
eglobal = 103
Local thread t1
elocal = 103
edelete = 99
edelete = 102
Periodically increment
③
42
Garbage Collection
Decentralized scheme
Implementation
1.Thread copies eglobal to its elocal at the beginning of new
operation.

2.When garbage is created, edelete is tagged with latest eglobal

3.At the end of operation, thread copies eglobal to its elocal again.

4.Initiate Garbage Collection
Global Epoch
eglobal = 103
Local thread t1
elocal = 103
edelete = 99
edelete = 102
④ reclaim
Thread retrieves the elocal from all other threads and reclaim 

any edelete that is less than the minimum elocal
Local thread t2
elocal = 100
minimum elocal = 100
43
Table of Contents
Key Features Implementation OptimizationSMO
②
44
Node Split
SMO
PID Ptr
L
P
R
S
Node L Node R
Node P
Node S
① Accessor thread notices node size ≥ threshold
② Creates new base Node S with consolidated upper half of Node L
①
Not CaS
Node S is still invisible to other threads and Node L still has all the records
③
45
Node Split
SMO
③ add Δsplit to original node L
Δsplit has separator key Ks & logical pointer to new node
Δsplit also informs other threads that SMO is ongoing.
PID Ptr
L
P
R
S
Node L Node R
Node P
Node S
Δ SplitCaS
③
46
Node Split
SMO
③ add Δsplit to original node L
Δsplit has separator key Ks & logical pointer to new node
Δsplit also informs other threads that SMO is ongoing.
PID Ptr
L
P
R
S
Node L Node R
Node P
Node S
Δ SplitCaS
Half Split
47
Node Split
SMO
④ add Δseparator to parent node P
④
PID Ptr
L
P
R
S
Node L Node R
Node P
Node S
Δ Split
CaS
Δ separator
Δseparator has separator key KL, KS & logical
pointer to new node
48
Node Split
SMO
What if parent node is merged to other node?
PID Ptr
L
P
R
S
Node L Node R
Node P
Node S
Δ Split
Δ separator
49
Node Split
SMO
PID Ptr
L
P
R
S
Node L Node R
Node P
Node S
Δ Split
Δ separator
By epoch mechanism we are guaranteed to see Δremove of parent node
∴ we can find appropriate left sibling of parent to insert Δseparator
What if parent node is merged to other node?
50
Node Merge
SMO
Δ Remove
PID Ptr
L
P
R
S
Node L Node R
Node P
Node S
①
① Accessor thread notices node size ≤ threshold
thread is only allowed to merge with left sibling
51
Node Merge
SMO
② add Δremove to node R
This stops all further use of node R. 

A thread encountering a Δremove needs to read or update
the contents previously contained in R by going to the left sibling
Δ Remove
PID Ptr
L
P
R
S
Node L Node R
Node P
Node S
②
③
52
Node Merge
SMO
③ add Δmerge to left sibling node L
Δmerge physically points to node R.
After ∆merge, L and R are considered as part of the same logical node.
∆merge contains a merge key (copied from R’s low-key)
CaS
Δ Remove
PID Ptr
L
P
R
S
Node L Node R
Node P
Node S
Δ Merge
Logical Node L
⑤
53
Node Merge
SMO
④ add ΔSeparator to parent node P
This record indicates R is deleted & L's new key range
⑤ delete Δremove & PID for node R
only after epoch mechanism says it is safe to do so
CaS
Δ Remove
PID Ptr
L
P
R
S
Node L Node R
Node P
Node S
Δ Merge
Δ Separator
④
low-key from L
high-key from R
54
Table of Contents
Key Features Implementation SMO Optimization
55
Possible Optimization
Optimization
•retry CaS not from root but from parent.
•don't abandon failed consolidation. 

add inserted delta on created consolidation node and retry.
•consolidate periodically.
•Change CaS to TaS
56
References
References
•David B. Lomet, Sudipta Sengupta, and Justin J. Levandoski. 2013. 

The Bw-Tree: A B-tree for New Hardware Platforms.

•Ziqi Wang, Andrew Pavlo, Hyeontaek Lim, Viktor Leis, 

Huanchen Zhang, Michael Kaminsky, and David G. Anderson. 2018.

Building a Bw-Tree Takes More Than Just Buzz Words

More Related Content

Similar to Bw tree presentation

Short.course.introduction.to.vhdl for beginners
Short.course.introduction.to.vhdl for beginners Short.course.introduction.to.vhdl for beginners
Short.course.introduction.to.vhdl for beginners Ravi Sony
 
A Deep Dive Into Understanding Apache Cassandra
A Deep Dive Into Understanding Apache CassandraA Deep Dive Into Understanding Apache Cassandra
A Deep Dive Into Understanding Apache CassandraDataStax Academy
 
Short.course.introduction.to.vhdl
Short.course.introduction.to.vhdlShort.course.introduction.to.vhdl
Short.course.introduction.to.vhdlRavi Sony
 
Renegotiating the boundary between database latency and consistency
Renegotiating the boundary between database latency  and consistencyRenegotiating the boundary between database latency  and consistency
Renegotiating the boundary between database latency and consistencyScyllaDB
 
A Consolidation Success Story by Karl Arao
A Consolidation Success Story by Karl AraoA Consolidation Success Story by Karl Arao
A Consolidation Success Story by Karl AraoEnkitec
 
TiDB for Big Data
TiDB for Big DataTiDB for Big Data
TiDB for Big DataPingCAP
 
VLSI lab manual Part A, VTU 7the sem KIT-tiptur
VLSI lab manual Part A, VTU 7the sem KIT-tipturVLSI lab manual Part A, VTU 7the sem KIT-tiptur
VLSI lab manual Part A, VTU 7the sem KIT-tipturPramod Kumar S
 
IRJET- Adding Support for Vector Instructions to 8051 Architecture
IRJET- Adding Support for Vector Instructions to 8051 ArchitectureIRJET- Adding Support for Vector Instructions to 8051 Architecture
IRJET- Adding Support for Vector Instructions to 8051 ArchitectureIRJET Journal
 
OOW 2013: Where did my CPU go
OOW 2013: Where did my CPU goOOW 2013: Where did my CPU go
OOW 2013: Where did my CPU goKristofferson A
 
A VHDL Implemetation of the Advanced Encryption Standard-Rijndael.pdf
A VHDL Implemetation of the Advanced Encryption Standard-Rijndael.pdfA VHDL Implemetation of the Advanced Encryption Standard-Rijndael.pdf
A VHDL Implemetation of the Advanced Encryption Standard-Rijndael.pdfRamRaja15
 
OLTP+OLAP=HTAP
 OLTP+OLAP=HTAP OLTP+OLAP=HTAP
OLTP+OLAP=HTAPEDB
 
Ecet 230 Success Begins / snaptutorial.com
Ecet 230 Success Begins / snaptutorial.comEcet 230 Success Begins / snaptutorial.com
Ecet 230 Success Begins / snaptutorial.comWilliamsTaylorzm
 
ECET 230 Massive Success--snaptutorial.com
ECET 230 Massive Success--snaptutorial.comECET 230 Massive Success--snaptutorial.com
ECET 230 Massive Success--snaptutorial.comsantricksapiens71
 
ECET 230 Technology levels--snaptutorial.com
ECET 230 Technology levels--snaptutorial.comECET 230 Technology levels--snaptutorial.com
ECET 230 Technology levels--snaptutorial.comsholingarjosh102
 
Ecet 230 Enthusiastic Study / snaptutorial.com
Ecet 230 Enthusiastic Study / snaptutorial.comEcet 230 Enthusiastic Study / snaptutorial.com
Ecet 230 Enthusiastic Study / snaptutorial.comStephenson39
 
Azure Day Rome Reloaded 2019 - Deconstructing Kubernetes using AKS
Azure Day Rome Reloaded 2019 - Deconstructing Kubernetes using AKSAzure Day Rome Reloaded 2019 - Deconstructing Kubernetes using AKS
Azure Day Rome Reloaded 2019 - Deconstructing Kubernetes using AKSazuredayit
 
MySQL Optimizer Overview
MySQL Optimizer OverviewMySQL Optimizer Overview
MySQL Optimizer OverviewOlav Sandstå
 

Similar to Bw tree presentation (20)

Short.course.introduction.to.vhdl for beginners
Short.course.introduction.to.vhdl for beginners Short.course.introduction.to.vhdl for beginners
Short.course.introduction.to.vhdl for beginners
 
A Deep Dive Into Understanding Apache Cassandra
A Deep Dive Into Understanding Apache CassandraA Deep Dive Into Understanding Apache Cassandra
A Deep Dive Into Understanding Apache Cassandra
 
Short.course.introduction.to.vhdl
Short.course.introduction.to.vhdlShort.course.introduction.to.vhdl
Short.course.introduction.to.vhdl
 
Renegotiating the boundary between database latency and consistency
Renegotiating the boundary between database latency  and consistencyRenegotiating the boundary between database latency  and consistency
Renegotiating the boundary between database latency and consistency
 
A Consolidation Success Story by Karl Arao
A Consolidation Success Story by Karl AraoA Consolidation Success Story by Karl Arao
A Consolidation Success Story by Karl Arao
 
TiDB for Big Data
TiDB for Big DataTiDB for Big Data
TiDB for Big Data
 
VLSI lab manual Part A, VTU 7the sem KIT-tiptur
VLSI lab manual Part A, VTU 7the sem KIT-tipturVLSI lab manual Part A, VTU 7the sem KIT-tiptur
VLSI lab manual Part A, VTU 7the sem KIT-tiptur
 
IRJET- Adding Support for Vector Instructions to 8051 Architecture
IRJET- Adding Support for Vector Instructions to 8051 ArchitectureIRJET- Adding Support for Vector Instructions to 8051 Architecture
IRJET- Adding Support for Vector Instructions to 8051 Architecture
 
OOW 2013: Where did my CPU go
OOW 2013: Where did my CPU goOOW 2013: Where did my CPU go
OOW 2013: Where did my CPU go
 
A VHDL Implemetation of the Advanced Encryption Standard-Rijndael.pdf
A VHDL Implemetation of the Advanced Encryption Standard-Rijndael.pdfA VHDL Implemetation of the Advanced Encryption Standard-Rijndael.pdf
A VHDL Implemetation of the Advanced Encryption Standard-Rijndael.pdf
 
Design a pipeline
Design a pipelineDesign a pipeline
Design a pipeline
 
OLTP+OLAP=HTAP
 OLTP+OLAP=HTAP OLTP+OLAP=HTAP
OLTP+OLAP=HTAP
 
Ecet 230 Success Begins / snaptutorial.com
Ecet 230 Success Begins / snaptutorial.comEcet 230 Success Begins / snaptutorial.com
Ecet 230 Success Begins / snaptutorial.com
 
ECET 230 Massive Success--snaptutorial.com
ECET 230 Massive Success--snaptutorial.comECET 230 Massive Success--snaptutorial.com
ECET 230 Massive Success--snaptutorial.com
 
ECET 230 Technology levels--snaptutorial.com
ECET 230 Technology levels--snaptutorial.comECET 230 Technology levels--snaptutorial.com
ECET 230 Technology levels--snaptutorial.com
 
Ecet 230 Enthusiastic Study / snaptutorial.com
Ecet 230 Enthusiastic Study / snaptutorial.comEcet 230 Enthusiastic Study / snaptutorial.com
Ecet 230 Enthusiastic Study / snaptutorial.com
 
04 sequentialbasics 1
04 sequentialbasics 104 sequentialbasics 1
04 sequentialbasics 1
 
Azure Day Rome Reloaded 2019 - Deconstructing Kubernetes using AKS
Azure Day Rome Reloaded 2019 - Deconstructing Kubernetes using AKSAzure Day Rome Reloaded 2019 - Deconstructing Kubernetes using AKS
Azure Day Rome Reloaded 2019 - Deconstructing Kubernetes using AKS
 
MySQL Optimizer Overview
MySQL Optimizer OverviewMySQL Optimizer Overview
MySQL Optimizer Overview
 
TiReX: Tiled Regular eXpression matching architecture
TiReX: Tiled Regular eXpression matching architectureTiReX: Tiled Regular eXpression matching architecture
TiReX: Tiled Regular eXpression matching architecture
 

Recently uploaded

AI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning FrameworkAI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning FrameworkAlluxio, Inc.
 
Agnieszka Andrzejewska - BIM School Course in Kraków
Agnieszka Andrzejewska - BIM School Course in KrakówAgnieszka Andrzejewska - BIM School Course in Kraków
Agnieszka Andrzejewska - BIM School Course in Krakówbim.edu.pl
 
AI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in MichelangeloAI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in MichelangeloAlluxio, Inc.
 
A Guideline to Zendesk to Re:amaze Data Migration
A Guideline to Zendesk to Re:amaze Data MigrationA Guideline to Zendesk to Re:amaze Data Migration
A Guideline to Zendesk to Re:amaze Data MigrationHelp Desk Migration
 
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?XfilesPro
 
CompTIA Security+ (Study Notes) for cs.pdf
CompTIA Security+ (Study Notes) for cs.pdfCompTIA Security+ (Study Notes) for cs.pdf
CompTIA Security+ (Study Notes) for cs.pdfFurqanuddin10
 
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...rajkumar669520
 
A Guideline to Gorgias to to Re:amaze Data Migration
A Guideline to Gorgias to to Re:amaze Data MigrationA Guideline to Gorgias to to Re:amaze Data Migration
A Guideline to Gorgias to to Re:amaze Data MigrationHelp Desk Migration
 
GraphSummit Stockholm - Neo4j - Knowledge Graphs and Product Updates
GraphSummit Stockholm - Neo4j - Knowledge Graphs and Product UpdatesGraphSummit Stockholm - Neo4j - Knowledge Graphs and Product Updates
GraphSummit Stockholm - Neo4j - Knowledge Graphs and Product UpdatesNeo4j
 
GraphAware - Transforming policing with graph-based intelligence analysis
GraphAware - Transforming policing with graph-based intelligence analysisGraphAware - Transforming policing with graph-based intelligence analysis
GraphAware - Transforming policing with graph-based intelligence analysisNeo4j
 
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Anthony Dahanne
 
Breaking the Code : A Guide to WhatsApp Business API.pdf
Breaking the Code : A Guide to WhatsApp Business API.pdfBreaking the Code : A Guide to WhatsApp Business API.pdf
Breaking the Code : A Guide to WhatsApp Business API.pdfMeon Technology
 
INGKA DIGITAL: Linked Metadata by Design
INGKA DIGITAL: Linked Metadata by DesignINGKA DIGITAL: Linked Metadata by Design
INGKA DIGITAL: Linked Metadata by DesignNeo4j
 
Benefits of Employee Monitoring Software
Benefits of  Employee Monitoring SoftwareBenefits of  Employee Monitoring Software
Benefits of Employee Monitoring SoftwareMera Monitor
 
Crafting the Perfect Measurement Sheet with PLM Integration
Crafting the Perfect Measurement Sheet with PLM IntegrationCrafting the Perfect Measurement Sheet with PLM Integration
Crafting the Perfect Measurement Sheet with PLM IntegrationWave PLM
 
10 Essential Software Testing Tools You Need to Know About.pdf
10 Essential Software Testing Tools You Need to Know About.pdf10 Essential Software Testing Tools You Need to Know About.pdf
10 Essential Software Testing Tools You Need to Know About.pdfkalichargn70th171
 
StrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi
StrimziCon 2024 - Transition to Apache Kafka on Kubernetes with StrimziStrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi
StrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzisteffenkarlsson2
 
KLARNA - Language Models and Knowledge Graphs: A Systems Approach
KLARNA -  Language Models and Knowledge Graphs: A Systems ApproachKLARNA -  Language Models and Knowledge Graphs: A Systems Approach
KLARNA - Language Models and Knowledge Graphs: A Systems ApproachNeo4j
 

Recently uploaded (20)

AI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning FrameworkAI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning Framework
 
Agnieszka Andrzejewska - BIM School Course in Kraków
Agnieszka Andrzejewska - BIM School Course in KrakówAgnieszka Andrzejewska - BIM School Course in Kraków
Agnieszka Andrzejewska - BIM School Course in Kraków
 
AI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in MichelangeloAI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in Michelangelo
 
5 Reasons Driving Warehouse Management Systems Demand
5 Reasons Driving Warehouse Management Systems Demand5 Reasons Driving Warehouse Management Systems Demand
5 Reasons Driving Warehouse Management Systems Demand
 
A Guideline to Zendesk to Re:amaze Data Migration
A Guideline to Zendesk to Re:amaze Data MigrationA Guideline to Zendesk to Re:amaze Data Migration
A Guideline to Zendesk to Re:amaze Data Migration
 
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
 
CompTIA Security+ (Study Notes) for cs.pdf
CompTIA Security+ (Study Notes) for cs.pdfCompTIA Security+ (Study Notes) for cs.pdf
CompTIA Security+ (Study Notes) for cs.pdf
 
Top Mobile App Development Companies 2024
Top Mobile App Development Companies 2024Top Mobile App Development Companies 2024
Top Mobile App Development Companies 2024
 
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...
 
A Guideline to Gorgias to to Re:amaze Data Migration
A Guideline to Gorgias to to Re:amaze Data MigrationA Guideline to Gorgias to to Re:amaze Data Migration
A Guideline to Gorgias to to Re:amaze Data Migration
 
GraphSummit Stockholm - Neo4j - Knowledge Graphs and Product Updates
GraphSummit Stockholm - Neo4j - Knowledge Graphs and Product UpdatesGraphSummit Stockholm - Neo4j - Knowledge Graphs and Product Updates
GraphSummit Stockholm - Neo4j - Knowledge Graphs and Product Updates
 
GraphAware - Transforming policing with graph-based intelligence analysis
GraphAware - Transforming policing with graph-based intelligence analysisGraphAware - Transforming policing with graph-based intelligence analysis
GraphAware - Transforming policing with graph-based intelligence analysis
 
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
 
Breaking the Code : A Guide to WhatsApp Business API.pdf
Breaking the Code : A Guide to WhatsApp Business API.pdfBreaking the Code : A Guide to WhatsApp Business API.pdf
Breaking the Code : A Guide to WhatsApp Business API.pdf
 
INGKA DIGITAL: Linked Metadata by Design
INGKA DIGITAL: Linked Metadata by DesignINGKA DIGITAL: Linked Metadata by Design
INGKA DIGITAL: Linked Metadata by Design
 
Benefits of Employee Monitoring Software
Benefits of  Employee Monitoring SoftwareBenefits of  Employee Monitoring Software
Benefits of Employee Monitoring Software
 
Crafting the Perfect Measurement Sheet with PLM Integration
Crafting the Perfect Measurement Sheet with PLM IntegrationCrafting the Perfect Measurement Sheet with PLM Integration
Crafting the Perfect Measurement Sheet with PLM Integration
 
10 Essential Software Testing Tools You Need to Know About.pdf
10 Essential Software Testing Tools You Need to Know About.pdf10 Essential Software Testing Tools You Need to Know About.pdf
10 Essential Software Testing Tools You Need to Know About.pdf
 
StrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi
StrimziCon 2024 - Transition to Apache Kafka on Kubernetes with StrimziStrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi
StrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi
 
KLARNA - Language Models and Knowledge Graphs: A Systems Approach
KLARNA -  Language Models and Knowledge Graphs: A Systems ApproachKLARNA -  Language Models and Knowledge Graphs: A Systems Approach
KLARNA - Language Models and Knowledge Graphs: A Systems Approach
 

Bw tree presentation

  • 2. 2 Key Features Implementation SMO Optimization Table of Contents
  • 3. 3 Key Features Implementation SMO Optimization Table of Contents
  • 4. 4 Latch-free Key Features Latch-free approach ensures a thread never yields in the face of conflict. Thus increase level of concurrency with multi-core CPUs. State changes are made using atomic CaS instructions.
  • 5. 5 Mapping Table Key Features In order to implement latch-free design, Bw-tree uses CaS instruction to modify data. Mapping Table is helping tool that maps logical page to physical page with inter-links in Bw-Tree. Thus making atomic updates of several references to a tree node possible. Δ Remove PID Ptr L P R S Node L Node R Node P Node S
  • 6. 6 Delta Updates Key Features Good multi-core processor performance depends on high CPU cache hit ratios.
 Instead of updating memory in place(which results in cache invalidation), Bw-Tree uses delta updates to prepend changes. K1 K3 K5 K8 V1 V3 V5 V8 Δdelete [K5, V5] Δinsert [K4, V4] }Delta Chain
  • 7. 7 High Performance Key Features K1 K3 K5 K8 V1 V3 V5 V8 Δdelete [K5, V5] Δinsert [K4, V4] }Delta Chain + = High Performance
  • 8. 8 Table of Contents Key Features Implementation SMO Optimization
  • 9. 9 Mapping Table Implementation Every node have two inbound pointers from parent & left sibling. Meaning it is hard to update both pointers atomically. Bw-Tree use mapping table to translate PID into physical pointer. In this way, no matter how many inbounds node encounters, modification can be made atomically with one CaS instruction. logical pointer: parent-child, left-right sibling pointer. physical pointer: Mapping table-node, pointers within logical node(pointer for merge SMO).
  • 10. 10 Logical Node Base Node inner base node: sorted (key, nodeID) array leaf base node : sorted (key, value) array Implementation K1 K3 K5 K8 V1 V3 V5 V8 Δdelete [K5, V5] Δinsert [K4, V4] Delta Chain metadata metadata Base Node metadata } Inbound pointer
  • 11. 11 Logical Node Base Node inner base node: sorted (key, nodeID) array leaf base node : sorted (key, value) array Implementation Metadata low-key high-key right sibling size depth offset LSN The smallest key stored at the logical node The smallest key of a logical node’s right sibling The ID of the logical node’s right sibling The number of items in the logical node The number of records in the logical node’s Delta Chain The location of the inserted or deleted item in the base node Used to enforce WAL protocol K1 K3 K5 K8 V1 V3 V5 V8 Δdelete [K5, V5] Δinsert [K4, V4] Delta Chain metadata metadata Base Node metadata } Inbound pointer
  • 12. 12 Logical Node Base Node Delta Chain inner base node: sorted (key, nodeID) array leaf base node : sorted (key, value) array singly linked list that holds history of modification All inbound pointer points to the head of Delta Chain Implementation Metadata low-key high-key right sibling size depth offset LSN The smallest key stored at the logical node The smallest key of a logical node’s right sibling The ID of the logical node’s right sibling The number of items in the logical node The number of records in the logical node’s Delta Chain The location of the inserted or deleted item in the base node Used to enforce WAL protocol K1 K3 K5 K8 V1 V3 V5 V8 Δdelete [K5, V5] Δinsert [K4, V4] Delta Chain metadata metadata Base Node metadata } Inbound pointer
  • 13. 13 Delta Updates All page state changes are done by creating a delta record and prepending it to an existing page state. Implementation K1 K3 K5 K8 V1 V3 V5 V8 Δdelete [K5, V5] Δinsert [K4, V4] } Delta Chain metadata metadata Base Node metadata
  • 14. 14 Delta Updates Implementation Only one delta update can be succeed using CaS instruction. If it fails, thread will retry All page state changes are done by creating a delta record and prepending it to an existing page state. K1 K3 K5 K8 V1 V3 V5 V8 Δdelete [K5, V5] Δinsert [K4, V4] Delta Chain metadata metadata Base Node metadata}
  • 15. 15 Delta Updates Implementation K1 K3 K5 K8 V1 V3 V5 V8 Δdelete [K5, V5] Δinsert [K4, V4] metadata metadata metadata Mapping Table ptr.
  • 16. 16 Delta Updates Implementation K1 K3 K5 K8 V1 V3 V5 V8 Δdelete [K5, V5] Δinsert [K4, V4] metadata metadata metadata Δinsert [K2, V2] metadata Δdelete [K8, V8] metadata t1 t2 Mapping Table ptr. 1. t1 & t2 make delta record and prepend it to the head
  • 17. 17 Delta Updates Implementation K1 K3 K5 K8 V1 V3 V5 V8 Δdelete [K5, V5] Δinsert [K4, V4] metadata metadata metadata Δinsert [K2, V2] metadata Δdelete [K8, V8] metadata t1 t2 try CaS & failed try CaS & succeed 1. t1 & t2 make delta record and prepend it to the head 2. Both trying CaS instruction on mapping table.
 Only t1 succeed.
  • 18. 18 Delta Updates Implementation K1 K3 K5 K8 V1 V3 V5 V8 Δdelete [K5, V5] Δinsert [K4, V4] metadata metadata metadata Δinsert [K2, V2] metadata Δdelete [K8, V8] metadata t1 t2 Abort & Retry 1. t1 & t2 make delta record and prepend it to the head 2. Both trying CaS instruction on mapping table.
 Only t1 succeed.
 3. t2 abort delta record & retry whole process
  • 19. 19 Page Search Implementation metadata K1 K3 K5 K8 V1 V3 V5 V8 Δdelete [K5, V5] Δinsert [K4, V4] Δupdate [K8, V'8] metadata metadata metadata 1. Thread travels delta chain to find the search key
  • 20. 20 Page Search Implementation 1. Thread travels delta chain to find the search key 2. If key is present in delta chain and delta record is I.insert or update
 search succeeds and returns the record II. delete
 search fails metadata K1 K3 K5 K8 V1 V3 V5 V8 Δdelete [K5, V5] Δinsert [K4, V4] Δupdate [K8, V'8] metadata metadata metadata
  • 21. 21 Page Search Implementation 1. Thread travels delta chain to find the search key 2. If key is present in delta chain and delta record is I.insert or update
 search succeeds and returns the record II. delete
 search fails 3. If not, thread performs binary search on base nodemetadata K1 K3 K5 K8 V1 V3 V5 V8 Δdelete [K5, V5] Δinsert [K4, V4] Δupdate [K8, V'8] metadata metadata metadata
  • 22. 22 Consolidation Implementation delta chain length↑ == overhead of traversing tree↑ K1 K3 K5 K8 V1 V3 V5 V8 Δdelete [K5, V5] Δinsert [K4, V4] PID Ptr
  • 23. 23 Consolidation Implementation K1 K3 K5 K8 V1 V3 V5 V8 Δdelete [K5, V5] Δinsert [K4, V4] ① metadata K1 K3 K5 K8 V1 V3 V5 V8 New base node PID Ptr delta chain length↑ == overhead of traversing tree↑ 1.The thread creates new base node.
  • 24. 24 Consolidation Implementation K1 K3 K5 K8 V1 V3 V5 V8 Δdelete [K5, V5] Δinsert [K4, V4] ② delete② insert metadata K1 K3 K4 K5 K8 V1 V3 V4 V5 V8 New base node PID Ptr delta chain length↑ == overhead of traversing tree↑ 1.The thread creates new base node. 2.Applies the delta chain to new base node.
  • 25. 25 Consolidation Implementation 1.The thread creates new base node. 2.Applies the delta chain to new base node. 3.Change mapping table's physical pointer to new node and reclaim old logical node when it is safe to do(using epoch-mechanism) ③ CaS K1 K3 K5 K8 V1 V3 V5 V8 Δdelete [K5, V5] Δinsert [K4, V4] metadata K1 K3 K4 K8 V1 V3 V4 V8 New base node PID Ptr delta chain length↑ == overhead of traversing tree↑
  • 26. 26 Range Search Implementation Page P Page R Page Q K1 K3 K5 K8 V1 V3 V5 V8 K17 K19 K20 K22 V17 V19 V20 V22 K10 K13 K15 K16 V10 V13 V15 V16 Min key cursor A range scan is specified by a key range. 
 A cursor is set for providing how far search is progressed.
  • 27. 27 Range Search Implementation Page P Page R Page Q K1 K3 K5 K8 V1 V3 V5 V8 K17 K19 K20 K22 V17 V19 V20 V22 K10 K13 K15 K16 V10 V13 V15 V16 Min key cursor A range scan is specified by a key range. 
 A cursor is set for providing how far search is progressed. 1.Construct a vector of records could be part of the scan.
  • 28. 28 Range Search Implementation Page P Page R Page Q K1 K3 K5 K8 V1 V3 V5 V8 K17 K19 K20 K22 V17 V19 V20 V22 K10 K13 K15 K16 V10 V13 V15 V16 Min key cursor 1.Construct a vector of records could be part of the scan. 2."next record" operation is atomic but entire scan is not. A range scan is specified by a key range. 
 A cursor is set for providing how far search is progressed.
  • 29. 29 Range Search Implementation Page P Page R Page Q K1 K3 K5 K8 V1 V3 V5 V8 K17 K19 K20 K22 V17 V19 V20 V22 K10 K13 K15 K16 V10 V13 V15 V16 Min key cursor If update has been occurred, reconstruct record vector. So check is needed for update has affected the subrange in our record vector.
  • 30. 30 Garbage Collection Implementation 1.A latch-free environment, so readers can be active even being updated.
 2.Not to deallocate the old page while another threads still accessing.
 3.By using "epoch" mechanism, protect objects being deallocated.
 It has two ways for Garbage Collection
  • 31. 31 Garbage Collection Centralized scheme Implementation Epoch E1 Count = 0 NULL Garbage nodes 1.This is the final state of one epoch. No enrolled threads in epoch, ∴ garbage nodes are ready to be reclaimed. ①
  • 32. 32 Garbage Collection Centralized scheme Implementation 1.This is the final state of one epoch. No enrolled threads in epoch, ∴ nodes are ready to be reclaimed. 2.New epoch object is appended at every fixed intervals. Epoch E1 Count = 0 Epoch E2 Count = 0 NULL ②
  • 33. 33 Garbage Collection Centralized scheme Implementation 1.This is the final state of one epoch. No enrolled threads in epoch, ∴ nodes are ready to be reclaimed. 2.New epoch object is appended at every fixed intervals.
 3.Node must enrolls itself to the current epoch object before access to the tree. Epoch E1 Count = 0 Epoch E2 Count = 2 NULL③
  • 34. ④ 34 Garbage Collection Centralized scheme Implementation Epoch E1 Count = 0 Epoch E2 Count = 2 CaS NULL 1.This is the final state of one epoch. No enrolled threads in epoch, ∴ nodes are ready to be reclaimed. 2.New epoch object is appended at every fixed intervals.
 3.Node must enrolls itself to the current epoch object before access to the tree.
 4.t2 add new node to garbage list. But nodes are not reclaimed until all threads exit.
  • 35. 35 Garbage Collection Centralized scheme Implementation Epoch E1 Count = 0 Epoch E2 Count = 2 CaS NULL Changing global var. "Count" becomes bottleneck because of cache coherence traffic.
  • 36. 36 Garbage Collection Centralized scheme Implementation Epoch E1 Count = 0 Epoch E2 Count = 2 CaS NULL Changing global var. "Count" becomes bottleneck because of cache coherence traffic. How to avoid using global memory?
  • 37. 37 Garbage Collection Decentralized scheme Implementation Each worker thread maintains a private epoch(elocal) Global Epoch eglobal = 103 Local thread t1 elocal = 102 Local thread t2 elocal = 103
  • 38. 38 Garbage Collection Decentralized scheme Implementation Each worker thread maintains a private epoch(elocal) and linked list of objects that are marked for deletion(edelete) Global Epoch eglobal = 103 Local thread t1 elocal = 102 Local thread t2 elocal = 103 edelete = 98 edelete = 100 edelete = 103
  • 39. Local thread t2 elocal = 100 39 Garbage Collection Decentralized scheme Implementation 1.Thread copies eglobal to its elocal at the beginning of new operation. Global Epoch eglobal = 102 Local thread t1 elocal = 102 edelete = 99 ①
  • 40. Local thread t2 elocal = 100 40 Garbage Collection Decentralized scheme Implementation 1.Thread copies eglobal to its elocal at the beginning of new operation.
 2.When garbage is created, edelete is tagged with latest eglobal Global Epoch eglobal = 102 Local thread t1 elocal = 102 edelete = 99 edelete = 102 ②
  • 41. Local thread t2 elocal = 100 41 Garbage Collection Decentralized scheme Implementation 1.Thread copies eglobal to its elocal at the beginning of new operation.
 2.When garbage is created, edelete is tagged with latest eglobal
 3.At the end of operation, thread copies eglobal to its elocal again. Global Epoch eglobal = 103 Local thread t1 elocal = 103 edelete = 99 edelete = 102 Periodically increment ③
  • 42. 42 Garbage Collection Decentralized scheme Implementation 1.Thread copies eglobal to its elocal at the beginning of new operation.
 2.When garbage is created, edelete is tagged with latest eglobal
 3.At the end of operation, thread copies eglobal to its elocal again.
 4.Initiate Garbage Collection Global Epoch eglobal = 103 Local thread t1 elocal = 103 edelete = 99 edelete = 102 ④ reclaim Thread retrieves the elocal from all other threads and reclaim 
 any edelete that is less than the minimum elocal Local thread t2 elocal = 100 minimum elocal = 100
  • 43. 43 Table of Contents Key Features Implementation OptimizationSMO
  • 44. ② 44 Node Split SMO PID Ptr L P R S Node L Node R Node P Node S ① Accessor thread notices node size ≥ threshold ② Creates new base Node S with consolidated upper half of Node L ① Not CaS Node S is still invisible to other threads and Node L still has all the records
  • 45. ③ 45 Node Split SMO ③ add Δsplit to original node L Δsplit has separator key Ks & logical pointer to new node Δsplit also informs other threads that SMO is ongoing. PID Ptr L P R S Node L Node R Node P Node S Δ SplitCaS
  • 46. ③ 46 Node Split SMO ③ add Δsplit to original node L Δsplit has separator key Ks & logical pointer to new node Δsplit also informs other threads that SMO is ongoing. PID Ptr L P R S Node L Node R Node P Node S Δ SplitCaS Half Split
  • 47. 47 Node Split SMO ④ add Δseparator to parent node P ④ PID Ptr L P R S Node L Node R Node P Node S Δ Split CaS Δ separator Δseparator has separator key KL, KS & logical pointer to new node
  • 48. 48 Node Split SMO What if parent node is merged to other node? PID Ptr L P R S Node L Node R Node P Node S Δ Split Δ separator
  • 49. 49 Node Split SMO PID Ptr L P R S Node L Node R Node P Node S Δ Split Δ separator By epoch mechanism we are guaranteed to see Δremove of parent node ∴ we can find appropriate left sibling of parent to insert Δseparator What if parent node is merged to other node?
  • 50. 50 Node Merge SMO Δ Remove PID Ptr L P R S Node L Node R Node P Node S ① ① Accessor thread notices node size ≤ threshold thread is only allowed to merge with left sibling
  • 51. 51 Node Merge SMO ② add Δremove to node R This stops all further use of node R. 
 A thread encountering a Δremove needs to read or update the contents previously contained in R by going to the left sibling Δ Remove PID Ptr L P R S Node L Node R Node P Node S ②
  • 52. ③ 52 Node Merge SMO ③ add Δmerge to left sibling node L Δmerge physically points to node R. After ∆merge, L and R are considered as part of the same logical node. ∆merge contains a merge key (copied from R’s low-key) CaS Δ Remove PID Ptr L P R S Node L Node R Node P Node S Δ Merge Logical Node L
  • 53. ⑤ 53 Node Merge SMO ④ add ΔSeparator to parent node P This record indicates R is deleted & L's new key range ⑤ delete Δremove & PID for node R only after epoch mechanism says it is safe to do so CaS Δ Remove PID Ptr L P R S Node L Node R Node P Node S Δ Merge Δ Separator ④ low-key from L high-key from R
  • 54. 54 Table of Contents Key Features Implementation SMO Optimization
  • 55. 55 Possible Optimization Optimization •retry CaS not from root but from parent. •don't abandon failed consolidation. 
 add inserted delta on created consolidation node and retry. •consolidate periodically. •Change CaS to TaS
  • 56. 56 References References •David B. Lomet, Sudipta Sengupta, and Justin J. Levandoski. 2013. 
 The Bw-Tree: A B-tree for New Hardware Platforms.
 •Ziqi Wang, Andrew Pavlo, Hyeontaek Lim, Viktor Leis, 
 Huanchen Zhang, Michael Kaminsky, and David G. Anderson. 2018.
 Building a Bw-Tree Takes More Than Just Buzz Words