Database storage engine internals.pptx

Demystifying data structures
and algorithms adopted by
database storage engine
Adewumi Sunkanmi D.
Demystifying data
structures and
algorithms used by
database storage engine
Adewumi Sunkanmi D.
Senior Software Engineer at Acronis
working on Advanced Automation, one
of the cloud services offered by Acronis
Cyber Cloud.
Outline
1. Overview of a three-tier application
2. Criteria for selecting the best database for an application
3. Overview of database architecture
4. Types for database storage engines and their tradeoffs
5. Q/A
client
POST
GET
client
server
POST
GET
WRITE
READ
server
client
Database
POST
GET
WRITE
READ
server
client
Database
Which database should we use🤔?
POST
GET
WRITE
READ
server
client
Database
Which database should we use🤔?
POST
GET
WRITE
READ
server
client
Database
Which database should we use🤔?
POST
GET
WRITE
READ
server
client
Database
Which database should we use🤔?
POST
GET
WRITE
READ
server
client
Database
Which database should we use🤔?
POST
GET
WRITE
READ
server
client
Database
Which database should we use🤔?
BigTable
POST
GET
WRITE
READ
server
client
Database
Which database should we use🤔?
BigTable
Neo4J
How do we select
the best database
for an application?
1. Structure of the data we want to store;
structured, unstructured, graph?
How do we select
the best database
for an application?
1. Structure of the data we want to store;
structured, unstructured, graph?
1. Scalability of the database
How do we select
the best database
for an application?
1. Structure of the data we want to store;
structured, unstructured, graph?
1. Scalability of the database
- Horizontal or Vertical scaling
How do we select
the best database
for an application?
1. Structure of the data we want to store;
structured, unstructured, graph?
1. Scalability of the database
- Horizontal or Vertical scaling
- Sharding(Partition data across nodes)
How do we select
the best database
for an application?
1. Structure of the data we want to store;
structured, unstructured, graph?
1. Scalability of the database
- Horizontal or Vertical scaling
- Sharding(Partition data across nodes)
- Replication(Copies of data on multiple nodes)
How do we select
the best database
for an application?
1. Structure of the data we want to store;
structured, unstructured, graph?
1. Scalability of the database
- Horizontal or Vertical scaling
- Sharding(Partition data across nodes)
- Replication(Copies of data on multiple nodes)
3. Support and familiarity of developers with database
How do we select
the best database
for an application?
1. Structure of the data we want to store;
structured, unstructured, graph?
1. Scalability of the database
- Horizontal or Vertical scaling
- Sharding(Partition data across nodes)
- Replication(Copies of data on multiple nodes)
3. Support and familiarity of developers with database
4. Rate of write and read and how EXACTLY are these
operations handled at the hardware level?
https://www.oreilly.com/library/view/high-performance-mysql/9781449332471/ch01.html
SELECT
COLS FROM WHERE
COL_ID students >
score 70
firstname lastname
“SELECT firstname, lastname FROM students WHERE score > 70;”
https://www.oreilly.com/library/view/high-performance-mysql/9781449332471/ch01.html
https://www.oreilly.com/library/view/high-performance-mysql/9781449332471/ch01.html
Disk
Types of storage engines
- Log Structured Merge (LSM) Tree
- Page Oriented (B-Tree)
https://www.cs.umb.edu/~poneil/lsmtree.pdf
Log Structured Merge Tree Storage
Engine
The LMS tree is an immutable disk resident data
structure and it is optimized for sequential writes while
maintaining the acceptable read performance.
Log Structured Merge Tree Storage
Engine
Three components
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
Log Structured Merge Tree Storage
Engine
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 177
Log Structured Merge Tree Storage
Engine
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 177
write
ben: 300
Memtable
e.g Red black
tree in RAM
Log Structured Merge Tree Storage
Engine
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 177
write
ben: 300
Memtable
e.g Red black
tree in RAM
josh: 500
Log Structured Merge Tree Storage
Engine
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 177
write
bin: 220
Memtable
e.g Red black
tree in RAM
ben: 300 josh: 500
Threshold reached!
Log Structured Merge Tree Storage
Engine
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 177
write
bin: 220
Memtable
e.g Red Black
tree in RAM
ben: 300 josh: 500
flush
ben: 300
bin: 220
josh: 500
SSD/HDD file (SSTable file)
T1
ben: 300
bin: 220
josh: 500
Log Structured Merge Tree Storage
Engine
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 177
write
bin: 220
Memtable
e.g Red Black
tree in RAM
ben: 300 josh: 500
flush
ben: 300
bin: 220
josh: 500
T1
SSD/HDD file (SSTable file)
40MB
Log Structured Merge Tree Storage
Engine
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 177
write
bin: 220
Memtable
e.g Red Black
tree in RAM
ben: 300 josh: 500
flush
ben: 300
bin: 220
josh: 500
T1
SSD/HDD file (SSTable file)
40MB
10MB
10MB
10MB
10MB
alexandar : 10
andreas : 50
…….
erik : 500
erling : 200
……..
jan : 11
johan : 300
……..
robert: 499
roy: 200
………
Log Structured Merge Tree Storage
Engine
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 177
write
bin: 220
Memtable
e.g Red Black
tree in RAM
ben: 300 josh: 500
flush
ben: 300
bin: 220
josh: 500
T1
SSD/HDD file (SSTable file)
40MB
10MB
10MB
10MB
400
alexandar : 10
andreas : 50
…….
arik : 500
erling : 200
……..
jan : 11
johan : 300
……..
robert: 499
roy: 200
………
In-memory index
Key Byte offset
alexandar 0
arik 303
jan
robert 500
10MB
Log Structured Merge Tree Storage
Engine
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 177
write
bin: 220
Memtable
e.g Red Black
tree in RAM
ben: 300 josh: 500
flush
ben: 300
bin: 220
josh: 500
T1
SSD/HDD file (segment file)
40MB
10MB
10MB
10MB
400
alexandar : 10
andreas : 50
…….
arik : 500
erling : 200
……..
jan : 11
johan : 300
……..
robert: 499
roy: 200
………
In-memory index
Key Byte offset
alexandar 0
arik 303
jan
robert 500
10MB
Find(apa)
Log Structured Merge Tree Storage
Engine
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 173
write
bin: 220
Memtable
e.g Red Black
tree in RAM
ben: 300 josh: 500
flush
ben: 300
bin: 220
josh: 500
T1
SSD/HDD file (SSTable file)
40MB
10MB
10MB
10MB
400
alexandar : 10
andreas : 50
…….
arik : 500
erling : 200
……..
jan : 11
johan : 300
……..
robert: 499
roy: 200
………
In-memory index
Key Byte offset
alexandar 0
arik 303
jan
robert 500
10MB
eve: 173
ben: 300 mia: 220
write
Log Structured Merge Tree Storage
Engine
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 173
write
bin: 220
Memtable
e.g Red Black
tree in RAM
ben: 300 josh: 500
flush
ben: 300
bin: 220
josh: 500
T1
40MB
10MB
10MB
10MB
400
alexandar : 10
andreas : 50
…….
arik : 500
erling : 200
……..
jan : 11
johan : 300
……..
robert: 499
roy: 200
………
In-memory index
Key Byte offset
alexandar 0
arik 303
jan
robert 500
10MB
eve: 173
ben: 177 mia: 220
write
ben: 177
eve: 173
mia: 200
flush T2
SSD/HDD file (SSTable file)
Log Structured Merge Tree Storage
Engine
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 173
write
bin: 220
Memtable
e.g Red Black
tree in RAM
ben: 300 josh: 500
flush
ben: 300
bin: 220
josh: 500
T1
40MB
10MB
10MB
10MB
400
alexandar : 10
andreas : 50
…….
arik : 500
erling : 200
……..
jan : 11
johan : 300
……..
robert: 499
roy: 200
………
In-memory index
Key Byte offset
alexandar 0
arik 303
jan
robert 500
10MB
eve: 173
ben: 177 mia: 220
write
ben: 177
eve: 173
mia: 200
flush T2
Read(ben)
SSD/HDD file (SSTable file)
Log Structured Merge Tree Storage
Engine
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 173
write
bin: 220
Memtable
e.g Red Black
tree in RAM
ben: 300 josh: 500
flush
ben: 300
bin: 220
josh: 500
T1
40MB
10MB
10MB
10MB
400
alexandar : 10
andreas : 50
…….
arik : 500
erling : 200
……..
jan : 11
johan : 300
……..
robert: 499
roy: 200
………
In-memory index
Key Byte offset
alexandar 0
arik 303
jan
robert 500
10MB
eve: 173
ben: 177 mia: 220
write
ben: 177
eve: 173
mia: 200
flush T2
Read(ben)
Step 1
SSD/HDD file (SSTable file)
Log Structured Merge Tree Storage
Engine
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 173
write
bin: 220
Memtable
e.g Red Black
tree in RAM
ben: 300 josh: 500
flush
ben: 300
bin: 220
josh: 500
T1
40MB
10MB
10MB
10MB
400
alexandar : 10
andreas : 50
…….
arik : 500
erling : 200
……..
jan : 11
johan : 300
……..
robert: 499
roy: 200
………
In-memory index
Key Byte offset
alexandar 0
arik 303
jan
robert 500
10MB
eve: 173
ben: 177 mia: 220
write
ben: 177
eve: 173
mia: 200
flush T2
Read(ben)
Step 1
Step 2
SSD/HDD file (SSTable file)
Log Structured Merge Tree Storage
Engine
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 173
write
bin: 220
Memtable
e.g Red Black
tree in RAM
ben: 300 josh: 500
flush
ben: 300
bin: 220
josh: 500
T1
40MB
10MB
10MB
10MB
400
alexandar : 10
andreas : 50
…….
arik : 500
erling : 200
……..
jan : 11
johan : 300
……..
robert: 499
roy: 200
………
In-memory index
Key Byte offset
alexandar 0
arik 303
jan
robert 500
10MB
eve: 173
ben: 177 mia: 220
write
ben: 177
eve: 173
mia: 200
flush T2
Read(ben)
Step 1
Step 2
Step 3
SSD/HDD file (SSTable file)
Log Structured Merge Tree Storage
Engine
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 173
write
bin: 220
Memtable
e.g Red Black
tree in RAM
ben: 300 josh: 500
flush
ben: 300
bin: 220
josh: 500
T1
40MB
10MB
10MB
10MB
400
alexandar : 10
andreas : 50
…….
arik : 500
erling : 200
……..
jan : 11
johan : 300
……..
robert: 499
roy: 200
………
In-memory index
Key Byte offset
alexandar 0
arik 303
jan
robert 500
10MB
eve: 173
ben: 177 mia: 220
write
ben: 177
eve: 173
mia: 200
flush T2
Read(ben)
Step 1
Step 2
Step 3
SSD/HDD file (SSTable file)
Log Structured Merge Tree Storage
Engine
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 173
write
bin: 220
Memtable
e.g Red Black
tree in RAM
ben: 300 josh: 500
flush
ben: 300
bin: 220
josh: 500
T1
40MB
10MB
10MB
10MB
400
alexandar : 10
andreas : 50
…….
arik : 500
erling : 200
……..
jan : 11
johan : 300
……..
robert: 499
roy: 200
………
In-memory index
Key Byte offset
alexandar 0
arik 303
jan
robert 500
10MB
eve: 173
ben: 177 mia: 220
write
ben: 177
eve: 173
mia: 200
flush T2
Read(ben)
Step 1
Step 2
Step 3
How do we handle
update?
Since we return from the most
recent memtable or segment file, we
just insert the key with the new
value,
Ben will be returned from T2 not T1
SSD/HDD file (SSTable file)
Log Structured Merge Tree Storage
Engine
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 173
write
bin: 220
Memtable
e.g Red Black
tree in RAM
ben: 300 josh: 500
flush
ben: 300
bin: 220
josh: 500
T1
40MB
10MB
10MB
10MB
400
alexandar : 10
andreas : 50
…….
arik : 500
erling : 200
……..
jan : 11
johan : 300
……..
robert: 499
roy: 200
………
In-memory index
Key Byte offset
alexandar 0
arik 303
jan
robert 500
10MB
eve: 173
ben: 177 mia: 220
write
ben: 177
eve: 173
mia: 200
flush T2
Read(ben)
Step 1
Step 2
Step 3
How do we handle
delete?
Insert the key with a delete marker
called tombstone, since this will be
the most recent, we can tell it has
been deleted, e.g
ben->null
SSD/HDD file (SSTable file)
Log Structured Merge Tree Storage
Engine
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 173
write
bin: 220
Memtable
e.g Red Black
tree in RAM
ben: 300 josh: 500
flush
ben: 300
bin: 220
josh: 500
T1
40MB
10MB
10MB
10MB
400
alexandar : 10
andreas : 50
…….
arik : 500
erling : 200
……..
jan : 11
johan : 300
……..
robert: 499
roy: 200
………
In-memory index
Key Byte offset
alexandar 0
arik 303
jan
robert 500
10MB
eve: 173
ben: 177 mia: 220
write
ben: 177
eve: 173
mia: 200
flush T2
Read(ben)
Step 1
Step 2
Step 3
But now we have
duplicates, space
wastage :(
SSD/HDD file (SSTable file)
Log Structured Merge Tree Storage
Engine
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 173
write
bin: 220
Memtable
e.g Red Black
tree in RAM
ben: 300 josh: 500
flush
ben: 300
bin: 220
josh: 500
T1
40MB
10MB
10MB
10MB
400
alexandar : 10
andreas : 50
…….
arik : 500
erling : 200
……..
jan : 11
johan : 300
……..
robert: 499
roy: 200
………
In-memory index
Key Byte offset
alexandar 0
arik 303
jan
robert 500
10MB
eve: 173
ben: 177 mia: 220
write
ben: 177
eve: 173
mia: 200
flush T2
Read(ben)
Step 1
Step 2
Step 3
But now we have
duplicates, space
wastage :(
Yes, but compaction will help
SSD/HDD file (SSTable file)
Log Structured Merge Tree Storage
Engine
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 173
write
bin: 220
Memtable
e.g Red Black
tree in RAM
ben: 300 josh: 500
flush
ben: 300
bin: 220
josh: 500
T1
40MB
10MB
10MB
10MB
400
alexandar : 10
andreas : 50
…….
arik : 500
erling : 200
……..
jan : 11
johan : 300
……..
robert: 499
roy: 200
………
In-memory index
Key Byte offset
alexandar 0
arik 303
jan
robert 500
10MB
eve: 173
ben: 177 mia: 220
write
ben: 177
eve: 173
mia: 200
flush T2
Read(ben)
Step 1
Step 2
Step 3
But now we have
duplicates, space
wastage :(
Yes, but compaction will help
Compaction
SSD/HDD file (SSTable file)
Log Structured Merge Tree Storage
Engine
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 173
write
bin: 220
Memtable
e.g Red Black
tree in RAM
ben: 300 josh: 500
flush
ben: 300
bin: 220
josh: 500
T1
40MB
10MB
10MB
10MB
400
alexandar : 10
andreas : 50
…….
arik : 500
erling : 200
……..
jan : 11
johan : 300
……..
robert: 499
roy: 200
………
In-memory index
Key Byte offset
alexandar 0
arik 303
jan
robert 500
10MB
eve: 173
ben: 177 mia: 220
write
ben: 177
eve: 173
mia: 200
flush T2
Read(ben)
Step 1
Step 2
Step 3
What if we don’t find
the key, we search all
the SSTable files?
Compaction
SSD/HDD file (SSTable file)
Log Structured Merge Tree Storage
Engine Key present?: Strict NO if not
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 173
write
bin: 220
Memtable
e.g Red Black
tree in RAM
ben: 300 josh: 500
flush
ben: 300
bin: 220
josh: 500
T1
40MB
10MB
10MB
10MB
400
alexandar : 10
andreas : 50
…….
arik : 500
erling : 200
……..
jan : 11
johan : 300
……..
robert: 499
roy: 200
………
In-memory index
Key Byte offset
alexandar 0
arik 303
jan
robert 500
10MB
eve: 173
ben: 177 mia: 220
write
ben: 177
eve: 173
mia: 200
flush T2
Read(ben)
Step 1
Step 2
Step 3
What if we don’t find
the key, we search all
the SSTable files?
Compaction
Optimtimize reads with Bloom Filters
Maybe or Maybe
not(99% accurate)
https://brilliant.org/wiki/bloom-filter/
SSD/HDD file (SSTable file)
Log Structured Merge Tree Storage
Engine
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 173
write
bin: 220
Memtable
e.g Red Black
tree in RAM
ben: 300 josh: 500
flush
ben: 300
bin: 220
josh: 500
T1
40MB
10MB
10MB
10MB
400
alexandar : 10
andreas : 50
…….
arik : 500
erling : 200
……..
jan : 11
johan : 300
……..
robert: 499
roy: 200
………
In-memory index
Key Byte offset
alexandar 0
arik 303
jan
robert 500
10MB
eve: 173
ben: 177 mia: 220
write
ben: 177
eve: 173
mia: 200
flush T2
Read(ben)
Step 1
Step 2
Step 3
What if power failure
happens before data
is flushed to disk?
Compaction
1. Persist write in an append only log file before
writing to in-memory table. WAL
2. Recreate memtable from last Log Sequence
Number.
SSD/HDD file (SSTable file)
Log Structured Merge Tree Storage
Engine
Where is LSM tree Storage engine
used?
1. Apache Cassandra
2. WiredTiger
3. InfluxDB
4. Yugabyte DB
5. ScyllaDB
6. CockroachDB
7. Google’s BigTable
8. RocksDB
Types of storage engines
- Log Structured Merge (LSM) Tree
- Page Oriented (B-Tree)
https://carlosproal.com/ir/papers/p121-comer.pdf
https://carlosproal.com/ir/papers/p121-comer.pdf
B-Trees
B trees are page-oriented indexing structures
https://carlosproal.com/ir/papers/p121-comer.pdf
B-Trees
Important notes on B-tree
1. Store key value pairs (sorted by key)
2. Self balancing
3. Often used for indexing
4. Mutable data structure(in place update)
5. Each node is a fixed size block/page 4KB
6. Can only read or write one page at a time
https://carlosproal.com/ir/papers/p121-comer.pdf
Anatomy of B-Tree
12 30
13 23 25 27
50 120
31 37 42 49 52 60 69 90
69 70 78 85
Key < 10
Key [120, inf)
key [12, 30) key [30, 50) key [50, 120)
key [69, 90)
val val val
B-Trees
https://sqlbak.com/academy/database-page
A database page
https://carlosproal.com/ir/papers/p121-comer.pdf
12 30
13 23 25 27
50 120
31 37 42 49 52 60 69 90
69 70 78 85
Key < 10
Key [120, inf)
key [12, 30) key [30, 50) key [50, 120)
key [69, 90)
val val val
NOTE: Leaf Page contains both
the key and value
Anatomy of B-Trees
https://carlosproal.com/ir/papers/p121-comer.pdf
12 30
13 23 25 27
50 120
31 37 42 49 52 60 69 90
Key < 10
Key [120, inf)
key [12, 30) key [30, 50) key [50, 120)
69 70 78 85
key [69, 90)
val val val
Branching factor = 5
Depth= 3
Anatomy of B-Tree
https://carlosproal.com/ir/papers/p121-comer.pdf
12 30
13 23 25 27
50 120
31 37 42 49 52 60 69 90
Key < 10
Key [120, inf)
key [12, 30) key [30, 50) key [50, 120)
69 70 78 85
key [69, 90)
val val val
READ(78)
Anatomy of B-Tree
https://carlosproal.com/ir/papers/p121-comer.pdf
12 30
13 23 25 27
50 120
31 37 42 49 52 60 69 90
Key < 10
Key [120, inf)
key [12, 30) key [30, 50) key [50, 120)
69 70 78 85
key [69, 90)
val val val
READ(78)
Anatomy of B-Tree
https://carlosproal.com/ir/papers/p121-comer.pdf
12 30
13 23 25 27
50 120
31 37 42 49 52 60 69 90
Key < 10
Key [120, inf)
key [12, 30) key [30, 50) key [50, 120)
69 70 78 85
key [69, 90)
val val val
READ(78)
Anatomy of B-Tree
https://carlosproal.com/ir/papers/p121-comer.pdf
12 30
13 23 25 27
50 120
31 37 42 49 52 60 69 90
Key < 10
Key [120, inf)
key [12, 30) key [30, 50) key [50, 120)
69 70 78 85
key [69, 90)
val val val
found!
READ(78)
Anatomy of B-Tree
https://carlosproal.com/ir/papers/p121-comer.pdf
12 30
13 23 25 27
50 120
31 37 42 49 52 60 69 90
Key < 10
Key [120, inf)
key [12, 30) key [30, 50) key [50, 120)
69 70 78 85
key [69, 90)
val val val
found!
READ(78)
Anatomy of B-Tree
Searching for a key is faster because we are not scaning
all keys but only keys within range, takes O(log n)
Where n is the total number of keys
https://carlosproal.com/ir/papers/p121-comer.pdf
60 69 90
val val val val
70 78 85
INSERT(87)
87
69 val val val val
70 78 85 86 val
Branching factor - 5
https://carlosproal.com/ir/papers/p121-comer.pdf
60 69 90
Branching
factor
exceeded! > 5
Create new
page
val val val val
70 78 85 87
69 val val val val
70 78 85 86 val 87 val
Branching factor - 5
INSERT(87)
https://carlosproal.com/ir/papers/p121-comer.pdf
60 69 90
69 70 78
val val val 85 86 87
val val val
INSERT(87)
Branching
factor
exceeded! > 5
Create new
page
https://carlosproal.com/ir/papers/p121-comer.pdf
60 69 85
69 70 78
val val val 85 86 87
val val val
90
Add 85 to parent page
INSERT(87)
https://carlosproal.com/ir/papers/p121-comer.pdf
60 69 85
69 70 78
val val val 85 86 87
val val val
90
Add 85 to parent page
What if the parent page is full?
Split it
INSERT(87)
https://carlosproal.com/ir/papers/p121-comer.pdf
60 69 85
69 70 78
val val val 85 86 87
val val val
90
Add 85 to parent page
How does update work?
1. Find the leaf page with key
2. Edit the row
3. Overwrite the page
INSERT(87)
LSM trees Vs B-Trees storage engine
LSM Tree B-Tree
Optimized for write Optimized for read
Compressed better(No
Fragmentation)
Fragmentation wastes space
There can be duplicates before
compaction
Each key exist exactly in one
place
Strong transaction support
Spikes in write can cause slow
compaction due to many
SSTable files. Can cause Out
of Memory Error(OOM)
Space optimization in B-tree
Primary index(primary key index)
Secondary index
Space optimization in B-tree
Secondary index
Primary index(primary key index)
Leaf page contains both key and value Leaf page contains both key and value
DUPLICATE !
Space optimization in B-tree
Secondary index
Primary index(primary key index)
Store value offset(smaller in size)
Store value offset (smaller in size)
val1
val2
val3
val4
val5
…
Heap File
Space optimization in B-tree
Secondary index
Primary index(primary key index)
Store value offset(smaller in size)
Store value offset (smaller in size)
val1
val2
val3
val4
val5
…
Heap File
Store value offset(smaller in size)
Extra Disk I/O
So you can store important
columns in leaf page and less
important columns in heap file
@gifted_dl
@gifted_dl
Adewumi Sunkanmi D.
1 of 79

Recommended

IO Dubi Lebel by
IO Dubi LebelIO Dubi Lebel
IO Dubi Lebelsqlserver.co.il
1.2K views104 slides
5 steps to faster web sites & HTML5 games - updated for DDDscot by
5 steps to faster web sites & HTML5 games - updated for DDDscot5 steps to faster web sites & HTML5 games - updated for DDDscot
5 steps to faster web sites & HTML5 games - updated for DDDscotMichael Ewins
1.1K views73 slides
SQL Server On SANs by
SQL Server On SANsSQL Server On SANs
SQL Server On SANsQuest Software
1K views46 slides
Using Oracle Database with Amazon Web Services by
Using Oracle Database with Amazon Web ServicesUsing Oracle Database with Amazon Web Services
Using Oracle Database with Amazon Web Servicesguest484c12
5.7K views79 slides
Redis — memcached on steroids by
Redis — memcached on steroidsRedis — memcached on steroids
Redis — memcached on steroidsRobert Lehmann
1.8K views91 slides
All about Storage - Series 2 Defining Data by
All about Storage - Series 2 Defining DataAll about Storage - Series 2 Defining Data
All about Storage - Series 2 Defining DataDAGEOP LTD
128 views30 slides

More Related Content

Similar to Database storage engine internals.pptx

Data Replication Options in AWS (ARC302) | AWS re:Invent 2013 by
Data Replication Options in AWS (ARC302) | AWS re:Invent 2013Data Replication Options in AWS (ARC302) | AWS re:Invent 2013
Data Replication Options in AWS (ARC302) | AWS re:Invent 2013Amazon Web Services
12.7K views78 slides
How to Get a Game Changing Performance Advantage with Intel SSDs and Aerospike by
How to Get a Game Changing Performance Advantage with Intel SSDs and AerospikeHow to Get a Game Changing Performance Advantage with Intel SSDs and Aerospike
How to Get a Game Changing Performance Advantage with Intel SSDs and AerospikeAerospike, Inc.
1.5K views16 slides
5 Steps to Faster Web Sites and HTML5 Games by
5 Steps to Faster Web Sites and HTML5 Games5 Steps to Faster Web Sites and HTML5 Games
5 Steps to Faster Web Sites and HTML5 GamesMichael Ewins
1.3K views62 slides
Amazed by AWS Series #4 by
Amazed by AWS Series #4Amazed by AWS Series #4
Amazed by AWS Series #4Amazon Web Services Korea
1.7K views102 slides
Building the Perfect SharePoint 2010 Farm - Sharing the Point South America by
Building the Perfect SharePoint 2010 Farm - Sharing the Point South AmericaBuilding the Perfect SharePoint 2010 Farm - Sharing the Point South America
Building the Perfect SharePoint 2010 Farm - Sharing the Point South AmericaMichael Noel
699 views35 slides
Unity Connect - Getting SQL Spinning with SharePoint - Best Practices for the... by
Unity Connect - Getting SQL Spinning with SharePoint - Best Practices for the...Unity Connect - Getting SQL Spinning with SharePoint - Best Practices for the...
Unity Connect - Getting SQL Spinning with SharePoint - Best Practices for the...Knut Relbe-Moe [MVP, MCT]
390 views50 slides

Similar to Database storage engine internals.pptx(20)

Data Replication Options in AWS (ARC302) | AWS re:Invent 2013 by Amazon Web Services
Data Replication Options in AWS (ARC302) | AWS re:Invent 2013Data Replication Options in AWS (ARC302) | AWS re:Invent 2013
Data Replication Options in AWS (ARC302) | AWS re:Invent 2013
Amazon Web Services12.7K views
How to Get a Game Changing Performance Advantage with Intel SSDs and Aerospike by Aerospike, Inc.
How to Get a Game Changing Performance Advantage with Intel SSDs and AerospikeHow to Get a Game Changing Performance Advantage with Intel SSDs and Aerospike
How to Get a Game Changing Performance Advantage with Intel SSDs and Aerospike
Aerospike, Inc. 1.5K views
5 Steps to Faster Web Sites and HTML5 Games by Michael Ewins
5 Steps to Faster Web Sites and HTML5 Games5 Steps to Faster Web Sites and HTML5 Games
5 Steps to Faster Web Sites and HTML5 Games
Michael Ewins1.3K views
Building the Perfect SharePoint 2010 Farm - Sharing the Point South America by Michael Noel
Building the Perfect SharePoint 2010 Farm - Sharing the Point South AmericaBuilding the Perfect SharePoint 2010 Farm - Sharing the Point South America
Building the Perfect SharePoint 2010 Farm - Sharing the Point South America
Michael Noel699 views
Unity Connect - Getting SQL Spinning with SharePoint - Best Practices for the... by Knut Relbe-Moe [MVP, MCT]
Unity Connect - Getting SQL Spinning with SharePoint - Best Practices for the...Unity Connect - Getting SQL Spinning with SharePoint - Best Practices for the...
Unity Connect - Getting SQL Spinning with SharePoint - Best Practices for the...
HPCC Systems vs Hadoop by Fujio Turner
HPCC Systems vs HadoopHPCC Systems vs Hadoop
HPCC Systems vs Hadoop
Fujio Turner3.4K views
(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New Features by Amazon Web Services
(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New Features(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New Features
(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New Features
Amazon Web Services30.9K views
Site Performance - From Pinto to Ferrari by Joseph Scott
Site Performance - From Pinto to FerrariSite Performance - From Pinto to Ferrari
Site Performance - From Pinto to Ferrari
Joseph Scott3.5K views
MySpace Data Architecture June 2009 by Mark Ginnebaugh
MySpace Data Architecture June 2009MySpace Data Architecture June 2009
MySpace Data Architecture June 2009
Mark Ginnebaugh2.3K views
Virtualization and SAN Basics for DBAs by Quest Software
Virtualization and SAN Basics for DBAsVirtualization and SAN Basics for DBAs
Virtualization and SAN Basics for DBAs
Quest Software901 views
Designing Information Structures For Performance And Reliability by bryanrandol
Designing Information Structures For Performance And ReliabilityDesigning Information Structures For Performance And Reliability
Designing Information Structures For Performance And Reliability
bryanrandol608 views
Maa wp-10g-racprimaryracphysicalsta-131940 by gopalchsamanta
Maa wp-10g-racprimaryracphysicalsta-131940Maa wp-10g-racprimaryracphysicalsta-131940
Maa wp-10g-racprimaryracphysicalsta-131940
gopalchsamanta177 views
Sql server backup internals by Hamid J. Fard
Sql server backup internalsSql server backup internals
Sql server backup internals
Hamid J. Fard861 views
ora_sothea by thysothea
ora_sotheaora_sothea
ora_sothea
thysothea709 views
Understanding Solid State Disk and the Oracle Database Flash Cache (older ver... by Guy Harrison
Understanding Solid State Disk and the Oracle Database Flash Cache (older ver...Understanding Solid State Disk and the Oracle Database Flash Cache (older ver...
Understanding Solid State Disk and the Oracle Database Flash Cache (older ver...
Guy Harrison3.3K views
The care and feeding of a MySQL database by Dave Stokes
The care and feeding of a MySQL databaseThe care and feeding of a MySQL database
The care and feeding of a MySQL database
Dave Stokes879 views
AWS March 2016 Webinar Series - Managed Database Services on Amazon Web Services by Amazon Web Services
AWS March 2016 Webinar Series - Managed Database Services on Amazon Web ServicesAWS March 2016 Webinar Series - Managed Database Services on Amazon Web Services
AWS March 2016 Webinar Series - Managed Database Services on Amazon Web Services
Amazon Web Services1.4K views
Experiences with Oracle SPARC S7-2 Server by JomaSoft
Experiences with Oracle SPARC S7-2 ServerExperiences with Oracle SPARC S7-2 Server
Experiences with Oracle SPARC S7-2 Server
JomaSoft273 views
Best practices for using flash in hyperscale software storage architectures by Eric Carter
Best practices for using flash in hyperscale software storage architecturesBest practices for using flash in hyperscale software storage architectures
Best practices for using flash in hyperscale software storage architectures
Eric Carter477 views

Recently uploaded

nintendo_64.pptx by
nintendo_64.pptxnintendo_64.pptx
nintendo_64.pptxpaiga02016
5 views7 slides
2023-November-Schneider Electric-Meetup-BCN Admin Group.pptx by
2023-November-Schneider Electric-Meetup-BCN Admin Group.pptx2023-November-Schneider Electric-Meetup-BCN Admin Group.pptx
2023-November-Schneider Electric-Meetup-BCN Admin Group.pptxanimuscrm
15 views19 slides
Programming Field by
Programming FieldProgramming Field
Programming Fieldthehardtechnology
5 views9 slides
ShortStory_qlora.pptx by
ShortStory_qlora.pptxShortStory_qlora.pptx
ShortStory_qlora.pptxpranathikrishna22
5 views10 slides
Keep by
KeepKeep
KeepGeniusee
77 views10 slides
FIMA 2023 Neo4j & FS - Entity Resolution.pptx by
FIMA 2023 Neo4j & FS - Entity Resolution.pptxFIMA 2023 Neo4j & FS - Entity Resolution.pptx
FIMA 2023 Neo4j & FS - Entity Resolution.pptxNeo4j
8 views26 slides

Recently uploaded(20)

2023-November-Schneider Electric-Meetup-BCN Admin Group.pptx by animuscrm
2023-November-Schneider Electric-Meetup-BCN Admin Group.pptx2023-November-Schneider Electric-Meetup-BCN Admin Group.pptx
2023-November-Schneider Electric-Meetup-BCN Admin Group.pptx
animuscrm15 views
FIMA 2023 Neo4j & FS - Entity Resolution.pptx by Neo4j
FIMA 2023 Neo4j & FS - Entity Resolution.pptxFIMA 2023 Neo4j & FS - Entity Resolution.pptx
FIMA 2023 Neo4j & FS - Entity Resolution.pptx
Neo4j8 views
Navigating container technology for enhanced security by Niklas Saari by Metosin Oy
Navigating container technology for enhanced security by Niklas SaariNavigating container technology for enhanced security by Niklas Saari
Navigating container technology for enhanced security by Niklas Saari
Metosin Oy14 views
predicting-m3-devopsconMunich-2023.pptx by Tier1 app
predicting-m3-devopsconMunich-2023.pptxpredicting-m3-devopsconMunich-2023.pptx
predicting-m3-devopsconMunich-2023.pptx
Tier1 app7 views
Bootstrapping vs Venture Capital.pptx by Zeljko Svedic
Bootstrapping vs Venture Capital.pptxBootstrapping vs Venture Capital.pptx
Bootstrapping vs Venture Capital.pptx
Zeljko Svedic12 views
Generic or specific? Making sensible software design decisions by Bert Jan Schrijver
Generic or specific? Making sensible software design decisionsGeneric or specific? Making sensible software design decisions
Generic or specific? Making sensible software design decisions
Fleet Management Software in India by Fleetable
Fleet Management Software in India Fleet Management Software in India
Fleet Management Software in India
Fleetable11 views
Copilot Prompting Toolkit_All Resources.pdf by Riccardo Zamana
Copilot Prompting Toolkit_All Resources.pdfCopilot Prompting Toolkit_All Resources.pdf
Copilot Prompting Toolkit_All Resources.pdf
Riccardo Zamana10 views
Ports-and-Adapters Architecture for Embedded HMI by Burkhard Stubert
Ports-and-Adapters Architecture for Embedded HMIPorts-and-Adapters Architecture for Embedded HMI
Ports-and-Adapters Architecture for Embedded HMI
Burkhard Stubert21 views
Dev-Cloud Conference 2023 - Continuous Deployment Showdown: Traditionelles CI... by Marc Müller
Dev-Cloud Conference 2023 - Continuous Deployment Showdown: Traditionelles CI...Dev-Cloud Conference 2023 - Continuous Deployment Showdown: Traditionelles CI...
Dev-Cloud Conference 2023 - Continuous Deployment Showdown: Traditionelles CI...
Marc Müller41 views
Introduction to Git Source Control by John Valentino
Introduction to Git Source ControlIntroduction to Git Source Control
Introduction to Git Source Control
John Valentino5 views
Dev-HRE-Ops - Addressing the _Last Mile DevOps Challenge_ in Highly Regulated... by TomHalpin9
Dev-HRE-Ops - Addressing the _Last Mile DevOps Challenge_ in Highly Regulated...Dev-HRE-Ops - Addressing the _Last Mile DevOps Challenge_ in Highly Regulated...
Dev-HRE-Ops - Addressing the _Last Mile DevOps Challenge_ in Highly Regulated...
TomHalpin96 views

Database storage engine internals.pptx

  • 1. Demystifying data structures and algorithms adopted by database storage engine Adewumi Sunkanmi D.
  • 2. Demystifying data structures and algorithms used by database storage engine
  • 3. Adewumi Sunkanmi D. Senior Software Engineer at Acronis working on Advanced Automation, one of the cloud services offered by Acronis Cyber Cloud.
  • 4. Outline 1. Overview of a three-tier application 2. Criteria for selecting the best database for an application 3. Overview of database architecture 4. Types for database storage engines and their tradeoffs 5. Q/A
  • 15. How do we select the best database for an application? 1. Structure of the data we want to store; structured, unstructured, graph?
  • 16. How do we select the best database for an application? 1. Structure of the data we want to store; structured, unstructured, graph? 1. Scalability of the database
  • 17. How do we select the best database for an application? 1. Structure of the data we want to store; structured, unstructured, graph? 1. Scalability of the database - Horizontal or Vertical scaling
  • 18. How do we select the best database for an application? 1. Structure of the data we want to store; structured, unstructured, graph? 1. Scalability of the database - Horizontal or Vertical scaling - Sharding(Partition data across nodes)
  • 19. How do we select the best database for an application? 1. Structure of the data we want to store; structured, unstructured, graph? 1. Scalability of the database - Horizontal or Vertical scaling - Sharding(Partition data across nodes) - Replication(Copies of data on multiple nodes)
  • 20. How do we select the best database for an application? 1. Structure of the data we want to store; structured, unstructured, graph? 1. Scalability of the database - Horizontal or Vertical scaling - Sharding(Partition data across nodes) - Replication(Copies of data on multiple nodes) 3. Support and familiarity of developers with database
  • 21. How do we select the best database for an application? 1. Structure of the data we want to store; structured, unstructured, graph? 1. Scalability of the database - Horizontal or Vertical scaling - Sharding(Partition data across nodes) - Replication(Copies of data on multiple nodes) 3. Support and familiarity of developers with database 4. Rate of write and read and how EXACTLY are these operations handled at the hardware level?
  • 23. SELECT COLS FROM WHERE COL_ID students > score 70 firstname lastname “SELECT firstname, lastname FROM students WHERE score > 70;”
  • 26. Types of storage engines - Log Structured Merge (LSM) Tree - Page Oriented (B-Tree)
  • 28. Log Structured Merge Tree Storage Engine The LMS tree is an immutable disk resident data structure and it is optimized for sequential writes while maintaining the acceptable read performance.
  • 29. Log Structured Merge Tree Storage Engine Three components 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table)
  • 30. Log Structured Merge Tree Storage Engine 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 177
  • 31. Log Structured Merge Tree Storage Engine 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 177 write ben: 300 Memtable e.g Red black tree in RAM
  • 32. Log Structured Merge Tree Storage Engine 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 177 write ben: 300 Memtable e.g Red black tree in RAM josh: 500
  • 33. Log Structured Merge Tree Storage Engine 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 177 write bin: 220 Memtable e.g Red black tree in RAM ben: 300 josh: 500 Threshold reached!
  • 34. Log Structured Merge Tree Storage Engine 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 177 write bin: 220 Memtable e.g Red Black tree in RAM ben: 300 josh: 500 flush ben: 300 bin: 220 josh: 500 SSD/HDD file (SSTable file) T1 ben: 300 bin: 220 josh: 500
  • 35. Log Structured Merge Tree Storage Engine 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 177 write bin: 220 Memtable e.g Red Black tree in RAM ben: 300 josh: 500 flush ben: 300 bin: 220 josh: 500 T1 SSD/HDD file (SSTable file) 40MB
  • 36. Log Structured Merge Tree Storage Engine 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 177 write bin: 220 Memtable e.g Red Black tree in RAM ben: 300 josh: 500 flush ben: 300 bin: 220 josh: 500 T1 SSD/HDD file (SSTable file) 40MB 10MB 10MB 10MB 10MB alexandar : 10 andreas : 50 ……. erik : 500 erling : 200 …….. jan : 11 johan : 300 …….. robert: 499 roy: 200 ………
  • 37. Log Structured Merge Tree Storage Engine 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 177 write bin: 220 Memtable e.g Red Black tree in RAM ben: 300 josh: 500 flush ben: 300 bin: 220 josh: 500 T1 SSD/HDD file (SSTable file) 40MB 10MB 10MB 10MB 400 alexandar : 10 andreas : 50 ……. arik : 500 erling : 200 …….. jan : 11 johan : 300 …….. robert: 499 roy: 200 ……… In-memory index Key Byte offset alexandar 0 arik 303 jan robert 500 10MB
  • 38. Log Structured Merge Tree Storage Engine 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 177 write bin: 220 Memtable e.g Red Black tree in RAM ben: 300 josh: 500 flush ben: 300 bin: 220 josh: 500 T1 SSD/HDD file (segment file) 40MB 10MB 10MB 10MB 400 alexandar : 10 andreas : 50 ……. arik : 500 erling : 200 …….. jan : 11 johan : 300 …….. robert: 499 roy: 200 ……… In-memory index Key Byte offset alexandar 0 arik 303 jan robert 500 10MB Find(apa)
  • 39. Log Structured Merge Tree Storage Engine 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 173 write bin: 220 Memtable e.g Red Black tree in RAM ben: 300 josh: 500 flush ben: 300 bin: 220 josh: 500 T1 SSD/HDD file (SSTable file) 40MB 10MB 10MB 10MB 400 alexandar : 10 andreas : 50 ……. arik : 500 erling : 200 …….. jan : 11 johan : 300 …….. robert: 499 roy: 200 ……… In-memory index Key Byte offset alexandar 0 arik 303 jan robert 500 10MB eve: 173 ben: 300 mia: 220 write
  • 40. Log Structured Merge Tree Storage Engine 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 173 write bin: 220 Memtable e.g Red Black tree in RAM ben: 300 josh: 500 flush ben: 300 bin: 220 josh: 500 T1 40MB 10MB 10MB 10MB 400 alexandar : 10 andreas : 50 ……. arik : 500 erling : 200 …….. jan : 11 johan : 300 …….. robert: 499 roy: 200 ……… In-memory index Key Byte offset alexandar 0 arik 303 jan robert 500 10MB eve: 173 ben: 177 mia: 220 write ben: 177 eve: 173 mia: 200 flush T2 SSD/HDD file (SSTable file)
  • 41. Log Structured Merge Tree Storage Engine 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 173 write bin: 220 Memtable e.g Red Black tree in RAM ben: 300 josh: 500 flush ben: 300 bin: 220 josh: 500 T1 40MB 10MB 10MB 10MB 400 alexandar : 10 andreas : 50 ……. arik : 500 erling : 200 …….. jan : 11 johan : 300 …….. robert: 499 roy: 200 ……… In-memory index Key Byte offset alexandar 0 arik 303 jan robert 500 10MB eve: 173 ben: 177 mia: 220 write ben: 177 eve: 173 mia: 200 flush T2 Read(ben) SSD/HDD file (SSTable file)
  • 42. Log Structured Merge Tree Storage Engine 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 173 write bin: 220 Memtable e.g Red Black tree in RAM ben: 300 josh: 500 flush ben: 300 bin: 220 josh: 500 T1 40MB 10MB 10MB 10MB 400 alexandar : 10 andreas : 50 ……. arik : 500 erling : 200 …….. jan : 11 johan : 300 …….. robert: 499 roy: 200 ……… In-memory index Key Byte offset alexandar 0 arik 303 jan robert 500 10MB eve: 173 ben: 177 mia: 220 write ben: 177 eve: 173 mia: 200 flush T2 Read(ben) Step 1 SSD/HDD file (SSTable file)
  • 43. Log Structured Merge Tree Storage Engine 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 173 write bin: 220 Memtable e.g Red Black tree in RAM ben: 300 josh: 500 flush ben: 300 bin: 220 josh: 500 T1 40MB 10MB 10MB 10MB 400 alexandar : 10 andreas : 50 ……. arik : 500 erling : 200 …….. jan : 11 johan : 300 …….. robert: 499 roy: 200 ……… In-memory index Key Byte offset alexandar 0 arik 303 jan robert 500 10MB eve: 173 ben: 177 mia: 220 write ben: 177 eve: 173 mia: 200 flush T2 Read(ben) Step 1 Step 2 SSD/HDD file (SSTable file)
  • 44. Log Structured Merge Tree Storage Engine 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 173 write bin: 220 Memtable e.g Red Black tree in RAM ben: 300 josh: 500 flush ben: 300 bin: 220 josh: 500 T1 40MB 10MB 10MB 10MB 400 alexandar : 10 andreas : 50 ……. arik : 500 erling : 200 …….. jan : 11 johan : 300 …….. robert: 499 roy: 200 ……… In-memory index Key Byte offset alexandar 0 arik 303 jan robert 500 10MB eve: 173 ben: 177 mia: 220 write ben: 177 eve: 173 mia: 200 flush T2 Read(ben) Step 1 Step 2 Step 3 SSD/HDD file (SSTable file)
  • 45. Log Structured Merge Tree Storage Engine 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 173 write bin: 220 Memtable e.g Red Black tree in RAM ben: 300 josh: 500 flush ben: 300 bin: 220 josh: 500 T1 40MB 10MB 10MB 10MB 400 alexandar : 10 andreas : 50 ……. arik : 500 erling : 200 …….. jan : 11 johan : 300 …….. robert: 499 roy: 200 ……… In-memory index Key Byte offset alexandar 0 arik 303 jan robert 500 10MB eve: 173 ben: 177 mia: 220 write ben: 177 eve: 173 mia: 200 flush T2 Read(ben) Step 1 Step 2 Step 3 SSD/HDD file (SSTable file)
  • 46. Log Structured Merge Tree Storage Engine 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 173 write bin: 220 Memtable e.g Red Black tree in RAM ben: 300 josh: 500 flush ben: 300 bin: 220 josh: 500 T1 40MB 10MB 10MB 10MB 400 alexandar : 10 andreas : 50 ……. arik : 500 erling : 200 …….. jan : 11 johan : 300 …….. robert: 499 roy: 200 ……… In-memory index Key Byte offset alexandar 0 arik 303 jan robert 500 10MB eve: 173 ben: 177 mia: 220 write ben: 177 eve: 173 mia: 200 flush T2 Read(ben) Step 1 Step 2 Step 3 How do we handle update? Since we return from the most recent memtable or segment file, we just insert the key with the new value, Ben will be returned from T2 not T1 SSD/HDD file (SSTable file)
  • 47. Log Structured Merge Tree Storage Engine 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 173 write bin: 220 Memtable e.g Red Black tree in RAM ben: 300 josh: 500 flush ben: 300 bin: 220 josh: 500 T1 40MB 10MB 10MB 10MB 400 alexandar : 10 andreas : 50 ……. arik : 500 erling : 200 …….. jan : 11 johan : 300 …….. robert: 499 roy: 200 ……… In-memory index Key Byte offset alexandar 0 arik 303 jan robert 500 10MB eve: 173 ben: 177 mia: 220 write ben: 177 eve: 173 mia: 200 flush T2 Read(ben) Step 1 Step 2 Step 3 How do we handle delete? Insert the key with a delete marker called tombstone, since this will be the most recent, we can tell it has been deleted, e.g ben->null SSD/HDD file (SSTable file)
  • 48. Log Structured Merge Tree Storage Engine 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 173 write bin: 220 Memtable e.g Red Black tree in RAM ben: 300 josh: 500 flush ben: 300 bin: 220 josh: 500 T1 40MB 10MB 10MB 10MB 400 alexandar : 10 andreas : 50 ……. arik : 500 erling : 200 …….. jan : 11 johan : 300 …….. robert: 499 roy: 200 ……… In-memory index Key Byte offset alexandar 0 arik 303 jan robert 500 10MB eve: 173 ben: 177 mia: 220 write ben: 177 eve: 173 mia: 200 flush T2 Read(ben) Step 1 Step 2 Step 3 But now we have duplicates, space wastage :( SSD/HDD file (SSTable file)
  • 49. Log Structured Merge Tree Storage Engine 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 173 write bin: 220 Memtable e.g Red Black tree in RAM ben: 300 josh: 500 flush ben: 300 bin: 220 josh: 500 T1 40MB 10MB 10MB 10MB 400 alexandar : 10 andreas : 50 ……. arik : 500 erling : 200 …….. jan : 11 johan : 300 …….. robert: 499 roy: 200 ……… In-memory index Key Byte offset alexandar 0 arik 303 jan robert 500 10MB eve: 173 ben: 177 mia: 220 write ben: 177 eve: 173 mia: 200 flush T2 Read(ben) Step 1 Step 2 Step 3 But now we have duplicates, space wastage :( Yes, but compaction will help SSD/HDD file (SSTable file)
  • 50. Log Structured Merge Tree Storage Engine 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 173 write bin: 220 Memtable e.g Red Black tree in RAM ben: 300 josh: 500 flush ben: 300 bin: 220 josh: 500 T1 40MB 10MB 10MB 10MB 400 alexandar : 10 andreas : 50 ……. arik : 500 erling : 200 …….. jan : 11 johan : 300 …….. robert: 499 roy: 200 ……… In-memory index Key Byte offset alexandar 0 arik 303 jan robert 500 10MB eve: 173 ben: 177 mia: 220 write ben: 177 eve: 173 mia: 200 flush T2 Read(ben) Step 1 Step 2 Step 3 But now we have duplicates, space wastage :( Yes, but compaction will help Compaction SSD/HDD file (SSTable file)
  • 51. Log Structured Merge Tree Storage Engine 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 173 write bin: 220 Memtable e.g Red Black tree in RAM ben: 300 josh: 500 flush ben: 300 bin: 220 josh: 500 T1 40MB 10MB 10MB 10MB 400 alexandar : 10 andreas : 50 ……. arik : 500 erling : 200 …….. jan : 11 johan : 300 …….. robert: 499 roy: 200 ……… In-memory index Key Byte offset alexandar 0 arik 303 jan robert 500 10MB eve: 173 ben: 177 mia: 220 write ben: 177 eve: 173 mia: 200 flush T2 Read(ben) Step 1 Step 2 Step 3 What if we don’t find the key, we search all the SSTable files? Compaction SSD/HDD file (SSTable file)
  • 52. Log Structured Merge Tree Storage Engine Key present?: Strict NO if not 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 173 write bin: 220 Memtable e.g Red Black tree in RAM ben: 300 josh: 500 flush ben: 300 bin: 220 josh: 500 T1 40MB 10MB 10MB 10MB 400 alexandar : 10 andreas : 50 ……. arik : 500 erling : 200 …….. jan : 11 johan : 300 …….. robert: 499 roy: 200 ……… In-memory index Key Byte offset alexandar 0 arik 303 jan robert 500 10MB eve: 173 ben: 177 mia: 220 write ben: 177 eve: 173 mia: 200 flush T2 Read(ben) Step 1 Step 2 Step 3 What if we don’t find the key, we search all the SSTable files? Compaction Optimtimize reads with Bloom Filters Maybe or Maybe not(99% accurate) https://brilliant.org/wiki/bloom-filter/ SSD/HDD file (SSTable file)
  • 53. Log Structured Merge Tree Storage Engine 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 173 write bin: 220 Memtable e.g Red Black tree in RAM ben: 300 josh: 500 flush ben: 300 bin: 220 josh: 500 T1 40MB 10MB 10MB 10MB 400 alexandar : 10 andreas : 50 ……. arik : 500 erling : 200 …….. jan : 11 johan : 300 …….. robert: 499 roy: 200 ……… In-memory index Key Byte offset alexandar 0 arik 303 jan robert 500 10MB eve: 173 ben: 177 mia: 220 write ben: 177 eve: 173 mia: 200 flush T2 Read(ben) Step 1 Step 2 Step 3 What if power failure happens before data is flushed to disk? Compaction 1. Persist write in an append only log file before writing to in-memory table. WAL 2. Recreate memtable from last Log Sequence Number. SSD/HDD file (SSTable file)
  • 54. Log Structured Merge Tree Storage Engine Where is LSM tree Storage engine used? 1. Apache Cassandra 2. WiredTiger 3. InfluxDB 4. Yugabyte DB 5. ScyllaDB 6. CockroachDB 7. Google’s BigTable 8. RocksDB
  • 55. Types of storage engines - Log Structured Merge (LSM) Tree - Page Oriented (B-Tree)
  • 58. https://carlosproal.com/ir/papers/p121-comer.pdf B-Trees Important notes on B-tree 1. Store key value pairs (sorted by key) 2. Self balancing 3. Often used for indexing 4. Mutable data structure(in place update) 5. Each node is a fixed size block/page 4KB 6. Can only read or write one page at a time
  • 59. https://carlosproal.com/ir/papers/p121-comer.pdf Anatomy of B-Tree 12 30 13 23 25 27 50 120 31 37 42 49 52 60 69 90 69 70 78 85 Key < 10 Key [120, inf) key [12, 30) key [30, 50) key [50, 120) key [69, 90) val val val
  • 61. https://carlosproal.com/ir/papers/p121-comer.pdf 12 30 13 23 25 27 50 120 31 37 42 49 52 60 69 90 69 70 78 85 Key < 10 Key [120, inf) key [12, 30) key [30, 50) key [50, 120) key [69, 90) val val val NOTE: Leaf Page contains both the key and value Anatomy of B-Trees
  • 62. https://carlosproal.com/ir/papers/p121-comer.pdf 12 30 13 23 25 27 50 120 31 37 42 49 52 60 69 90 Key < 10 Key [120, inf) key [12, 30) key [30, 50) key [50, 120) 69 70 78 85 key [69, 90) val val val Branching factor = 5 Depth= 3 Anatomy of B-Tree
  • 63. https://carlosproal.com/ir/papers/p121-comer.pdf 12 30 13 23 25 27 50 120 31 37 42 49 52 60 69 90 Key < 10 Key [120, inf) key [12, 30) key [30, 50) key [50, 120) 69 70 78 85 key [69, 90) val val val READ(78) Anatomy of B-Tree
  • 64. https://carlosproal.com/ir/papers/p121-comer.pdf 12 30 13 23 25 27 50 120 31 37 42 49 52 60 69 90 Key < 10 Key [120, inf) key [12, 30) key [30, 50) key [50, 120) 69 70 78 85 key [69, 90) val val val READ(78) Anatomy of B-Tree
  • 65. https://carlosproal.com/ir/papers/p121-comer.pdf 12 30 13 23 25 27 50 120 31 37 42 49 52 60 69 90 Key < 10 Key [120, inf) key [12, 30) key [30, 50) key [50, 120) 69 70 78 85 key [69, 90) val val val READ(78) Anatomy of B-Tree
  • 66. https://carlosproal.com/ir/papers/p121-comer.pdf 12 30 13 23 25 27 50 120 31 37 42 49 52 60 69 90 Key < 10 Key [120, inf) key [12, 30) key [30, 50) key [50, 120) 69 70 78 85 key [69, 90) val val val found! READ(78) Anatomy of B-Tree
  • 67. https://carlosproal.com/ir/papers/p121-comer.pdf 12 30 13 23 25 27 50 120 31 37 42 49 52 60 69 90 Key < 10 Key [120, inf) key [12, 30) key [30, 50) key [50, 120) 69 70 78 85 key [69, 90) val val val found! READ(78) Anatomy of B-Tree Searching for a key is faster because we are not scaning all keys but only keys within range, takes O(log n) Where n is the total number of keys
  • 68. https://carlosproal.com/ir/papers/p121-comer.pdf 60 69 90 val val val val 70 78 85 INSERT(87) 87 69 val val val val 70 78 85 86 val Branching factor - 5
  • 69. https://carlosproal.com/ir/papers/p121-comer.pdf 60 69 90 Branching factor exceeded! > 5 Create new page val val val val 70 78 85 87 69 val val val val 70 78 85 86 val 87 val Branching factor - 5 INSERT(87)
  • 70. https://carlosproal.com/ir/papers/p121-comer.pdf 60 69 90 69 70 78 val val val 85 86 87 val val val INSERT(87) Branching factor exceeded! > 5 Create new page
  • 71. https://carlosproal.com/ir/papers/p121-comer.pdf 60 69 85 69 70 78 val val val 85 86 87 val val val 90 Add 85 to parent page INSERT(87)
  • 72. https://carlosproal.com/ir/papers/p121-comer.pdf 60 69 85 69 70 78 val val val 85 86 87 val val val 90 Add 85 to parent page What if the parent page is full? Split it INSERT(87)
  • 73. https://carlosproal.com/ir/papers/p121-comer.pdf 60 69 85 69 70 78 val val val 85 86 87 val val val 90 Add 85 to parent page How does update work? 1. Find the leaf page with key 2. Edit the row 3. Overwrite the page INSERT(87)
  • 74. LSM trees Vs B-Trees storage engine LSM Tree B-Tree Optimized for write Optimized for read Compressed better(No Fragmentation) Fragmentation wastes space There can be duplicates before compaction Each key exist exactly in one place Strong transaction support Spikes in write can cause slow compaction due to many SSTable files. Can cause Out of Memory Error(OOM)
  • 75. Space optimization in B-tree Primary index(primary key index) Secondary index
  • 76. Space optimization in B-tree Secondary index Primary index(primary key index) Leaf page contains both key and value Leaf page contains both key and value DUPLICATE !
  • 77. Space optimization in B-tree Secondary index Primary index(primary key index) Store value offset(smaller in size) Store value offset (smaller in size) val1 val2 val3 val4 val5 … Heap File
  • 78. Space optimization in B-tree Secondary index Primary index(primary key index) Store value offset(smaller in size) Store value offset (smaller in size) val1 val2 val3 val4 val5 … Heap File Store value offset(smaller in size) Extra Disk I/O So you can store important columns in leaf page and less important columns in heap file