안미진
RocksDB
Embedded Key-Value Store for Flash and RAM
Contents
1. RocksDB Introduction
2. RocksDB Architecture
3. LSM DB
4. RocksDB Compaction
Overview
RocksDB Introduction
• Open source based on LevelDB 1.5, written in C++
• Key-Value persistent store
• Embedded Library
• Pluggable database
• Optimized for fast storage (flash or RAM)
• Optimized for server workloads
• Get(), Put(), Delete()
A persistent key-value store for fast storage environments
Three Basic Constructs
of RocksDB
• Memtable
– in-memory data structure
– A buffer, temporarily host the incoming writes
• Logfile
– Sequentially-written file
– On storage
Three Basic Constructs
of RocksDB
• SSTable(=SSTfile)
– Sorted Static Table on storage
– A file which contains a set of arbitrary, sorted key-value
pairs inside
– Organized in levels
– Immutable in its life time
– sorted data → to facilitate easy lookup of keys
– Storage of the entire database
SSTable-BlockBasedTable
of RocksDB
Data Data Data …
Meta
(filter)
Meta
(stats)
…
Meta
index
Data
index
Footer
The default SSTable format in RocksDB
SSTable & Memtable
of RocksDB
• On-disk SSTable indexes are always loaded into memory
• All writes go directly to the Memtable index
• Reads check the Memtable first → the SSTable indexes
• Periodically, the Memtable is flushed to disk as an SSTable
• Periodically, on-disk SSTables are merged
→ update/delete records will overwrite/remove the older
data
Simplified RocksDB
Memory
Storage
Memtable
SSTable 1
SSTable 2
Key Offset
Key Offset
… …
Index
Key Value Key Value … …
Simplified SSTable file
RocksDB Architecture
Active
Memtable
Read-Only
Memtable
Memory
Log
Log
SSTSSTSST
SSTSSTSST
Persistent Storage
Write Request
Read Request LSM Files
CompactionFlush
Switch Switch
RocksDB Architecture
Active
Memtable
Read-Only
Memtable
Memory
Log
Log
SSTSSTSST
SSTSSTSST
Persistent Storage
Write Request
Read Request LSM Files
CompactionFlush
Switch Switch
RocksDB Architecture
Active
Memtable
Read-Only
Memtable
Memory
Log
Log
SSTSSTSST
SSTSSTSST
Persistent Storage
Write Request
Read Request LSM Files
CompactionFlush
Switch Switch
RocksDB Architecture
Active
Memtable
Read-Only
Memtable
Memory
Log
Log
SSTSSTSST
SSTSSTSST
Persistent Storage
Write Request
LSM Files
CompactionFlush
Switch Switch
Read Request
RocksDB Architecture
Active
Memtable
Read-Only
Memtable
Memory
Log
Log
SSTSSTSST
SSTSSTSST
Persistent Storage
Write Request
LSM Files
CompactionFlush
Switch Switch
Read Request
RocksDB Architecture
Active
Memtable
(4MB)
Immutable
Memtable
Memory
Disk
Write
Level 0
(4 SSTfile)
Level 1
(10MB)
Level 2
(100MB)
. . .
. . . . . .
Info Log
MANIFEST
CURRENT
Compaction
Log
SSTfile
(2MB)
Log-Structured Merge Tree
• LSM-tree
– N-level merge trees
– Splitting a logical tree into several physical pieces
– So that the most-recently-updated portion of data is in a tree in
memory
– Transform random writes into sequential writes using logfile &
in-memory store(Memtable)
Log-Structured Merge DB
to minimize “random writes”
Write RequestRead Request
Read Write data
in RAM
Read Only data in RAM
on disk
Periodic
Compaction
Transaction Log
Log-Structured Merge DB
to minimize “random writes”
① Data Write(Insert, Update)
• New puts are written to memory(Memtable) & logfile
sequentially
• Memtable is filled up → flushed to a SSTable on disk
• Operated in memory, no disk access → faster than B+ tree
② Data Read
• Memtable → SSTable
• Maintain all the SSTable indexes in memory
RocksDB Compaction
Multi-threaded compactions
• Background Multi-thread
→ periodically do the “compaction”
→ parallel compactions on different parts of the database
can occur simultaneously
• Merge SSTfiles to a bigger SSTfile
• Remove multiple copies of the same key
– Duplicate or overwritten keys
• Process deletions of keys
• Supports two different styles of compaction
– Tunable compaction to trade-off
RocksDB Compaction
Storage
SSTable 1
SSTable 2
SSTable 3
SSTable 4
SSTable 5
1. Level Style Compaction
• RocksDB default compaction style
• Stores data in multiple levels in the database
• More recent data → L0
The oldest data → Lmax
• Files in L0
- overlapping keys, sorted by flush time
Files in L1 and higher
- non-overlapping keys, sorted by key
• Each level is 10 times larger than the previous one
Inherited from LevelDB
Level Style Compaction
Compaction process
cache
log
level1
level2
level3
level0
① Pick one file from level N
② Compact it with all its overlapping
files from level N+1
③ Replace them with new files in
level N+1
Level Style Compaction
Compaction example
5 bytes
6 bytes
10 bytes 10 bytes
11 bytes
10 bytes
Level-0
Level-1
Level-2
Stage 1 Stage 2 Stage 3
Two compactions by Level Style Compaction
Level 0 → Level 1 Compaction
• Level 0 → overlapping keys
• Compaction includes all files from L1
• All files from L1 are compacted with L0
• L0 → L1 compaction completion
L1 → L2 compaction start
• Single thread compaction → not good throughput
• Solution : Making the size of L0 similar to size of L1
Tricky Compaction
2. Universal Style Compaction
• For write-heavy workloads
→ Level Style Compaction may be bottlenecked on
disk throughput
• Stores all files in L0
• All files are arranged in time order
• Temporarily increase size amplification by a factor of
two
• Intended to decrease write amplification
• But, increase space amplification
Universal Style Compaction
① Pick up a few files that are chronologically adjacent to one
another
② Merge them
③ Replace them with a new file in level 0
Compaction process
Universal Style Compaction
• size_ratio
- Percentage flexibility while comparing file size
- Default : 1
• min_merge_width
- The minimum number of files in a single compaction
- Default : 2
• max_merge_width
- The maximum number of files in a single compaction
- Default : UINT_MAX
Compaction options
Universal Style Compaction
Compaction example
5 bytes
6 bytes
10 bytes 10 bytes
Stage 1 Stage 2
Single compaction by Universal Style Compaction
Level-0

RocksDB detail

Editor's Notes

  • #4 Facebook 사에서 수행한 방대한 데이터를 빠른 저장장치(SSD)에서 동작할 데이터베이스 소프트웨어 개발 프로젝트 / 대용량 데이터 처리에 대체로 적합하다 / C++ 라이브러리 / 기존의 관계형 데이터베이스와는 달리 KEY-VALUE 저장방식 / 기존의 관계형 데이터베이스에서의 테이블 간의 관계 설정이나 Join 연산과 같은 개념과 기능을 대폭 축소 / LevelDB와 비교하면 플래쉬 스토리지를 고속의 액세스 성능을 풀 활용할 수 있기 때문에 랜덤 읽기/쓰기나 대량 업로드 전반에 걸쳐서 고속화 할 수 있다. 랜덤 쓰기와 대량 업로드에서는 10배, 랜덤 읽기에서는 30%의 고속화를 실현하고 있다.
  • #15 MANIFEST files will be formatted as a log all changes cause a state change (add or delete) will be appended to the log. A MANIFEST file lists the set of sorted tables that make up each level Informational messages are printed to files named LOG and LOG.old. CURRENT is a latest manifest file name of the text file