RocksDB is an embedded key-value store written in C++ and optimized for fast storage environments like flash or RAM. It uses a log-structured merge tree to store data by writing new data sequentially to an in-memory log and memtable, periodically flushing the memtable to disk in sorted SSTables. It reads from the memtable and SSTables, and performs background compaction to merge SSTables and remove overwritten data. RocksDB supports two compaction styles - level style, which stores SSTables in multiple levels sorted by age, and universal style, which stores all SSTables in level 0 sorted by time.
Overview of RocksDB as an embedded key-value store optimized for fast storage solutions.
Key constructs of RocksDB include Memtable (in-memory), Logfile (sequential), and SSTable (storage). Each structure aids efficiency and data organization.
Explains the architecture layout of RocksDB with active and read-only Memtables, logs, and SSTables, emphasizing persistent storage.
Detailed discussion on the Log-Structured Merge Tree (LSM-tree) which optimizes write operations by minimizing random writes through periodic compactions. Multi-threaded compaction strategies including Level Style and Universal Style compactions, highlighting their roles in managing SSTables and optimizing database performance.
RocksDB Introduction
• Opensource based on LevelDB 1.5, written in C++
• Key-Value persistent store
• Embedded Library
• Pluggable database
• Optimized for fast storage (flash or RAM)
• Optimized for server workloads
• Get(), Put(), Delete()
A persistent key-value store for fast storage environments
4.
Three Basic Constructs
ofRocksDB
• Memtable
– in-memory data structure
– A buffer, temporarily host the incoming writes
• Logfile
– Sequentially-written file
– On storage
5.
Three Basic Constructs
ofRocksDB
• SSTable(=SSTfile)
– Sorted Static Table on storage
– A file which contains a set of arbitrary, sorted key-value
pairs inside
– Organized in levels
– Immutable in its life time
– sorted data → to facilitate easy lookup of keys
– Storage of the entire database
SSTable & Memtable
ofRocksDB
• On-disk SSTable indexes are always loaded into memory
• All writes go directly to the Memtable index
• Reads check the Memtable first → the SSTable indexes
• Periodically, the Memtable is flushed to disk as an SSTable
• Periodically, on-disk SSTables are merged
→ update/delete records will overwrite/remove the older
data
Log-Structured Merge Tree
•LSM-tree
– N-level merge trees
– Splitting a logical tree into several physical pieces
– So that the most-recently-updated portion of data is in a tree in
memory
– Transform random writes into sequential writes using logfile &
in-memory store(Memtable)
16.
Log-Structured Merge DB
tominimize “random writes”
Write RequestRead Request
Read Write data
in RAM
Read Only data in RAM
on disk
Periodic
Compaction
Transaction Log
17.
Log-Structured Merge DB
tominimize “random writes”
① Data Write(Insert, Update)
• New puts are written to memory(Memtable) & logfile
sequentially
• Memtable is filled up → flushed to a SSTable on disk
• Operated in memory, no disk access → faster than B+ tree
② Data Read
• Memtable → SSTable
• Maintain all the SSTable indexes in memory
18.
RocksDB Compaction
Multi-threaded compactions
•Background Multi-thread
→ periodically do the “compaction”
→ parallel compactions on different parts of the database
can occur simultaneously
• Merge SSTfiles to a bigger SSTfile
• Remove multiple copies of the same key
– Duplicate or overwritten keys
• Process deletions of keys
• Supports two different styles of compaction
– Tunable compaction to trade-off
1. Level StyleCompaction
• RocksDB default compaction style
• Stores data in multiple levels in the database
• More recent data → L0
The oldest data → Lmax
• Files in L0
- overlapping keys, sorted by flush time
Files in L1 and higher
- non-overlapping keys, sorted by key
• Each level is 10 times larger than the previous one
Inherited from LevelDB
21.
Level Style Compaction
Compactionprocess
cache
log
level1
level2
level3
level0
① Pick one file from level N
② Compact it with all its overlapping
files from level N+1
③ Replace them with new files in
level N+1
Level 0 →Level 1 Compaction
• Level 0 → overlapping keys
• Compaction includes all files from L1
• All files from L1 are compacted with L0
• L0 → L1 compaction completion
L1 → L2 compaction start
• Single thread compaction → not good throughput
• Solution : Making the size of L0 similar to size of L1
Tricky Compaction
24.
2. Universal StyleCompaction
• For write-heavy workloads
→ Level Style Compaction may be bottlenecked on
disk throughput
• Stores all files in L0
• All files are arranged in time order
• Temporarily increase size amplification by a factor of
two
• Intended to decrease write amplification
• But, increase space amplification
25.
Universal Style Compaction
①Pick up a few files that are chronologically adjacent to one
another
② Merge them
③ Replace them with a new file in level 0
Compaction process
26.
Universal Style Compaction
•size_ratio
- Percentage flexibility while comparing file size
- Default : 1
• min_merge_width
- The minimum number of files in a single compaction
- Default : 2
• max_merge_width
- The maximum number of files in a single compaction
- Default : UINT_MAX
Compaction options
#4 Facebook 사에서 수행한 방대한 데이터를 빠른 저장장치(SSD)에서 동작할 데이터베이스 소프트웨어 개발 프로젝트 / 대용량 데이터 처리에 대체로 적합하다 / C++ 라이브러리 / 기존의 관계형 데이터베이스와는 달리 KEY-VALUE 저장방식 / 기존의 관계형 데이터베이스에서의 테이블 간의 관계 설정이나 Join 연산과 같은 개념과 기능을 대폭 축소 / LevelDB와 비교하면 플래쉬 스토리지를 고속의 액세스 성능을 풀 활용할 수 있기 때문에 랜덤 읽기/쓰기나 대량 업로드 전반에 걸쳐서 고속화 할 수 있다. 랜덤 쓰기와 대량 업로드에서는 10배, 랜덤 읽기에서는 30%의 고속화를 실현하고 있다.
#15 MANIFEST files will be formatted as a log all changes cause a state change (add or delete) will be appended to the log. A MANIFEST file lists the set of sorted tables that make up each level
Informational messages are printed to files named LOG and LOG.old.
CURRENT is a latest manifest file name of the text file