RocksDB meetup

1
RocksDB on Open-Channel SSDs
Javier González <javier@cnexlabs.com>
RocksDB Annual Meetup'15 - Facebook

2
Solid State Drives (SSDs)
High throughput + Low latency Parallelism + Controller

3
Embedded Flash Translation Layers (FTLs)
• Have enabled wide adoption by exposing a
block I/O interface to the application
• However, they are a roadblock for I/O
intensive workloads
- Hardwire design decisions about data placement, over-
provisioning, scheduling, garbage collection, and wear
leveling -> Assumptions on application workload
- Introduces redundancies, missed optimizations, and
underutilization of resources
- Predictable latencies cannot be guaranteed – 99
percentiles
- Introduces unavoidable write-amplification on the
device side (GC)
Metadata Mgmt.
Standard Libraries
User SpaceKernel Space
App-specific opt.
Page cache
Block I/O interface
FS-specific logic
I/O Intensive
Applications
FTL
Device- L2P translation table
- GC, over-provisioning,
wear-levelling, sched.,
etc.
➡ I/O Applications use expensive data structures and
employ host ressources to align their I/O patterns to
(unreachable) flash constrains (e.g., LSM tree)

4
Open-Channel SSDs
Linux
Kernel
Target (e.g. RRPC)
User-space
File-System, Key-Value, Object-Store, ...
Open-Channel SSDs
Metadata State Mgmt. XOR/RAID EngineECC Engine
LightNVM Compatible Device Drivers (NVMe, null-blk)
CNEX FTL
liblightnvm
AppsI/O Apps
General Media Manager
Framework
Management
• Host Manages
- Data Placement
- Garbage Collection
- I/O Scheduling
- Over-provisioning
- Wear-levelling
• Device Information
- SSD offload engines &
responsibilities
- SSD geometry:
• NAND media geometry
• Channels, timings, etc.
• Bad blocks
• PPA Format
Common Data
Structures
(e.g., append-
only, key-value)
Decoupled
Interfaces
(get_block/
put_block)
LightNVM spec.
Traditional
Interfaces
Full stack FTL
HW offload
engines
Open-Channel SSDs share control responsibilities with the Host in order to implement and maintain features that
typical SSDs implement strictly in the device firmware

5
Why Open-Channel SSDs
• Optimize high-performance I/O applications for fast Flash memories
- Control data placement - use physical flash blocks directly
- Remove device garbage collection (GC) - reuse LSM compaction on the host
• Achieve predictable latency - no 99 percentiles (measured in seconds)
• Avoid write-amplification introduced by the FTL - control device endurance with host software
• Improve the steady state of the device (and start from it)
- Minimize over-provisioning (normal SSDs employ 10-30% over-provisioning for performance)
IOPS
Time

6
RocksDB: Using Open-Channel SSDs
Open-Channel SSD
get_block() /
put_block() /
Media Manager
Provisioning interface Block device
Applica4on
Provisioning buffer
Block0 Block1 BlockN…
Applica@on Logic
Normal I/O
blockN->bppa * PAGE_SIZE
struct nvm_tgt_type M_dflash = {
[…]
.make_rq = df_make_rq,
. end_io = df_end_io,
[…]
};
FTL
sync
psync
libaio
posixaio
…
…
…
…
…
CH0
CH1
Lun0 LunN
Physical Flash Layout
- NAND - specific
- Managed by controller
…
…
…
…
vblockManaged Flash
- Exploit parallelism
- Serve applica@on needs
type 1
type 2vblock
vblock type 3
KERNEL SPACEUSER SPACE
Data placement
I/O scheduling
Over-provisioning
Garbage collection
Wear-leveling
liblightnvm
struct vblock {
uint64_t id;
uint64_t owner_id;
uint64_t nppas;
uint64_t ppa_bitmap
sector_t bppa;
uint32_t vlun_id;
uint8_t flags
};

7
RocksDB: Using Open-Channel SSDs
• Sstables
- P1: Fit block sizes in L0 and further level (merges + compactions)
• No need for GC on SSD side - RocksDB merging as GC (less write and space amplification)
- P2: Keep block metadata to reconstruct sstable in case of host crash
• WAL (Write-Ahead Log) and MANIFEST
- P3: Fit block sizes (same as in sstables)
- P4: Keep block metadata to reconstruct the log in case of host crash
• Other Metadata
- P5: Keep superblock metadata and allow to recover the database
- P6: Keep other metadata to account for flash constrains (e.g., partial pages, bad pages, bad blocks)
• Process
- P7: Upstream to vanilla RocksDB

8
RocksDB: Data Placement
Arena&block&(kArenaSize)
(Op3mal&size&~&1/10&of&write_buffer_size)
write_buffer_size
• WAL and MANIFEST are reused
in future instances until replaced
- P3: Ensure that WAL and MANIFEST
replace size fills up most of last block
• Sstable sizes follow a heuristic - MemTable::ShouldFlushNow()
P1:
- kArenaBlockSize = sizeof(block)
- Conservative heuristic in terms of overallocation
• Few lost pages better than allocating a new block
- Flash block size becomes a “static” DB tuning parameter that is used
to optimize “dynamic” ones
…
Flash Block 0 Flash Block 1
nppas * FLASH_PAGE_SIZE
offset 0
gid: 128 gid: 273 gid: 481
EOF
space amplificaJon
(life of file)
➡ Optimize RocksDB bottom up (from storage backend to LSM)

9
First&Valid&Page
Intermediate&Page
Last&Page
Out&of&
Bound
Block&
Star:ng&
Metadata
Block&
Ending&
Metadata
RocksDB(data
RocksDB(data
RocksDB(data
Out&of&
Bound
Out&of&
Bound
Flash&Block
RocksDB: Crash Recovery
- Blocks can be checked for integrity
- New DB instance can append;
padding is maintained in OOB (P6)
- Closing a block updates bad page
& bad block information (P6)
OPCODE
ENCODED
METADATA
OPCODE
ENCODED
METADATA
OPCODE
ENCODED
METADATA
…
Metadata/Type:
3/Log
3/Current
3/Metadata
3/Sstable
!"Private"(Env)
Enough/metadata/to/
recover/database/in/a/
new/instance
Private"(DFlash):
vblocks/forming/the/
DFlash/File
BLOCK BLOCK BLOCK
Open%Channel*SSD
BM Block&
(ownerID)
Block&
(ownerID)
Block&
(ownerID)
…
Block&ListRecovery
1 2
3
- A “file” can be reconstructed from individual blocks (P2, P4, P5 P6)
- 1. Metadata for the blocks forming a file is stored in MANIFEST
- 2. On recovery, LightNVM provides an application with all its valid
blocks
- 3. Each block stores enough metadata to reconstruct a file

11
Architecture: RocksDB + liblightnvm
• LSM is the FTL
- Tree nodes (files) control data
placement on physical flash
- Sstables, WAL, and MANIFEST on
Open-Channel SSD - rest in FS
- Garbage collection takes place
during LSM compaction
• LightNVM manages flash
- liblightnvm takes care of flash block
provisioning; Env DFlash works
directly with PPAs
- Media manager handles wear-
levelling (get/put block)
Open-Channel SSD
Media
Manager
KERNEL SPACEUSER SPACE
Append-only FS I/O device
DFlash File (blocks)
Append API
DFWritableFile
DFSequenAalFile
DFRandomAccesslFile
Free Blocks Used Blocks Bad Blocks Fast I/O Path
Manifestsst WALData Placement
controlled by LSM
I/O
Env DFlash Env Posix
PxWritableFile
PxSequenAalFile
PxRandomAccesslFile
CURRENT
LOCK
IDENTITY
LOG
(Info)
TradiGonal SSD
File System
I/O
RocksDB LSM
Device
Features
Env OpAons
OpAmizaAons
Init
Code
liblightnvm

12
Architecture: Optimizing RocksDB
Target (e.g.
RRPC)
File-System
Open-Channel SSDs
Metadata State Mgmt. XOR/RAID EngineECC Engine
Linux Kernel
RocksDB (IO Apps)
Ac#ve
memtabl
e
Merging & Compac#on
R/WR/WR/WR/WW R R
liblightnvm
LUN0 LUN1 LUN2 LUN3 LUN4 LUN5 LUN6 LUN7 LUNm
DFlash Storage Backend
Traditional
Apps
LightNVM Compatible Device Drivers (NVMe, null-blk)
General Media Manager
User-space
Read-only
memtable
L0
sst
L0
sst
WAL MANIFEST sst1 sst2 sstn…
…
- Do not mix R/W
- Different VLUN per IO path
- Different VLUN types
- Enabling I/O scheduling
- Block pool in DFlash
(prefetching)
• RocksDB is in full
control:

0"
200"
400"
600"
800"
1000"
1200"
1400"
1600"
1" 3" 5" 7" 9" 11" 13" 15" 17" 19" 21" 23" 25" 27" 29" 31" 33" 35" 37" 39" 41" 43" 45" 47" 49" 51" 53" 55" 57" 59" 61" 63"
MB/s"
LUNs"ac5ve"
Read/Write"Performance"(MB/s)"
Write"KB/s"
Read"KB/s"
13
Evaluation: CNEX WestLake FPGA SDK overview
FPGA Prototype Platform
before ASIC:
- PCIe G3x4 or
PCI G2x8
- 4x10GE NVMoE
- 40 bit DDR3
- 16 CH NAND
* Not real performance results - FPGA prototype, not ASIC

14
Evaluation: RocksDB on CNEX WestLake FPGA SDK
RocksDB
make release
ENTRY
KEYS
with 4
threads
WRITES
(1 LUN)
READS
(1 LUN)
WRITES
(2 LUNS)
READS
(2 LUNS)
WRITES
(64 LUNS)
READS
(64 LUNS)
RocksDB
DFLASH
10000
keys
23MB/s 40MB/s 35MB/s X X X
100000
keys
1000000
keys
Raw DFLASH
(with fio)
32MB/s 64MB/s 64MB/s 128MB/s 920MB/s 1,3GB/s

15
Status & Ongoing work
• Status:
- LightNVM has been accepted in the Linux Kernel
- Working prototype of RocksDB with DFlash storage backend on LightNVM
- Implemented liblightnvm append-only support + IO interface (first iteration)
• Ongoing:
- Performance testing and tuning on Westlake ASIC
- Move RocksDB DFlash storage backend logic to liblightnvm
- Improve I/O submission
• Submit I/Os directly to the media manager
• Support libaio to enable async I/O through the traditional stack
- Exploit device parallelism within RocksDB’s LSM

16
Tools & Libraries
lnvm – Open-Channel SSDs administraton
hups://github.com/OpenChannelSSD/lightnvm-adm
liblightnvm – LightNVM applicaton support
hups://github.com/OpenChannelSSD/liblightnvm
Test tools
hups://github.com/OpenChannelSSD/lightnvm-hw
QEMU NVMe with Open-Channel SSD Support
hups://github.com/OpenChannelSSD/qemu-nvme

17
RocksDB on Open-Channel SSDs
Javier González <javier@cnexlabs.com>
RocksDB Annual Meetup'15 - Facebook
Open-Channel SSD Project: hups://github.com/OpenChannelSSD
LightNVM: hups://github.com/OpenChannelSSD/linux
liblightnvm: hups://github.com/OpenChannelSSD/liblightnvm
RocksDB: hups://github.com/OpenChannelSSD/rocksdb
Interface Speciﬁcaton: hup://goo.gl/BYTjLI
Documentaton: hup://openchannelssd.readthedocs.org

RocksDB meetup

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (10)

Similar to RocksDB meetup

Similar to RocksDB meetup (20)

Recently uploaded

Recently uploaded (20)

RocksDB meetup