In this presentation, we introduce liblightnvm, a user space library that manages provisioning and I/O submission for physical flash.
We argue how liblightnvm can benefit I/O-intensive applications by providing predictable latency and reducing device write amplification, thus prolonging the device's endurance. We show how to integrate liblightnvm with RocksDB.
3. 3
Embedded Flash Translation Layers (FTLs)
• Have enabled wide adoption by exposing a
block I/O interface to the application
• However, they are a roadblock for I/O
intensive workloads
- Hardwire design decisions about data placement, over-
provisioning, scheduling, garbage collection, and wear
leveling -> Assumptions on application workload
- Introduces redundancies, missed optimizations, and
underutilization of resources
- Predictable latencies cannot be guaranteed – 99
percentiles
- Introduces unavoidable write-amplification on the
device side (GC)
Metadata Mgmt.
Standard Libraries
User SpaceKernel Space
App-specific opt.
Page cache
Block I/O interface
FS-specific logic
I/O Intensive
Applications
FTL
Device- L2P translation table
- GC, over-provisioning,
wear-levelling, sched.,
etc.
➡ I/O Applications use expensive data structures and
employ host ressources to align their I/O patterns to
(unreachable) flash constrains (e.g., LSM tree)
4. 4
Open-Channel SSDs
Linux
Kernel
Target (e.g. RRPC)
User-space
File-System, Key-Value, Object-Store, ...
Open-Channel SSDs
Metadata State Mgmt. XOR/RAID EngineECC Engine
LightNVM Compatible Device Drivers (NVMe, null-blk)
CNEX FTL
liblightnvm
AppsI/O Apps
General Media Manager
Framework
Management
• Host Manages
- Data Placement
- Garbage Collection
- I/O Scheduling
- Over-provisioning
- Wear-levelling
• Device Information
- SSD offload engines &
responsibilities
- SSD geometry:
• NAND media geometry
• Channels, timings, etc.
• Bad blocks
• PPA Format
Common Data
Structures
(e.g., append-
only, key-value)
Decoupled
Interfaces
(get_block/
put_block)
LightNVM spec.
Traditional
Interfaces
Full stack FTL
HW offload
engines
Open-Channel SSDs share control responsibilities with the Host in order to implement and maintain features that
typical SSDs implement strictly in the device firmware
5. 5
Why Open-Channel SSDs
• Optimize high-performance I/O applications for fast Flash memories
- Control data placement - use physical flash blocks directly
- Remove device garbage collection (GC) - reuse LSM compaction on the host
• Achieve predictable latency - no 99 percentiles (measured in seconds)
• Avoid write-amplification introduced by the FTL - control device endurance with host software
• Improve the steady state of the device (and start from it)
- Minimize over-provisioning (normal SSDs employ 10-30% over-provisioning for performance)
IOPS
Time
7. 7
RocksDB: Using Open-Channel SSDs
• Sstables
- P1: Fit block sizes in L0 and further level (merges + compactions)
• No need for GC on SSD side - RocksDB merging as GC (less write and space amplification)
- P2: Keep block metadata to reconstruct sstable in case of host crash
• WAL (Write-Ahead Log) and MANIFEST
- P3: Fit block sizes (same as in sstables)
- P4: Keep block metadata to reconstruct the log in case of host crash
• Other Metadata
- P5: Keep superblock metadata and allow to recover the database
- P6: Keep other metadata to account for flash constrains (e.g., partial pages, bad pages, bad blocks)
• Process
- P7: Upstream to vanilla RocksDB
8. 8
RocksDB: Data Placement
Arena&block&(kArenaSize)
(Op3mal&size&~&1/10&of&write_buffer_size)
write_buffer_size
• WAL and MANIFEST are reused
in future instances until replaced
- P3: Ensure that WAL and MANIFEST
replace size fills up most of last block
• Sstable sizes follow a heuristic - MemTable::ShouldFlushNow()
P1:
- kArenaBlockSize = sizeof(block)
- Conservative heuristic in terms of overallocation
• Few lost pages better than allocating a new block
- Flash block size becomes a “static” DB tuning parameter that is used
to optimize “dynamic” ones
…
Flash Block 0 Flash Block 1
nppas * FLASH_PAGE_SIZE
offset 0
gid: 128 gid: 273 gid: 481
EOF
space amplificaJon
(life of file)
➡ Optimize RocksDB bottom up (from storage backend to LSM)
9. 9
First&Valid&Page
Intermediate&Page
Last&Page
Out&of&
Bound
Block&
Star:ng&
Metadata
Block&
Ending&
Metadata
RocksDB(data
RocksDB(data
RocksDB(data
Out&of&
Bound
Out&of&
Bound
Flash&Block
RocksDB: Crash Recovery
- Blocks can be checked for integrity
- New DB instance can append;
padding is maintained in OOB (P6)
- Closing a block updates bad page
& bad block information (P6)
OPCODE
ENCODED
METADATA
OPCODE
ENCODED
METADATA
OPCODE
ENCODED
METADATA
…
Metadata/Type:
3/Log
3/Current
3/Metadata
3/Sstable
!"Private"(Env)
Enough/metadata/to/
recover/database/in/a/
new/instance
Private"(DFlash):
vblocks/forming/the/
DFlash/File
BLOCK BLOCK BLOCK
Open%Channel*SSD
BM Block&
(ownerID)
Block&
(ownerID)
Block&
(ownerID)
…
Block&ListRecovery
1 2
3
- A “file” can be reconstructed from individual blocks (P2, P4, P5 P6)
- 1. Metadata for the blocks forming a file is stored in MANIFEST
- 2. On recovery, LightNVM provides an application with all its valid
blocks
- 3. Each block stores enough metadata to reconstruct a file
11. 11
Architecture: RocksDB + liblightnvm
• LSM is the FTL
- Tree nodes (files) control data
placement on physical flash
- Sstables, WAL, and MANIFEST on
Open-Channel SSD - rest in FS
- Garbage collection takes place
during LSM compaction
• LightNVM manages flash
- liblightnvm takes care of flash block
provisioning; Env DFlash works
directly with PPAs
- Media manager handles wear-
levelling (get/put block)
Open-Channel SSD
Media
Manager
KERNEL SPACEUSER SPACE
Append-only FS I/O device
DFlash File (blocks)
Append API
DFWritableFile
DFSequenAalFile
DFRandomAccesslFile
Free Blocks Used Blocks Bad Blocks Fast I/O Path
Manifestsst WALData Placement
controlled by LSM
I/O
Env DFlash Env Posix
PxWritableFile
PxSequenAalFile
PxRandomAccesslFile
CURRENT
LOCK
IDENTITY
LOG
(Info)
TradiGonal SSD
File System
I/O
RocksDB LSM
Device
Features
Env OpAons
OpAmizaAons
Init
Code
liblightnvm
12. 12
Architecture: Optimizing RocksDB
Target (e.g.
RRPC)
File-System
Open-Channel SSDs
Metadata State Mgmt. XOR/RAID EngineECC Engine
Linux Kernel
RocksDB (IO Apps)
Ac#ve
memtabl
e
Merging & Compac#on
R/WR/WR/WR/WW R R
liblightnvm
LUN0 LUN1 LUN2 LUN3 LUN4 LUN5 LUN6 LUN7 LUNm
DFlash Storage Backend
Traditional
Apps
LightNVM Compatible Device Drivers (NVMe, null-blk)
General Media Manager
User-space
Read-only
memtable
L0
sst
L0
sst
WAL MANIFEST sst1 sst2 sstn…
…
- Do not mix R/W
- Different VLUN per IO path
- Different VLUN types
- Enabling I/O scheduling
- Block pool in DFlash
(prefetching)
• RocksDB is in full
control:
15. 15
Status & Ongoing work
• Status:
- LightNVM has been accepted in the Linux Kernel
- Working prototype of RocksDB with DFlash storage backend on LightNVM
- Implemented liblightnvm append-only support + IO interface (first iteration)
• Ongoing:
- Performance testing and tuning on Westlake ASIC
- Move RocksDB DFlash storage backend logic to liblightnvm
- Improve I/O submission
• Submit I/Os directly to the media manager
• Support libaio to enable async I/O through the traditional stack
- Exploit device parallelism within RocksDB’s LSM
16. 16
Tools & Libraries
lnvm – Open-Channel SSDs administraton
hups://github.com/OpenChannelSSD/lightnvm-adm
liblightnvm – LightNVM applicaton support
hups://github.com/OpenChannelSSD/liblightnvm
Test tools
hups://github.com/OpenChannelSSD/lightnvm-hw
QEMU NVMe with Open-Channel SSD Support
hups://github.com/OpenChannelSSD/qemu-nvme