SlideShare a Scribd company logo
1 of 17
Download to read offline
1
RocksDB	on	Open-Channel	SSDs
Javier González <javier@cnexlabs.com>
RocksDB	Annual	Meetup'15	-	Facebook
2
Solid	State	Drives	(SSDs)
High throughput + Low latency Parallelism + Controller
3
Embedded	Flash	Translation	Layers	(FTLs)
• Have	enabled	wide	adoption	by	exposing	a	
block	I/O	interface	to	the	application	
• However,	they	are	a	roadblock	for	I/O	
intensive	workloads	
- Hardwire design decisions about data placement, over-
provisioning, scheduling, garbage collection, and wear
leveling -> Assumptions on application workload
- Introduces redundancies, missed optimizations, and
underutilization of resources
- Predictable latencies cannot be guaranteed – 99
percentiles
- Introduces unavoidable write-amplification on the
device side (GC)
Metadata	Mgmt.
Standard	Libraries
User	SpaceKernel	Space
App-specific	opt.
Page	cache
Block	I/O	interface
FS-specific	logic
I/O	Intensive		
Applications
FTL
Device- L2P	translation	table	
- GC,	over-provisioning,	
wear-levelling,	sched.,	
etc.
➡ I/O Applications use expensive data structures and
employ host ressources to align their I/O patterns to
(unreachable) flash constrains (e.g., LSM tree)
4
Open-Channel	SSDs
Linux
Kernel
Target (e.g. RRPC)
User-space
File-System, Key-Value, Object-Store, ...
Open-Channel SSDs
Metadata State Mgmt. XOR/RAID EngineECC Engine
LightNVM Compatible Device Drivers (NVMe, null-blk)
CNEX FTL
liblightnvm
AppsI/O Apps
General Media Manager
Framework
Management
• Host	Manages	
- Data Placement
- Garbage Collection
- I/O Scheduling
- Over-provisioning
- Wear-levelling
• Device	Information	
- SSD offload engines &
responsibilities
- SSD geometry:
• NAND media geometry
• Channels, timings, etc.
• Bad blocks
• PPA Format
Common Data
Structures
(e.g., append-
only, key-value)
Decoupled
Interfaces
(get_block/
put_block)
LightNVM spec.
Traditional
Interfaces
Full stack FTL
HW offload
engines
Open-Channel	SSDs	share	control	responsibilities	with	the	Host	in	order	to	implement	and	maintain	features	that	
typical	SSDs	implement	strictly	in	the	device	firmware
5
Why	Open-Channel	SSDs
• Optimize	high-performance	I/O	applications	for	fast	Flash	memories	
- Control data placement - use physical flash blocks directly
- Remove device garbage collection (GC) - reuse LSM compaction on the host
• Achieve predictable	latency - no 99 percentiles (measured in seconds)
• Avoid write-amplification introduced by the FTL - control device	endurance	with host software	
• Improve the steady state of the device (and start from it)	
- Minimize over-provisioning (normal SSDs employ 10-30% over-provisioning for performance)
IOPS
Time
6
RocksDB:	Using	Open-Channel	SSDs
Open-Channel	SSD
get_block() /
put_block() /
Media	Manager
Provisioning interface Block device
Applica4on
Provisioning buffer
Block0 Block1 BlockN…
Applica@on Logic
Normal I/O
blockN->bppa * PAGE_SIZE
struct nvm_tgt_type M_dflash = {
[…]
.make_rq = df_make_rq,
. end_io = df_end_io,
[…]
};
FTL
sync
psync
libaio
posixaio
…
…
…
…
…
CH0
CH1
Lun0 LunN
Physical Flash Layout
- NAND - specific
- Managed by controller
…
…
…
…
vblockManaged Flash
- Exploit parallelism
- Serve applica@on needs
type 1
type 2vblock
vblock type 3
KERNEL	SPACEUSER	SPACE
Data placement
I/O scheduling
Over-provisioning
Garbage collection
Wear-leveling
liblightnvm
struct vblock {
uint64_t id;
uint64_t owner_id;
uint64_t nppas;
uint64_t ppa_bitmap
sector_t bppa;
uint32_t vlun_id;
uint8_t flags
};
7
RocksDB:	Using	Open-Channel	SSDs
• Sstables	
- P1: Fit block sizes in L0 and further level (merges + compactions)
• No need for GC on SSD side - RocksDB merging as GC (less write and space amplification)
- P2:	Keep block metadata to reconstruct sstable in case of host crash
• WAL	(Write-Ahead	Log)	and	MANIFEST	
- P3:	Fit block sizes (same as in sstables)
- P4:	Keep block metadata to reconstruct the log in case of host crash	
• Other	Metadata	
- P5:	Keep superblock metadata and allow to recover the database
- P6:	Keep other metadata to account for flash constrains (e.g., partial pages, bad pages, bad blocks)	
• Process	
- P7: Upstream to vanilla RocksDB
8
RocksDB:	Data	Placement
Arena&block&(kArenaSize)
(Op3mal&size&~&1/10&of&write_buffer_size)
write_buffer_size
• WAL and MANIFEST are reused
in future instances until replaced
- P3: Ensure that WAL and MANIFEST
replace size fills up most of last block
• Sstable sizes follow a heuristic - MemTable::ShouldFlushNow()
P1:
- kArenaBlockSize = sizeof(block)
- Conservative heuristic in terms of overallocation
• Few lost pages better than allocating a new block
- Flash block size becomes a “static” DB tuning parameter that is used
to optimize “dynamic” ones
…
Flash Block 0 Flash Block 1
nppas * FLASH_PAGE_SIZE
offset 0
gid: 128 gid: 273 gid: 481
EOF
space amplificaJon
(life of file)
➡ Optimize	RocksDB	bottom	up	(from	storage	backend	to	LSM)
9
First&Valid&Page
Intermediate&Page
Last&Page
Out&of&
Bound
Block&
Star:ng&
Metadata
Block&
Ending&
Metadata
RocksDB(data
RocksDB(data
RocksDB(data
Out&of&
Bound
Out&of&
Bound
Flash&Block
RocksDB:	Crash	Recovery
- Blocks can be checked for integrity
- New DB instance can append;
padding is maintained in OOB (P6)
- Closing a block updates bad page
& bad block information (P6)
OPCODE
ENCODED
METADATA
OPCODE
ENCODED
METADATA
OPCODE
ENCODED
METADATA
…
Metadata/Type:
3/Log
3/Current
3/Metadata
3/Sstable
!"Private"(Env)
Enough/metadata/to/
recover/database/in/a/
new/instance
Private"(DFlash):
vblocks/forming/the/
DFlash/File
BLOCK BLOCK BLOCK
Open%Channel*SSD
BM Block&
(ownerID)
Block&
(ownerID)
Block&
(ownerID)
…
Block&ListRecovery
1 2
3
- A “file” can be reconstructed from individual blocks (P2, P4, P5	P6)
- 1. Metadata for the blocks forming a file is stored in MANIFEST
- 2. On recovery, LightNVM provides an application with all its valid
blocks
- 3. Each block stores enough metadata to reconstruct a file
10
Work	Upstream
11
Architecture:	RocksDB	+	liblightnvm
• LSM	is	the	FTL	
- Tree nodes (files) control data
placement on physical flash
- Sstables, WAL, and MANIFEST on
Open-Channel SSD - rest in FS
- Garbage collection takes place
during LSM compaction
• LightNVM	manages	flash	
- liblightnvm takes care of flash block
provisioning; Env DFlash works
directly with PPAs
- Media manager handles wear-
levelling (get/put block)
Open-Channel	SSD
Media
Manager
KERNEL	SPACEUSER	SPACE
Append-only FS I/O device
DFlash File (blocks)
Append API
DFWritableFile
DFSequenAalFile
DFRandomAccesslFile
Free Blocks Used Blocks Bad Blocks Fast I/O Path
Manifestsst WALData Placement
controlled by LSM
I/O
Env	DFlash Env	Posix
PxWritableFile
PxSequenAalFile
PxRandomAccesslFile
CURRENT
LOCK
IDENTITY
LOG	
(Info)
TradiGonal	SSD
File	System
I/O
RocksDB	LSM
Device
Features
Env OpAons
OpAmizaAons
Init
Code
liblightnvm
12
Architecture:	Optimizing	RocksDB
Target (e.g.
RRPC)
File-System
Open-Channel SSDs
Metadata State Mgmt. XOR/RAID EngineECC Engine
Linux Kernel
RocksDB (IO Apps)
Ac#ve
memtabl
e
Merging & Compac#on
R/WR/WR/WR/WW R R
liblightnvm
LUN0 LUN1 LUN2 LUN3 LUN4 LUN5 LUN6 LUN7 LUNm
DFlash Storage Backend
Traditional
Apps
LightNVM Compatible Device Drivers (NVMe, null-blk)
General Media Manager
User-space
Read-only
memtable
L0
sst
L0
sst
WAL MANIFEST sst1 sst2 sstn…
…
- Do not mix R/W
- Different VLUN per IO path
- Different VLUN types
- Enabling I/O scheduling
- Block pool in DFlash
(prefetching)
• RocksDB	is	in	full	
control:
0"
200"
400"
600"
800"
1000"
1200"
1400"
1600"
1" 3" 5" 7" 9" 11" 13" 15" 17" 19" 21" 23" 25" 27" 29" 31" 33" 35" 37" 39" 41" 43" 45" 47" 49" 51" 53" 55" 57" 59" 61" 63"
MB/s"
LUNs"ac5ve"
Read/Write"Performance"(MB/s)"
Write"KB/s"
Read"KB/s"
13
Evaluation:	CNEX	WestLake	FPGA	SDK	overview
FPGA Prototype Platform
before ASIC:
- PCIe G3x4 or
PCI G2x8
- 4x10GE NVMoE
- 40 bit DDR3
- 16 CH NAND
* Not real performance results - FPGA prototype, not ASIC
14
Evaluation:	RocksDB	on	CNEX	WestLake	FPGA	SDK
RocksDB	
make	release
ENTRY	
KEYS		
with	4	
threads
WRITES		
(1	LUN)
READS	
(1	LUN)
WRITES		
(2	LUNS)
READS	
(2	LUNS)
WRITES		
(64	LUNS)
READS		
(64	LUNS)
RocksDB	
DFLASH
10000	
keys
23MB/s 40MB/s 35MB/s X X X
100000	
keys
26MB/s 40MB/s 33MB/s X X X
1000000	
keys
25MB/s 40MB/s 32MB/s X X X
Raw	DFLASH	
(with	fio)
32MB/s 64MB/s 64MB/s 128MB/s 920MB/s 1,3GB/s
15
Status	&	Ongoing	work
• Status:	
- LightNVM has been accepted in the Linux Kernel
- Working prototype of RocksDB with DFlash storage backend on LightNVM
- Implemented liblightnvm append-only support + IO interface (first iteration)
• Ongoing:	
- Performance testing and tuning on Westlake ASIC
- Move RocksDB DFlash storage backend logic to liblightnvm
- Improve I/O submission
• Submit I/Os directly to the media manager
• Support libaio to enable async I/O through the traditional stack
- Exploit device parallelism within RocksDB’s LSM
16
Tools	&	Libraries
lnvm – Open-Channel SSDs administraton
hups://github.com/OpenChannelSSD/lightnvm-adm
liblightnvm – LightNVM applicaton support
hups://github.com/OpenChannelSSD/liblightnvm
Test tools
hups://github.com/OpenChannelSSD/lightnvm-hw
QEMU NVMe with Open-Channel SSD Support
hups://github.com/OpenChannelSSD/qemu-nvme
17
RocksDB	on	Open-Channel	SSDs
Javier González <javier@cnexlabs.com>
RocksDB	Annual	Meetup'15	-	Facebook
Open-Channel SSD Project: hups://github.com/OpenChannelSSD
LightNVM: hups://github.com/OpenChannelSSD/linux
liblightnvm: hups://github.com/OpenChannelSSD/liblightnvm
RocksDB: hups://github.com/OpenChannelSSD/rocksdb
Interface Specificaton: hup://goo.gl/BYTjLI
Documentaton: hup://openchannelssd.readthedocs.org

More Related Content

What's hot

RocksDB compaction
RocksDB compactionRocksDB compaction
RocksDB compactionMIJIN AN
 
Some key value stores using log-structure
Some key value stores using log-structureSome key value stores using log-structure
Some key value stores using log-structureZhichao Liang
 
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of FacebookTech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of FacebookThe Hive
 
When is MyRocks good?
When is MyRocks good? When is MyRocks good?
When is MyRocks good? Alkin Tezuysal
 
EVCache: Lowering Costs for a Low Latency Cache with RocksDB
EVCache: Lowering Costs for a Low Latency Cache with RocksDBEVCache: Lowering Costs for a Low Latency Cache with RocksDB
EVCache: Lowering Costs for a Low Latency Cache with RocksDBScott Mansfield
 
MyRocks in MariaDB | M18
MyRocks in MariaDB | M18MyRocks in MariaDB | M18
MyRocks in MariaDB | M18Sergey Petrunya
 
M|18 How to use MyRocks with MariaDB Server
M|18 How to use MyRocks with MariaDB ServerM|18 How to use MyRocks with MariaDB Server
M|18 How to use MyRocks with MariaDB ServerMariaDB plc
 
Towards Application Driven Storage
Towards Application Driven StorageTowards Application Driven Storage
Towards Application Driven StorageJavier González
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to RedisArnab Mitra
 
Ten Reasons Why You Should Prefer PostgreSQL to MySQL
Ten Reasons Why You Should Prefer PostgreSQL to MySQLTen Reasons Why You Should Prefer PostgreSQL to MySQL
Ten Reasons Why You Should Prefer PostgreSQL to MySQLanandology
 
Boosting Machine Learning with Redis Modules and Spark
Boosting Machine Learning with Redis Modules and SparkBoosting Machine Learning with Redis Modules and Spark
Boosting Machine Learning with Redis Modules and SparkDvir Volk
 
Managing terabytes: When Postgres gets big
Managing terabytes: When Postgres gets bigManaging terabytes: When Postgres gets big
Managing terabytes: When Postgres gets bigSelena Deckelmann
 
SSD Deployment Strategies for MySQL
SSD Deployment Strategies for MySQLSSD Deployment Strategies for MySQL
SSD Deployment Strategies for MySQLYoshinori Matsunobu
 
Introduction to PostgreSQL
Introduction to PostgreSQLIntroduction to PostgreSQL
Introduction to PostgreSQLJim Mlodgenski
 
Scylla Summit 2018: In-Memory Scylla - When Fast Storage is Not Fast Enough
Scylla Summit 2018: In-Memory Scylla - When Fast Storage is Not Fast EnoughScylla Summit 2018: In-Memory Scylla - When Fast Storage is Not Fast Enough
Scylla Summit 2018: In-Memory Scylla - When Fast Storage is Not Fast EnoughScyllaDB
 
XSKY - ceph luminous update
XSKY - ceph luminous updateXSKY - ceph luminous update
XSKY - ceph luminous updateinwin stack
 
Disperse xlator ramon_datalab
Disperse xlator ramon_datalabDisperse xlator ramon_datalab
Disperse xlator ramon_datalabGluster.org
 
PostgreSQL as an Alternative to MSSQL
PostgreSQL as an Alternative to MSSQLPostgreSQL as an Alternative to MSSQL
PostgreSQL as an Alternative to MSSQLAlexei Krasner
 
SQL Server 2014 In-Memory OLTP
SQL Server 2014 In-Memory OLTPSQL Server 2014 In-Memory OLTP
SQL Server 2014 In-Memory OLTPTony Rogerson
 

What's hot (20)

RocksDB compaction
RocksDB compactionRocksDB compaction
RocksDB compaction
 
Some key value stores using log-structure
Some key value stores using log-structureSome key value stores using log-structure
Some key value stores using log-structure
 
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of FacebookTech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
 
When is MyRocks good?
When is MyRocks good? When is MyRocks good?
When is MyRocks good?
 
EVCache: Lowering Costs for a Low Latency Cache with RocksDB
EVCache: Lowering Costs for a Low Latency Cache with RocksDBEVCache: Lowering Costs for a Low Latency Cache with RocksDB
EVCache: Lowering Costs for a Low Latency Cache with RocksDB
 
MyRocks in MariaDB | M18
MyRocks in MariaDB | M18MyRocks in MariaDB | M18
MyRocks in MariaDB | M18
 
M|18 How to use MyRocks with MariaDB Server
M|18 How to use MyRocks with MariaDB ServerM|18 How to use MyRocks with MariaDB Server
M|18 How to use MyRocks with MariaDB Server
 
Towards Application Driven Storage
Towards Application Driven StorageTowards Application Driven Storage
Towards Application Driven Storage
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
 
Ten Reasons Why You Should Prefer PostgreSQL to MySQL
Ten Reasons Why You Should Prefer PostgreSQL to MySQLTen Reasons Why You Should Prefer PostgreSQL to MySQL
Ten Reasons Why You Should Prefer PostgreSQL to MySQL
 
Boosting Machine Learning with Redis Modules and Spark
Boosting Machine Learning with Redis Modules and SparkBoosting Machine Learning with Redis Modules and Spark
Boosting Machine Learning with Redis Modules and Spark
 
Managing terabytes: When Postgres gets big
Managing terabytes: When Postgres gets bigManaging terabytes: When Postgres gets big
Managing terabytes: When Postgres gets big
 
SSD Deployment Strategies for MySQL
SSD Deployment Strategies for MySQLSSD Deployment Strategies for MySQL
SSD Deployment Strategies for MySQL
 
Introduction to PostgreSQL
Introduction to PostgreSQLIntroduction to PostgreSQL
Introduction to PostgreSQL
 
Scylla Summit 2018: In-Memory Scylla - When Fast Storage is Not Fast Enough
Scylla Summit 2018: In-Memory Scylla - When Fast Storage is Not Fast EnoughScylla Summit 2018: In-Memory Scylla - When Fast Storage is Not Fast Enough
Scylla Summit 2018: In-Memory Scylla - When Fast Storage is Not Fast Enough
 
XSKY - ceph luminous update
XSKY - ceph luminous updateXSKY - ceph luminous update
XSKY - ceph luminous update
 
PostgreSQL and MySQL
PostgreSQL and MySQLPostgreSQL and MySQL
PostgreSQL and MySQL
 
Disperse xlator ramon_datalab
Disperse xlator ramon_datalabDisperse xlator ramon_datalab
Disperse xlator ramon_datalab
 
PostgreSQL as an Alternative to MSSQL
PostgreSQL as an Alternative to MSSQLPostgreSQL as an Alternative to MSSQL
PostgreSQL as an Alternative to MSSQL
 
SQL Server 2014 In-Memory OLTP
SQL Server 2014 In-Memory OLTPSQL Server 2014 In-Memory OLTP
SQL Server 2014 In-Memory OLTP
 

Viewers also liked

REDRAFT - Resume Martin Firth 2017-01-03
REDRAFT - Resume Martin Firth 2017-01-03REDRAFT - Resume Martin Firth 2017-01-03
REDRAFT - Resume Martin Firth 2017-01-03Martin Firth
 
Solr Payloads for Ranking Data
Solr Payloads for Ranking Data Solr Payloads for Ranking Data
Solr Payloads for Ranking Data BloomReach
 
HTML5 Animation in Mobile Web Games
HTML5 Animation in Mobile Web GamesHTML5 Animation in Mobile Web Games
HTML5 Animation in Mobile Web Gameslivedoor
 
20161122 gpu deep_learningcommunity#02
20161122 gpu deep_learningcommunity#0220161122 gpu deep_learningcommunity#02
20161122 gpu deep_learningcommunity#02ManaMurakami1
 
Facilitating product discovery in e-commerce inventory, The Fifth elephant, 2016
Facilitating product discovery in e-commerce inventory, The Fifth elephant, 2016Facilitating product discovery in e-commerce inventory, The Fifth elephant, 2016
Facilitating product discovery in e-commerce inventory, The Fifth elephant, 2016Ekta Grover
 
Artificial general intelligence research project at Keen Software House (3/2015)
Artificial general intelligence research project at Keen Software House (3/2015)Artificial general intelligence research project at Keen Software House (3/2015)
Artificial general intelligence research project at Keen Software House (3/2015)Marek Rosa
 
High-Performance GPU Programming for Deep Learning
High-Performance GPU Programming for Deep LearningHigh-Performance GPU Programming for Deep Learning
High-Performance GPU Programming for Deep LearningIntel Nervana
 
Common Design of Deep Learning Frameworks
Common Design of Deep Learning FrameworksCommon Design of Deep Learning Frameworks
Common Design of Deep Learning FrameworksKenta Oono
 
GPU Accelerated Deep Learning for CUDNN V2
GPU Accelerated Deep Learning for CUDNN V2GPU Accelerated Deep Learning for CUDNN V2
GPU Accelerated Deep Learning for CUDNN V2NVIDIA
 
[251] implementing deep learning using cu dnn
[251] implementing deep learning using cu dnn[251] implementing deep learning using cu dnn
[251] implementing deep learning using cu dnnNAVER D2
 

Viewers also liked (10)

REDRAFT - Resume Martin Firth 2017-01-03
REDRAFT - Resume Martin Firth 2017-01-03REDRAFT - Resume Martin Firth 2017-01-03
REDRAFT - Resume Martin Firth 2017-01-03
 
Solr Payloads for Ranking Data
Solr Payloads for Ranking Data Solr Payloads for Ranking Data
Solr Payloads for Ranking Data
 
HTML5 Animation in Mobile Web Games
HTML5 Animation in Mobile Web GamesHTML5 Animation in Mobile Web Games
HTML5 Animation in Mobile Web Games
 
20161122 gpu deep_learningcommunity#02
20161122 gpu deep_learningcommunity#0220161122 gpu deep_learningcommunity#02
20161122 gpu deep_learningcommunity#02
 
Facilitating product discovery in e-commerce inventory, The Fifth elephant, 2016
Facilitating product discovery in e-commerce inventory, The Fifth elephant, 2016Facilitating product discovery in e-commerce inventory, The Fifth elephant, 2016
Facilitating product discovery in e-commerce inventory, The Fifth elephant, 2016
 
Artificial general intelligence research project at Keen Software House (3/2015)
Artificial general intelligence research project at Keen Software House (3/2015)Artificial general intelligence research project at Keen Software House (3/2015)
Artificial general intelligence research project at Keen Software House (3/2015)
 
High-Performance GPU Programming for Deep Learning
High-Performance GPU Programming for Deep LearningHigh-Performance GPU Programming for Deep Learning
High-Performance GPU Programming for Deep Learning
 
Common Design of Deep Learning Frameworks
Common Design of Deep Learning FrameworksCommon Design of Deep Learning Frameworks
Common Design of Deep Learning Frameworks
 
GPU Accelerated Deep Learning for CUDNN V2
GPU Accelerated Deep Learning for CUDNN V2GPU Accelerated Deep Learning for CUDNN V2
GPU Accelerated Deep Learning for CUDNN V2
 
[251] implementing deep learning using cu dnn
[251] implementing deep learning using cu dnn[251] implementing deep learning using cu dnn
[251] implementing deep learning using cu dnn
 

Similar to RocksDB meetup

FlashSQL 소개 & TechTalk
FlashSQL 소개 & TechTalkFlashSQL 소개 & TechTalk
FlashSQL 소개 & TechTalkI Goo Lee
 
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...Glenn K. Lockwood
 
Experiences with Oracle SPARC S7-2 Server
Experiences with Oracle SPARC S7-2 ServerExperiences with Oracle SPARC S7-2 Server
Experiences with Oracle SPARC S7-2 ServerJomaSoft
 
Exploiting Your File System to Build Robust & Efficient Workflows
Exploiting Your File System to Build Robust & Efficient WorkflowsExploiting Your File System to Build Robust & Efficient Workflows
Exploiting Your File System to Build Robust & Efficient Workflowsjasonajohnson
 
Proving out flash storage array performance using swingbench and slob
Proving out flash storage array performance using swingbench and slobProving out flash storage array performance using swingbench and slob
Proving out flash storage array performance using swingbench and slobKapil Goyal
 
[OpenInfra Days Korea 2018] (Track 3) - CephFS with OpenStack Manila based on...
[OpenInfra Days Korea 2018] (Track 3) - CephFS with OpenStack Manila based on...[OpenInfra Days Korea 2018] (Track 3) - CephFS with OpenStack Manila based on...
[OpenInfra Days Korea 2018] (Track 3) - CephFS with OpenStack Manila based on...OpenStack Korea Community
 
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...Databricks
 
Kernel Recipes 2015: Linux Kernel IO subsystem - How it works and how can I s...
Kernel Recipes 2015: Linux Kernel IO subsystem - How it works and how can I s...Kernel Recipes 2015: Linux Kernel IO subsystem - How it works and how can I s...
Kernel Recipes 2015: Linux Kernel IO subsystem - How it works and how can I s...Anne Nicolas
 
Community Update at OpenStack Summit Boston
Community Update at OpenStack Summit BostonCommunity Update at OpenStack Summit Boston
Community Update at OpenStack Summit BostonSage Weil
 
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a LaptopProject Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a LaptopDatabricks
 
Renaissance of sparc UKOUG 2014
Renaissance of sparc UKOUG 2014Renaissance of sparc UKOUG 2014
Renaissance of sparc UKOUG 2014Philippe Fierens
 
Ippevent : openshift Introduction
Ippevent : openshift IntroductionIppevent : openshift Introduction
Ippevent : openshift Introductionkanedafromparis
 
Accelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket CacheAccelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket CacheNicolas Poggi
 
Oow 2008 yahoo_pie-db
Oow 2008 yahoo_pie-dbOow 2008 yahoo_pie-db
Oow 2008 yahoo_pie-dbbohanchen
 
What we unlearned_and_learned_by_moving_from_m9000_to_ssc_ukoug2014
What we unlearned_and_learned_by_moving_from_m9000_to_ssc_ukoug2014What we unlearned_and_learned_by_moving_from_m9000_to_ssc_ukoug2014
What we unlearned_and_learned_by_moving_from_m9000_to_ssc_ukoug2014Philippe Fierens
 
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Databricks
 
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Databricks
 
Storage and performance- Batch processing, Whiptail
Storage and performance- Batch processing, WhiptailStorage and performance- Batch processing, Whiptail
Storage and performance- Batch processing, WhiptailInternet World
 
[db tech showcase Tokyo 2018] #dbts2018 #B17 『オラクル パフォーマンス チューニング - 神話、伝説と解決策』
[db tech showcase Tokyo 2018] #dbts2018 #B17 『オラクル パフォーマンス チューニング - 神話、伝説と解決策』[db tech showcase Tokyo 2018] #dbts2018 #B17 『オラクル パフォーマンス チューニング - 神話、伝説と解決策』
[db tech showcase Tokyo 2018] #dbts2018 #B17 『オラクル パフォーマンス チューニング - 神話、伝説と解決策』Insight Technology, Inc.
 

Similar to RocksDB meetup (20)

FlashSQL 소개 & TechTalk
FlashSQL 소개 & TechTalkFlashSQL 소개 & TechTalk
FlashSQL 소개 & TechTalk
 
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
 
Experiences with Oracle SPARC S7-2 Server
Experiences with Oracle SPARC S7-2 ServerExperiences with Oracle SPARC S7-2 Server
Experiences with Oracle SPARC S7-2 Server
 
Exploiting Your File System to Build Robust & Efficient Workflows
Exploiting Your File System to Build Robust & Efficient WorkflowsExploiting Your File System to Build Robust & Efficient Workflows
Exploiting Your File System to Build Robust & Efficient Workflows
 
Proving out flash storage array performance using swingbench and slob
Proving out flash storage array performance using swingbench and slobProving out flash storage array performance using swingbench and slob
Proving out flash storage array performance using swingbench and slob
 
[OpenInfra Days Korea 2018] (Track 3) - CephFS with OpenStack Manila based on...
[OpenInfra Days Korea 2018] (Track 3) - CephFS with OpenStack Manila based on...[OpenInfra Days Korea 2018] (Track 3) - CephFS with OpenStack Manila based on...
[OpenInfra Days Korea 2018] (Track 3) - CephFS with OpenStack Manila based on...
 
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
 
Kernel Recipes 2015: Linux Kernel IO subsystem - How it works and how can I s...
Kernel Recipes 2015: Linux Kernel IO subsystem - How it works and how can I s...Kernel Recipes 2015: Linux Kernel IO subsystem - How it works and how can I s...
Kernel Recipes 2015: Linux Kernel IO subsystem - How it works and how can I s...
 
Community Update at OpenStack Summit Boston
Community Update at OpenStack Summit BostonCommunity Update at OpenStack Summit Boston
Community Update at OpenStack Summit Boston
 
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a LaptopProject Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
 
Renaissance of sparc UKOUG 2014
Renaissance of sparc UKOUG 2014Renaissance of sparc UKOUG 2014
Renaissance of sparc UKOUG 2014
 
LUG 2014
LUG 2014LUG 2014
LUG 2014
 
Ippevent : openshift Introduction
Ippevent : openshift IntroductionIppevent : openshift Introduction
Ippevent : openshift Introduction
 
Accelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket CacheAccelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket Cache
 
Oow 2008 yahoo_pie-db
Oow 2008 yahoo_pie-dbOow 2008 yahoo_pie-db
Oow 2008 yahoo_pie-db
 
What we unlearned_and_learned_by_moving_from_m9000_to_ssc_ukoug2014
What we unlearned_and_learned_by_moving_from_m9000_to_ssc_ukoug2014What we unlearned_and_learned_by_moving_from_m9000_to_ssc_ukoug2014
What we unlearned_and_learned_by_moving_from_m9000_to_ssc_ukoug2014
 
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
 
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
 
Storage and performance- Batch processing, Whiptail
Storage and performance- Batch processing, WhiptailStorage and performance- Batch processing, Whiptail
Storage and performance- Batch processing, Whiptail
 
[db tech showcase Tokyo 2018] #dbts2018 #B17 『オラクル パフォーマンス チューニング - 神話、伝説と解決策』
[db tech showcase Tokyo 2018] #dbts2018 #B17 『オラクル パフォーマンス チューニング - 神話、伝説と解決策』[db tech showcase Tokyo 2018] #dbts2018 #B17 『オラクル パフォーマンス チューニング - 神話、伝説と解決策』
[db tech showcase Tokyo 2018] #dbts2018 #B17 『オラクル パフォーマンス チューニング - 神話、伝説と解決策』
 

Recently uploaded

My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 

Recently uploaded (20)

My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 

RocksDB meetup

  • 2. 2 Solid State Drives (SSDs) High throughput + Low latency Parallelism + Controller
  • 3. 3 Embedded Flash Translation Layers (FTLs) • Have enabled wide adoption by exposing a block I/O interface to the application • However, they are a roadblock for I/O intensive workloads - Hardwire design decisions about data placement, over- provisioning, scheduling, garbage collection, and wear leveling -> Assumptions on application workload - Introduces redundancies, missed optimizations, and underutilization of resources - Predictable latencies cannot be guaranteed – 99 percentiles - Introduces unavoidable write-amplification on the device side (GC) Metadata Mgmt. Standard Libraries User SpaceKernel Space App-specific opt. Page cache Block I/O interface FS-specific logic I/O Intensive Applications FTL Device- L2P translation table - GC, over-provisioning, wear-levelling, sched., etc. ➡ I/O Applications use expensive data structures and employ host ressources to align their I/O patterns to (unreachable) flash constrains (e.g., LSM tree)
  • 4. 4 Open-Channel SSDs Linux Kernel Target (e.g. RRPC) User-space File-System, Key-Value, Object-Store, ... Open-Channel SSDs Metadata State Mgmt. XOR/RAID EngineECC Engine LightNVM Compatible Device Drivers (NVMe, null-blk) CNEX FTL liblightnvm AppsI/O Apps General Media Manager Framework Management • Host Manages - Data Placement - Garbage Collection - I/O Scheduling - Over-provisioning - Wear-levelling • Device Information - SSD offload engines & responsibilities - SSD geometry: • NAND media geometry • Channels, timings, etc. • Bad blocks • PPA Format Common Data Structures (e.g., append- only, key-value) Decoupled Interfaces (get_block/ put_block) LightNVM spec. Traditional Interfaces Full stack FTL HW offload engines Open-Channel SSDs share control responsibilities with the Host in order to implement and maintain features that typical SSDs implement strictly in the device firmware
  • 5. 5 Why Open-Channel SSDs • Optimize high-performance I/O applications for fast Flash memories - Control data placement - use physical flash blocks directly - Remove device garbage collection (GC) - reuse LSM compaction on the host • Achieve predictable latency - no 99 percentiles (measured in seconds) • Avoid write-amplification introduced by the FTL - control device endurance with host software • Improve the steady state of the device (and start from it) - Minimize over-provisioning (normal SSDs employ 10-30% over-provisioning for performance) IOPS Time
  • 6. 6 RocksDB: Using Open-Channel SSDs Open-Channel SSD get_block() / put_block() / Media Manager Provisioning interface Block device Applica4on Provisioning buffer Block0 Block1 BlockN… Applica@on Logic Normal I/O blockN->bppa * PAGE_SIZE struct nvm_tgt_type M_dflash = { […] .make_rq = df_make_rq, . end_io = df_end_io, […] }; FTL sync psync libaio posixaio … … … … … CH0 CH1 Lun0 LunN Physical Flash Layout - NAND - specific - Managed by controller … … … … vblockManaged Flash - Exploit parallelism - Serve applica@on needs type 1 type 2vblock vblock type 3 KERNEL SPACEUSER SPACE Data placement I/O scheduling Over-provisioning Garbage collection Wear-leveling liblightnvm struct vblock { uint64_t id; uint64_t owner_id; uint64_t nppas; uint64_t ppa_bitmap sector_t bppa; uint32_t vlun_id; uint8_t flags };
  • 7. 7 RocksDB: Using Open-Channel SSDs • Sstables - P1: Fit block sizes in L0 and further level (merges + compactions) • No need for GC on SSD side - RocksDB merging as GC (less write and space amplification) - P2: Keep block metadata to reconstruct sstable in case of host crash • WAL (Write-Ahead Log) and MANIFEST - P3: Fit block sizes (same as in sstables) - P4: Keep block metadata to reconstruct the log in case of host crash • Other Metadata - P5: Keep superblock metadata and allow to recover the database - P6: Keep other metadata to account for flash constrains (e.g., partial pages, bad pages, bad blocks) • Process - P7: Upstream to vanilla RocksDB
  • 8. 8 RocksDB: Data Placement Arena&block&(kArenaSize) (Op3mal&size&~&1/10&of&write_buffer_size) write_buffer_size • WAL and MANIFEST are reused in future instances until replaced - P3: Ensure that WAL and MANIFEST replace size fills up most of last block • Sstable sizes follow a heuristic - MemTable::ShouldFlushNow() P1: - kArenaBlockSize = sizeof(block) - Conservative heuristic in terms of overallocation • Few lost pages better than allocating a new block - Flash block size becomes a “static” DB tuning parameter that is used to optimize “dynamic” ones … Flash Block 0 Flash Block 1 nppas * FLASH_PAGE_SIZE offset 0 gid: 128 gid: 273 gid: 481 EOF space amplificaJon (life of file) ➡ Optimize RocksDB bottom up (from storage backend to LSM)
  • 9. 9 First&Valid&Page Intermediate&Page Last&Page Out&of& Bound Block& Star:ng& Metadata Block& Ending& Metadata RocksDB(data RocksDB(data RocksDB(data Out&of& Bound Out&of& Bound Flash&Block RocksDB: Crash Recovery - Blocks can be checked for integrity - New DB instance can append; padding is maintained in OOB (P6) - Closing a block updates bad page & bad block information (P6) OPCODE ENCODED METADATA OPCODE ENCODED METADATA OPCODE ENCODED METADATA … Metadata/Type: 3/Log 3/Current 3/Metadata 3/Sstable !"Private"(Env) Enough/metadata/to/ recover/database/in/a/ new/instance Private"(DFlash): vblocks/forming/the/ DFlash/File BLOCK BLOCK BLOCK Open%Channel*SSD BM Block& (ownerID) Block& (ownerID) Block& (ownerID) … Block&ListRecovery 1 2 3 - A “file” can be reconstructed from individual blocks (P2, P4, P5 P6) - 1. Metadata for the blocks forming a file is stored in MANIFEST - 2. On recovery, LightNVM provides an application with all its valid blocks - 3. Each block stores enough metadata to reconstruct a file
  • 11. 11 Architecture: RocksDB + liblightnvm • LSM is the FTL - Tree nodes (files) control data placement on physical flash - Sstables, WAL, and MANIFEST on Open-Channel SSD - rest in FS - Garbage collection takes place during LSM compaction • LightNVM manages flash - liblightnvm takes care of flash block provisioning; Env DFlash works directly with PPAs - Media manager handles wear- levelling (get/put block) Open-Channel SSD Media Manager KERNEL SPACEUSER SPACE Append-only FS I/O device DFlash File (blocks) Append API DFWritableFile DFSequenAalFile DFRandomAccesslFile Free Blocks Used Blocks Bad Blocks Fast I/O Path Manifestsst WALData Placement controlled by LSM I/O Env DFlash Env Posix PxWritableFile PxSequenAalFile PxRandomAccesslFile CURRENT LOCK IDENTITY LOG (Info) TradiGonal SSD File System I/O RocksDB LSM Device Features Env OpAons OpAmizaAons Init Code liblightnvm
  • 12. 12 Architecture: Optimizing RocksDB Target (e.g. RRPC) File-System Open-Channel SSDs Metadata State Mgmt. XOR/RAID EngineECC Engine Linux Kernel RocksDB (IO Apps) Ac#ve memtabl e Merging & Compac#on R/WR/WR/WR/WW R R liblightnvm LUN0 LUN1 LUN2 LUN3 LUN4 LUN5 LUN6 LUN7 LUNm DFlash Storage Backend Traditional Apps LightNVM Compatible Device Drivers (NVMe, null-blk) General Media Manager User-space Read-only memtable L0 sst L0 sst WAL MANIFEST sst1 sst2 sstn… … - Do not mix R/W - Different VLUN per IO path - Different VLUN types - Enabling I/O scheduling - Block pool in DFlash (prefetching) • RocksDB is in full control:
  • 13. 0" 200" 400" 600" 800" 1000" 1200" 1400" 1600" 1" 3" 5" 7" 9" 11" 13" 15" 17" 19" 21" 23" 25" 27" 29" 31" 33" 35" 37" 39" 41" 43" 45" 47" 49" 51" 53" 55" 57" 59" 61" 63" MB/s" LUNs"ac5ve" Read/Write"Performance"(MB/s)" Write"KB/s" Read"KB/s" 13 Evaluation: CNEX WestLake FPGA SDK overview FPGA Prototype Platform before ASIC: - PCIe G3x4 or PCI G2x8 - 4x10GE NVMoE - 40 bit DDR3 - 16 CH NAND * Not real performance results - FPGA prototype, not ASIC
  • 15. 15 Status & Ongoing work • Status: - LightNVM has been accepted in the Linux Kernel - Working prototype of RocksDB with DFlash storage backend on LightNVM - Implemented liblightnvm append-only support + IO interface (first iteration) • Ongoing: - Performance testing and tuning on Westlake ASIC - Move RocksDB DFlash storage backend logic to liblightnvm - Improve I/O submission • Submit I/Os directly to the media manager • Support libaio to enable async I/O through the traditional stack - Exploit device parallelism within RocksDB’s LSM
  • 16. 16 Tools & Libraries lnvm – Open-Channel SSDs administraton hups://github.com/OpenChannelSSD/lightnvm-adm liblightnvm – LightNVM applicaton support hups://github.com/OpenChannelSSD/liblightnvm Test tools hups://github.com/OpenChannelSSD/lightnvm-hw QEMU NVMe with Open-Channel SSD Support hups://github.com/OpenChannelSSD/qemu-nvme
  • 17. 17 RocksDB on Open-Channel SSDs Javier González <javier@cnexlabs.com> RocksDB Annual Meetup'15 - Facebook Open-Channel SSD Project: hups://github.com/OpenChannelSSD LightNVM: hups://github.com/OpenChannelSSD/linux liblightnvm: hups://github.com/OpenChannelSSD/liblightnvm RocksDB: hups://github.com/OpenChannelSSD/rocksdb Interface Specificaton: hup://goo.gl/BYTjLI Documentaton: hup://openchannelssd.readthedocs.org