SlideShare a Scribd company logo
1
Towards	
  Application	
  Driven	
  Storage
Optimizing	
  RocksDB	
  for	
  Open-­‐Channel	
  SSDs
Javier  González  <javier@cnexlabs.com>
LinuxCon  Europe  2015
Contributors:  Matias  Bjørling  and  Florin  Petriuc
Application	
  Driven	
  Storage:	
  What	
  is	
  it?
2
RocksDB
Metadata	
  Mgmt.
Standard	
  Libraries
User	
  SpaceKernel	
  Space
App-­‐specific	
  opt.
Page	
  cache
Block	
  I/O	
  interface
FS-­‐specific	
  logic
Application	
  Driven	
  Storage:	
  What	
  is	
  it?
• Application	
  Driven	
  Storage	
  
- Avoid  multiple  (redundant)  
translation  layers    
- Leverage  optimization  
opportunities    
- Minimize  overhead  when  
manipulating  persistent  data  
- Make  better  decisions  regarding  
latency,  resource  utilization,  and  
data  movement  (compared  to  
best-­‐effort  techniques  today)
3
RocksDB
Metadata	
  Mgmt.
Standard	
  Libraries
User	
  SpaceKernel	
  Space
App-­‐specific	
  opt.
Page	
  cache
Block	
  I/O	
  interface
FS-­‐specific	
  logic
Generic  <>  Optimized
➡ Motivation:	
  Give	
  the	
  tools	
  to	
  the	
  applications	
  
that	
  know	
  how	
  to	
  manage	
  their	
  own	
  storage
Application	
  Driven	
  Storage	
  Today
• Arrakis	
  (https://arrakis.cs.washington.edu)  
- Remove  the  OS  kernel  entirely  from  normal  application  execution  
• Samsung	
  multi	
  stream	
  
- Let  the  SSD  know  from  where  “I/O  streams”  emerge  to  make  better  
decisions  
• Fusion	
  I/O	
  
- Dedicated  I/O  stack  to  support  a  specific  type  of  hardware  
• Open-­‐Channel	
  SSDs	
  
- Expose  SSD  characteristics  to  the  host  and  give  it  full  control  over  its  
storage
4
Traditional	
  Solid	
  State	
  Drives
• Flash	
  complexity	
  is	
  abstracted	
  away	
  form	
  the	
  host	
  by	
  an	
  
embedded	
  Flash	
  Translation	
  Layer	
  (FTL)	
  
- Maps  logical  addresses  (LBAs)  to  physical  addresses  (PPAs)  
- Deals  with  flash  constrains  (next  slide)  
- Has  enabled  adoption  by  making  SSDs  compliant  with  the  existing  I/O  stack
5
High  throughput  +  Low  latency Parallelism  +  Controller
Page 0
Page 1
Page 2
Page n -1
…
StateOOBData
Flash	
  memory	
  101
6
• Flash	
  constrains:	
  
- Write  at  a  page  granularity  
• Page  state:  Valid,  invalid,  erased  
- Write  sequentially  in  a  block  
- Write  always  to  an  erased  page  
• Page  becomes  valid  
- Updates  are  written  to  a  new  page  
• Old  pages  become  invalid  -­‐  need  for  GC  
- Read  at  a  page  granularity  (seq./random  reads)  
- Erase  at  a  block  granularity  (all  pages  in  block)  
• Garbage  collection  (GC):  
• Move  valid  pages  to  new  block  
• Erase  valid  and  invalid  pages  -­‐>  erased  state
• Open	
  Channel	
  SSDs	
  share	
  control	
  responsibilities	
  with	
  the	
  Host	
  
in	
  order	
  to	
  implement	
  and	
  maintain	
  features	
  that	
  typical	
  SSDs	
  
implemented	
  strictly	
  in	
  the	
  SSD	
  device	
  firmware
Open-­‐Channel	
  SSDs:	
  Overview
7
• Host-­‐based	
  FTL	
  manages:	
  
- Data  placement  
- I/O  scheduling  
- Over-­‐provisioning  
- Garbage  collection  
- Wear-­‐leveling  
• Host	
  needs	
  to	
  know:	
  
- SSD  features  &  responsibilities  
- SSD  geometry  
• NAND  media  idiosyncrasies  
• Die  geometry  (blocks  &  pages)  
• Channels,  timings,  etc.  
• Bad  blocks  &  ECC
Host	
  manages	
  physical	
  flash Application	
  Driven	
  Storage
Physical flash exposed to the
host (Read, Write, Erase)
Open-­‐Channel	
  SSDs:	
  LightNVM
8
Key-Value/Object/FS/
Block/etc.
Block Target Direct Flash Target
File-System
Block Manager (Generic, Vendor-specific, ...)
Open-Channel SSDs (NVMe, PCI-e, RapidIO, ...)
Kernel
User-space
Block Copy
Engine
Metadata State
Mgmt.
Bad Block State
Mgmt.
XOR Engine ECC Engine
Error Handling
Etc.GC Engine
Raw	
  NAND	
  Geometry
Managed	
  Geometry
Vendor-Specific Target
HardwareSoftware
LightNVM	
  Framework
• Targets	
  
- Expose  physical  media  to  user-­‐
space  
• Block	
  Managers	
  
- Manage  physical  SSD  
characteristics  
- Evens  out  wear-­‐leveling  across  
all  flash  
• Open-­‐Channel	
  SSD	
  
- Responsibility  
- Offload  engines  
LightNVM’s	
  DFlash:	
  Application	
  FTL
9
➡ DFlash	
  is	
  the	
  LightNVM	
  target	
  supporting	
  application	
  FTLs
Open%Channel*SSD
get_block(),/,
put_block(),/
erase_block()
Block*Manager
DFlash*Target
Provisioning,interface Block,device
Applica:on
Provisioning,buffer
Block0 Block1 BlockN…
Applica@on,Logic
Normal,I/O
blockNE>bppa,*,PAGE_SIZE
struct&nvm_tgt_type&/_dflash&=&{
&&[…]
&&.make_rq&&&&&&=&df_make_rq,
&&.&end_io&&&&&&&&=&df_end_io,
&&[…]
};
FTL
sync
psync
libaio
posixaio
…
…
…
… …
CH0
CH1
CHN
Lun0 LunN
Physical,Flash,Layout
,,E,NAND,E,specific
,,E,Managed,by,controller
…
…
…
…
vblockManaged,Flash
,,E,Exploit,parallelism
,,E,Serve,applica@on,needs
type%1
type%2vblock
vblock type%3
KERNEL*SPACEUSER*SPACE
Data,placement
I/O,scheduling
OverEprovisioning
Garbage,collection
WearEleveling
struct  vblock  {  
    uint64_t  id;  
    uint64_t  owner_id;  
    uint64_t  nppas;  
    uint64_t  ppa_bitmap  
    sector_t  bppa;  
    uint32_t  vlun_id;      
    uint8_t  flags  
};
Open-­‐Channel	
  SSDs:	
  Challenges
1. Which	
  classes	
  of	
  applications	
  would	
  benefit	
  most	
  from	
  being	
  
able	
  to	
  manage	
  physical	
  flash?	
  
- Modify  storage  backend  (i.e.,  no  posix)  
- Probably  no  file  system,  page  cache,  block  I/O  interface,  etc.  
2. Which	
  changes	
  do	
  we	
  need	
  to	
  make	
  on	
  these	
  applications?	
  
- Make  them  work  on  Open-­‐Channel  SSDs  
- Optimize  them  to  take  advantage  of  directly  using  physical  flash  (e.g.,  data  
structures,  file  abstractions,  algorithms).  
3. Which	
  interfaces	
  would	
  (i)	
  make	
  the	
  transition	
  simpler,	
  and	
  (ii)	
  
simultaneously	
  cover	
  different	
  classes	
  of	
  applications?
10
➡ New	
  paradigm	
  that	
  we	
  need	
  to	
  explore	
  in	
  the	
  whole	
  I/O	
  stack
RocksDB:	
  Overview
11
• Embedded	
  Key-­‐Value	
  persistent	
  store	
  
• Based	
  on	
  Log-­‐Structured	
  Merge	
  Tree	
  
• Optimized	
  for	
  fast	
  storage	
  
• Server	
  workloads	
  	
  
• Fork	
  from	
  LevelDB	
  
• Open	
  Source:	
  	
  
- https://github.com/facebook/rocksdb  
• RocksDB	
  is	
  not:	
  
- Not  distributed  
- No  failover  
- Not  highly  available
RocksDB	
  Reference:  The  Story  of  RocksDB,  Dhruba	
  Borthakur	
  and	
  Haobo	
  Xu	
  (link)
The  Log-­‐Structured  Merge-­‐Tree,  Patrick  O'Neil,	
  Edward	
  Cheng	
  
Dieter	
  Gawlick,	
  Elizabeth	
  O’Neil.	
  Acta  Informatica,  1996.
RocksDB:	
  Overview
12
Active
MemTable
ReadOnly
MemTable
Log
Log
sst
sst
sst sst
sst sst
Compaction
Flush
Switch Switch
LSM
Write	
  Request
Read	
  Request
Blooms
Problem:	
  RocksDB	
  Storage	
  Backend	
  
13
sst
sst
sst sst
sst sst
User Data MetadataDB Log
WAL
… WALWALWAL
Manifest Manifest Manifest…
Current
Other
LOG (Info)LOCK IDENTITY
Storage Backend
LSM Logic
Posix HDFS Win
RocksDB LSM
• Storage	
  backend	
  decoupled	
  from	
  LSM	
  
- WritableFile():  Sequential  writes  -­‐>  Only  way  to  write  to  
secondary  storage  
- SequentialFile()  -­‐>  Sequential  reads.  Used  primarily  for  
sstable  user  data  and  recovery  
- RandomAccessFile()  -­‐>  Random  reads.  Used  primarily  for  
metadata  (e.g.,  CRC  checks)
- Sstable:  Persistent  memtable  
- DB	
  Log:  Write-­‐Ahead  Log  (WAL)  
- MANIFEST:  File  metadata  
- IDENTITY:	
  Instance  ID	
  
- LOCK:	
  use_existing_db	
  
- CURRENT:  Superblock    
- Info	
  Log:  Log  &  Debugging
RocksDB:	
  LSM	
  using	
  Physical	
  Flash
• Objective:	
  Fully	
  optimize	
  RocksDB	
  for	
  Flash	
  memories	
  
- Control  data  placement:  
• User  data  in  sstables  is  close  in  the  physical  media  (same  block,  adjacent  blocks)  
• Same  for  WAL  and  MANIFEST  
- Exploit  Parallelism:  
• Define  virtual  blocks  based  on  file  write  patters  in  the  storage  backend    
• Get  blocks  from  different  luns  based  on  RocksDB’s  LSM  write  patters    
- Schedule  GC  and  minimize  over-­‐provisioning  
• Use  LSM  sstable  merging  strategies  to  minimize  (and  ideally  remove)  the  need  for  GC  and  
over-­‐provisioning  on  the  SSD  
- Control  I/O  scheduling  
• Prioritize  I/Os  based  on  the  LSM  persistent  needs  (e.g.,  L0  and  WAL  have  higher  priority  
than  levels  used  for  compacted  data  to  maximize  persistency  in  case  of  power  loss)
14
➡ Implement	
  an	
  FTL	
  optimized	
  for	
  RocksDB,	
  which	
  can	
  be	
  reused	
  for	
  
similar	
  applications	
  (e.g.,	
  LevelDB,	
  Cassandra,	
  MongoDB)
RocksDB	
  +	
  DFlash:	
  Challenges
• Sstables	
  (persistent	
  memtables)	
  
- P1:  Fit  block  sizes  in  L0  and  further  level  (merges  +  compactions)  
• No  need  for  GC  on  SSD  side  -­‐  RocksDB  merging  as  GC  (less  write  and  space  amplification)  
- P2:	
  Keep  block  metadata  to  reconstruct  sstable  in  case  of  host  crash  
• WAL	
  (Write-­‐Ahead	
  Log)	
  and	
  MANIFEST	
  
- P3:	
  Fit  block  sizes  (same  as  in  sstables)  
- P4:	
  Keep  block  metadata  to  reconstruct  the  log  in  case  of  host  crash	
  
• Other	
  Metadata	
  
- P5:	
  Keep  superblock  metadata  and  allow  to  recover  the  database  
- P6:	
  Keep  other  metadata  to  account  for  flash  constrains  (e.g.,  partial  pages,  
bad  pages,  bad  blocks)	
  
• Process	
  
- P7:  Follow  RocksDB  architecture  -­‐  upstreamable  solution
15
Arena&block&(kArenaSize)
(Op3mal&size&~&1/10&of&write_buffer_size)
write_buffer_size
…
Flash'Block'0 Flash'Block'1
nppas'*'PAGE_SIZE
offset'0
gid:'128 gid:'273 gid:'481
EOF
space'amplificaHon
(life'of'file)
P1,	
  P3:	
  Match	
  flash	
  block	
  size
16
• WAL  and  MANIFEST  are  
reused  in  future  instances  
until  replaced  
- P3:  Ensure  that  WAL  and  
MANIFEST  replace  size  fills  up  
most  of  last  block
• Sstable  sizes  follow  a  heuristic  -­‐  MemTable::ShouldFlushNow()
P1:
- kArenaBlockSize  =  sizeof(block)  
- Conservative  heuristic  in  terms  of  overallocation  
• Few  lost  pages  is  better  than  allocating  a  new  block  
- Flash  block  size  becomes  a  “static”  DB  tuning  
parameter  that  is  used  to  optimize  “dynamic”  ones
➡ Optimize	
  RocksDB	
  bottom	
  up	
  (from	
  storage	
  backend	
  to	
  LSM)
DFlash  File
P2,	
  P4,	
  P6:	
  Block	
  Metadata
17
Block&
Ending&
Metadata
Out$of$
Bound
Block&
Star+ng&
Metadata
• Blocks	
  can	
  be	
  checked	
  for	
  integrity	
  
• New	
  DB	
  instance	
  can	
  append;	
  padding	
  is	
  maintained	
  in	
  OOB	
  (P6)	
  
• Closing	
  a	
  block	
  updates	
  bad	
  page	
  &	
  bad	
  block	
  information	
  (P6)
First&Valid&Page
Intermediate&Page
Last&Page
Out&of&
Bound
Block&
Star:ng&
Metadata
Block&
Ending&
Metadata
RocksDB(data
RocksDB(data
RocksDB(data
Out&of&
Bound
Out&of&
Bound
Flash&Block
struct  vblock_init_meta  {  
    char  filename[100];            //  RocksDB  file  GID  
              uint64_t  owner_id;            //  Application  owning  the  block  
    size_t  pos;                                            //  relative  position  in  block  
};  
struct  vpage_meta  {  
    size_t  valid_bytes;                //  Valid  bytes  from  offset  0  
    uint8_t  flags;                                  //  State  of  the  page  
};  
struct  vblock_close_meta  {  
    size_t  written_bytes;            //  Payload  size  
    size_t  ppa_bitmap;                  //  Updated  valid  page  bitmap  
              size_t  crc;                                                  //  CRC  of  the  whole  block  
    unsigned  long  next_id;    //  Next  block  ID  (0  if  last)  
    uint8_t  flags;                                      //  Vblock  flags  
};  
P2,	
  P4,	
  P6:	
  Crash	
  Recovery
18
• A	
  DFlash	
  file	
  can	
  be	
  reconstructed	
  from	
  individual	
  blocks	
  (P2,	
  P4)	
  
1. Metadata  for  the  blocks  forming  a  DFlash  file  is  stored  in  MANIFEST  
• The  last  WAL  is  not  guaranteed  to  reach  the  MANIFEST  -­‐>  RECOVERY  metadata  for  DFLASH  
2. On  recovery,  LightNVM  provides  an  application  with  all  its  valid  blocks  
3. Each  block  stores  enough  metadata  to  reconstruct  a  DFLash  file
OPCODE
ENCODED
METADATA
OPCODE
ENCODED
METADATA
OPCODE
ENCODED
METADATA
…
Metadata/Type:
3/Log
3/Current
3/Metadata
3/Sstable
!"Private"(Env)
Enough/metadata/to/
recover/database/in/a/
new/instance
Private"(DFlash):
vblocks/forming/the/
DFlash/File
1
BLOCK BLOCK BLOCK
3
Open%Channel*SSD
BM Block&
(ownerID)
Block&
(ownerID)
Block&
(ownerID)
…
Block&ListRecovery
2
19
• CURRENT	
  is	
  used	
  to	
  store	
  RocksDB	
  “superblock”	
  
- Points  to  current  MANIFEST,  which  is  used  to  reconstruct  the  DB  when  
creating  a  new  instance.  We  append  the  block  metadata  that  points  to  the  
blocks  forming  the  current  MANIFEST  (P5)
P5:	
  Superblock
OPCODE
ENCODED
METADATA
OPCODE
ENCODED
METADATA
OPCODE
ENCODED
METADATA
…
Metadata/Type:
3/Log
3/Current
3/Metadata
3/Sstable
!"Private"(Env)
Enough/metadata/to/
recover/database/in/a/
new/instance
Private"(DFlash):
vblocks/forming/the/
DFlash/File
MANIFEST	
  
(DFlash)
CURRENT	
  
(Posix)
Normal	
  Recovery
P7:	
  Work	
  upstream
20
RocksDB	
  +	
  DFlash:	
  Prototype	
  (1/2)
21
• Optimize	
  RocksDB	
  for	
  Flash	
  storage	
  
- Implement  a  user  space  append-­‐only  FS  that  deals  with  flash  constrains  
• Append-­‐only:  Updates  are  re-­‐written  and  old  data  invalidated  -­‐>  LSM  understands  this  logic  
• Page  cache  implemented  in  user  space;  use  Direct  I/O  
• Only  “sync”  complete  pages,  and  prefer  closed  blocks.  
• In  case  of  write  failure,  write  to  a  new  block  (or  mark  bad  page  and  re-­‐try)  
- Implement  RocksDB’s  file  classes  for  DFlash:  
• WritableFile():  Sequential  writes  -­‐>  Only  way  to  write  to  secondary  storage  
• SequentialFile()  -­‐>  Used  primarily  for  sstable  user  data  and  recovery  
• RandomAccessFile()  -­‐>  Used  primarily  for  metadata  (e.g.,  CRC  checks)  
• Use	
  flash	
  block	
  as	
  the	
  central	
  piece	
  for	
  storage	
  optimizations	
  
- The  Open-­‐Channel  SSD  fabric  is  configured  at  first  
• Define  block  size  -­‐  across  luns  and  channels  to  exploit  parallelism  
• Define  different  types  of  luns  with  different  block  features  
- RocksDB  is  configured  with  standard  parameters  (e.g.,  write  buffer,  cache)  
• DFlash  backend  tunes  these  parameters  based  on  the  type  of  lun  and  block
RocksDB	
  +	
  DFlash:	
  Prototype	
  (2/2)
22
• Use	
  LSM	
  merging	
  strategies	
  as	
  perfect	
  garbage	
  collection	
  (GC)	
  
- All  blocks  in  a  DFlash  file  are  either  active  or  inactive  -­‐>  no  need  to  GC  in  SSD  
- Reduce  over-­‐provisioning  significantly  (~5%)  
- Predictable  latency  -­‐>  SSD  is  in  stable  state  from  the  beginning  
• Reuse	
  RocksDB	
  concepts	
  and	
  abstractions	
  as	
  much	
  as	
  possible	
  
- Store  private  metadata  in  MANIFEST  
- Store  superblock  in  CURRENT  
- Minimize  the  amount  of  “visible”  metadata  -­‐  use  OOB,  Root  FS,  etc.  
• Separate	
  persistent	
  (meta)data	
  between	
  “fast”	
  and	
  “static”	
  
- Fast  data  is  all  user  data  (i.e.,  sstables)  and  the  WAL  
- Fast  metadata  that  follows  user  data  rates  (i.e.,  MANIFEST)  
- Static  metadata  is  written  once  and  seldom  updated  
• CURRENT:  Superblock  for  MANIFEST  
• LOCK  and  IDENTITY
Architecture:	
  RocksDB	
  with	
  DFlash
23
DFlash'Storage'Backend
LSM'Logic
Open8Channel'SSD'(Fast'Storage) Posix'FS'(Sta>c'Meta.)
sst
sst
sst sst
sst sst
User%Data DB%Log
WAL
…
WALWALWAL
Metadata
Manifest
Metadata
CURRENT
LOCK
IDENTITY LOG'
(Info)
Env'DFlash
Op2miza2ons
Env%Op2ons
(Flash%charac.%on%init)
DFlash'File'Classes
DFWritableFile()
DFRandomAccessFile()
DFSequen2alFile()
Persist%opera2ons
Environment%
Tunning
Env'DFlash
Private%metadata
Metadata,%Close,%Crash
Architecture:	
  DFlash	
  +	
  LightNVM
24
Open%Channel*SSD
BM
DFlash
KERNEL*SPACEUSER*SPACE
Provisioning)interface Block)device
DFlash)File)(blocks)
get_block()
put_block()
DFWritableFile
DFSequen?alFile
DFRandomAccesslFile
Free)Blocks Used)Blocks Bad)Blocks
Sector)Calcula?ons
I/O)Path
Manifestsst WAL
Data)Placement
controlled)by)LSM
I/O
Env*DFlash Env*Posix
PxWritableFile
PxSequen?alFile
PxRandomAccesslFile
CURRENT
LOCK
IDENTITY
LOG*
(Info)
TradiGonal*SSD
File*System
I/O
RocksDB*LSM
Device)
Features
Env)Op?ons
Op?miza?ons
Init
Code
Architecture:	
  DFlash	
  +	
  LightNVM
25
• LSM	
  is	
  the	
  FTL	
  
- DFlash  target  and  RocksDB  storage  
backend  take  care  of  provisioning  
flash  blocks  
• Optimized	
  critical	
  I/O	
  path	
  
- Sstables,  WAL,  and  MANIFEST  are  
stored  in  the  Open-­‐Channel  SSD,  
where  we  can  provide  QoS  
• Enable	
  a	
  RocksDB	
  distributed	
  
architecture	
  
- BM  abstracts  the  storage  fabric  (e.g.,  
NVM)  and  can  potentially  provide  
blocks  from  different  drives  -­‐>  single  
address  space
Open%Channel*SSD
BM
DFlash
KERNEL*SPACEUSER*SPACE
Provisioning)interface Block)device
DFlash)File)(blocks)
get_block()
put_block()
DFWritableFile
DFSequen?alFile
DFRandomAccesslFile
Free)Blocks Used)Blocks Bad)Blocks
Sector)Calcula?ons
I/O)Path
Manifestsst WAL
Data)Placement
controlled)by)LSM
I/O
Env*DFlash Env*Posix
PxWritableFile
PxSequen?alFile
PxRandomAccesslFile
CURRENT
LOCK
IDENTITY
LOG*
(Info)
TradiGonal*SSD
File*System
I/O
RocksDB*LSM
Device)
Features
Env)Op?ons
Op?miza?ons
Init
Code
26
RocksDB	
  
make	
  release
ENTRY	
  KEYS	
  	
  
with	
  4	
  threads
DFLASH	
  
(1	
  LUN)
POSIX
WRITES	
  
10000	
  keys 70MB/s 25MB/s
100000	
  keys 40MB/s 25MB/s
1000000	
  keys 25MB/s 20MB/s
Page-­‐aligned	
  	
  
write	
  buffer
- DFlash  write  page  cache  +  Direct  I/O  +  flash  page  aligned  write  buffer  is  better  
optimized  than  best  effort  techniques  and  top-­‐down  optimizations  (RocksDB  
parameters).  Write  buffer  required  by  RocksDB  due  to  small  WAL  writes  
- If  we  manually  tune  buffer  sizes  with  Posix,  we  obtain  similar  results.  However,  
it  requires  lots  of  experimentation  for  each  configuration
QEMU	
  Evaluation:	
  Writes
Bare	
  Metal:	
  
	
  ~180MB/s
27
RocksDB	
  
make	
  release
ENTRY	
  KEYS
DFLASH	
  
(1	
  LUN)
POSIX
READ 10000	
  keys 5MB/s 300MB/s
100000	
  keys 5MB/s 500MB/s
1000000	
  keys 5MB/s 570MB/s
No	
  page	
  
cache	
  
support
- Without  a  DFLASH  page  cache  we  need  to  issue  an  I/O  for  each  read!  
• Sequential  20  byte-­‐reads  in  same  page  would  issue  different  PAGE_SIZE  I/Os
QEMU	
  Evaluation:	
  Reads
28
RocksDB	
  
make	
  release
ENTRY	
  KEYS
DFLASH	
  
(1	
  LUN)
DFLASH	
  
(1	
  LUN)	
  
(+	
  simple	
  page	
  cache)
POSIX
READ 10000	
  keys 5MB/s 160MB/s 300MB/s
100000	
  keys 5MB/s 280MB/s 500MB/s
1000000	
  
keys
5MB/s 300MB/s 570MB/s
- Posix  +  buffered  I/O  using  Linux’s  page  cache  is  still  better,  but  we  have  
confirmed  our  hypothesis.
Simple	
  page	
  
cache	
  for	
  reads
➡ User-­‐space	
  page	
  cache	
  is	
  a	
  necessary	
  optimization	
  when	
  
the	
  generic	
  OS	
  page	
  cache	
  is	
  on	
  the	
  way.	
  Other	
  databases	
  
use	
  this	
  technique	
  (e.g.,	
  Oracle,	
  MySQL)
QEMU	
  Evaluation:	
  “Fixing”	
  Reads
QEMU	
  Evaluation:	
  Insights
29
• Posix	
  backend	
  and	
  DFlash	
  backend	
  (with	
  1	
  lun)	
  should	
  achieve	
  
very	
  similar	
  throughput	
  for	
  reads/writes	
  when	
  using	
  same	
  page	
  
cache	
  and	
  write	
  buffer	
  optimizations	
  
• But…	
  
- DFlash  allows  to  optimize  buffer  and  cache  sizes  based  on  Flash  
characteristics  
- DFlash  knows  which  file  class  is  calling  -­‐  we  can  do  prefetching  for  sequential  
reads  (DFSequentialFile)  at  block  granularity  
- DFlash  designed  to  implement  a  Flash  optimized  page  cache  using  Direct  I/O  
• If	
  the	
  Open-­‐Channel	
  SSD	
  exposes	
  several	
  LUNs,	
  we	
  can	
  exploit	
  
parallelism	
  within	
  DFlash	
  and	
  RocksDB’s	
  LSM	
  write/read	
  patterns	
  
- How  many  luns  and  their  characteristics  are  organized  is  controller  specific
CNEX	
  WestLake	
  SDK:	
  Overview
30
0"
200"
400"
600"
800"
1000"
1200"
1400"
1600"
1" 3" 5" 7" 9" 11" 13" 15" 17" 19" 21" 23" 25" 27" 29" 31" 33" 35" 37" 39" 41" 43" 45" 47" 49" 51" 53" 55" 57" 59" 61" 63"
MB/s"
LUNs"ac5ve"
Read/Write"Performance"(MB/s)"
Write"KB/s"
Read"KB/s"
FPGA  Prototype  Platform  before  ASIC:  
-­‐  PCIe  G3x4  or  PCI  G2x8  
-­‐  4x10GE  NVMoE  
-­‐  40  bit  DDR3  
-­‐  16  CH  NAND
*  Not  real  performance  results  -­‐  FPGA  prototype,  not  ASIC  
CNEX	
  WestLake	
  Evaluation
31
RocksDB	
  
make	
  release
ENTRY	
  
KEYS	
  	
  
with	
  4	
  
threads
WRITES	
  	
  
(1	
  LUN)
READS	
  
(1	
  LUN)
WRITES	
  	
  
(8	
  LUNS)
READS	
  
(8	
  LUNS)
WRITES	
  	
  
(64	
  LUNS)
READS	
  	
  
(64	
  LUNS)
RocksDB	
  
DFLASH 10000	
  keys 21MB/s 40MB/s X X X X
100000	
  
keys
21MB/s 40MB/s X X X X
1000000	
  
keys
21MB/s 40MB/s X X X X
Raw	
  DFLASH	
  
(with	
  fio) 32MB/s 64MB/s 190MB/s 180MB/s 920MB/s 1,3GB/s
• We	
  focus	
  on	
  a	
  single	
  I/O	
  stream	
  for	
  first	
  prototype	
  -­‐>	
  1	
  LUN	
  
CNEX	
  WestLake	
  Evaluation:	
  Insights
32
• RocksDB	
  checks	
  for	
  sstable	
  integrity	
  on	
  writes	
  (intermittent	
  reads)	
  
- We  pay  the  price  of  not  having  an  optimized  page  cache  also  on  writes  
- Reads  and  writes  are  mixed  in  one  single  lun  
• Ongoing	
  work:	
  Exploit	
  parallelism	
  in	
  RocksDB’s	
  I/O	
  patters
RocksDB
Ac#ve&
memtable WAL… SST1 SST2 SST3 … SSTN
Merging&&&Compac#on
w w w w w wr
W R
r r r r
R/W
r
ReadBonly&
memtable
r
RO&
MT
RO&
MT
…
R R
w r
W
MANIFEST
W
w r w r
Open,Channel1SSD
Block1Manager
LUN0 LUN1 LUN2 LUN3 LUN4 LUN5 LUN6 LUN7 LUN8 LUN9…
DFlash
get_block()
put_block()
Virtual&LUN
Physical&LUN
…
…
…
…
CH0
CH1
CHN
Lun0 LunN
WestLake
- Do  not  mix  R/W  
- Different  VLUN  per  path  
- Different  VLUN  types  
- Enabling  I/O  scheduling  
- Block  pool  in  DFlash  
(prefetching)
CNEX	
  WestLake	
  Evaluation:	
  Insights
33
• Also,	
  in	
  any	
  Open-­‐Channel	
  SSD	
  
- DFlash  will  not  get  a  performance  hit  when  the  SSD  triggers  GC  -­‐  RocksDB  
does  GC  when  merging  sstables  in  LSM  >  L0  
• SSD  steady  state  is  improved  (and  reached  from  the  beginning)  
• We  achieve  predictable  latency
IOPS
Time
Status	
  and	
  ongoing	
  work
• Status:	
  
- get_block/put_block  interface  (first  iteration)  through  the  DFlash  target  
- RocksDB  DFlash  target  that  plugs  into  LightNVM.  Source  code  to  test  in  
available  (RocksDB,  Kernel,  and  QEMU).  Working  on  WestLake  upstreaming  
• Ongoing:	
  
- Implement  functions  to  increase  the  synergy  between  the  LSM  and  the  
storage  backend  (i.e.,  tune  write  buffer  based  on  block  size)  -­‐>  Upstreaming  
- Support  libaio  to  enable  async  I/O  in  DFlash  storage  backend  
• Need  to  deal  with  RocksDB  design  decisions  (e.g.,  Get()  assumes  sync  IO)  
- Exploit  device  parallelism  within  RockDB  internal  structures  
- Define  different  types  of  virtual  luns  and  expose  them  to  the  application  
- Other  optimizations:  double  buffering,  aligned  memory  in  LSM,  etc.
34
- Move	
  RocksDB	
  DFlash’s	
  logic	
  to	
  liblightnvm	
  -­‐>	
  append-­‐only	
  FS	
  for	
  Flash
Conclusions
• Application	
  Driven	
  Storage	
  
- Demo  working  on  real  hardware:    
• (RocksDB	
  -­‐>	
  LightNVM	
  -­‐>	
  WestLake-­‐powered	
  SSD)	
  
- QEMU  support  for  testing  and  development  
- More  Open-­‐Channel  SSDs  coming  soon.  
• RocksDB	
  
- DFlash  dedicated  backend  -­‐>  append-­‐only  FS  optimized  for  Flash  
- Set  the  basis  for  moving  to  a  distributed  architecture  while  guaranteeing  
performance  constrains  (specially  in  terms  of  latency)  
- Intention  to  upstream  the  whole  DFlash  storage  backend    
• LightNVM	
  
- Framework  to  support  Open-­‐Channel  SSDs  in  Linux    
- DFlash  target  to  support  application  FTLs
35
36
Towards	
  Application	
  Driven	
  Storage
Optimizing	
  RocksDB	
  on	
  Open-­‐Channel	
  SSDs	
  with	
  LightNVM
LinuxCon  Europe  2015
Questions?
• Open-­‐Channel  SSD  Project:  https://github.com/OpenChannelSSD    
• LightNVM:  https://github.com/OpenChannelSSD/linux    
• RocksDB:  https://github.com/OpenChannelSSD/rocksdb  
Javier  González  <javier@cnexlabs.com>

More Related Content

What's hot

Polyglot persistence @ netflix (CDE Meetup)
Polyglot persistence @ netflix (CDE Meetup) Polyglot persistence @ netflix (CDE Meetup)
Polyglot persistence @ netflix (CDE Meetup)
Roopa Tangirala
 
When is MyRocks good?
When is MyRocks good? When is MyRocks good?
When is MyRocks good?
Alkin Tezuysal
 
TiDB Introduction
TiDB IntroductionTiDB Introduction
TiDB Introduction
Morgan Tocker
 
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of FacebookTech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
The Hive
 
Apache Kafka Best Practices
Apache Kafka Best PracticesApache Kafka Best Practices
Apache Kafka Best Practices
DataWorks Summit/Hadoop Summit
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
Alexey Grishchenko
 
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaTuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Databricks
 
RedisConf17- Using Redis at scale @ Twitter
RedisConf17- Using Redis at scale @ TwitterRedisConf17- Using Redis at scale @ Twitter
RedisConf17- Using Redis at scale @ Twitter
Redis Labs
 
Real-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotReal-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache Pinot
Xiang Fu
 
Stability Patterns for Microservices
Stability Patterns for MicroservicesStability Patterns for Microservices
Stability Patterns for Microservices
pflueras
 
Rocks db state store in structured streaming
Rocks db state store in structured streamingRocks db state store in structured streaming
Rocks db state store in structured streaming
Balaji Mohanam
 
Linux Memory Management with CMA (Contiguous Memory Allocator)
Linux Memory Management with CMA (Contiguous Memory Allocator)Linux Memory Management with CMA (Contiguous Memory Allocator)
Linux Memory Management with CMA (Contiguous Memory Allocator)
Pankaj Suryawanshi
 
VictoriaLogs: Open Source Log Management System - Preview
VictoriaLogs: Open Source Log Management System - PreviewVictoriaLogs: Open Source Log Management System - Preview
VictoriaLogs: Open Source Log Management System - Preview
VictoriaMetrics
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing
Till Rohrmann
 
Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...
Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...
Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...
Databricks
 
Cassandra Introduction & Features
Cassandra Introduction & FeaturesCassandra Introduction & Features
Cassandra Introduction & Features
DataStax Academy
 
Apache BookKeeper: A High Performance and Low Latency Storage Service
Apache BookKeeper: A High Performance and Low Latency Storage ServiceApache BookKeeper: A High Performance and Low Latency Storage Service
Apache BookKeeper: A High Performance and Low Latency Storage Service
Sijie Guo
 
Sizing MongoDB Clusters
Sizing MongoDB Clusters Sizing MongoDB Clusters
Sizing MongoDB Clusters
MongoDB
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
 
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted Malaska
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted MalaskaTop 5 Mistakes When Writing Spark Applications by Mark Grover and Ted Malaska
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted Malaska
Spark Summit
 

What's hot (20)

Polyglot persistence @ netflix (CDE Meetup)
Polyglot persistence @ netflix (CDE Meetup) Polyglot persistence @ netflix (CDE Meetup)
Polyglot persistence @ netflix (CDE Meetup)
 
When is MyRocks good?
When is MyRocks good? When is MyRocks good?
When is MyRocks good?
 
TiDB Introduction
TiDB IntroductionTiDB Introduction
TiDB Introduction
 
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of FacebookTech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
 
Apache Kafka Best Practices
Apache Kafka Best PracticesApache Kafka Best Practices
Apache Kafka Best Practices
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaTuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
 
RedisConf17- Using Redis at scale @ Twitter
RedisConf17- Using Redis at scale @ TwitterRedisConf17- Using Redis at scale @ Twitter
RedisConf17- Using Redis at scale @ Twitter
 
Real-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotReal-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache Pinot
 
Stability Patterns for Microservices
Stability Patterns for MicroservicesStability Patterns for Microservices
Stability Patterns for Microservices
 
Rocks db state store in structured streaming
Rocks db state store in structured streamingRocks db state store in structured streaming
Rocks db state store in structured streaming
 
Linux Memory Management with CMA (Contiguous Memory Allocator)
Linux Memory Management with CMA (Contiguous Memory Allocator)Linux Memory Management with CMA (Contiguous Memory Allocator)
Linux Memory Management with CMA (Contiguous Memory Allocator)
 
VictoriaLogs: Open Source Log Management System - Preview
VictoriaLogs: Open Source Log Management System - PreviewVictoriaLogs: Open Source Log Management System - Preview
VictoriaLogs: Open Source Log Management System - Preview
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing
 
Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...
Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...
Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...
 
Cassandra Introduction & Features
Cassandra Introduction & FeaturesCassandra Introduction & Features
Cassandra Introduction & Features
 
Apache BookKeeper: A High Performance and Low Latency Storage Service
Apache BookKeeper: A High Performance and Low Latency Storage ServiceApache BookKeeper: A High Performance and Low Latency Storage Service
Apache BookKeeper: A High Performance and Low Latency Storage Service
 
Sizing MongoDB Clusters
Sizing MongoDB Clusters Sizing MongoDB Clusters
Sizing MongoDB Clusters
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted Malaska
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted MalaskaTop 5 Mistakes When Writing Spark Applications by Mark Grover and Ted Malaska
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted Malaska
 

Viewers also liked

Towards Application Driven Storage
Towards Application Driven StorageTowards Application Driven Storage
Towards Application Driven Storage
Javier González
 
퓨전아이오 - Io memory 소개
퓨전아이오 - Io memory 소개퓨전아이오 - Io memory 소개
퓨전아이오 - Io memory 소개
silverfox2580
 
ARM server, The Cy7 Introduction by Aaron Joue, Ambedded Technology
ARM server, The Cy7 Introduction by Aaron Joue, Ambedded TechnologyARM server, The Cy7 Introduction by Aaron Joue, Ambedded Technology
ARM server, The Cy7 Introduction by Aaron Joue, Ambedded Technology
Aaron Joue
 
Fusion-io Memory Flash for Microsoft SQL Server 2012
Fusion-io Memory Flash for Microsoft SQL Server 2012Fusion-io Memory Flash for Microsoft SQL Server 2012
Fusion-io Memory Flash for Microsoft SQL Server 2012
Mark Ginnebaugh
 
Energy Saving ARM Server Cluster Born for Distributed Storage & Computing
Energy Saving ARM Server Cluster Born for Distributed Storage & ComputingEnergy Saving ARM Server Cluster Born for Distributed Storage & Computing
Energy Saving ARM Server Cluster Born for Distributed Storage & Computing
Aaron Joue
 
How Ceph performs on ARM Microserver Cluster
How Ceph performs on ARM Microserver ClusterHow Ceph performs on ARM Microserver Cluster
How Ceph performs on ARM Microserver Cluster
Aaron Joue
 
FOG COMPUTING
FOG COMPUTINGFOG COMPUTING
FOG COMPUTING
Saisharan Amaravadhi
 

Viewers also liked (8)

Towards Application Driven Storage
Towards Application Driven StorageTowards Application Driven Storage
Towards Application Driven Storage
 
퓨전아이오 - Io memory 소개
퓨전아이오 - Io memory 소개퓨전아이오 - Io memory 소개
퓨전아이오 - Io memory 소개
 
ARM server, The Cy7 Introduction by Aaron Joue, Ambedded Technology
ARM server, The Cy7 Introduction by Aaron Joue, Ambedded TechnologyARM server, The Cy7 Introduction by Aaron Joue, Ambedded Technology
ARM server, The Cy7 Introduction by Aaron Joue, Ambedded Technology
 
Fusion-io Memory Flash for Microsoft SQL Server 2012
Fusion-io Memory Flash for Microsoft SQL Server 2012Fusion-io Memory Flash for Microsoft SQL Server 2012
Fusion-io Memory Flash for Microsoft SQL Server 2012
 
Energy Saving ARM Server Cluster Born for Distributed Storage & Computing
Energy Saving ARM Server Cluster Born for Distributed Storage & ComputingEnergy Saving ARM Server Cluster Born for Distributed Storage & Computing
Energy Saving ARM Server Cluster Born for Distributed Storage & Computing
 
How Ceph performs on ARM Microserver Cluster
How Ceph performs on ARM Microserver ClusterHow Ceph performs on ARM Microserver Cluster
How Ceph performs on ARM Microserver Cluster
 
Fusion io
Fusion ioFusion io
Fusion io
 
FOG COMPUTING
FOG COMPUTINGFOG COMPUTING
FOG COMPUTING
 

Similar to Optimizing RocksDB for Open-Channel SSDs

RocksDB meetup
RocksDB meetupRocksDB meetup
RocksDB meetup
Javier González
 
FlashSQL 소개 & TechTalk
FlashSQL 소개 & TechTalkFlashSQL 소개 & TechTalk
FlashSQL 소개 & TechTalk
I Goo Lee
 
IMCSummit 2015 - Day 2 General Session - Flash-Extending In-Memory Computing
IMCSummit 2015 - Day 2 General Session - Flash-Extending In-Memory ComputingIMCSummit 2015 - Day 2 General Session - Flash-Extending In-Memory Computing
IMCSummit 2015 - Day 2 General Session - Flash-Extending In-Memory Computing
In-Memory Computing Summit
 
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
Glenn K. Lockwood
 
Deploying ssd in the data center 2014
Deploying ssd in the data center 2014Deploying ssd in the data center 2014
Deploying ssd in the data center 2014
Howard Marks
 
Zoned Storage
Zoned StorageZoned Storage
Zoned Storage
singh.gurjeet
 
Storage and performance- Batch processing, Whiptail
Storage and performance- Batch processing, WhiptailStorage and performance- Batch processing, Whiptail
Storage and performance- Batch processing, WhiptailInternet World
 
Migrating to XtraDB Cluster
Migrating to XtraDB ClusterMigrating to XtraDB Cluster
Migrating to XtraDB Cluster
percona2013
 
IMCSummit 2015 - Day 2 IT Business Track - Drive IMC Efficiency with Flash E...
IMCSummit 2015 - Day 2  IT Business Track - Drive IMC Efficiency with Flash E...IMCSummit 2015 - Day 2  IT Business Track - Drive IMC Efficiency with Flash E...
IMCSummit 2015 - Day 2 IT Business Track - Drive IMC Efficiency with Flash E...
In-Memory Computing Summit
 
Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...
Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...
Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...
Amazon Web Services
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
Uwe Printz
 
Accelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket CacheAccelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket Cache
Nicolas Poggi
 
Database as a Service on the Oracle Database Appliance Platform
Database as a Service on the Oracle Database Appliance PlatformDatabase as a Service on the Oracle Database Appliance Platform
Database as a Service on the Oracle Database Appliance Platform
Maris Elsins
 
An Efficient Backup and Replication of Storage
An Efficient Backup and Replication of StorageAn Efficient Backup and Replication of Storage
An Efficient Backup and Replication of Storage
Takashi Hoshino
 
Oracle real application_cluster
Oracle real application_clusterOracle real application_cluster
Oracle real application_cluster
Prabhat gangwar
 
Using oracle12c pluggable databases to archive
Using oracle12c pluggable databases to archiveUsing oracle12c pluggable databases to archive
Using oracle12c pluggable databases to archive
Secure-24
 
The Hive Think Tank: Rocking the Database World with RocksDB
The Hive Think Tank:  Rocking the Database World with RocksDBThe Hive Think Tank:  Rocking the Database World with RocksDB
The Hive Think Tank: Rocking the Database World with RocksDB
The Hive
 
OpenStack Cinder, Implementation Today and New Trends for Tomorrow
OpenStack Cinder, Implementation Today and New Trends for TomorrowOpenStack Cinder, Implementation Today and New Trends for Tomorrow
OpenStack Cinder, Implementation Today and New Trends for Tomorrow
Ed Balduf
 
Migrating to XtraDB Cluster
Migrating to XtraDB ClusterMigrating to XtraDB Cluster
Migrating to XtraDB Cluster
percona2013
 

Similar to Optimizing RocksDB for Open-Channel SSDs (20)

RocksDB meetup
RocksDB meetupRocksDB meetup
RocksDB meetup
 
FlashSQL 소개 & TechTalk
FlashSQL 소개 & TechTalkFlashSQL 소개 & TechTalk
FlashSQL 소개 & TechTalk
 
CLFS 2010
CLFS 2010CLFS 2010
CLFS 2010
 
IMCSummit 2015 - Day 2 General Session - Flash-Extending In-Memory Computing
IMCSummit 2015 - Day 2 General Session - Flash-Extending In-Memory ComputingIMCSummit 2015 - Day 2 General Session - Flash-Extending In-Memory Computing
IMCSummit 2015 - Day 2 General Session - Flash-Extending In-Memory Computing
 
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
 
Deploying ssd in the data center 2014
Deploying ssd in the data center 2014Deploying ssd in the data center 2014
Deploying ssd in the data center 2014
 
Zoned Storage
Zoned StorageZoned Storage
Zoned Storage
 
Storage and performance- Batch processing, Whiptail
Storage and performance- Batch processing, WhiptailStorage and performance- Batch processing, Whiptail
Storage and performance- Batch processing, Whiptail
 
Migrating to XtraDB Cluster
Migrating to XtraDB ClusterMigrating to XtraDB Cluster
Migrating to XtraDB Cluster
 
IMCSummit 2015 - Day 2 IT Business Track - Drive IMC Efficiency with Flash E...
IMCSummit 2015 - Day 2  IT Business Track - Drive IMC Efficiency with Flash E...IMCSummit 2015 - Day 2  IT Business Track - Drive IMC Efficiency with Flash E...
IMCSummit 2015 - Day 2 IT Business Track - Drive IMC Efficiency with Flash E...
 
Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...
Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...
Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
 
Accelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket CacheAccelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket Cache
 
Database as a Service on the Oracle Database Appliance Platform
Database as a Service on the Oracle Database Appliance PlatformDatabase as a Service on the Oracle Database Appliance Platform
Database as a Service on the Oracle Database Appliance Platform
 
An Efficient Backup and Replication of Storage
An Efficient Backup and Replication of StorageAn Efficient Backup and Replication of Storage
An Efficient Backup and Replication of Storage
 
Oracle real application_cluster
Oracle real application_clusterOracle real application_cluster
Oracle real application_cluster
 
Using oracle12c pluggable databases to archive
Using oracle12c pluggable databases to archiveUsing oracle12c pluggable databases to archive
Using oracle12c pluggable databases to archive
 
The Hive Think Tank: Rocking the Database World with RocksDB
The Hive Think Tank:  Rocking the Database World with RocksDBThe Hive Think Tank:  Rocking the Database World with RocksDB
The Hive Think Tank: Rocking the Database World with RocksDB
 
OpenStack Cinder, Implementation Today and New Trends for Tomorrow
OpenStack Cinder, Implementation Today and New Trends for TomorrowOpenStack Cinder, Implementation Today and New Trends for Tomorrow
OpenStack Cinder, Implementation Today and New Trends for Tomorrow
 
Migrating to XtraDB Cluster
Migrating to XtraDB ClusterMigrating to XtraDB Cluster
Migrating to XtraDB Cluster
 

Recently uploaded

FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 

Recently uploaded (20)

FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 

Optimizing RocksDB for Open-Channel SSDs

  • 1. 1 Towards  Application  Driven  Storage Optimizing  RocksDB  for  Open-­‐Channel  SSDs Javier  González  <javier@cnexlabs.com> LinuxCon  Europe  2015 Contributors:  Matias  Bjørling  and  Florin  Petriuc
  • 2. Application  Driven  Storage:  What  is  it? 2 RocksDB Metadata  Mgmt. Standard  Libraries User  SpaceKernel  Space App-­‐specific  opt. Page  cache Block  I/O  interface FS-­‐specific  logic
  • 3. Application  Driven  Storage:  What  is  it? • Application  Driven  Storage   - Avoid  multiple  (redundant)   translation  layers     - Leverage  optimization   opportunities     - Minimize  overhead  when   manipulating  persistent  data   - Make  better  decisions  regarding   latency,  resource  utilization,  and   data  movement  (compared  to   best-­‐effort  techniques  today) 3 RocksDB Metadata  Mgmt. Standard  Libraries User  SpaceKernel  Space App-­‐specific  opt. Page  cache Block  I/O  interface FS-­‐specific  logic Generic  <>  Optimized ➡ Motivation:  Give  the  tools  to  the  applications   that  know  how  to  manage  their  own  storage
  • 4. Application  Driven  Storage  Today • Arrakis  (https://arrakis.cs.washington.edu)   - Remove  the  OS  kernel  entirely  from  normal  application  execution   • Samsung  multi  stream   - Let  the  SSD  know  from  where  “I/O  streams”  emerge  to  make  better   decisions   • Fusion  I/O   - Dedicated  I/O  stack  to  support  a  specific  type  of  hardware   • Open-­‐Channel  SSDs   - Expose  SSD  characteristics  to  the  host  and  give  it  full  control  over  its   storage 4
  • 5. Traditional  Solid  State  Drives • Flash  complexity  is  abstracted  away  form  the  host  by  an   embedded  Flash  Translation  Layer  (FTL)   - Maps  logical  addresses  (LBAs)  to  physical  addresses  (PPAs)   - Deals  with  flash  constrains  (next  slide)   - Has  enabled  adoption  by  making  SSDs  compliant  with  the  existing  I/O  stack 5 High  throughput  +  Low  latency Parallelism  +  Controller
  • 6. Page 0 Page 1 Page 2 Page n -1 … StateOOBData Flash  memory  101 6 • Flash  constrains:   - Write  at  a  page  granularity   • Page  state:  Valid,  invalid,  erased   - Write  sequentially  in  a  block   - Write  always  to  an  erased  page   • Page  becomes  valid   - Updates  are  written  to  a  new  page   • Old  pages  become  invalid  -­‐  need  for  GC   - Read  at  a  page  granularity  (seq./random  reads)   - Erase  at  a  block  granularity  (all  pages  in  block)   • Garbage  collection  (GC):   • Move  valid  pages  to  new  block   • Erase  valid  and  invalid  pages  -­‐>  erased  state
  • 7. • Open  Channel  SSDs  share  control  responsibilities  with  the  Host   in  order  to  implement  and  maintain  features  that  typical  SSDs   implemented  strictly  in  the  SSD  device  firmware Open-­‐Channel  SSDs:  Overview 7 • Host-­‐based  FTL  manages:   - Data  placement   - I/O  scheduling   - Over-­‐provisioning   - Garbage  collection   - Wear-­‐leveling   • Host  needs  to  know:   - SSD  features  &  responsibilities   - SSD  geometry   • NAND  media  idiosyncrasies   • Die  geometry  (blocks  &  pages)   • Channels,  timings,  etc.   • Bad  blocks  &  ECC Host  manages  physical  flash Application  Driven  Storage Physical flash exposed to the host (Read, Write, Erase)
  • 8. Open-­‐Channel  SSDs:  LightNVM 8 Key-Value/Object/FS/ Block/etc. Block Target Direct Flash Target File-System Block Manager (Generic, Vendor-specific, ...) Open-Channel SSDs (NVMe, PCI-e, RapidIO, ...) Kernel User-space Block Copy Engine Metadata State Mgmt. Bad Block State Mgmt. XOR Engine ECC Engine Error Handling Etc.GC Engine Raw  NAND  Geometry Managed  Geometry Vendor-Specific Target HardwareSoftware LightNVM  Framework • Targets   - Expose  physical  media  to  user-­‐ space   • Block  Managers   - Manage  physical  SSD   characteristics   - Evens  out  wear-­‐leveling  across   all  flash   • Open-­‐Channel  SSD   - Responsibility   - Offload  engines  
  • 9. LightNVM’s  DFlash:  Application  FTL 9 ➡ DFlash  is  the  LightNVM  target  supporting  application  FTLs Open%Channel*SSD get_block(),/, put_block(),/ erase_block() Block*Manager DFlash*Target Provisioning,interface Block,device Applica:on Provisioning,buffer Block0 Block1 BlockN… Applica@on,Logic Normal,I/O blockNE>bppa,*,PAGE_SIZE struct&nvm_tgt_type&/_dflash&=&{ &&[…] &&.make_rq&&&&&&=&df_make_rq, &&.&end_io&&&&&&&&=&df_end_io, &&[…] }; FTL sync psync libaio posixaio … … … … … CH0 CH1 CHN Lun0 LunN Physical,Flash,Layout ,,E,NAND,E,specific ,,E,Managed,by,controller … … … … vblockManaged,Flash ,,E,Exploit,parallelism ,,E,Serve,applica@on,needs type%1 type%2vblock vblock type%3 KERNEL*SPACEUSER*SPACE Data,placement I/O,scheduling OverEprovisioning Garbage,collection WearEleveling struct  vblock  {      uint64_t  id;      uint64_t  owner_id;      uint64_t  nppas;      uint64_t  ppa_bitmap      sector_t  bppa;      uint32_t  vlun_id;          uint8_t  flags   };
  • 10. Open-­‐Channel  SSDs:  Challenges 1. Which  classes  of  applications  would  benefit  most  from  being   able  to  manage  physical  flash?   - Modify  storage  backend  (i.e.,  no  posix)   - Probably  no  file  system,  page  cache,  block  I/O  interface,  etc.   2. Which  changes  do  we  need  to  make  on  these  applications?   - Make  them  work  on  Open-­‐Channel  SSDs   - Optimize  them  to  take  advantage  of  directly  using  physical  flash  (e.g.,  data   structures,  file  abstractions,  algorithms).   3. Which  interfaces  would  (i)  make  the  transition  simpler,  and  (ii)   simultaneously  cover  different  classes  of  applications? 10 ➡ New  paradigm  that  we  need  to  explore  in  the  whole  I/O  stack
  • 11. RocksDB:  Overview 11 • Embedded  Key-­‐Value  persistent  store   • Based  on  Log-­‐Structured  Merge  Tree   • Optimized  for  fast  storage   • Server  workloads     • Fork  from  LevelDB   • Open  Source:     - https://github.com/facebook/rocksdb   • RocksDB  is  not:   - Not  distributed   - No  failover   - Not  highly  available RocksDB  Reference:  The  Story  of  RocksDB,  Dhruba  Borthakur  and  Haobo  Xu  (link) The  Log-­‐Structured  Merge-­‐Tree,  Patrick  O'Neil,  Edward  Cheng   Dieter  Gawlick,  Elizabeth  O’Neil.  Acta  Informatica,  1996.
  • 12. RocksDB:  Overview 12 Active MemTable ReadOnly MemTable Log Log sst sst sst sst sst sst Compaction Flush Switch Switch LSM Write  Request Read  Request Blooms
  • 13. Problem:  RocksDB  Storage  Backend   13 sst sst sst sst sst sst User Data MetadataDB Log WAL … WALWALWAL Manifest Manifest Manifest… Current Other LOG (Info)LOCK IDENTITY Storage Backend LSM Logic Posix HDFS Win RocksDB LSM • Storage  backend  decoupled  from  LSM   - WritableFile():  Sequential  writes  -­‐>  Only  way  to  write  to   secondary  storage   - SequentialFile()  -­‐>  Sequential  reads.  Used  primarily  for   sstable  user  data  and  recovery   - RandomAccessFile()  -­‐>  Random  reads.  Used  primarily  for   metadata  (e.g.,  CRC  checks) - Sstable:  Persistent  memtable   - DB  Log:  Write-­‐Ahead  Log  (WAL)   - MANIFEST:  File  metadata   - IDENTITY:  Instance  ID   - LOCK:  use_existing_db   - CURRENT:  Superblock     - Info  Log:  Log  &  Debugging
  • 14. RocksDB:  LSM  using  Physical  Flash • Objective:  Fully  optimize  RocksDB  for  Flash  memories   - Control  data  placement:   • User  data  in  sstables  is  close  in  the  physical  media  (same  block,  adjacent  blocks)   • Same  for  WAL  and  MANIFEST   - Exploit  Parallelism:   • Define  virtual  blocks  based  on  file  write  patters  in  the  storage  backend     • Get  blocks  from  different  luns  based  on  RocksDB’s  LSM  write  patters     - Schedule  GC  and  minimize  over-­‐provisioning   • Use  LSM  sstable  merging  strategies  to  minimize  (and  ideally  remove)  the  need  for  GC  and   over-­‐provisioning  on  the  SSD   - Control  I/O  scheduling   • Prioritize  I/Os  based  on  the  LSM  persistent  needs  (e.g.,  L0  and  WAL  have  higher  priority   than  levels  used  for  compacted  data  to  maximize  persistency  in  case  of  power  loss) 14 ➡ Implement  an  FTL  optimized  for  RocksDB,  which  can  be  reused  for   similar  applications  (e.g.,  LevelDB,  Cassandra,  MongoDB)
  • 15. RocksDB  +  DFlash:  Challenges • Sstables  (persistent  memtables)   - P1:  Fit  block  sizes  in  L0  and  further  level  (merges  +  compactions)   • No  need  for  GC  on  SSD  side  -­‐  RocksDB  merging  as  GC  (less  write  and  space  amplification)   - P2:  Keep  block  metadata  to  reconstruct  sstable  in  case  of  host  crash   • WAL  (Write-­‐Ahead  Log)  and  MANIFEST   - P3:  Fit  block  sizes  (same  as  in  sstables)   - P4:  Keep  block  metadata  to  reconstruct  the  log  in  case  of  host  crash   • Other  Metadata   - P5:  Keep  superblock  metadata  and  allow  to  recover  the  database   - P6:  Keep  other  metadata  to  account  for  flash  constrains  (e.g.,  partial  pages,   bad  pages,  bad  blocks)   • Process   - P7:  Follow  RocksDB  architecture  -­‐  upstreamable  solution 15
  • 16. Arena&block&(kArenaSize) (Op3mal&size&~&1/10&of&write_buffer_size) write_buffer_size … Flash'Block'0 Flash'Block'1 nppas'*'PAGE_SIZE offset'0 gid:'128 gid:'273 gid:'481 EOF space'amplificaHon (life'of'file) P1,  P3:  Match  flash  block  size 16 • WAL  and  MANIFEST  are   reused  in  future  instances   until  replaced   - P3:  Ensure  that  WAL  and   MANIFEST  replace  size  fills  up   most  of  last  block • Sstable  sizes  follow  a  heuristic  -­‐  MemTable::ShouldFlushNow() P1: - kArenaBlockSize  =  sizeof(block)   - Conservative  heuristic  in  terms  of  overallocation   • Few  lost  pages  is  better  than  allocating  a  new  block   - Flash  block  size  becomes  a  “static”  DB  tuning   parameter  that  is  used  to  optimize  “dynamic”  ones ➡ Optimize  RocksDB  bottom  up  (from  storage  backend  to  LSM) DFlash  File
  • 17. P2,  P4,  P6:  Block  Metadata 17 Block& Ending& Metadata Out$of$ Bound Block& Star+ng& Metadata • Blocks  can  be  checked  for  integrity   • New  DB  instance  can  append;  padding  is  maintained  in  OOB  (P6)   • Closing  a  block  updates  bad  page  &  bad  block  information  (P6) First&Valid&Page Intermediate&Page Last&Page Out&of& Bound Block& Star:ng& Metadata Block& Ending& Metadata RocksDB(data RocksDB(data RocksDB(data Out&of& Bound Out&of& Bound Flash&Block struct  vblock_init_meta  {      char  filename[100];            //  RocksDB  file  GID                uint64_t  owner_id;            //  Application  owning  the  block      size_t  pos;                                            //  relative  position  in  block   };   struct  vpage_meta  {      size_t  valid_bytes;                //  Valid  bytes  from  offset  0      uint8_t  flags;                                  //  State  of  the  page   };   struct  vblock_close_meta  {      size_t  written_bytes;            //  Payload  size      size_t  ppa_bitmap;                  //  Updated  valid  page  bitmap                size_t  crc;                                                  //  CRC  of  the  whole  block      unsigned  long  next_id;    //  Next  block  ID  (0  if  last)      uint8_t  flags;                                      //  Vblock  flags   };  
  • 18. P2,  P4,  P6:  Crash  Recovery 18 • A  DFlash  file  can  be  reconstructed  from  individual  blocks  (P2,  P4)   1. Metadata  for  the  blocks  forming  a  DFlash  file  is  stored  in  MANIFEST   • The  last  WAL  is  not  guaranteed  to  reach  the  MANIFEST  -­‐>  RECOVERY  metadata  for  DFLASH   2. On  recovery,  LightNVM  provides  an  application  with  all  its  valid  blocks   3. Each  block  stores  enough  metadata  to  reconstruct  a  DFLash  file OPCODE ENCODED METADATA OPCODE ENCODED METADATA OPCODE ENCODED METADATA … Metadata/Type: 3/Log 3/Current 3/Metadata 3/Sstable !"Private"(Env) Enough/metadata/to/ recover/database/in/a/ new/instance Private"(DFlash): vblocks/forming/the/ DFlash/File 1 BLOCK BLOCK BLOCK 3 Open%Channel*SSD BM Block& (ownerID) Block& (ownerID) Block& (ownerID) … Block&ListRecovery 2
  • 19. 19 • CURRENT  is  used  to  store  RocksDB  “superblock”   - Points  to  current  MANIFEST,  which  is  used  to  reconstruct  the  DB  when   creating  a  new  instance.  We  append  the  block  metadata  that  points  to  the   blocks  forming  the  current  MANIFEST  (P5) P5:  Superblock OPCODE ENCODED METADATA OPCODE ENCODED METADATA OPCODE ENCODED METADATA … Metadata/Type: 3/Log 3/Current 3/Metadata 3/Sstable !"Private"(Env) Enough/metadata/to/ recover/database/in/a/ new/instance Private"(DFlash): vblocks/forming/the/ DFlash/File MANIFEST   (DFlash) CURRENT   (Posix) Normal  Recovery
  • 21. RocksDB  +  DFlash:  Prototype  (1/2) 21 • Optimize  RocksDB  for  Flash  storage   - Implement  a  user  space  append-­‐only  FS  that  deals  with  flash  constrains   • Append-­‐only:  Updates  are  re-­‐written  and  old  data  invalidated  -­‐>  LSM  understands  this  logic   • Page  cache  implemented  in  user  space;  use  Direct  I/O   • Only  “sync”  complete  pages,  and  prefer  closed  blocks.   • In  case  of  write  failure,  write  to  a  new  block  (or  mark  bad  page  and  re-­‐try)   - Implement  RocksDB’s  file  classes  for  DFlash:   • WritableFile():  Sequential  writes  -­‐>  Only  way  to  write  to  secondary  storage   • SequentialFile()  -­‐>  Used  primarily  for  sstable  user  data  and  recovery   • RandomAccessFile()  -­‐>  Used  primarily  for  metadata  (e.g.,  CRC  checks)   • Use  flash  block  as  the  central  piece  for  storage  optimizations   - The  Open-­‐Channel  SSD  fabric  is  configured  at  first   • Define  block  size  -­‐  across  luns  and  channels  to  exploit  parallelism   • Define  different  types  of  luns  with  different  block  features   - RocksDB  is  configured  with  standard  parameters  (e.g.,  write  buffer,  cache)   • DFlash  backend  tunes  these  parameters  based  on  the  type  of  lun  and  block
  • 22. RocksDB  +  DFlash:  Prototype  (2/2) 22 • Use  LSM  merging  strategies  as  perfect  garbage  collection  (GC)   - All  blocks  in  a  DFlash  file  are  either  active  or  inactive  -­‐>  no  need  to  GC  in  SSD   - Reduce  over-­‐provisioning  significantly  (~5%)   - Predictable  latency  -­‐>  SSD  is  in  stable  state  from  the  beginning   • Reuse  RocksDB  concepts  and  abstractions  as  much  as  possible   - Store  private  metadata  in  MANIFEST   - Store  superblock  in  CURRENT   - Minimize  the  amount  of  “visible”  metadata  -­‐  use  OOB,  Root  FS,  etc.   • Separate  persistent  (meta)data  between  “fast”  and  “static”   - Fast  data  is  all  user  data  (i.e.,  sstables)  and  the  WAL   - Fast  metadata  that  follows  user  data  rates  (i.e.,  MANIFEST)   - Static  metadata  is  written  once  and  seldom  updated   • CURRENT:  Superblock  for  MANIFEST   • LOCK  and  IDENTITY
  • 23. Architecture:  RocksDB  with  DFlash 23 DFlash'Storage'Backend LSM'Logic Open8Channel'SSD'(Fast'Storage) Posix'FS'(Sta>c'Meta.) sst sst sst sst sst sst User%Data DB%Log WAL … WALWALWAL Metadata Manifest Metadata CURRENT LOCK IDENTITY LOG' (Info) Env'DFlash Op2miza2ons Env%Op2ons (Flash%charac.%on%init) DFlash'File'Classes DFWritableFile() DFRandomAccessFile() DFSequen2alFile() Persist%opera2ons Environment% Tunning Env'DFlash Private%metadata Metadata,%Close,%Crash
  • 24. Architecture:  DFlash  +  LightNVM 24 Open%Channel*SSD BM DFlash KERNEL*SPACEUSER*SPACE Provisioning)interface Block)device DFlash)File)(blocks) get_block() put_block() DFWritableFile DFSequen?alFile DFRandomAccesslFile Free)Blocks Used)Blocks Bad)Blocks Sector)Calcula?ons I/O)Path Manifestsst WAL Data)Placement controlled)by)LSM I/O Env*DFlash Env*Posix PxWritableFile PxSequen?alFile PxRandomAccesslFile CURRENT LOCK IDENTITY LOG* (Info) TradiGonal*SSD File*System I/O RocksDB*LSM Device) Features Env)Op?ons Op?miza?ons Init Code
  • 25. Architecture:  DFlash  +  LightNVM 25 • LSM  is  the  FTL   - DFlash  target  and  RocksDB  storage   backend  take  care  of  provisioning   flash  blocks   • Optimized  critical  I/O  path   - Sstables,  WAL,  and  MANIFEST  are   stored  in  the  Open-­‐Channel  SSD,   where  we  can  provide  QoS   • Enable  a  RocksDB  distributed   architecture   - BM  abstracts  the  storage  fabric  (e.g.,   NVM)  and  can  potentially  provide   blocks  from  different  drives  -­‐>  single   address  space Open%Channel*SSD BM DFlash KERNEL*SPACEUSER*SPACE Provisioning)interface Block)device DFlash)File)(blocks) get_block() put_block() DFWritableFile DFSequen?alFile DFRandomAccesslFile Free)Blocks Used)Blocks Bad)Blocks Sector)Calcula?ons I/O)Path Manifestsst WAL Data)Placement controlled)by)LSM I/O Env*DFlash Env*Posix PxWritableFile PxSequen?alFile PxRandomAccesslFile CURRENT LOCK IDENTITY LOG* (Info) TradiGonal*SSD File*System I/O RocksDB*LSM Device) Features Env)Op?ons Op?miza?ons Init Code
  • 26. 26 RocksDB   make  release ENTRY  KEYS     with  4  threads DFLASH   (1  LUN) POSIX WRITES   10000  keys 70MB/s 25MB/s 100000  keys 40MB/s 25MB/s 1000000  keys 25MB/s 20MB/s Page-­‐aligned     write  buffer - DFlash  write  page  cache  +  Direct  I/O  +  flash  page  aligned  write  buffer  is  better   optimized  than  best  effort  techniques  and  top-­‐down  optimizations  (RocksDB   parameters).  Write  buffer  required  by  RocksDB  due  to  small  WAL  writes   - If  we  manually  tune  buffer  sizes  with  Posix,  we  obtain  similar  results.  However,   it  requires  lots  of  experimentation  for  each  configuration QEMU  Evaluation:  Writes Bare  Metal:    ~180MB/s
  • 27. 27 RocksDB   make  release ENTRY  KEYS DFLASH   (1  LUN) POSIX READ 10000  keys 5MB/s 300MB/s 100000  keys 5MB/s 500MB/s 1000000  keys 5MB/s 570MB/s No  page   cache   support - Without  a  DFLASH  page  cache  we  need  to  issue  an  I/O  for  each  read!   • Sequential  20  byte-­‐reads  in  same  page  would  issue  different  PAGE_SIZE  I/Os QEMU  Evaluation:  Reads
  • 28. 28 RocksDB   make  release ENTRY  KEYS DFLASH   (1  LUN) DFLASH   (1  LUN)   (+  simple  page  cache) POSIX READ 10000  keys 5MB/s 160MB/s 300MB/s 100000  keys 5MB/s 280MB/s 500MB/s 1000000   keys 5MB/s 300MB/s 570MB/s - Posix  +  buffered  I/O  using  Linux’s  page  cache  is  still  better,  but  we  have   confirmed  our  hypothesis. Simple  page   cache  for  reads ➡ User-­‐space  page  cache  is  a  necessary  optimization  when   the  generic  OS  page  cache  is  on  the  way.  Other  databases   use  this  technique  (e.g.,  Oracle,  MySQL) QEMU  Evaluation:  “Fixing”  Reads
  • 29. QEMU  Evaluation:  Insights 29 • Posix  backend  and  DFlash  backend  (with  1  lun)  should  achieve   very  similar  throughput  for  reads/writes  when  using  same  page   cache  and  write  buffer  optimizations   • But…   - DFlash  allows  to  optimize  buffer  and  cache  sizes  based  on  Flash   characteristics   - DFlash  knows  which  file  class  is  calling  -­‐  we  can  do  prefetching  for  sequential   reads  (DFSequentialFile)  at  block  granularity   - DFlash  designed  to  implement  a  Flash  optimized  page  cache  using  Direct  I/O   • If  the  Open-­‐Channel  SSD  exposes  several  LUNs,  we  can  exploit   parallelism  within  DFlash  and  RocksDB’s  LSM  write/read  patterns   - How  many  luns  and  their  characteristics  are  organized  is  controller  specific
  • 30. CNEX  WestLake  SDK:  Overview 30 0" 200" 400" 600" 800" 1000" 1200" 1400" 1600" 1" 3" 5" 7" 9" 11" 13" 15" 17" 19" 21" 23" 25" 27" 29" 31" 33" 35" 37" 39" 41" 43" 45" 47" 49" 51" 53" 55" 57" 59" 61" 63" MB/s" LUNs"ac5ve" Read/Write"Performance"(MB/s)" Write"KB/s" Read"KB/s" FPGA  Prototype  Platform  before  ASIC:   -­‐  PCIe  G3x4  or  PCI  G2x8   -­‐  4x10GE  NVMoE   -­‐  40  bit  DDR3   -­‐  16  CH  NAND *  Not  real  performance  results  -­‐  FPGA  prototype,  not  ASIC  
  • 31. CNEX  WestLake  Evaluation 31 RocksDB   make  release ENTRY   KEYS     with  4   threads WRITES     (1  LUN) READS   (1  LUN) WRITES     (8  LUNS) READS   (8  LUNS) WRITES     (64  LUNS) READS     (64  LUNS) RocksDB   DFLASH 10000  keys 21MB/s 40MB/s X X X X 100000   keys 21MB/s 40MB/s X X X X 1000000   keys 21MB/s 40MB/s X X X X Raw  DFLASH   (with  fio) 32MB/s 64MB/s 190MB/s 180MB/s 920MB/s 1,3GB/s • We  focus  on  a  single  I/O  stream  for  first  prototype  -­‐>  1  LUN  
  • 32. CNEX  WestLake  Evaluation:  Insights 32 • RocksDB  checks  for  sstable  integrity  on  writes  (intermittent  reads)   - We  pay  the  price  of  not  having  an  optimized  page  cache  also  on  writes   - Reads  and  writes  are  mixed  in  one  single  lun   • Ongoing  work:  Exploit  parallelism  in  RocksDB’s  I/O  patters RocksDB Ac#ve& memtable WAL… SST1 SST2 SST3 … SSTN Merging&&&Compac#on w w w w w wr W R r r r r R/W r ReadBonly& memtable r RO& MT RO& MT … R R w r W MANIFEST W w r w r Open,Channel1SSD Block1Manager LUN0 LUN1 LUN2 LUN3 LUN4 LUN5 LUN6 LUN7 LUN8 LUN9… DFlash get_block() put_block() Virtual&LUN Physical&LUN … … … … CH0 CH1 CHN Lun0 LunN WestLake - Do  not  mix  R/W   - Different  VLUN  per  path   - Different  VLUN  types   - Enabling  I/O  scheduling   - Block  pool  in  DFlash   (prefetching)
  • 33. CNEX  WestLake  Evaluation:  Insights 33 • Also,  in  any  Open-­‐Channel  SSD   - DFlash  will  not  get  a  performance  hit  when  the  SSD  triggers  GC  -­‐  RocksDB   does  GC  when  merging  sstables  in  LSM  >  L0   • SSD  steady  state  is  improved  (and  reached  from  the  beginning)   • We  achieve  predictable  latency IOPS Time
  • 34. Status  and  ongoing  work • Status:   - get_block/put_block  interface  (first  iteration)  through  the  DFlash  target   - RocksDB  DFlash  target  that  plugs  into  LightNVM.  Source  code  to  test  in   available  (RocksDB,  Kernel,  and  QEMU).  Working  on  WestLake  upstreaming   • Ongoing:   - Implement  functions  to  increase  the  synergy  between  the  LSM  and  the   storage  backend  (i.e.,  tune  write  buffer  based  on  block  size)  -­‐>  Upstreaming   - Support  libaio  to  enable  async  I/O  in  DFlash  storage  backend   • Need  to  deal  with  RocksDB  design  decisions  (e.g.,  Get()  assumes  sync  IO)   - Exploit  device  parallelism  within  RockDB  internal  structures   - Define  different  types  of  virtual  luns  and  expose  them  to  the  application   - Other  optimizations:  double  buffering,  aligned  memory  in  LSM,  etc. 34 - Move  RocksDB  DFlash’s  logic  to  liblightnvm  -­‐>  append-­‐only  FS  for  Flash
  • 35. Conclusions • Application  Driven  Storage   - Demo  working  on  real  hardware:     • (RocksDB  -­‐>  LightNVM  -­‐>  WestLake-­‐powered  SSD)   - QEMU  support  for  testing  and  development   - More  Open-­‐Channel  SSDs  coming  soon.   • RocksDB   - DFlash  dedicated  backend  -­‐>  append-­‐only  FS  optimized  for  Flash   - Set  the  basis  for  moving  to  a  distributed  architecture  while  guaranteeing   performance  constrains  (specially  in  terms  of  latency)   - Intention  to  upstream  the  whole  DFlash  storage  backend     • LightNVM   - Framework  to  support  Open-­‐Channel  SSDs  in  Linux     - DFlash  target  to  support  application  FTLs 35
  • 36. 36 Towards  Application  Driven  Storage Optimizing  RocksDB  on  Open-­‐Channel  SSDs  with  LightNVM LinuxCon  Europe  2015 Questions? • Open-­‐Channel  SSD  Project:  https://github.com/OpenChannelSSD     • LightNVM:  https://github.com/OpenChannelSSD/linux     • RocksDB:  https://github.com/OpenChannelSSD/rocksdb   Javier  González  <javier@cnexlabs.com>