Ceph Day KL - Bluestore

B L U E S TO R E : A N E W, FASTER S T O R A G E B A C K E N D
F O R C E P H
Patrick McGarry
Ceph Days APAC Roadshow
2016

2
O UTLIN E
● Ce p h b a c k g r o u n d a n d c o n t e x t
–
–
FileStore, a n d w h y POSIX failed us
Ne wS to r e – a h y b r i d a p p r o a c h
● BlueStore – a n e w Ce p h OSD b a c k e n d
–
–
M e t a d a t a
D a t a
●
●
●
Performance
Status a n d availability
S u m m a r y

CEPH
●
●
●
●
●
●
Object, block, a n d file storage in a single cluster
All c o m p o n e n t s scale horizontally
N o single p o in t of failure
H a r d w a r e agnostic, c o m m o d i t y h a r d w a r e
Self-manage w h e n e v e r possible
O p e n source (LGPL)
●
●
“ A Scalable, High-Performance Distributed File S y s t e m ”
“ p e r f o r ma n c e , reliability, a n d scalability”
4

CEPH COMPONENTS
RGW
A w e b services g a t e w a y
for o b je ct storage,
co mp a t ib le w i t h S3 a n d
Swift
LIBRADOS
A library a llo wing a p p s t o directly access RADOS (C, C + + , Java, Python, Ruby, PHP)
RADOS
A software -based, reliable, a u t o n o m o u s , d is t rib ute d o bject store c o m p r i s e d of
self-healing, self-managing, intelligent st o ra g e n o d e s a n d lig h t we ig h t mo n it o rs
RBD
A reliable, fully-distributed
block d e vice w i t h clo u d
p la t f o rm in t e g rat ion
CEPHFS
A d ist ribut ed file s y s t e m
w i t h POSIX se ma n t ics a n d
scale-out m e t a d a t a
m a n a g e m e n t
OBJECT
5
BLOCK FILE

OBJECT STORAGE DAEMONS (OSDS)
FS
DISK
OSD
DISK
OSD
FS
DISK
OSD
FS
DISK
OSD
FS
xfs
b t rfs
ex t 4
M
M
M
6

OBJECT STORAGE DAEMONS (OSDS)
FS
DISK
OSD
DISK
OSD
FS
DISK
OSD
FS
DISK
OSD
FS
xfs
b t rfs
ex t 4
M
M
M
FileStore
7
FileStoreFileStoreFileStore

●
8
ObjectStore
–
–
abs t ract interface for storing
local d a t a
EBOFS, FileStore
●
EBOFS
–
–
a us er -s pac e e x t e n t - b a s e d
o b j e c t file s y s t e m
deprec at ed in f av or of FileStore
o n btrfs in 2 0 0 9
●
Object – “ file ”
–
–
–
d a t a (file-like b y t e s t ream )
at t ributes (small key/value)
o m a p ( u n b o u n d e d key/value)
●
Collection – “ d i r e c t o r y ”
–
–
p l a c e m e n t g r o u p shard (slice of
t h e RADOS pool)
s h a r d e d b y 3 2 - b i t h a s h v a l u e
●
All writes are transactions
–
–
A t o m i c + C o n s i s t e n t + D u r a b l e
Isolation prov ided b y OSD
OBJECTSTORE A N D DATA MODEL

●
9
FileSt ore
–
–
PG = collection = directory
object = file
●
Le v e ld b
–
–
large x a t t r spillover
object o m a p (key/value) d a t a
●
Originally just for development...
– later, o n l y s u p p o r t e d b a c k e n d
( o n XFS)
● /var/lib/ceph/osd/ceph-123/
– current/
● meta/
–
–
osdmap123
osdmap124
● 0.1_head/
–
–
object1
object12
● 0.7_head/
–
–
object3
object5
● 0.a_head/
–
–
object4
object6
● db/
– <leveldb files>
FILESTORE

●
1 0
OSD carefully m a n a g e s
c o n s is te n c y of its d a t a
●
All w rite s a re tra n s a c tio n s
– w e n e e d A + C + D ; OSD prov ides I
● M o s t a re s i m p l e
–
–
–
w r i t e s o m e b y t e s t o objec t (file)
u p d a t e objec t a t t r i b u t e (file
x a t t r )
a p p e n d t o u p d a t e log (lev eldb
insert)
...but o t h e r s a re arbitrarily
l a r g e / c o m p l e x
[
{
"op_name": "write",
"collection": "0.6_head",
"oid": "#0:73d87003:::benchmark_data_gnit_10346_object23:head#",
"length": 4194304,
"offset": 0,
"bufferlist length": 4194304
},
{
"op_name": "setattrs",
"oid": "#0:73d87003:::benchmark_data_gnit_10346_object23:head#",
"attr_lens": {
"_": 269,
"snapset": 31
}
},
{
"op_name": "omap_setkeys",
"oid": "#0:60000000::::head#",
"attr_lens": {
"0000000005.00000000000000000006": 178,
"_info": 847
}
}
]
POSIX FAILS: TRANSACTIONS

●
1 1
Btrfs t r a n s a c t io n h o o k s
dangerous, and only for
how to avoid the
/* trans start and trans end are
* use by applications that know
* resulting deadlocks
*/
#define BTRFS_IOC_TRANS_START _IO(BTRFS_IOCTL_MAGIC, 6)
#define BTRFS_IOC_TRANS_END _IO(BTRFS_IOCTL_MAGIC, 7)
●
●
Wr it e b a c k o r d e r in g
#define BTRFS_MOUNT_FLUSHONCOMMIT (1 << 7)
W h a t if w e hit a n error? c e p h - o s d process dies?
–
#define BTRFS_MOUNT_WEDGEONTRANSABORT (1 << …)
T her e is n o rollback...

●
1 2
Writ e -ahead journal
–
–
serialize a n d journal e v e r y ObjectStore::Transaction
t h e n w r i t e it t o t h e file s y s t e m
●
Btrfs parallel journaling
–
–
periodic s y n c t akes a snapshot , t h e n t r i m old journal ent ries
o n OSD restart: rollback a n d replay journal against last s n a p s h o t
●
XFS/ext4 w rit e -ahead journaling
–
–
–
periodic sync, t h e n t r i m old journal ent ries
o n restart, replay ent ire journal
lots of u g l y h a c k e r y t o deal w i t h e v e n t s t h a t aren't i d e m p o t e n t
●
e.g., r e n a m e s , collection d e l e t e + create, …
●
fu l l d a t a j o u r n a l → w e d o u b l e w r i t e e v e r y t h i n g → ~ h a l v e d i s k t h r o u g h p u t

POSIX FAILS: ENUMERATION
1 3
●
●
C e p h objects are d i st ri but ed b y a 32-bit h a sh
E n u m e r a t i o n is in h a sh o r der
–
–
–
s c r u b b i n g
“b a c k f i l l ” ( d a t a re b a la n cin g , re c o v e ry )
e n u m e r a t i o n via lib ra do s client API
●
POSIX readdir is n o t well-ordered
●
N e e d O(1) “ s p l i t ” for a g i v e n sh a rd/range
●
Build directory tree b y hash-value prefix
–
–
–
split a n y d ire ct o ry w h e n size > ~ 1 0 0 files
m e r g e w h e n size < ~ 5 0 files
r e a d e n t ire d ire cto ry, so rt i n - m e m o r y
…
DIR_A/
DIR_A/A03224D3_qwer
DIR_A/A247233E_zxcv
…
DIR_B/
DIR_B/DIR_8/
DIR_B/DIR_8/B823032D_foo
DIR_B/DIR_8/B8474342_bar
DIR_B/DIR_9/
DIR_B/DIR_9/B924273B_baz
DIR_B/DIR_A/
DIR_B/DIR_A/BA4328D2_asdf
…

NE W OBJECTSTORE GOALS
1 5
●
●
●
●
●
M o r e n a t u r a l t ra n sa ct i o n a t o m i c i t y
Avoid d o u b l e w ri t e s
Efficient o b j e ct e n u m e r a t i o n
Efficient cl one o p e r a t i o n
Efficient splice ( “ m o v e t h e s e b y t e s f r o m o b j e ct X t o o b j e ct Y”)
●
●
Efficient IO p a t t e r n for H D D s, SSDs, N V M e
M i n i m a l locking, m a x i m u m parallelism ( b e t w e e n PGs)
●
●
Full d a t a a n d m e t a d a t a c h e c k s u m s
Inline c o m p r e s s i o n

NEWSTORE – WE MANAGE NAMESPACE
●
●
POSIX has t h e w r o n g m e t a d a t a m o d e l for u s
Ordere d k ey / v al ue is perf ec t m a t c h
–
–
well-defined object n a m e sort order
efficient e n u m e r a t i o n a n d r a n d o m lookup
●
N e w S t o r e = roc k s db + obj ec t files
– /var/lib/ceph/osd/ceph-123/
● db/
– <rocksdb, leveldb, whatever>
● blobs.1/
–
–
–
0
1
...
● blobs.2/
–
–
–
100000
100001
...
H D D
OSD
SSD SSD
OSD
H D D
OSD
NewStore NewStore NewStore
RocksDBRocksDB
1 6

●
1 7
RocksDB has a write-ahead log
“ j o u r n a l”
●
XFS/ext4(/btrfs) h a v e their o w n
journal (tree-log)
●
Journal-on-journal has h ig h
o v e r h e a d
– e a c h journal m a n a g e s half of
overall consistency, b u t incurs
t h e s a m e o v e r h e a d
●
●
write(2) + fsync(2) to new blobs.2/10302
1 write + flush to block device
1 write + flush to XFS/ext4 journal
write(2) + fsync(2) on RocksDB log
1 write + flush to block device
1 write + flush to XFS/ext4 journal
NEWSTORE FAIL: CONSISTENCY OVERHEAD

●
1 8
We c an't o v e r w r i t e a POSIX file as p a r t of a a t o m i c t rans action
– ( w e m u s t p r e s e r v e old d a t a unt il t h e t ransact ion c o m m i t s )
●
●
Writing o v e r w r i t e d a t a t o a n e w file m e a n s m a n y files for e a c h objec t
Writ e -ahead logging
–
–
–
–
p u t o v e r w r i t e d a t a in a “WAL” records in RocksDB
c o m m i t a t o m i c a l l y w i t h t ransact ion
t h e n o v e r w r i t e original file d a t a
...but t h e n w e ' r e b a c k t o a d o u b l e - w r i t e for o v e r w r i t e s
●
●
Perf ormance sucks again
Ov erw rit es d o m i n a t e RBD bloc k w ork loads
NEWSTORE FAIL: ATOMICITY NEEDS WAL

●
BlueStore = Block + N e w S t o r e
–
–
–
–
c o n s u m e r a w b l o c k dev ic e(s )
k e y / v a l u e d a t a b a s e (RocksDB) f or m e t a d a t a
d a t a w rit t en directly t o block device
p l u g g a b l e b l o c k Allocat or (policy)
●
We m u s t share t h e block device w i t h RocksDB
–
–
–
i m p l e m e n t our o w n rocksdb::Env
i m p l e m e n t t iny “file s y s t e m ” BlueFS
m a k e BlueStore a n d BlueFS share device(s)
BLUESTORE
BlockDeviceBlockDeviceBlockDevice
data
B l u e S t o r e
metadata
R o c k s D B
BlueRocksEnv
BlueFS
Allocator
ObjectStore
2 0

ROCKSDB: BLUEROCKSENV + BLUEFS
●
class BlueRocksEnv : public rocksdb: : EnvWrapper
– p a s s e s file IO o p e r a t i o n s t o BlueFS
●
BlueFS is a super-simple “file s y s t e m ”
– all m e t a d a t a l o a d e d in RAM o n s t a r t / m o u n t
–
–
–
–
n o n e e d t o s t o r e b lo c k f r e e list
c o a r s e a llo c a t io n u n i t ( 1 M B b lo c k s )
all m e t a d a t a lives in w r i t t e n t o a jo u r n a l
jo u r n a l r e w r i t t e n / c o m p a c t e d w h e n it g e t s la r g e
su p e rb lock journal …
d a t a d a t a
m o r e journal … d a t a
d a t a
file 1 0 file 11 file 1 2 file 1 2 file 1 3 r m file 1 2 file 1 3 ...
●
M a p “ d i rect ories” t o different block devices
–
–
–
db.wal/
d b /
d b .slow /
– o n NVRAM, NVMe, SSD
– level0 a n d h o t SSTs o n SSD
– cold SSTs o n H D D
●
BlueStore periodically balances free space
2 1

ROCKSDB: JOURNAL RECYCLING
2 2
●
rocksdb LogReader only u n d e r stand s t w o m o d e s
–
–
r e a d unt il e n d of file ( n e e d a c c u r a t e file size)
r e a d all v alid records, t h e n i g n o r e zeros a t e n d ( n e e d z e r o e d tail)
●
●
writing t o “ f r e s h ” log “files” m e a n s > 1 IO for a log a p p e n d
mo d ifie d u p s t r e a m rocksdb t o re-use previous log files
– n o w res em bles “ n o r m a l ” journaling behav ior ov er a circular buffer
● w o r k s w i t h vanilla RocksDB o n files a n d o n BlueFS

●
2 3
Single dev i c e
– H D D or SSD
●
●
rocksdb
object d a t a
● Two devices
– 1 2 8 M B of SSD or NVRAM
● rocksdb WAL
– big d e v ic e
●
e v e r y t h in g else
MULTI-DEVICE SUPPORT
● Two devices
– a f e w GB of SSD
●
●
r o c k s d b WAL
r o c k s d b ( w a r m d a t a )
– b ig d e v ic e
●
●
r o cksd b (cold d a t a )
o b je c t d a t a
● Three devices
– 1 2 8 M B NVRAM
●
r o c k s d b WAL
– a f e w GB SSD
●
r o c k s d b ( w a r m d a t a )
– b ig d e v ic e
●
●
r o c k s d b (cold d a t a )
o b je c t d a t a

BLUESTORE METADATA
2 5
●
Partition n a m e s p a c e for different m e t a d a t a
–
–
–
S* – “ s u p e r b l o c k ” m e t a d a t a f or t h e e n t i r e s t ore
B * – b l o c k alloc at ion m e t a d a t a (f ree b l o c k b i t m a p )
T* – stats (bytes used, c om pres sed, etc.)
–
–
C * – collection n a m e →c n o d e _ t
O * – o b j e c t n a m e →o n o d e _ t o r b n o d e _ t
– L* – w r i t e - a h e a d l o g ent ries, p r o m i s e s of f u t u r e IO
– M * – o m a p (user key/value data, stored in objects)

●
2 6
Collection m e t a d a t a
– Interval of object n a m e s p a c e
shard pool hash name bits
C<NOSHARD,12,3d3e0000> “12.e3d3” = <19>
shard pool hash name snap gen
O<NOSHARD,12,3d3d880e,foo,NOSNAP,NOGEN> = …
O<NOSHARD,12,3d3d9223,bar,NOSNAP,NOGEN> = …
O<NOSHARD,12,3d3e02c2,baz,NOSNAP,NOGEN> = …
O<NOSHARD,12,3d3e125d,zip,NOSNAP,NOGEN> = …
O<NOSHARD,12,3d3e1d41,dee,NOSNAP,NOGEN> = …
O<NOSHARD,12,3d3e3832,dah,NOSNAP,NOGEN> = …
struct spg_t {
uint64_t pool;
uint32_t hash;
shard_id_t shard;
};
struct bluestore_cnode_t {
uint32_t bits;
};
●
Nice properties
–
–
O rd e re d e n u m e r a t i o n of objects
We ca n “s p l i t ” collections b y adjusting co d e
m e t a d a t a o n ly
CN O DE

●
2 7
Per object m e t a d a t a
–
–
Lives direc t ly in k e y / v a l u e pair
Serializes t o 1 0 0 s of b y t e s
●
●
●
Size in b y te s
Inline attributes (user a ttr d a ta )
D a t a pointers (user b y t e d a ta )
–
–
lextent_t →(blob, offset, lengt h)
blob →(disk extents, csums, ...)
●
O m a p prefix/ID (user k/v d a ta )
struct bluestore_onode_t {
uint64_t size;
map<string,bufferptr> attrs;
map<uint64_t,bluestore_lextent_t> extent_map;
uint64_t omap_head;
};
struct bluestore_blob_t {
vector<bluestore_pextent_t> extents;
uint32_t compressed_length;
bluestore_extent_ref_map_t ref_map;
uint8_t csum_type, csum_order;
bufferptr csum_data;
};
struct bluestore_pextent_t {
uint64_t offset;
uint64_t length;
};
ONODE

●
B lo b m e t a d a t a
–
–
–
–
Usually blobs stored in t h e o n o d e
S o m e t i m e s w e share blocks b e t w e e n objects (usually clones/snaps)
W e n e e d t o reference c o u n t t h o s e e x t e n t s
W e still w a n t t o split collections a n d repartition e x t e n t m e t a d a t a b y h a s h
shard pool hash name snap gen
O<NOSHARD,12,3d3d9223,bar,NOSNAP,NOGEN> = onode
O<NOSHARD,12,3d3e02c2> = bnode
O<NOSHARD,12,3d3e02c2,baz,NOSNAP,NOGEN> = onode
O<NOSHARD,12,3d3e125d> = bnode
O<NOSHARD,12,3d3e125d,zip,NOSNAP,NOGEN> = onode
O<NOSHARD,12,3d3e1d41,dee,NOSNAP,NOGEN> = onode
O<NOSHARD,12,3d3e3832,dah,NOSNAP,NOGEN> = onode
●
o n o d e va lu e includes, a n d b n o d e va lu e is
map<int64_t,bluestore_blob_t> blob_map;
●
le xt e n t b lo b ids
–
–
> 0 →b lo b in o n o d e
< 0 →b lo b in b n o d e
BN O D E
2 8

●
2 9
We scrub... periodically
–
–
–
w i n d o w b e f o r e w e d e t e c t error
w e m a y r e a d b a d d a t a
w e m a y n o t b e sure w h i c h c o p y
is b a d
●
We w a n t t o validate c h e c k s u m
o n e v e r y read
●
Mu s t store m o r e m e t a d a t a in t h e
blobs
–
–
3 2 - b i t c s u m m e t a d a t a f or 4 M B
o b j e c t a n d 4 K B blocks = 4 K B
larger c s u m bloc k s
● c s u m _ o r d e r > 1 2
– smaller c s u m s
● crc32c_8 or 1 6
●
IO hints
–
–
s e q r e a d + w r i t e →b i g c h u n k s
c o m p r e s s i o n → b i g c h u n k s
● Per-pool policy
CHECKSUMS

●
3 x replication is ex pens iv e
– A n y s c al e - out c l us t er is e x p e n s i v e
●
●
Lots of s t or ed d a t a is (highly) c om pr es s ible
N e e d largish e x t e n t s t o g e t c o m p r e s s i o n benef it ( 6 4 KB, 1 2 8 KB)
–
–
m a y n e e d t o s u p p o r t s m a l l (ov er)w ri t es
o v e r w r i t e s oc c l ude/ o bs c ur e c o m p r e s s e d bl obs
– c o m p a c t e d ( r e w r i t t e n ) w h e n > N l ay ers d e e p
INLINE COMPRESSION
start of object end of object
3 0
allocated
written
written (compressed)
uncompressed blob

Te rm s
3 2
●
Sequencer
–
–
An i n d e p e n d e n t , totally ordered
q u e u e of transactions
O n e per PG
●
TransContext
– State describing a n ex ec uting
t r a n s a c t i o n
DATA PATH BASICS
Two w a y s t o w r it e
●
N e w allocation
–
–
A n y w r i t e larger t h a n
m i n _ a l l o c _ s i z e g o e s t o a n e w,
u n u s e d e x t e n t o n disk
O n c e t h a t IO c o m p l e t e s , w e
c o m m i t t h e t r ans ac t ion
●
WAL ( w r it e - a h e a d - lo g g e d )
– C o m m i t t e m p o r a r y p r o m i s e t o
( ov er ) w r it e d a t a w i t h t r ans ac t ion
●
includes d a t a !
–
–
D o a s y n c o v e r w r i t e
T h e n u p t e m p o r a r y k / v pair

TRANSCONTEXT STATE MACHINE
PREPARE AIO_WAIT
KV_QUEUED KV_COMMITTING
WAL_QUEUED WAL_AIO_WAIT
FINISH
WAL_CLEANUP
Initiate someAIO
Wait for next TransContext(s) in Sequencer to be ready
WAL_QUEUED
Sequencer queue
Initiate someAIO
Wait for next commit batch
PREPARE
AIO_WAIT
KV_QUEUED
AIO_WAIT
KV_COMMITTING
KV_COMMITTING
KV_COMMITTING
FINISH
WAL_QUEUED
FINISH
WAL_AIO_WAIT
WAL_CLEANUP
WAL_CLEANUP
FINISH
FINISHWAL_CLEANUP_COMMITTING
3 3

●
3 4
O n o d e S p a c e p e r collection
– i n - m e m o r y g h o bj ect _t → O n o d e m a p of d e c o d e d o n o d e s
●
Bu ff e r Sp a c e for i n - m e m o r y blobs
– m a y c o n t a i n c a c h e d o n -di sk d a t a
● B o t h buffers a n d o n o d e s h a v e lifecycles linked t o a C a ch e
–
–
LRUCache – trivial LRU
TwoQCache – i m p l e m e n t s 2 Q c a c h e r e p l a c e m e n t a l g o r i t h m (d efaul t)
●
C a c h e is s h a r d e d for parallelism
–
–
–
Collection → sh a r d m a p p i n g m a t c h e s OSD's o p _ w q
s a m e CPU c o n t e x t t h a t processes client r e q u est s will t o u c h t h e LRU/2Q lists
IO c o m p l e t i o n e x e c u t i o n n o t y e t s h a r d e d – TODO?
CACH IN G

●
3 5
FreelistManager
–
–
persist list of free e x t e n t s t o k e y / v a lu e
store
p re p a re i n c r e m e n t a l u p d a t e s for allocate
o r release
●
Initial i m p l e m e n t a t i o n
–
–
–
–
e xt e n t -b a se d
<offset> = <length>
k e p t i n - m e m o r y c o p y
e n fo rces a n o rd e rin g o n c o m m i t s ; freelist
u p d a t e s h a d t o pass t h r o u g h single
t h re a d / lo ck
del 1600=100000
put 1700=0fff00
sma ll initial m e m o r y f o o t p rint , v e r y
e xp e n sive w h e n f r a g m e n t e d
●
N e w b i t m a p - b a s e d a p p r oa ch
<offset> = <region bitmap>
– w h e r e r e g i o n is N bloc k s
● 1 2 8 bloc k s = 8 b y t e s
–
–
–
u s e k / v m e r g e o p e r a t o r t o XOR
allocation or release
merge 10=0000000011
merge 20=1110000000
R oc k s D B l o g - s t r u c t u r e d - m e r g e
t r e e c oales c es k e y s d u r i n g
c o m p a c t i o n
n o i n - m e m o r y s t a t e
BLOCK FREE LIST

●
3 6
Allocator
– abstract interface t o allocate blocks
● St upidAlloc at or
–
–
–
–
e x t e n t - b a s e d
bin free ext ent s b y size (powers of
2)
c h o o s e suff icient ly l a r g e e x t e n t
closest t o hint
h i g h l y v a r i a b l e m e m o r y u s a g e
● bt ree of free ext ent s
–
–
i m p l e m e n t e d , w o r k s
based o n ancient ebofs policy
●
Bit m apAlloc ator
– hierarchy of indexes
●
●
●
L1: 2 bits = 2 ^ 6 blocks
L2: 2 bits = 2 ^ 1 2 blocks
...
0 0 = all free, 11 = all used,
0 1 = m i x
– fixed m e m o r y c o n s u m p t i o n
●
~ 3 5 M B RAM p e r TB
BLOCK ALLOCATOR

●
3 7
●
Let's s u p p ort t h e m natively!
2 5 6 M B zones / b a n d s
–
–
–
m u s t b e w r i t t e n sequentially,
b u t n o t all a t onc e
libzbc s upport s ZAC a n d ZBC
H D D s
h o s t - m a n a g e d or hos t -aw are
●
SMRAllocator
–
–
–
w r i t e p o i n t e r p e r z o n e
u s e d + f ree c o u n t e r s p e r z o n e
B o n u s : a l m o s t n o m e m o r y !
●
IO ordering
– m u s t e n s u r e a l l o c a t e d w r i t e s
reach disk in order
● Cle a n in g
–
–
–
store k / v hints
z one offset →object has h
pick e m pt i e s t closed zone, scan
hints, m o v e objects t h a t are still
t h e r e
opportunistically rewrite objects
w e read if t h e z one is f lagged
f or c l e a n i n g s o o n
SMR H D D

500
450
400
350
300
250
200
150
100
50
0
Ceph 10.1.0 Bluestore vs Filestore Sequential Writes
FS HDD
BS HDD
3 9
IOSize
Throughput(MB/s)
H D D : SEQUENTIAL WRITE

H D D : RANDOM WRITE
200
150
100
50
0
Ceph 10.1.0 Bluestore vs Filestore Random Writes
BS
HDD
HDD
IOSize
Throughput(MB/s)
600
400
200
0
450 1600
400 1400
350 1200
300
250
1000
FS 800
Ceph 10.1.0 Bluestore vs Filestore Random Writes
FS HDD
BS HDD
4 0
IOSize
IOPS

H D D : SEQUENTIAL READ
400
200
0
600
Ceph 10.1.0 Bluestore vs Filestore Sequential Reads
1200
1000
800
FS HDD
BS HDD
4 1
IOSize
Throughput(MB/s)

H D D : RANDOM READ
1400
1200
1000
800
600
400
200
0
Ceph 10.1.0 Bluestore vs Filestore Random Reads
FS HDD
BS HDD
IOSize
Throughput(MB/s)
3500
3000
2500
2000
1500
1000
500
0
Ceph 10.1.0 Bluestore vs Filestore Random Reads
FS HDD
BS HDD
4 2
IOSize
IOPS

SSD A N D NVME?
4 3
●
N VMe journal
–
–
r a n d o m w r i t e s ~ 2 x f as t er
s o m e t e s t i n g a n o m a l i e s ( p r o b l e m w i t h t e s t rig kernel?)
●
SSD only
–
–
s im ilar t o H D D res ult
small writ e benef it is m o r e p r o n o u n c e d
●
N VMe only
– m o r e t es t ing anom alies o n test rig.. WIP

STATUS
4 5
●
D o n e
–
–
–
fully function IO p a t h w it h c h e c k s u m s a n d compression
fsck
b it m a p - bas e d allocator a n d freelist
●
Current efforts
–
–
–
–
optimize m e t a d a t a e n c o d in g efficiency
p e r f o r m a n c e t u n in g
ZetaScale key/value d b as RocksDB alternative
b o u n d s o n c o m p r e s s e d blob occlusion
●
So o n
–
–
–
–
per-pool properties t h a t m a p t o compression, c h e c k s u m , IO hints
m o r e p e r f o r m a n c e optimization
native SMR H D D support
SPDK (kernel bypass for N V M e devices)

AVAILABILITY
4 6
● E x per im ent al bac k end in Jewel v10.2.z (just released)
–
–
e n a b l e e x p e r i m e n t a l u n r e c o v e r a b l e d a t a c o r r u p t i n g f eat ures = bluest ore roc k s db
ceph- disk --bluestore DEV
●
n o m ul t i - dev i c e m a g i c provisioni ng just y e t
– predat es c h e c k s u m s a n d c o m p r e s s i o n
● Current m a s t e r
–
–
–
n e w disk f o r m a t
c h e c k s u m s
c om pres s i o n
● The goal...
–
–
stable in K rak en (Fall '16)
def aul t in L u m i n o u s (Spring '17)

SUM M ARY
4 7
●
●
●
●
●
C e p h is g r e a t
POSIX w a s p o o r choice for storing objects
RocksDB rocks a n d w a s e a s y t o e m b e d
O u r n e w BlueStore b a c k e n d is a w e s o m e
Full d a t a c h e c k s u m s a n d inline c o m p r e s s i o n !

THANK YOU!
Patrick McGarry
Dir Ceph Community
pmcgarry@redhat.com
@scuttlemonkey

Ceph Day KL - Bluestore

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Viewers also liked

Viewers also liked (20)

Similar to Ceph Day KL - Bluestore

Similar to Ceph Day KL - Bluestore (20)

Recently uploaded

Recently uploaded (20)

Ceph Day KL - Bluestore

Editor's Notes