SlideShare a Scribd company logo
B L U E S TO R E : A N E W, FASTER S T O R A G E B A C K E N D
F O R C E P H
Patrick McGarry
Ceph Days APAC Roadshow
2016
2
O UTLIN E
● Ce p h b a c k g r o u n d a n d c o n t e x t
–
–
FileStore, a n d w h y POSIX failed us
Ne wS to r e – a h y b r i d a p p r o a c h
● BlueStore – a n e w Ce p h OSD b a c k e n d
–
–
M e t a d a t a
D a t a
●
●
●
Performance
Status a n d availability
S u m m a r y
M OTIVATION
CEPH
●
●
●
●
●
●
Object, block, a n d file storage in a single cluster
All c o m p o n e n t s scale horizontally
N o single p o in t of failure
H a r d w a r e agnostic, c o m m o d i t y h a r d w a r e
Self-manage w h e n e v e r possible
O p e n source (LGPL)
●
●
“ A Scalable, High-Performance Distributed File S y s t e m ”
“ p e r f o r ma n c e , reliability, a n d scalability”
4
CEPH COMPONENTS
RGW
A w e b services g a t e w a y
for o b je ct storage,
co mp a t ib le w i t h S3 a n d
Swift
LIBRADOS
A library a llo wing a p p s t o directly access RADOS (C, C + + , Java, Python, Ruby, PHP)
RADOS
A software -based, reliable, a u t o n o m o u s , d is t rib ute d o bject store c o m p r i s e d of
self-healing, self-managing, intelligent st o ra g e n o d e s a n d lig h t we ig h t mo n it o rs
RBD
A reliable, fully-distributed
block d e vice w i t h clo u d
p la t f o rm in t e g rat ion
CEPHFS
A d ist ribut ed file s y s t e m
w i t h POSIX se ma n t ics a n d
scale-out m e t a d a t a
m a n a g e m e n t
OBJECT
5
BLOCK FILE
OBJECT STORAGE DAEMONS (OSDS)
FS
DISK
OSD
DISK
OSD
FS
DISK
OSD
FS
DISK
OSD
FS
xfs
b t rfs
ex t 4
M
M
M
6
OBJECT STORAGE DAEMONS (OSDS)
FS
DISK
OSD
DISK
OSD
FS
DISK
OSD
FS
DISK
OSD
FS
xfs
b t rfs
ex t 4
M
M
M
FileStore
7
FileStoreFileStoreFileStore
●
8
ObjectStore
–
–
abs t ract interface for storing
local d a t a
EBOFS, FileStore
●
EBOFS
–
–
a us er -s pac e e x t e n t - b a s e d
o b j e c t file s y s t e m
deprec at ed in f av or of FileStore
o n btrfs in 2 0 0 9
●
Object – “ file ”
–
–
–
d a t a (file-like b y t e s t ream )
at t ributes (small key/value)
o m a p ( u n b o u n d e d key/value)
●
Collection – “ d i r e c t o r y ”
–
–
p l a c e m e n t g r o u p shard (slice of
t h e RADOS pool)
s h a r d e d b y 3 2 - b i t h a s h v a l u e
●
All writes are transactions
–
–
A t o m i c + C o n s i s t e n t + D u r a b l e
Isolation prov ided b y OSD
OBJECTSTORE A N D DATA MODEL
●
9
FileSt ore
–
–
PG = collection = directory
object = file
●
Le v e ld b
–
–
large x a t t r spillover
object o m a p (key/value) d a t a
●
Originally just for development...
– later, o n l y s u p p o r t e d b a c k e n d
( o n XFS)
● /var/lib/ceph/osd/ceph-123/
– current/
● meta/
–
–
osdmap123
osdmap124
● 0.1_head/
–
–
object1
object12
● 0.7_head/
–
–
object3
object5
● 0.a_head/
–
–
object4
object6
● db/
– <leveldb files>
FILESTORE
●
1 0
OSD carefully m a n a g e s
c o n s is te n c y of its d a t a
●
All w rite s a re tra n s a c tio n s
– w e n e e d A + C + D ; OSD prov ides I
● M o s t a re s i m p l e
–
–
–
w r i t e s o m e b y t e s t o objec t (file)
u p d a t e objec t a t t r i b u t e (file
x a t t r )
a p p e n d t o u p d a t e log (lev eldb
insert)
...but o t h e r s a re arbitrarily
l a r g e / c o m p l e x
[
{
"op_name": "write",
"collection": "0.6_head",
"oid": "#0:73d87003:::benchmark_data_gnit_10346_object23:head#",
"length": 4194304,
"offset": 0,
"bufferlist length": 4194304
},
{
"op_name": "setattrs",
"collection": "0.6_head",
"oid": "#0:73d87003:::benchmark_data_gnit_10346_object23:head#",
"attr_lens": {
"_": 269,
"snapset": 31
}
},
{
"op_name": "omap_setkeys",
"collection": "0.6_head",
"oid": "#0:60000000::::head#",
"attr_lens": {
"0000000005.00000000000000000006": 178,
"_info": 847
}
}
]
POSIX FAILS: TRANSACTIONS
●
1 1
Btrfs t r a n s a c t io n h o o k s
dangerous, and only for
how to avoid the
/* trans start and trans end are
* use by applications that know
* resulting deadlocks
*/
#define BTRFS_IOC_TRANS_START _IO(BTRFS_IOCTL_MAGIC, 6)
#define BTRFS_IOC_TRANS_END _IO(BTRFS_IOCTL_MAGIC, 7)
●
●
Wr it e b a c k o r d e r in g
#define BTRFS_MOUNT_FLUSHONCOMMIT (1 << 7)
W h a t if w e hit a n error? c e p h - o s d process dies?
–
#define BTRFS_MOUNT_WEDGEONTRANSABORT (1 << …)
T her e is n o rollback...
POSIX FAILS: TRANSACTIONS
●
1 2
Writ e -ahead journal
–
–
serialize a n d journal e v e r y ObjectStore::Transaction
t h e n w r i t e it t o t h e file s y s t e m
●
Btrfs parallel journaling
–
–
periodic s y n c t akes a snapshot , t h e n t r i m old journal ent ries
o n OSD restart: rollback a n d replay journal against last s n a p s h o t
●
XFS/ext4 w rit e -ahead journaling
–
–
–
periodic sync, t h e n t r i m old journal ent ries
o n restart, replay ent ire journal
lots of u g l y h a c k e r y t o deal w i t h e v e n t s t h a t aren't i d e m p o t e n t
●
e.g., r e n a m e s , collection d e l e t e + create, …
●
fu l l d a t a j o u r n a l → w e d o u b l e w r i t e e v e r y t h i n g → ~ h a l v e d i s k t h r o u g h p u t
POSIX FAILS: TRANSACTIONS
POSIX FAILS: ENUMERATION
1 3
●
●
C e p h objects are d i st ri but ed b y a 32-bit h a sh
E n u m e r a t i o n is in h a sh o r der
–
–
–
s c r u b b i n g
“b a c k f i l l ” ( d a t a re b a la n cin g , re c o v e ry )
e n u m e r a t i o n via lib ra do s client API
●
POSIX readdir is n o t well-ordered
●
N e e d O(1) “ s p l i t ” for a g i v e n sh a rd/range
●
Build directory tree b y hash-value prefix
–
–
–
split a n y d ire ct o ry w h e n size > ~ 1 0 0 files
m e r g e w h e n size < ~ 5 0 files
r e a d e n t ire d ire cto ry, so rt i n - m e m o r y
…
DIR_A/
DIR_A/A03224D3_qwer
DIR_A/A247233E_zxcv
…
DIR_B/
DIR_B/DIR_8/
DIR_B/DIR_8/B823032D_foo
DIR_B/DIR_8/B8474342_bar
DIR_B/DIR_9/
DIR_B/DIR_9/B924273B_baz
DIR_B/DIR_A/
DIR_B/DIR_A/BA4328D2_asdf
…
N EW STORE
NE W OBJECTSTORE GOALS
1 5
●
●
●
●
●
M o r e n a t u r a l t ra n sa ct i o n a t o m i c i t y
Avoid d o u b l e w ri t e s
Efficient o b j e ct e n u m e r a t i o n
Efficient cl one o p e r a t i o n
Efficient splice ( “ m o v e t h e s e b y t e s f r o m o b j e ct X t o o b j e ct Y”)
●
●
Efficient IO p a t t e r n for H D D s, SSDs, N V M e
M i n i m a l locking, m a x i m u m parallelism ( b e t w e e n PGs)
●
●
Full d a t a a n d m e t a d a t a c h e c k s u m s
Inline c o m p r e s s i o n
NEWSTORE – WE MANAGE NAMESPACE
●
●
POSIX has t h e w r o n g m e t a d a t a m o d e l for u s
Ordere d k ey / v al ue is perf ec t m a t c h
–
–
well-defined object n a m e sort order
efficient e n u m e r a t i o n a n d r a n d o m lookup
●
N e w S t o r e = roc k s db + obj ec t files
– /var/lib/ceph/osd/ceph-123/
● db/
– <rocksdb, leveldb, whatever>
● blobs.1/
–
–
–
0
1
...
● blobs.2/
–
–
–
100000
100001
...
H D D
OSD
SSD SSD
OSD
H D D
OSD
NewStore NewStore NewStore
RocksDBRocksDB
1 6
●
1 7
RocksDB has a write-ahead log
“ j o u r n a l”
●
XFS/ext4(/btrfs) h a v e their o w n
journal (tree-log)
●
Journal-on-journal has h ig h
o v e r h e a d
– e a c h journal m a n a g e s half of
overall consistency, b u t incurs
t h e s a m e o v e r h e a d
●
●
write(2) + fsync(2) to new blobs.2/10302
1 write + flush to block device
1 write + flush to XFS/ext4 journal
write(2) + fsync(2) on RocksDB log
1 write + flush to block device
1 write + flush to XFS/ext4 journal
NEWSTORE FAIL: CONSISTENCY OVERHEAD
●
1 8
We c an't o v e r w r i t e a POSIX file as p a r t of a a t o m i c t rans action
– ( w e m u s t p r e s e r v e old d a t a unt il t h e t ransact ion c o m m i t s )
●
●
Writing o v e r w r i t e d a t a t o a n e w file m e a n s m a n y files for e a c h objec t
Writ e -ahead logging
–
–
–
–
p u t o v e r w r i t e d a t a in a “WAL” records in RocksDB
c o m m i t a t o m i c a l l y w i t h t ransact ion
t h e n o v e r w r i t e original file d a t a
...but t h e n w e ' r e b a c k t o a d o u b l e - w r i t e for o v e r w r i t e s
●
●
Perf ormance sucks again
Ov erw rit es d o m i n a t e RBD bloc k w ork loads
NEWSTORE FAIL: ATOMICITY NEEDS WAL
BLUESTORE
●
BlueStore = Block + N e w S t o r e
–
–
–
–
c o n s u m e r a w b l o c k dev ic e(s )
k e y / v a l u e d a t a b a s e (RocksDB) f or m e t a d a t a
d a t a w rit t en directly t o block device
p l u g g a b l e b l o c k Allocat or (policy)
●
We m u s t share t h e block device w i t h RocksDB
–
–
–
i m p l e m e n t our o w n rocksdb::Env
i m p l e m e n t t iny “file s y s t e m ” BlueFS
m a k e BlueStore a n d BlueFS share device(s)
BLUESTORE
BlockDeviceBlockDeviceBlockDevice
data
B l u e S t o r e
metadata
R o c k s D B
BlueRocksEnv
BlueFS
Allocator
ObjectStore
2 0
ROCKSDB: BLUEROCKSENV + BLUEFS
●
class BlueRocksEnv : public rocksdb: : EnvWrapper
– p a s s e s file IO o p e r a t i o n s t o BlueFS
●
BlueFS is a super-simple “file s y s t e m ”
– all m e t a d a t a l o a d e d in RAM o n s t a r t / m o u n t
–
–
–
–
n o n e e d t o s t o r e b lo c k f r e e list
c o a r s e a llo c a t io n u n i t ( 1 M B b lo c k s )
all m e t a d a t a lives in w r i t t e n t o a jo u r n a l
jo u r n a l r e w r i t t e n / c o m p a c t e d w h e n it g e t s la r g e
su p e rb lock journal …
d a t a d a t a
m o r e journal … d a t a
d a t a
file 1 0 file 11 file 1 2 file 1 2 file 1 3 r m file 1 2 file 1 3 ...
●
M a p “ d i rect ories” t o different block devices
–
–
–
db.wal/
d b /
d b .slow /
– o n NVRAM, NVMe, SSD
– level0 a n d h o t SSTs o n SSD
– cold SSTs o n H D D
●
BlueStore periodically balances free space
2 1
ROCKSDB: JOURNAL RECYCLING
2 2
●
rocksdb LogReader only u n d e r stand s t w o m o d e s
–
–
r e a d unt il e n d of file ( n e e d a c c u r a t e file size)
r e a d all v alid records, t h e n i g n o r e zeros a t e n d ( n e e d z e r o e d tail)
●
●
writing t o “ f r e s h ” log “files” m e a n s > 1 IO for a log a p p e n d
mo d ifie d u p s t r e a m rocksdb t o re-use previous log files
– n o w res em bles “ n o r m a l ” journaling behav ior ov er a circular buffer
● w o r k s w i t h vanilla RocksDB o n files a n d o n BlueFS
●
2 3
Single dev i c e
– H D D or SSD
●
●
rocksdb
object d a t a
● Two devices
– 1 2 8 M B of SSD or NVRAM
● rocksdb WAL
– big d e v ic e
●
e v e r y t h in g else
MULTI-DEVICE SUPPORT
● Two devices
– a f e w GB of SSD
●
●
r o c k s d b WAL
r o c k s d b ( w a r m d a t a )
– b ig d e v ic e
●
●
r o cksd b (cold d a t a )
o b je c t d a t a
● Three devices
– 1 2 8 M B NVRAM
●
r o c k s d b WAL
– a f e w GB SSD
●
r o c k s d b ( w a r m d a t a )
– b ig d e v ic e
●
●
r o c k s d b (cold d a t a )
o b je c t d a t a
METADATA
BLUESTORE METADATA
2 5
●
Partition n a m e s p a c e for different m e t a d a t a
–
–
–
S* – “ s u p e r b l o c k ” m e t a d a t a f or t h e e n t i r e s t ore
B * – b l o c k alloc at ion m e t a d a t a (f ree b l o c k b i t m a p )
T* – stats (bytes used, c om pres sed, etc.)
–
–
C * – collection n a m e →c n o d e _ t
O * – o b j e c t n a m e →o n o d e _ t o r b n o d e _ t
– L* – w r i t e - a h e a d l o g ent ries, p r o m i s e s of f u t u r e IO
– M * – o m a p (user key/value data, stored in objects)
●
2 6
Collection m e t a d a t a
– Interval of object n a m e s p a c e
shard pool hash name bits
C<NOSHARD,12,3d3e0000> “12.e3d3” = <19>
shard pool hash name snap gen
O<NOSHARD,12,3d3d880e,foo,NOSNAP,NOGEN> = …
O<NOSHARD,12,3d3d9223,bar,NOSNAP,NOGEN> = …
O<NOSHARD,12,3d3e02c2,baz,NOSNAP,NOGEN> = …
O<NOSHARD,12,3d3e125d,zip,NOSNAP,NOGEN> = …
O<NOSHARD,12,3d3e1d41,dee,NOSNAP,NOGEN> = …
O<NOSHARD,12,3d3e3832,dah,NOSNAP,NOGEN> = …
struct spg_t {
uint64_t pool;
uint32_t hash;
shard_id_t shard;
};
struct bluestore_cnode_t {
uint32_t bits;
};
●
Nice properties
–
–
O rd e re d e n u m e r a t i o n of objects
We ca n “s p l i t ” collections b y adjusting co d e
m e t a d a t a o n ly
CN O DE
●
2 7
Per object m e t a d a t a
–
–
Lives direc t ly in k e y / v a l u e pair
Serializes t o 1 0 0 s of b y t e s
●
●
●
Size in b y te s
Inline attributes (user a ttr d a ta )
D a t a pointers (user b y t e d a ta )
–
–
lextent_t →(blob, offset, lengt h)
blob →(disk extents, csums, ...)
●
O m a p prefix/ID (user k/v d a ta )
struct bluestore_onode_t {
uint64_t size;
map<string,bufferptr> attrs;
map<uint64_t,bluestore_lextent_t> extent_map;
uint64_t omap_head;
};
struct bluestore_blob_t {
vector<bluestore_pextent_t> extents;
uint32_t compressed_length;
bluestore_extent_ref_map_t ref_map;
uint8_t csum_type, csum_order;
bufferptr csum_data;
};
struct bluestore_pextent_t {
uint64_t offset;
uint64_t length;
};
ONODE
●
B lo b m e t a d a t a
–
–
–
–
Usually blobs stored in t h e o n o d e
S o m e t i m e s w e share blocks b e t w e e n objects (usually clones/snaps)
W e n e e d t o reference c o u n t t h o s e e x t e n t s
W e still w a n t t o split collections a n d repartition e x t e n t m e t a d a t a b y h a s h
shard pool hash name snap gen
O<NOSHARD,12,3d3d9223,bar,NOSNAP,NOGEN> = onode
O<NOSHARD,12,3d3e02c2> = bnode
O<NOSHARD,12,3d3e02c2,baz,NOSNAP,NOGEN> = onode
O<NOSHARD,12,3d3e125d> = bnode
O<NOSHARD,12,3d3e125d,zip,NOSNAP,NOGEN> = onode
O<NOSHARD,12,3d3e1d41,dee,NOSNAP,NOGEN> = onode
O<NOSHARD,12,3d3e3832,dah,NOSNAP,NOGEN> = onode
●
o n o d e va lu e includes, a n d b n o d e va lu e is
map<int64_t,bluestore_blob_t> blob_map;
●
le xt e n t b lo b ids
–
–
> 0 →b lo b in o n o d e
< 0 →b lo b in b n o d e
BN O D E
2 8
●
2 9
We scrub... periodically
–
–
–
w i n d o w b e f o r e w e d e t e c t error
w e m a y r e a d b a d d a t a
w e m a y n o t b e sure w h i c h c o p y
is b a d
●
We w a n t t o validate c h e c k s u m
o n e v e r y read
●
Mu s t store m o r e m e t a d a t a in t h e
blobs
–
–
3 2 - b i t c s u m m e t a d a t a f or 4 M B
o b j e c t a n d 4 K B blocks = 4 K B
larger c s u m bloc k s
● c s u m _ o r d e r > 1 2
– smaller c s u m s
● crc32c_8 or 1 6
●
IO hints
–
–
s e q r e a d + w r i t e →b i g c h u n k s
c o m p r e s s i o n → b i g c h u n k s
● Per-pool policy
CHECKSUMS
●
3 x replication is ex pens iv e
– A n y s c al e - out c l us t er is e x p e n s i v e
●
●
Lots of s t or ed d a t a is (highly) c om pr es s ible
N e e d largish e x t e n t s t o g e t c o m p r e s s i o n benef it ( 6 4 KB, 1 2 8 KB)
–
–
m a y n e e d t o s u p p o r t s m a l l (ov er)w ri t es
o v e r w r i t e s oc c l ude/ o bs c ur e c o m p r e s s e d bl obs
– c o m p a c t e d ( r e w r i t t e n ) w h e n > N l ay ers d e e p
INLINE COMPRESSION
start of object end of object
3 0
allocated
written
written (compressed)
uncompressed blob
DATA PATH
Te rm s
3 2
●
Sequencer
–
–
An i n d e p e n d e n t , totally ordered
q u e u e of transactions
O n e per PG
●
TransContext
– State describing a n ex ec uting
t r a n s a c t i o n
DATA PATH BASICS
Two w a y s t o w r it e
●
N e w allocation
–
–
A n y w r i t e larger t h a n
m i n _ a l l o c _ s i z e g o e s t o a n e w,
u n u s e d e x t e n t o n disk
O n c e t h a t IO c o m p l e t e s , w e
c o m m i t t h e t r ans ac t ion
●
WAL ( w r it e - a h e a d - lo g g e d )
– C o m m i t t e m p o r a r y p r o m i s e t o
( ov er ) w r it e d a t a w i t h t r ans ac t ion
●
includes d a t a !
–
–
D o a s y n c o v e r w r i t e
T h e n u p t e m p o r a r y k / v pair
TRANSCONTEXT STATE MACHINE
PREPARE AIO_WAIT
KV_QUEUED KV_COMMITTING
WAL_QUEUED WAL_AIO_WAIT
FINISH
WAL_CLEANUP
Initiate someAIO
Wait for next TransContext(s) in Sequencer to be ready
WAL_QUEUED
Sequencer queue
Initiate someAIO
Wait for next commit batch
PREPARE
AIO_WAIT
KV_QUEUED
AIO_WAIT
KV_COMMITTING
KV_COMMITTING
KV_COMMITTING
FINISH
WAL_QUEUED
FINISH
WAL_AIO_WAIT
WAL_CLEANUP
WAL_CLEANUP
FINISH
FINISHWAL_CLEANUP_COMMITTING
3 3
●
3 4
O n o d e S p a c e p e r collection
– i n - m e m o r y g h o bj ect _t → O n o d e m a p of d e c o d e d o n o d e s
●
Bu ff e r Sp a c e for i n - m e m o r y blobs
– m a y c o n t a i n c a c h e d o n -di sk d a t a
● B o t h buffers a n d o n o d e s h a v e lifecycles linked t o a C a ch e
–
–
LRUCache – trivial LRU
TwoQCache – i m p l e m e n t s 2 Q c a c h e r e p l a c e m e n t a l g o r i t h m (d efaul t)
●
C a c h e is s h a r d e d for parallelism
–
–
–
Collection → sh a r d m a p p i n g m a t c h e s OSD's o p _ w q
s a m e CPU c o n t e x t t h a t processes client r e q u est s will t o u c h t h e LRU/2Q lists
IO c o m p l e t i o n e x e c u t i o n n o t y e t s h a r d e d – TODO?
CACH IN G
●
3 5
FreelistManager
–
–
persist list of free e x t e n t s t o k e y / v a lu e
store
p re p a re i n c r e m e n t a l u p d a t e s for allocate
o r release
●
Initial i m p l e m e n t a t i o n
–
–
–
–
e xt e n t -b a se d
<offset> = <length>
k e p t i n - m e m o r y c o p y
e n fo rces a n o rd e rin g o n c o m m i t s ; freelist
u p d a t e s h a d t o pass t h r o u g h single
t h re a d / lo ck
del 1600=100000
put 1700=0fff00
sma ll initial m e m o r y f o o t p rint , v e r y
e xp e n sive w h e n f r a g m e n t e d
●
N e w b i t m a p - b a s e d a p p r oa ch
<offset> = <region bitmap>
– w h e r e r e g i o n is N bloc k s
● 1 2 8 bloc k s = 8 b y t e s
–
–
–
u s e k / v m e r g e o p e r a t o r t o XOR
allocation or release
merge 10=0000000011
merge 20=1110000000
R oc k s D B l o g - s t r u c t u r e d - m e r g e
t r e e c oales c es k e y s d u r i n g
c o m p a c t i o n
n o i n - m e m o r y s t a t e
BLOCK FREE LIST
●
3 6
Allocator
– abstract interface t o allocate blocks
● St upidAlloc at or
–
–
–
–
e x t e n t - b a s e d
bin free ext ent s b y size (powers of
2)
c h o o s e suff icient ly l a r g e e x t e n t
closest t o hint
h i g h l y v a r i a b l e m e m o r y u s a g e
● bt ree of free ext ent s
–
–
i m p l e m e n t e d , w o r k s
based o n ancient ebofs policy
●
Bit m apAlloc ator
– hierarchy of indexes
●
●
●
L1: 2 bits = 2 ^ 6 blocks
L2: 2 bits = 2 ^ 1 2 blocks
...
0 0 = all free, 11 = all used,
0 1 = m i x
– fixed m e m o r y c o n s u m p t i o n
●
~ 3 5 M B RAM p e r TB
BLOCK ALLOCATOR
●
3 7
●
Let's s u p p ort t h e m natively!
2 5 6 M B zones / b a n d s
–
–
–
m u s t b e w r i t t e n sequentially,
b u t n o t all a t onc e
libzbc s upport s ZAC a n d ZBC
H D D s
h o s t - m a n a g e d or hos t -aw are
●
SMRAllocator
–
–
–
w r i t e p o i n t e r p e r z o n e
u s e d + f ree c o u n t e r s p e r z o n e
B o n u s : a l m o s t n o m e m o r y !
●
IO ordering
– m u s t e n s u r e a l l o c a t e d w r i t e s
reach disk in order
● Cle a n in g
–
–
–
store k / v hints
z one offset →object has h
pick e m pt i e s t closed zone, scan
hints, m o v e objects t h a t are still
t h e r e
opportunistically rewrite objects
w e read if t h e z one is f lagged
f or c l e a n i n g s o o n
SMR H D D
PERFORM AN CE
500
450
400
350
300
250
200
150
100
50
0
Ceph 10.1.0 Bluestore vs Filestore Sequential Writes
FS HDD
BS HDD
3 9
IOSize
Throughput(MB/s)
H D D : SEQUENTIAL WRITE
H D D : RANDOM WRITE
200
150
100
50
0
Ceph 10.1.0 Bluestore vs Filestore Random Writes
BS
HDD
HDD
IOSize
Throughput(MB/s)
600
400
200
0
450 1600
400 1400
350 1200
300
250
1000
FS 800
Ceph 10.1.0 Bluestore vs Filestore Random Writes
FS HDD
BS HDD
4 0
IOSize
IOPS
H D D : SEQUENTIAL READ
400
200
0
600
Ceph 10.1.0 Bluestore vs Filestore Sequential Reads
1200
1000
800
FS HDD
BS HDD
4 1
IOSize
Throughput(MB/s)
H D D : RANDOM READ
1400
1200
1000
800
600
400
200
0
Ceph 10.1.0 Bluestore vs Filestore Random Reads
FS HDD
BS HDD
IOSize
Throughput(MB/s)
3500
3000
2500
2000
1500
1000
500
0
Ceph 10.1.0 Bluestore vs Filestore Random Reads
FS HDD
BS HDD
4 2
IOSize
IOPS
SSD A N D NVME?
4 3
●
N VMe journal
–
–
r a n d o m w r i t e s ~ 2 x f as t er
s o m e t e s t i n g a n o m a l i e s ( p r o b l e m w i t h t e s t rig kernel?)
●
SSD only
–
–
s im ilar t o H D D res ult
small writ e benef it is m o r e p r o n o u n c e d
●
N VMe only
– m o r e t es t ing anom alies o n test rig.. WIP
STATUS
STATUS
4 5
●
D o n e
–
–
–
fully function IO p a t h w it h c h e c k s u m s a n d compression
fsck
b it m a p - bas e d allocator a n d freelist
●
Current efforts
–
–
–
–
optimize m e t a d a t a e n c o d in g efficiency
p e r f o r m a n c e t u n in g
ZetaScale key/value d b as RocksDB alternative
b o u n d s o n c o m p r e s s e d blob occlusion
●
So o n
–
–
–
–
per-pool properties t h a t m a p t o compression, c h e c k s u m , IO hints
m o r e p e r f o r m a n c e optimization
native SMR H D D support
SPDK (kernel bypass for N V M e devices)
AVAILABILITY
4 6
● E x per im ent al bac k end in Jewel v10.2.z (just released)
–
–
e n a b l e e x p e r i m e n t a l u n r e c o v e r a b l e d a t a c o r r u p t i n g f eat ures = bluest ore roc k s db
ceph- disk --bluestore DEV
●
n o m ul t i - dev i c e m a g i c provisioni ng just y e t
– predat es c h e c k s u m s a n d c o m p r e s s i o n
● Current m a s t e r
–
–
–
n e w disk f o r m a t
c h e c k s u m s
c om pres s i o n
● The goal...
–
–
stable in K rak en (Fall '16)
def aul t in L u m i n o u s (Spring '17)
SUM M ARY
4 7
●
●
●
●
●
C e p h is g r e a t
POSIX w a s p o o r choice for storing objects
RocksDB rocks a n d w a s e a s y t o e m b e d
O u r n e w BlueStore b a c k e n d is a w e s o m e
Full d a t a c h e c k s u m s a n d inline c o m p r e s s i o n !
THANK YOU!
Patrick McGarry
Dir Ceph Community
pmcgarry@redhat.com
@scuttlemonkey

More Related Content

What's hot

Intorduce to Ceph
Intorduce to CephIntorduce to Ceph
Intorduce to Ceph
kao kuo-tung
 
Ceph - A distributed storage system
Ceph - A distributed storage systemCeph - A distributed storage system
Ceph - A distributed storage system
Italo Santos
 
librados
libradoslibrados
librados
Patrick McGarry
 
HKG15-401: Ceph and Software Defined Storage on ARM servers
HKG15-401: Ceph and Software Defined Storage on ARM serversHKG15-401: Ceph and Software Defined Storage on ARM servers
HKG15-401: Ceph and Software Defined Storage on ARM servers
Linaro
 
Hadoop over rgw
Hadoop over rgwHadoop over rgw
Hadoop over rgw
zhouyuan
 
Unified readonly cache for ceph
Unified readonly cache for cephUnified readonly cache for ceph
Unified readonly cache for ceph
zhouyuan
 
Ceph and RocksDB
Ceph and RocksDBCeph and RocksDB
Ceph and RocksDB
Sage Weil
 
Build an High-Performance and High-Durable Block Storage Service Based on Ceph
Build an High-Performance and High-Durable Block Storage Service Based on CephBuild an High-Performance and High-Durable Block Storage Service Based on Ceph
Build an High-Performance and High-Durable Block Storage Service Based on Ceph
Rongze Zhu
 
Community Update at OpenStack Summit Boston
Community Update at OpenStack Summit BostonCommunity Update at OpenStack Summit Boston
Community Update at OpenStack Summit Boston
Sage Weil
 
CephFS update February 2016
CephFS update February 2016CephFS update February 2016
CephFS update February 2016
John Spray
 
Ceph, Now and Later: Our Plan for Open Unified Cloud Storage
Ceph, Now and Later: Our Plan for Open Unified Cloud StorageCeph, Now and Later: Our Plan for Open Unified Cloud Storage
Ceph, Now and Later: Our Plan for Open Unified Cloud Storage
Sage Weil
 
Quick-and-Easy Deployment of a Ceph Storage Cluster with SLES
Quick-and-Easy Deployment of a Ceph Storage Cluster with SLESQuick-and-Easy Deployment of a Ceph Storage Cluster with SLES
Quick-and-Easy Deployment of a Ceph Storage Cluster with SLESJan Kalcic
 
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Odinot Stanislas
 
Introduction to redis - version 2
Introduction to redis - version 2Introduction to redis - version 2
Introduction to redis - version 2
Dvir Volk
 
Redis modules 101
Redis modules 101Redis modules 101
Redis modules 101
Dvir Volk
 
BlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephBlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for Ceph
Sage Weil
 
Cephfs jewel mds performance benchmark
Cephfs jewel mds performance benchmarkCephfs jewel mds performance benchmark
Cephfs jewel mds performance benchmark
Xiaoxi Chen
 
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
Glenn K. Lockwood
 

What's hot (18)

Intorduce to Ceph
Intorduce to CephIntorduce to Ceph
Intorduce to Ceph
 
Ceph - A distributed storage system
Ceph - A distributed storage systemCeph - A distributed storage system
Ceph - A distributed storage system
 
librados
libradoslibrados
librados
 
HKG15-401: Ceph and Software Defined Storage on ARM servers
HKG15-401: Ceph and Software Defined Storage on ARM serversHKG15-401: Ceph and Software Defined Storage on ARM servers
HKG15-401: Ceph and Software Defined Storage on ARM servers
 
Hadoop over rgw
Hadoop over rgwHadoop over rgw
Hadoop over rgw
 
Unified readonly cache for ceph
Unified readonly cache for cephUnified readonly cache for ceph
Unified readonly cache for ceph
 
Ceph and RocksDB
Ceph and RocksDBCeph and RocksDB
Ceph and RocksDB
 
Build an High-Performance and High-Durable Block Storage Service Based on Ceph
Build an High-Performance and High-Durable Block Storage Service Based on CephBuild an High-Performance and High-Durable Block Storage Service Based on Ceph
Build an High-Performance and High-Durable Block Storage Service Based on Ceph
 
Community Update at OpenStack Summit Boston
Community Update at OpenStack Summit BostonCommunity Update at OpenStack Summit Boston
Community Update at OpenStack Summit Boston
 
CephFS update February 2016
CephFS update February 2016CephFS update February 2016
CephFS update February 2016
 
Ceph, Now and Later: Our Plan for Open Unified Cloud Storage
Ceph, Now and Later: Our Plan for Open Unified Cloud StorageCeph, Now and Later: Our Plan for Open Unified Cloud Storage
Ceph, Now and Later: Our Plan for Open Unified Cloud Storage
 
Quick-and-Easy Deployment of a Ceph Storage Cluster with SLES
Quick-and-Easy Deployment of a Ceph Storage Cluster with SLESQuick-and-Easy Deployment of a Ceph Storage Cluster with SLES
Quick-and-Easy Deployment of a Ceph Storage Cluster with SLES
 
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
 
Introduction to redis - version 2
Introduction to redis - version 2Introduction to redis - version 2
Introduction to redis - version 2
 
Redis modules 101
Redis modules 101Redis modules 101
Redis modules 101
 
BlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephBlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for Ceph
 
Cephfs jewel mds performance benchmark
Cephfs jewel mds performance benchmarkCephfs jewel mds performance benchmark
Cephfs jewel mds performance benchmark
 
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
 

Viewers also liked

Ceph Day Taipei - Ceph Tiering with High Performance Architecture
Ceph Day Taipei - Ceph Tiering with High Performance Architecture Ceph Day Taipei - Ceph Tiering with High Performance Architecture
Ceph Day Taipei - Ceph Tiering with High Performance Architecture
Ceph Community
 
Ceph Day Shanghai - Ceph Performance Tools
Ceph Day Shanghai - Ceph Performance Tools Ceph Day Shanghai - Ceph Performance Tools
Ceph Day Shanghai - Ceph Performance Tools
Ceph Community
 
iSCSI Target Support for Ceph
iSCSI Target Support for Ceph iSCSI Target Support for Ceph
iSCSI Target Support for Ceph
Ceph Community
 
Ceph Community Talk on High-Performance Solid Sate Ceph
Ceph Community Talk on High-Performance Solid Sate Ceph Ceph Community Talk on High-Performance Solid Sate Ceph
Ceph Community Talk on High-Performance Solid Sate Ceph
Ceph Community
 
Ceph Day Seoul - Community Update
Ceph Day Seoul - Community UpdateCeph Day Seoul - Community Update
Ceph Day Seoul - Community Update
Ceph Community
 
Ceph Day Tokyo - Bit-Isle's 3 years footprint with Ceph
Ceph Day Tokyo - Bit-Isle's 3 years footprint with Ceph Ceph Day Tokyo - Bit-Isle's 3 years footprint with Ceph
Ceph Day Tokyo - Bit-Isle's 3 years footprint with Ceph
Ceph Community
 
Ceph Day Tokyo - High Performance Layered Architecture
Ceph Day Tokyo - High Performance Layered Architecture  Ceph Day Tokyo - High Performance Layered Architecture
Ceph Day Tokyo - High Performance Layered Architecture
Ceph Community
 
Ceph Day Tokyo - Bring Ceph to Enterprise
Ceph Day Tokyo - Bring Ceph to Enterprise Ceph Day Tokyo - Bring Ceph to Enterprise
Ceph Day Tokyo - Bring Ceph to Enterprise
Ceph Community
 
Ceph Day Tokyo - Delivering cost effective, high performance Ceph cluster
Ceph Day Tokyo - Delivering cost effective, high performance Ceph clusterCeph Day Tokyo - Delivering cost effective, high performance Ceph cluster
Ceph Day Tokyo - Delivering cost effective, high performance Ceph cluster
Ceph Community
 
Ceph Day Tokyo - Ceph Community Update
Ceph Day Tokyo - Ceph Community Update Ceph Day Tokyo - Ceph Community Update
Ceph Day Tokyo - Ceph Community Update
Ceph Community
 
Ceph Day Tokyo -- Ceph on All-Flash Storage
Ceph Day Tokyo -- Ceph on All-Flash StorageCeph Day Tokyo -- Ceph on All-Flash Storage
Ceph Day Tokyo -- Ceph on All-Flash Storage
Ceph Community
 
Ceph Day Tokyo - Ceph on ARM: Scaleable and Efficient
Ceph Day Tokyo - Ceph on ARM: Scaleable and Efficient Ceph Day Tokyo - Ceph on ARM: Scaleable and Efficient
Ceph Day Tokyo - Ceph on ARM: Scaleable and Efficient
Ceph Community
 
Ceph Day Shanghai - Ceph in Ctrip
Ceph Day Shanghai - Ceph in CtripCeph Day Shanghai - Ceph in Ctrip
Ceph Day Shanghai - Ceph in Ctrip
Ceph Community
 
Ceph Day Taipei - Bring Ceph to Enterprise
Ceph Day Taipei - Bring Ceph to EnterpriseCeph Day Taipei - Bring Ceph to Enterprise
Ceph Day Taipei - Bring Ceph to Enterprise
Ceph Community
 
Ceph Day Shanghai - CeTune - Benchmarking and tuning your Ceph cluster
Ceph Day Shanghai - CeTune - Benchmarking and tuning your Ceph cluster Ceph Day Shanghai - CeTune - Benchmarking and tuning your Ceph cluster
Ceph Day Shanghai - CeTune - Benchmarking and tuning your Ceph cluster
Ceph Community
 
librados
libradoslibrados
librados
Ceph Community
 
Ceph Day Shanghai - SSD/NVM Technology Boosting Ceph Performance
Ceph Day Shanghai - SSD/NVM Technology Boosting Ceph Performance Ceph Day Shanghai - SSD/NVM Technology Boosting Ceph Performance
Ceph Day Shanghai - SSD/NVM Technology Boosting Ceph Performance
Ceph Community
 
Ceph Day Shanghai - Recovery Erasure Coding and Cache Tiering
Ceph Day Shanghai - Recovery Erasure Coding and Cache TieringCeph Day Shanghai - Recovery Erasure Coding and Cache Tiering
Ceph Day Shanghai - Recovery Erasure Coding and Cache Tiering
Ceph Community
 
Ceph on 64-bit ARM with X-Gene
Ceph on 64-bit ARM with X-GeneCeph on 64-bit ARM with X-Gene
Ceph on 64-bit ARM with X-Gene
Ceph Community
 
Ceph Day Taipei - Community Update
Ceph Day Taipei - Community Update Ceph Day Taipei - Community Update
Ceph Day Taipei - Community Update
Ceph Community
 

Viewers also liked (20)

Ceph Day Taipei - Ceph Tiering with High Performance Architecture
Ceph Day Taipei - Ceph Tiering with High Performance Architecture Ceph Day Taipei - Ceph Tiering with High Performance Architecture
Ceph Day Taipei - Ceph Tiering with High Performance Architecture
 
Ceph Day Shanghai - Ceph Performance Tools
Ceph Day Shanghai - Ceph Performance Tools Ceph Day Shanghai - Ceph Performance Tools
Ceph Day Shanghai - Ceph Performance Tools
 
iSCSI Target Support for Ceph
iSCSI Target Support for Ceph iSCSI Target Support for Ceph
iSCSI Target Support for Ceph
 
Ceph Community Talk on High-Performance Solid Sate Ceph
Ceph Community Talk on High-Performance Solid Sate Ceph Ceph Community Talk on High-Performance Solid Sate Ceph
Ceph Community Talk on High-Performance Solid Sate Ceph
 
Ceph Day Seoul - Community Update
Ceph Day Seoul - Community UpdateCeph Day Seoul - Community Update
Ceph Day Seoul - Community Update
 
Ceph Day Tokyo - Bit-Isle's 3 years footprint with Ceph
Ceph Day Tokyo - Bit-Isle's 3 years footprint with Ceph Ceph Day Tokyo - Bit-Isle's 3 years footprint with Ceph
Ceph Day Tokyo - Bit-Isle's 3 years footprint with Ceph
 
Ceph Day Tokyo - High Performance Layered Architecture
Ceph Day Tokyo - High Performance Layered Architecture  Ceph Day Tokyo - High Performance Layered Architecture
Ceph Day Tokyo - High Performance Layered Architecture
 
Ceph Day Tokyo - Bring Ceph to Enterprise
Ceph Day Tokyo - Bring Ceph to Enterprise Ceph Day Tokyo - Bring Ceph to Enterprise
Ceph Day Tokyo - Bring Ceph to Enterprise
 
Ceph Day Tokyo - Delivering cost effective, high performance Ceph cluster
Ceph Day Tokyo - Delivering cost effective, high performance Ceph clusterCeph Day Tokyo - Delivering cost effective, high performance Ceph cluster
Ceph Day Tokyo - Delivering cost effective, high performance Ceph cluster
 
Ceph Day Tokyo - Ceph Community Update
Ceph Day Tokyo - Ceph Community Update Ceph Day Tokyo - Ceph Community Update
Ceph Day Tokyo - Ceph Community Update
 
Ceph Day Tokyo -- Ceph on All-Flash Storage
Ceph Day Tokyo -- Ceph on All-Flash StorageCeph Day Tokyo -- Ceph on All-Flash Storage
Ceph Day Tokyo -- Ceph on All-Flash Storage
 
Ceph Day Tokyo - Ceph on ARM: Scaleable and Efficient
Ceph Day Tokyo - Ceph on ARM: Scaleable and Efficient Ceph Day Tokyo - Ceph on ARM: Scaleable and Efficient
Ceph Day Tokyo - Ceph on ARM: Scaleable and Efficient
 
Ceph Day Shanghai - Ceph in Ctrip
Ceph Day Shanghai - Ceph in CtripCeph Day Shanghai - Ceph in Ctrip
Ceph Day Shanghai - Ceph in Ctrip
 
Ceph Day Taipei - Bring Ceph to Enterprise
Ceph Day Taipei - Bring Ceph to EnterpriseCeph Day Taipei - Bring Ceph to Enterprise
Ceph Day Taipei - Bring Ceph to Enterprise
 
Ceph Day Shanghai - CeTune - Benchmarking and tuning your Ceph cluster
Ceph Day Shanghai - CeTune - Benchmarking and tuning your Ceph cluster Ceph Day Shanghai - CeTune - Benchmarking and tuning your Ceph cluster
Ceph Day Shanghai - CeTune - Benchmarking and tuning your Ceph cluster
 
librados
libradoslibrados
librados
 
Ceph Day Shanghai - SSD/NVM Technology Boosting Ceph Performance
Ceph Day Shanghai - SSD/NVM Technology Boosting Ceph Performance Ceph Day Shanghai - SSD/NVM Technology Boosting Ceph Performance
Ceph Day Shanghai - SSD/NVM Technology Boosting Ceph Performance
 
Ceph Day Shanghai - Recovery Erasure Coding and Cache Tiering
Ceph Day Shanghai - Recovery Erasure Coding and Cache TieringCeph Day Shanghai - Recovery Erasure Coding and Cache Tiering
Ceph Day Shanghai - Recovery Erasure Coding and Cache Tiering
 
Ceph on 64-bit ARM with X-Gene
Ceph on 64-bit ARM with X-GeneCeph on 64-bit ARM with X-Gene
Ceph on 64-bit ARM with X-Gene
 
Ceph Day Taipei - Community Update
Ceph Day Taipei - Community Update Ceph Day Taipei - Community Update
Ceph Day Taipei - Community Update
 

Similar to Ceph Day KL - Bluestore

BlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephBlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for Ceph
Sage Weil
 
Ceph Tech Talk: Bluestore
Ceph Tech Talk: BluestoreCeph Tech Talk: Bluestore
Ceph Tech Talk: Bluestore
Ceph Community
 
Linuxcommands 091018105536-phpapp01
Linuxcommands 091018105536-phpapp01Linuxcommands 091018105536-phpapp01
Linuxcommands 091018105536-phpapp01
Nagarajan Kamalakannan
 
DEVIEW 2013
DEVIEW 2013DEVIEW 2013
DEVIEW 2013
Patrick McGarry
 
Diagnostics and Debugging
Diagnostics and DebuggingDiagnostics and Debugging
Diagnostics and DebuggingMongoDB
 
Scaling Dropbox
Scaling DropboxScaling Dropbox
Scaling Dropbox
C4Media
 
Ceph Internals
Ceph InternalsCeph Internals
Ceph Internals
Victor Santos
 
NSC #2 - Challenge Solution
NSC #2 - Challenge SolutionNSC #2 - Challenge Solution
NSC #2 - Challenge Solution
NoSuchCon
 
Bringing up Android on your favorite X86 Workstation or VM (AnDevCon Boston, ...
Bringing up Android on your favorite X86 Workstation or VM (AnDevCon Boston, ...Bringing up Android on your favorite X86 Workstation or VM (AnDevCon Boston, ...
Bringing up Android on your favorite X86 Workstation or VM (AnDevCon Boston, ...
Ron Munitz
 
Logging and ranting / Vytis Valentinavičius (Lamoda)
Logging and ranting / Vytis Valentinavičius (Lamoda)Logging and ranting / Vytis Valentinavičius (Lamoda)
Logging and ranting / Vytis Valentinavičius (Lamoda)
Ontico
 
Hadoop in Data Warehousing
Hadoop in Data WarehousingHadoop in Data Warehousing
Hadoop in Data Warehousing
Alexey Grigorev
 
Experiences building a distributed shared log on RADOS - Noah Watkins
Experiences building a distributed shared log on RADOS - Noah WatkinsExperiences building a distributed shared log on RADOS - Noah Watkins
Experiences building a distributed shared log on RADOS - Noah Watkins
Ceph Community
 
RocksDB meetup
RocksDB meetupRocksDB meetup
RocksDB meetup
Javier González
 
PyLadies Talk: Learn to love the command line!
PyLadies Talk: Learn to love the command line!PyLadies Talk: Learn to love the command line!
PyLadies Talk: Learn to love the command line!
Blanca Mancilla
 
Ceph Day Melbourne - Troubleshooting Ceph
Ceph Day Melbourne - Troubleshooting Ceph Ceph Day Melbourne - Troubleshooting Ceph
Ceph Day Melbourne - Troubleshooting Ceph
Ceph Community
 
What's new in Redis v3.2
What's new in Redis v3.2What's new in Redis v3.2
What's new in Redis v3.2
Itamar Haber
 
Building Data applications with Go: from Bloom filters to Data pipelines / FO...
Building Data applications with Go: from Bloom filters to Data pipelines / FO...Building Data applications with Go: from Bloom filters to Data pipelines / FO...
Building Data applications with Go: from Bloom filters to Data pipelines / FO...
Sergii Khomenko
 
Code GPU with CUDA - Identifying performance limiters
Code GPU with CUDA - Identifying performance limitersCode GPU with CUDA - Identifying performance limiters
Code GPU with CUDA - Identifying performance limiters
Marina Kolpakova
 
2019.06.27 Intro to Ceph
2019.06.27 Intro to Ceph2019.06.27 Intro to Ceph
2019.06.27 Intro to Ceph
Ceph Community
 
Stripe CTF3 wrap-up
Stripe CTF3 wrap-upStripe CTF3 wrap-up
Stripe CTF3 wrap-up
Stripe
 

Similar to Ceph Day KL - Bluestore (20)

BlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephBlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for Ceph
 
Ceph Tech Talk: Bluestore
Ceph Tech Talk: BluestoreCeph Tech Talk: Bluestore
Ceph Tech Talk: Bluestore
 
Linuxcommands 091018105536-phpapp01
Linuxcommands 091018105536-phpapp01Linuxcommands 091018105536-phpapp01
Linuxcommands 091018105536-phpapp01
 
DEVIEW 2013
DEVIEW 2013DEVIEW 2013
DEVIEW 2013
 
Diagnostics and Debugging
Diagnostics and DebuggingDiagnostics and Debugging
Diagnostics and Debugging
 
Scaling Dropbox
Scaling DropboxScaling Dropbox
Scaling Dropbox
 
Ceph Internals
Ceph InternalsCeph Internals
Ceph Internals
 
NSC #2 - Challenge Solution
NSC #2 - Challenge SolutionNSC #2 - Challenge Solution
NSC #2 - Challenge Solution
 
Bringing up Android on your favorite X86 Workstation or VM (AnDevCon Boston, ...
Bringing up Android on your favorite X86 Workstation or VM (AnDevCon Boston, ...Bringing up Android on your favorite X86 Workstation or VM (AnDevCon Boston, ...
Bringing up Android on your favorite X86 Workstation or VM (AnDevCon Boston, ...
 
Logging and ranting / Vytis Valentinavičius (Lamoda)
Logging and ranting / Vytis Valentinavičius (Lamoda)Logging and ranting / Vytis Valentinavičius (Lamoda)
Logging and ranting / Vytis Valentinavičius (Lamoda)
 
Hadoop in Data Warehousing
Hadoop in Data WarehousingHadoop in Data Warehousing
Hadoop in Data Warehousing
 
Experiences building a distributed shared log on RADOS - Noah Watkins
Experiences building a distributed shared log on RADOS - Noah WatkinsExperiences building a distributed shared log on RADOS - Noah Watkins
Experiences building a distributed shared log on RADOS - Noah Watkins
 
RocksDB meetup
RocksDB meetupRocksDB meetup
RocksDB meetup
 
PyLadies Talk: Learn to love the command line!
PyLadies Talk: Learn to love the command line!PyLadies Talk: Learn to love the command line!
PyLadies Talk: Learn to love the command line!
 
Ceph Day Melbourne - Troubleshooting Ceph
Ceph Day Melbourne - Troubleshooting Ceph Ceph Day Melbourne - Troubleshooting Ceph
Ceph Day Melbourne - Troubleshooting Ceph
 
What's new in Redis v3.2
What's new in Redis v3.2What's new in Redis v3.2
What's new in Redis v3.2
 
Building Data applications with Go: from Bloom filters to Data pipelines / FO...
Building Data applications with Go: from Bloom filters to Data pipelines / FO...Building Data applications with Go: from Bloom filters to Data pipelines / FO...
Building Data applications with Go: from Bloom filters to Data pipelines / FO...
 
Code GPU with CUDA - Identifying performance limiters
Code GPU with CUDA - Identifying performance limitersCode GPU with CUDA - Identifying performance limiters
Code GPU with CUDA - Identifying performance limiters
 
2019.06.27 Intro to Ceph
2019.06.27 Intro to Ceph2019.06.27 Intro to Ceph
2019.06.27 Intro to Ceph
 
Stripe CTF3 wrap-up
Stripe CTF3 wrap-upStripe CTF3 wrap-up
Stripe CTF3 wrap-up
 

Recently uploaded

GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 

Recently uploaded (20)

GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 

Ceph Day KL - Bluestore

  • 1. B L U E S TO R E : A N E W, FASTER S T O R A G E B A C K E N D F O R C E P H Patrick McGarry Ceph Days APAC Roadshow 2016
  • 2. 2 O UTLIN E ● Ce p h b a c k g r o u n d a n d c o n t e x t – – FileStore, a n d w h y POSIX failed us Ne wS to r e – a h y b r i d a p p r o a c h ● BlueStore – a n e w Ce p h OSD b a c k e n d – – M e t a d a t a D a t a ● ● ● Performance Status a n d availability S u m m a r y
  • 4. CEPH ● ● ● ● ● ● Object, block, a n d file storage in a single cluster All c o m p o n e n t s scale horizontally N o single p o in t of failure H a r d w a r e agnostic, c o m m o d i t y h a r d w a r e Self-manage w h e n e v e r possible O p e n source (LGPL) ● ● “ A Scalable, High-Performance Distributed File S y s t e m ” “ p e r f o r ma n c e , reliability, a n d scalability” 4
  • 5. CEPH COMPONENTS RGW A w e b services g a t e w a y for o b je ct storage, co mp a t ib le w i t h S3 a n d Swift LIBRADOS A library a llo wing a p p s t o directly access RADOS (C, C + + , Java, Python, Ruby, PHP) RADOS A software -based, reliable, a u t o n o m o u s , d is t rib ute d o bject store c o m p r i s e d of self-healing, self-managing, intelligent st o ra g e n o d e s a n d lig h t we ig h t mo n it o rs RBD A reliable, fully-distributed block d e vice w i t h clo u d p la t f o rm in t e g rat ion CEPHFS A d ist ribut ed file s y s t e m w i t h POSIX se ma n t ics a n d scale-out m e t a d a t a m a n a g e m e n t OBJECT 5 BLOCK FILE
  • 6. OBJECT STORAGE DAEMONS (OSDS) FS DISK OSD DISK OSD FS DISK OSD FS DISK OSD FS xfs b t rfs ex t 4 M M M 6
  • 7. OBJECT STORAGE DAEMONS (OSDS) FS DISK OSD DISK OSD FS DISK OSD FS DISK OSD FS xfs b t rfs ex t 4 M M M FileStore 7 FileStoreFileStoreFileStore
  • 8. ● 8 ObjectStore – – abs t ract interface for storing local d a t a EBOFS, FileStore ● EBOFS – – a us er -s pac e e x t e n t - b a s e d o b j e c t file s y s t e m deprec at ed in f av or of FileStore o n btrfs in 2 0 0 9 ● Object – “ file ” – – – d a t a (file-like b y t e s t ream ) at t ributes (small key/value) o m a p ( u n b o u n d e d key/value) ● Collection – “ d i r e c t o r y ” – – p l a c e m e n t g r o u p shard (slice of t h e RADOS pool) s h a r d e d b y 3 2 - b i t h a s h v a l u e ● All writes are transactions – – A t o m i c + C o n s i s t e n t + D u r a b l e Isolation prov ided b y OSD OBJECTSTORE A N D DATA MODEL
  • 9. ● 9 FileSt ore – – PG = collection = directory object = file ● Le v e ld b – – large x a t t r spillover object o m a p (key/value) d a t a ● Originally just for development... – later, o n l y s u p p o r t e d b a c k e n d ( o n XFS) ● /var/lib/ceph/osd/ceph-123/ – current/ ● meta/ – – osdmap123 osdmap124 ● 0.1_head/ – – object1 object12 ● 0.7_head/ – – object3 object5 ● 0.a_head/ – – object4 object6 ● db/ – <leveldb files> FILESTORE
  • 10. ● 1 0 OSD carefully m a n a g e s c o n s is te n c y of its d a t a ● All w rite s a re tra n s a c tio n s – w e n e e d A + C + D ; OSD prov ides I ● M o s t a re s i m p l e – – – w r i t e s o m e b y t e s t o objec t (file) u p d a t e objec t a t t r i b u t e (file x a t t r ) a p p e n d t o u p d a t e log (lev eldb insert) ...but o t h e r s a re arbitrarily l a r g e / c o m p l e x [ { "op_name": "write", "collection": "0.6_head", "oid": "#0:73d87003:::benchmark_data_gnit_10346_object23:head#", "length": 4194304, "offset": 0, "bufferlist length": 4194304 }, { "op_name": "setattrs", "collection": "0.6_head", "oid": "#0:73d87003:::benchmark_data_gnit_10346_object23:head#", "attr_lens": { "_": 269, "snapset": 31 } }, { "op_name": "omap_setkeys", "collection": "0.6_head", "oid": "#0:60000000::::head#", "attr_lens": { "0000000005.00000000000000000006": 178, "_info": 847 } } ] POSIX FAILS: TRANSACTIONS
  • 11. ● 1 1 Btrfs t r a n s a c t io n h o o k s dangerous, and only for how to avoid the /* trans start and trans end are * use by applications that know * resulting deadlocks */ #define BTRFS_IOC_TRANS_START _IO(BTRFS_IOCTL_MAGIC, 6) #define BTRFS_IOC_TRANS_END _IO(BTRFS_IOCTL_MAGIC, 7) ● ● Wr it e b a c k o r d e r in g #define BTRFS_MOUNT_FLUSHONCOMMIT (1 << 7) W h a t if w e hit a n error? c e p h - o s d process dies? – #define BTRFS_MOUNT_WEDGEONTRANSABORT (1 << …) T her e is n o rollback... POSIX FAILS: TRANSACTIONS
  • 12. ● 1 2 Writ e -ahead journal – – serialize a n d journal e v e r y ObjectStore::Transaction t h e n w r i t e it t o t h e file s y s t e m ● Btrfs parallel journaling – – periodic s y n c t akes a snapshot , t h e n t r i m old journal ent ries o n OSD restart: rollback a n d replay journal against last s n a p s h o t ● XFS/ext4 w rit e -ahead journaling – – – periodic sync, t h e n t r i m old journal ent ries o n restart, replay ent ire journal lots of u g l y h a c k e r y t o deal w i t h e v e n t s t h a t aren't i d e m p o t e n t ● e.g., r e n a m e s , collection d e l e t e + create, … ● fu l l d a t a j o u r n a l → w e d o u b l e w r i t e e v e r y t h i n g → ~ h a l v e d i s k t h r o u g h p u t POSIX FAILS: TRANSACTIONS
  • 13. POSIX FAILS: ENUMERATION 1 3 ● ● C e p h objects are d i st ri but ed b y a 32-bit h a sh E n u m e r a t i o n is in h a sh o r der – – – s c r u b b i n g “b a c k f i l l ” ( d a t a re b a la n cin g , re c o v e ry ) e n u m e r a t i o n via lib ra do s client API ● POSIX readdir is n o t well-ordered ● N e e d O(1) “ s p l i t ” for a g i v e n sh a rd/range ● Build directory tree b y hash-value prefix – – – split a n y d ire ct o ry w h e n size > ~ 1 0 0 files m e r g e w h e n size < ~ 5 0 files r e a d e n t ire d ire cto ry, so rt i n - m e m o r y … DIR_A/ DIR_A/A03224D3_qwer DIR_A/A247233E_zxcv … DIR_B/ DIR_B/DIR_8/ DIR_B/DIR_8/B823032D_foo DIR_B/DIR_8/B8474342_bar DIR_B/DIR_9/ DIR_B/DIR_9/B924273B_baz DIR_B/DIR_A/ DIR_B/DIR_A/BA4328D2_asdf …
  • 15. NE W OBJECTSTORE GOALS 1 5 ● ● ● ● ● M o r e n a t u r a l t ra n sa ct i o n a t o m i c i t y Avoid d o u b l e w ri t e s Efficient o b j e ct e n u m e r a t i o n Efficient cl one o p e r a t i o n Efficient splice ( “ m o v e t h e s e b y t e s f r o m o b j e ct X t o o b j e ct Y”) ● ● Efficient IO p a t t e r n for H D D s, SSDs, N V M e M i n i m a l locking, m a x i m u m parallelism ( b e t w e e n PGs) ● ● Full d a t a a n d m e t a d a t a c h e c k s u m s Inline c o m p r e s s i o n
  • 16. NEWSTORE – WE MANAGE NAMESPACE ● ● POSIX has t h e w r o n g m e t a d a t a m o d e l for u s Ordere d k ey / v al ue is perf ec t m a t c h – – well-defined object n a m e sort order efficient e n u m e r a t i o n a n d r a n d o m lookup ● N e w S t o r e = roc k s db + obj ec t files – /var/lib/ceph/osd/ceph-123/ ● db/ – <rocksdb, leveldb, whatever> ● blobs.1/ – – – 0 1 ... ● blobs.2/ – – – 100000 100001 ... H D D OSD SSD SSD OSD H D D OSD NewStore NewStore NewStore RocksDBRocksDB 1 6
  • 17. ● 1 7 RocksDB has a write-ahead log “ j o u r n a l” ● XFS/ext4(/btrfs) h a v e their o w n journal (tree-log) ● Journal-on-journal has h ig h o v e r h e a d – e a c h journal m a n a g e s half of overall consistency, b u t incurs t h e s a m e o v e r h e a d ● ● write(2) + fsync(2) to new blobs.2/10302 1 write + flush to block device 1 write + flush to XFS/ext4 journal write(2) + fsync(2) on RocksDB log 1 write + flush to block device 1 write + flush to XFS/ext4 journal NEWSTORE FAIL: CONSISTENCY OVERHEAD
  • 18. ● 1 8 We c an't o v e r w r i t e a POSIX file as p a r t of a a t o m i c t rans action – ( w e m u s t p r e s e r v e old d a t a unt il t h e t ransact ion c o m m i t s ) ● ● Writing o v e r w r i t e d a t a t o a n e w file m e a n s m a n y files for e a c h objec t Writ e -ahead logging – – – – p u t o v e r w r i t e d a t a in a “WAL” records in RocksDB c o m m i t a t o m i c a l l y w i t h t ransact ion t h e n o v e r w r i t e original file d a t a ...but t h e n w e ' r e b a c k t o a d o u b l e - w r i t e for o v e r w r i t e s ● ● Perf ormance sucks again Ov erw rit es d o m i n a t e RBD bloc k w ork loads NEWSTORE FAIL: ATOMICITY NEEDS WAL
  • 20. ● BlueStore = Block + N e w S t o r e – – – – c o n s u m e r a w b l o c k dev ic e(s ) k e y / v a l u e d a t a b a s e (RocksDB) f or m e t a d a t a d a t a w rit t en directly t o block device p l u g g a b l e b l o c k Allocat or (policy) ● We m u s t share t h e block device w i t h RocksDB – – – i m p l e m e n t our o w n rocksdb::Env i m p l e m e n t t iny “file s y s t e m ” BlueFS m a k e BlueStore a n d BlueFS share device(s) BLUESTORE BlockDeviceBlockDeviceBlockDevice data B l u e S t o r e metadata R o c k s D B BlueRocksEnv BlueFS Allocator ObjectStore 2 0
  • 21. ROCKSDB: BLUEROCKSENV + BLUEFS ● class BlueRocksEnv : public rocksdb: : EnvWrapper – p a s s e s file IO o p e r a t i o n s t o BlueFS ● BlueFS is a super-simple “file s y s t e m ” – all m e t a d a t a l o a d e d in RAM o n s t a r t / m o u n t – – – – n o n e e d t o s t o r e b lo c k f r e e list c o a r s e a llo c a t io n u n i t ( 1 M B b lo c k s ) all m e t a d a t a lives in w r i t t e n t o a jo u r n a l jo u r n a l r e w r i t t e n / c o m p a c t e d w h e n it g e t s la r g e su p e rb lock journal … d a t a d a t a m o r e journal … d a t a d a t a file 1 0 file 11 file 1 2 file 1 2 file 1 3 r m file 1 2 file 1 3 ... ● M a p “ d i rect ories” t o different block devices – – – db.wal/ d b / d b .slow / – o n NVRAM, NVMe, SSD – level0 a n d h o t SSTs o n SSD – cold SSTs o n H D D ● BlueStore periodically balances free space 2 1
  • 22. ROCKSDB: JOURNAL RECYCLING 2 2 ● rocksdb LogReader only u n d e r stand s t w o m o d e s – – r e a d unt il e n d of file ( n e e d a c c u r a t e file size) r e a d all v alid records, t h e n i g n o r e zeros a t e n d ( n e e d z e r o e d tail) ● ● writing t o “ f r e s h ” log “files” m e a n s > 1 IO for a log a p p e n d mo d ifie d u p s t r e a m rocksdb t o re-use previous log files – n o w res em bles “ n o r m a l ” journaling behav ior ov er a circular buffer ● w o r k s w i t h vanilla RocksDB o n files a n d o n BlueFS
  • 23. ● 2 3 Single dev i c e – H D D or SSD ● ● rocksdb object d a t a ● Two devices – 1 2 8 M B of SSD or NVRAM ● rocksdb WAL – big d e v ic e ● e v e r y t h in g else MULTI-DEVICE SUPPORT ● Two devices – a f e w GB of SSD ● ● r o c k s d b WAL r o c k s d b ( w a r m d a t a ) – b ig d e v ic e ● ● r o cksd b (cold d a t a ) o b je c t d a t a ● Three devices – 1 2 8 M B NVRAM ● r o c k s d b WAL – a f e w GB SSD ● r o c k s d b ( w a r m d a t a ) – b ig d e v ic e ● ● r o c k s d b (cold d a t a ) o b je c t d a t a
  • 25. BLUESTORE METADATA 2 5 ● Partition n a m e s p a c e for different m e t a d a t a – – – S* – “ s u p e r b l o c k ” m e t a d a t a f or t h e e n t i r e s t ore B * – b l o c k alloc at ion m e t a d a t a (f ree b l o c k b i t m a p ) T* – stats (bytes used, c om pres sed, etc.) – – C * – collection n a m e →c n o d e _ t O * – o b j e c t n a m e →o n o d e _ t o r b n o d e _ t – L* – w r i t e - a h e a d l o g ent ries, p r o m i s e s of f u t u r e IO – M * – o m a p (user key/value data, stored in objects)
  • 26. ● 2 6 Collection m e t a d a t a – Interval of object n a m e s p a c e shard pool hash name bits C<NOSHARD,12,3d3e0000> “12.e3d3” = <19> shard pool hash name snap gen O<NOSHARD,12,3d3d880e,foo,NOSNAP,NOGEN> = … O<NOSHARD,12,3d3d9223,bar,NOSNAP,NOGEN> = … O<NOSHARD,12,3d3e02c2,baz,NOSNAP,NOGEN> = … O<NOSHARD,12,3d3e125d,zip,NOSNAP,NOGEN> = … O<NOSHARD,12,3d3e1d41,dee,NOSNAP,NOGEN> = … O<NOSHARD,12,3d3e3832,dah,NOSNAP,NOGEN> = … struct spg_t { uint64_t pool; uint32_t hash; shard_id_t shard; }; struct bluestore_cnode_t { uint32_t bits; }; ● Nice properties – – O rd e re d e n u m e r a t i o n of objects We ca n “s p l i t ” collections b y adjusting co d e m e t a d a t a o n ly CN O DE
  • 27. ● 2 7 Per object m e t a d a t a – – Lives direc t ly in k e y / v a l u e pair Serializes t o 1 0 0 s of b y t e s ● ● ● Size in b y te s Inline attributes (user a ttr d a ta ) D a t a pointers (user b y t e d a ta ) – – lextent_t →(blob, offset, lengt h) blob →(disk extents, csums, ...) ● O m a p prefix/ID (user k/v d a ta ) struct bluestore_onode_t { uint64_t size; map<string,bufferptr> attrs; map<uint64_t,bluestore_lextent_t> extent_map; uint64_t omap_head; }; struct bluestore_blob_t { vector<bluestore_pextent_t> extents; uint32_t compressed_length; bluestore_extent_ref_map_t ref_map; uint8_t csum_type, csum_order; bufferptr csum_data; }; struct bluestore_pextent_t { uint64_t offset; uint64_t length; }; ONODE
  • 28. ● B lo b m e t a d a t a – – – – Usually blobs stored in t h e o n o d e S o m e t i m e s w e share blocks b e t w e e n objects (usually clones/snaps) W e n e e d t o reference c o u n t t h o s e e x t e n t s W e still w a n t t o split collections a n d repartition e x t e n t m e t a d a t a b y h a s h shard pool hash name snap gen O<NOSHARD,12,3d3d9223,bar,NOSNAP,NOGEN> = onode O<NOSHARD,12,3d3e02c2> = bnode O<NOSHARD,12,3d3e02c2,baz,NOSNAP,NOGEN> = onode O<NOSHARD,12,3d3e125d> = bnode O<NOSHARD,12,3d3e125d,zip,NOSNAP,NOGEN> = onode O<NOSHARD,12,3d3e1d41,dee,NOSNAP,NOGEN> = onode O<NOSHARD,12,3d3e3832,dah,NOSNAP,NOGEN> = onode ● o n o d e va lu e includes, a n d b n o d e va lu e is map<int64_t,bluestore_blob_t> blob_map; ● le xt e n t b lo b ids – – > 0 →b lo b in o n o d e < 0 →b lo b in b n o d e BN O D E 2 8
  • 29. ● 2 9 We scrub... periodically – – – w i n d o w b e f o r e w e d e t e c t error w e m a y r e a d b a d d a t a w e m a y n o t b e sure w h i c h c o p y is b a d ● We w a n t t o validate c h e c k s u m o n e v e r y read ● Mu s t store m o r e m e t a d a t a in t h e blobs – – 3 2 - b i t c s u m m e t a d a t a f or 4 M B o b j e c t a n d 4 K B blocks = 4 K B larger c s u m bloc k s ● c s u m _ o r d e r > 1 2 – smaller c s u m s ● crc32c_8 or 1 6 ● IO hints – – s e q r e a d + w r i t e →b i g c h u n k s c o m p r e s s i o n → b i g c h u n k s ● Per-pool policy CHECKSUMS
  • 30. ● 3 x replication is ex pens iv e – A n y s c al e - out c l us t er is e x p e n s i v e ● ● Lots of s t or ed d a t a is (highly) c om pr es s ible N e e d largish e x t e n t s t o g e t c o m p r e s s i o n benef it ( 6 4 KB, 1 2 8 KB) – – m a y n e e d t o s u p p o r t s m a l l (ov er)w ri t es o v e r w r i t e s oc c l ude/ o bs c ur e c o m p r e s s e d bl obs – c o m p a c t e d ( r e w r i t t e n ) w h e n > N l ay ers d e e p INLINE COMPRESSION start of object end of object 3 0 allocated written written (compressed) uncompressed blob
  • 32. Te rm s 3 2 ● Sequencer – – An i n d e p e n d e n t , totally ordered q u e u e of transactions O n e per PG ● TransContext – State describing a n ex ec uting t r a n s a c t i o n DATA PATH BASICS Two w a y s t o w r it e ● N e w allocation – – A n y w r i t e larger t h a n m i n _ a l l o c _ s i z e g o e s t o a n e w, u n u s e d e x t e n t o n disk O n c e t h a t IO c o m p l e t e s , w e c o m m i t t h e t r ans ac t ion ● WAL ( w r it e - a h e a d - lo g g e d ) – C o m m i t t e m p o r a r y p r o m i s e t o ( ov er ) w r it e d a t a w i t h t r ans ac t ion ● includes d a t a ! – – D o a s y n c o v e r w r i t e T h e n u p t e m p o r a r y k / v pair
  • 33. TRANSCONTEXT STATE MACHINE PREPARE AIO_WAIT KV_QUEUED KV_COMMITTING WAL_QUEUED WAL_AIO_WAIT FINISH WAL_CLEANUP Initiate someAIO Wait for next TransContext(s) in Sequencer to be ready WAL_QUEUED Sequencer queue Initiate someAIO Wait for next commit batch PREPARE AIO_WAIT KV_QUEUED AIO_WAIT KV_COMMITTING KV_COMMITTING KV_COMMITTING FINISH WAL_QUEUED FINISH WAL_AIO_WAIT WAL_CLEANUP WAL_CLEANUP FINISH FINISHWAL_CLEANUP_COMMITTING 3 3
  • 34. ● 3 4 O n o d e S p a c e p e r collection – i n - m e m o r y g h o bj ect _t → O n o d e m a p of d e c o d e d o n o d e s ● Bu ff e r Sp a c e for i n - m e m o r y blobs – m a y c o n t a i n c a c h e d o n -di sk d a t a ● B o t h buffers a n d o n o d e s h a v e lifecycles linked t o a C a ch e – – LRUCache – trivial LRU TwoQCache – i m p l e m e n t s 2 Q c a c h e r e p l a c e m e n t a l g o r i t h m (d efaul t) ● C a c h e is s h a r d e d for parallelism – – – Collection → sh a r d m a p p i n g m a t c h e s OSD's o p _ w q s a m e CPU c o n t e x t t h a t processes client r e q u est s will t o u c h t h e LRU/2Q lists IO c o m p l e t i o n e x e c u t i o n n o t y e t s h a r d e d – TODO? CACH IN G
  • 35. ● 3 5 FreelistManager – – persist list of free e x t e n t s t o k e y / v a lu e store p re p a re i n c r e m e n t a l u p d a t e s for allocate o r release ● Initial i m p l e m e n t a t i o n – – – – e xt e n t -b a se d <offset> = <length> k e p t i n - m e m o r y c o p y e n fo rces a n o rd e rin g o n c o m m i t s ; freelist u p d a t e s h a d t o pass t h r o u g h single t h re a d / lo ck del 1600=100000 put 1700=0fff00 sma ll initial m e m o r y f o o t p rint , v e r y e xp e n sive w h e n f r a g m e n t e d ● N e w b i t m a p - b a s e d a p p r oa ch <offset> = <region bitmap> – w h e r e r e g i o n is N bloc k s ● 1 2 8 bloc k s = 8 b y t e s – – – u s e k / v m e r g e o p e r a t o r t o XOR allocation or release merge 10=0000000011 merge 20=1110000000 R oc k s D B l o g - s t r u c t u r e d - m e r g e t r e e c oales c es k e y s d u r i n g c o m p a c t i o n n o i n - m e m o r y s t a t e BLOCK FREE LIST
  • 36. ● 3 6 Allocator – abstract interface t o allocate blocks ● St upidAlloc at or – – – – e x t e n t - b a s e d bin free ext ent s b y size (powers of 2) c h o o s e suff icient ly l a r g e e x t e n t closest t o hint h i g h l y v a r i a b l e m e m o r y u s a g e ● bt ree of free ext ent s – – i m p l e m e n t e d , w o r k s based o n ancient ebofs policy ● Bit m apAlloc ator – hierarchy of indexes ● ● ● L1: 2 bits = 2 ^ 6 blocks L2: 2 bits = 2 ^ 1 2 blocks ... 0 0 = all free, 11 = all used, 0 1 = m i x – fixed m e m o r y c o n s u m p t i o n ● ~ 3 5 M B RAM p e r TB BLOCK ALLOCATOR
  • 37. ● 3 7 ● Let's s u p p ort t h e m natively! 2 5 6 M B zones / b a n d s – – – m u s t b e w r i t t e n sequentially, b u t n o t all a t onc e libzbc s upport s ZAC a n d ZBC H D D s h o s t - m a n a g e d or hos t -aw are ● SMRAllocator – – – w r i t e p o i n t e r p e r z o n e u s e d + f ree c o u n t e r s p e r z o n e B o n u s : a l m o s t n o m e m o r y ! ● IO ordering – m u s t e n s u r e a l l o c a t e d w r i t e s reach disk in order ● Cle a n in g – – – store k / v hints z one offset →object has h pick e m pt i e s t closed zone, scan hints, m o v e objects t h a t are still t h e r e opportunistically rewrite objects w e read if t h e z one is f lagged f or c l e a n i n g s o o n SMR H D D
  • 39. 500 450 400 350 300 250 200 150 100 50 0 Ceph 10.1.0 Bluestore vs Filestore Sequential Writes FS HDD BS HDD 3 9 IOSize Throughput(MB/s) H D D : SEQUENTIAL WRITE
  • 40. H D D : RANDOM WRITE 200 150 100 50 0 Ceph 10.1.0 Bluestore vs Filestore Random Writes BS HDD HDD IOSize Throughput(MB/s) 600 400 200 0 450 1600 400 1400 350 1200 300 250 1000 FS 800 Ceph 10.1.0 Bluestore vs Filestore Random Writes FS HDD BS HDD 4 0 IOSize IOPS
  • 41. H D D : SEQUENTIAL READ 400 200 0 600 Ceph 10.1.0 Bluestore vs Filestore Sequential Reads 1200 1000 800 FS HDD BS HDD 4 1 IOSize Throughput(MB/s)
  • 42. H D D : RANDOM READ 1400 1200 1000 800 600 400 200 0 Ceph 10.1.0 Bluestore vs Filestore Random Reads FS HDD BS HDD IOSize Throughput(MB/s) 3500 3000 2500 2000 1500 1000 500 0 Ceph 10.1.0 Bluestore vs Filestore Random Reads FS HDD BS HDD 4 2 IOSize IOPS
  • 43. SSD A N D NVME? 4 3 ● N VMe journal – – r a n d o m w r i t e s ~ 2 x f as t er s o m e t e s t i n g a n o m a l i e s ( p r o b l e m w i t h t e s t rig kernel?) ● SSD only – – s im ilar t o H D D res ult small writ e benef it is m o r e p r o n o u n c e d ● N VMe only – m o r e t es t ing anom alies o n test rig.. WIP
  • 45. STATUS 4 5 ● D o n e – – – fully function IO p a t h w it h c h e c k s u m s a n d compression fsck b it m a p - bas e d allocator a n d freelist ● Current efforts – – – – optimize m e t a d a t a e n c o d in g efficiency p e r f o r m a n c e t u n in g ZetaScale key/value d b as RocksDB alternative b o u n d s o n c o m p r e s s e d blob occlusion ● So o n – – – – per-pool properties t h a t m a p t o compression, c h e c k s u m , IO hints m o r e p e r f o r m a n c e optimization native SMR H D D support SPDK (kernel bypass for N V M e devices)
  • 46. AVAILABILITY 4 6 ● E x per im ent al bac k end in Jewel v10.2.z (just released) – – e n a b l e e x p e r i m e n t a l u n r e c o v e r a b l e d a t a c o r r u p t i n g f eat ures = bluest ore roc k s db ceph- disk --bluestore DEV ● n o m ul t i - dev i c e m a g i c provisioni ng just y e t – predat es c h e c k s u m s a n d c o m p r e s s i o n ● Current m a s t e r – – – n e w disk f o r m a t c h e c k s u m s c om pres s i o n ● The goal... – – stable in K rak en (Fall '16) def aul t in L u m i n o u s (Spring '17)
  • 47. SUM M ARY 4 7 ● ● ● ● ● C e p h is g r e a t POSIX w a s p o o r choice for storing objects RocksDB rocks a n d w a s e a s y t o e m b e d O u r n e w BlueStore b a c k e n d is a w e s o m e Full d a t a c h e c k s u m s a n d inline c o m p r e s s i o n !
  • 48. THANK YOU! Patrick McGarry Dir Ceph Community pmcgarry@redhat.com @scuttlemonkey

Editor's Notes

  1. A bit of background on what Ceph is what Filestore is and why it doesn’t work anymore what newstore is (first attempt) Bluestore, current effort High level, how it’s structured, data path, performance numbers Current status of development, where we’re at and how to try it.
  2. [basic stuff] The original paper used the [last two bullets] but performance has been a challenge compared to raw hardware capabilities
  3. The RADOS cluster is structured in a series of hosts Collection of OSD daemons siting in front of HDDs FS sitting on top of disk
  4. In reality there is a well-contained of the OSD called filestore Responsible for writing that data to the filesystem on that disk It’s that piece that is getting replaced
  5. Filestore implements a interface called object store Abstract interface that describes how each OSD daemon stores data on its local disk (just local disk) The larger ceph system is responsible for replicating across multiple OSDs Originally called EBOFS,and filestore (two implementations) Built around two abstraction: Objects (sort of files): data (bunch of bytes), attributes – (extended attributes), omap – unbounded key value thing (less common) Collections: directory (group of objects): pool of objects sharded into PGs, PGs map to collections All writes are transactions – atomically and consistently and durably. Don’t worry about I in ACID (provided by another layer)
  6. EBOFS was first – user based, extent-based FS, copy-on-write btree based FS (full control of stack, most natural interface) got rid of it, switched to writing to btrfs in 2009 – had everything we needed and community growing Filestore – write objects as files Leveldb xattr (when they are too big) Originally just for dev w/o having dedicated disks – morphed it into prod OSD dir – dir for each PG DB dir – has level db Meta dir – high level metadata objects for osd as a whole
  7. Because this is built on existing FS, constrained by POSIX – has problems 1st – interface wants to provide atomic (b/c osd is managing consistency of data it’s stores locally – if it fails can recoverand resync w/ other replicas) need that transactionality In practice most transactions are pretty simple – writing bytes, attr = what ver, log = what version .. Can’t rely on simplicity On the right is an example of one of these transactions
  8. Initially to support these we tied into btr Had an ioctl that we’d bracket all of our stuff to prevent btr from doing a transaction while we were doing our work. Internal checkpoints Got us most of the way there The problem here is “what happens if OSD daemon crashed and didn’t finish writing full transaction” btr would see write start, some write, no end…would never get the second half. Got around that to add a very horrible mount option to deliberately make btr wedge itself and crash. Internally there was no option for rollback. Btr no meant to be transactional in that way...hard to shoehorn that in later. Didn’t work, so instead we....
  9. Did a write-ahead journal, serialize into a sequence of bytes In btr we could be a little bit clever – snapshot == full checkpoint (after checkpoint, we could trim journal). If OSD restarted, roll back to snapshot and replay journal (nice consistency model) Non-btr not so elegant – still do periodic sync, but on restart we just replayed the journal blindly, might be repeating operations unfortunately the objects or interfaces that were supported aren’t all idempotent – had things like renames/clones/etc – whole bunch of hackery so we don’t apply those operations twice (kinda nasty…but it works) Write twice – journal + disk…this halves disk throughput
  10. Another place POSIX gets in our way is enumeration Objects are distributed in a pool based on a 32-bit hash – we do enumeration in hash order for scrub / backfill / and when you request a list of objects via API POSIX readdir is totally random Also need the ability to take a given collection and split it in half/quarter/etc – that’s part of ceph is we can repartition our data collections Can’t do that with POSIX… can’t take a dir of a million files and split it into two dir In practice in filestore, we build an ugly tree of directories+files where the dir names are based on the first prefix part of the hash for the file (deep nested file structure – looks similar to what other projects do) Not terribly efficient because of complicated dir structre – hit some bottlenecks
  11. Time to do something different – POSIX is more trouble that it’s worth
  12. Objects aren’t files Collections aren’t directories Ordered k/v database RocksDB (picked somewhat randomly) Idea is you plug in your k/v db (rocksdb / leveldb / any kv db) Actual data for object is written in simple file w/ simple name (short name!) nice big efficient directories
  13. Didn’t work very well Main issue is rocksDB has a write ahead journal to maintain its consistency FS also has journal Journal on Journal is very inefficient (papers about it) each journal managing half of overall consistency of the system (pay the ovehead twice) When writing a file in newstore you write a file on blob, do an fsync, io with file data, another io to journal, flush device twice, The newstore update metadata, append a record to rocksdb, append rocksdb log file, then fsync that (another 2 ios w/ rocksdb log and again w/ fs log file. Pay 4 ios when you want to pay 2 Solution is to put everything in one big journal
  14. Problem is the system still needs the atomicity to do overwrites in the system In POSIX you can’t overwrite part of a file that already exists as a part of a larger transaction (POSIX doesn’t understand transactions) IN ceph we needs these overwrites to be atomic (so they don’t overwrite things unless they are ready to be committed) Could have had newstore write to a new file…but that leads to a big complex mapping structure End up where we were before with write-ahead logging
  15. The allocator is something we used to get from XFS or w/e that we now do ourselves We have to share the block device w/ RocksDB (writes a bunch of files like log file) Do that by implementing rocksdb backend – nice abstracted env class that captures platform dependent stuff Implement a very simple FS (just complicated enough to support RocksDB operations
  16. All metadata is stored in RAM Idea is write to journal – write updates to fnodes (like inodes) as they happen When you hit threshold, you rewrite whole thing in more compactable RocksDB writes big files only so it keeps it simples. BlueFS is smart about multiple devices (RocksDB writes types of data to different dirs, logs to SSD) BlueStore and BlueFS communicate so that as BlueFS runs out of space, bluestore gives it more and vice versa
  17. Did one tricky thing w/ rocksdb upstream Rocksdb written to use logfiles (journal) – write a new log file each time which leads to a pretty inefficient io pattern Every file system / db that does data logging uses a circular buffer – so we implemented that
  18. Two device is like what people do now (SSD journal + multiple HDDs for data) Larger device for two device can do more Three devices could even be split Don’t support bluestore teiring object data – but are exploring
  19. Ordere enumeration of objects – carefully construct key that sort in the order we want B/c objects are in hash order we can take collection that represents a range and split it into two collection without rewriting any k/v pairs (just change collection metadata to arbitrarily carve into two pieces) This is something that filestore had to work hard to do
  20. ONODES stores per object metadata Main things in here are: Size of object in bytes In-lines attributes like ver=2 Data pointers that indicate where the byte data is stored on disk Structure “omap head” that is if you have user data stored as k/v data, where to find it
  21. One other structure Need to store metadata about the blob – ONODE has mapping object space to logical extents which maps to blobs, but doesn’t contain the blobs themselves – usually stored next to ONODE, but occassionally blobs will have multiple ONODES map to multiple blobs Map of an identifier to blob – blob tells you where to find the data
  22. Blobs let us do checksums (every day = metadata, every week = data) With bluestore we want to validate a checksums on every read – that means bluestore blobs have to store more metadata (to include checksum) Use industry standard crc32c IOHints – (we control whole stack) things like RGW = read/write sequentially (no small overwrites) -> large checksum block If we compress a block, checksum for entire region Idea is policies for a pool basis
  23. 3x is expensive Bluestore implements in-line compression Trick is when you need to support overwrites (hopefully diagram makes sense) Figuring out performance is future work
  24. How the code flows when we’re taking data from OSD to disk Sequencer – independent stream fed to object store (1 per PG) Each transaction is represented by a transcontext New allocation – (most of the time) new region of disk, update metadata to point to the data WAL – (sometimes small writes) temporary k/v pair in rocksdb – effectively data journaling like filestore, only do it with small writes
  25. Complicated slide describes flow of transactions through this process
  26. Bluestore implements its own cache in user space memory (not using the kernel for any caching)
  27. Couple other things that happen Freelist – keeps track of unused space on disk
  28. Separate module that is responsible where we should allocate new data Pluggable, has two implementations StupidAllocator – not bad, highly variable memory usage BitmapAllocator – new implementation from SanDisk
  29. Allocator is pluggable so we also have a GSOC student who is adding support for SMR harddisk (annoying, prevents overwrites, have to write stripes)
  30. These graphs are produces a couple months ago…prelim and not super-detailed
  31. Sequential write for spinning platter – large io is twice as fast (as you would expect, removing double writes)
  32. Random writes are much better, also about twice as fast (left is streaming throughput, right is IOPS) – kink between 32k and 64k writes is where we transition from WAL to writing to a new region of disk
  33. Sequential reads are a little more interesting, high end we’re a little better, low end we’re the same…middle there is a dip pattern Newstore is based on XFS with the readahead Bluestore isn’t…b/c ceph has its own read ahead (cephfs, rbd, radosgw all have their own) ...this is faster when you look at the client level, but not at the OSD
  34. Random reads, sort of what you’d expect. Small IO our metadata is more efficient
  35. Did do SSD and NVME but don’t have graphs