RBD: What Will the Future Bring?
Jason Dillaman
RBD Project Technical Lead
Introductons
Jason Dillaman
●
Ceph developer since 2014
●
Lead for RADOS Block Device (RBD)
●
Principal Software Engineer at Red Hat
●
dillaman@redhat.com
Ceph Components
RGW
A web services gateway for
object storage, compatible with
S3 and Swift
LIBRADOS
A library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)
RADOS
A software-based, reliable, autonomous, distributed object store comprised of
self-healing, self-managing, intelligent storage nodes and lightweight monitors
RBD
A reliable, fully-distributed block
device with cloud platform
integration
CEPHFS
A distributed fle system with
POSIX semantics and scale-out
metadata management
OBJECT BLOCK FILE
RADOS Block Device
● Block device abstracton
● Striped over fied-size objects
● Highlights
– Broad integraton
– Thinly provisioned
– Snapshots
– Copy-on-write clones
RADOS Block Device
LIBRADOS
LIBRBD
M M
M
KRBD
User-space Client Linux Host
LIBCEPH
IMAGE IO
New Features in Mimic
Deep-Copy of Images
● Generic version of mirroring’s “image sync” logic
– Preserves clone parent linkage and snapshots
– Sparse copy of backing objects
● Useful for copying between pools or changing fied
image format parameters
● Optonally laten copies of cloned images
● Usage:
– “rbd deep copy <src-snap-spec> <dest-image-spec>”
– Similar to “rbd eiport --eiport-format 2 <src-snap-spec> -
| rbd import --eiport-format 2 <dest-image-spec> -”
Live Image Migraton
● Live “deep-copy” of image
– Safe to use with concurrent IO against image
● Reads follow hierarchy chain
● Writes only afect destnaton
– May require “copy-up” from hierarchy chain
LIBRADOS
LIBRBD
User-space Client READSWRITES
DESTINATIONSOURCEPARENT
(OPTIONAL)
Live Image Migraton, cont
● Useful for copying between pools or changing fied image
format parameters
– Cannot live-migrate between clusters (use mirroring)
● Alternatve to bolt-on solutons
– QEMU drive-mirror
– Kernel sofware RAID “repair”
● Usage:
– “rbd migraton prepare <src-image> <dst-image>”
– “rbd migraton eiecute”
– “rbd migraton commit”
Image Cloning (v2)
● Simplifed operatons for eiistng clone feature
– No change to data low handling
– Supported in krbd (new feature bit whitelisted)
● Parent snapshots no longer need to be protected
– Atomic reference countng in the OSD
– OSD “profle rbd[-read-only]” caps whitelist support
● Snapshots with linked children can be “deleted”
– In reality: moved to “trash” snapshot namespace
– Trashed snapshot auto-removed
Image Cloning (v2), cont
● Support enabled automatcally when Mimic clients
required
– “ceph osd set-require-min-compat-client mimic”
– Confg override: “rbd default clone format = [1, 2, auto]”
● Usage:
– “rbd snap create <parent-image>@<snap>”
– “rbd clone <parent-image>@<snap> <child image>”
– “rbd snap rm <parent-image>@<snap>”
– “rbd snap ls --all parent”
SNAPID NAME SIZE TIMESTAMP NAMESPACE
4 29da36c8-9ff6-46b7-9888-9e9804c22525 1024 kB Sun Mar 18 20:41:13 2018 trash
Image Groups
● Associate related-images together
● Perform snapshots concurrently against grouped images
● Future work:
– Need to add support for group cloning
– Need to add support for group journaling (consistency
group)
● Usage:
– “rbd group <group-name>”
– “rbd group image add <image-name>”
– “rbd group snap create <group-name>@<snap-name>”
Actve/Actve Mirroring
● Actve/Passive HA support added in Luminous
– Single daemon-per-pool elected to leader
– Failure of leader will result in new leader electon
● Actve/Actve leader assigns images to followers
– Simple “equal # of images” policy
– Smarter load-based policies possible in future
rbd-mirror
LIBRBD
SITE A
rbd-mirror
LIBRBD
rbd-mirror
LIBRBD
FOLLOWERFOLLOWERLEADER
Acquire/Release Image
Additonal Mirroring Improvements
● Deferred Image Deletons
– Propagated deletes from primary move non-primary
image to RBD trash
– Confg override: “rbd mirroring delete delay =
<seconds>”
– Restore via “rbd trash restore <image-id-spec>”
● Clone non-primary images
– Helpful for OpenStack Glance-style workloads
Minor Improvements
● IOPS throtling
– Client-side rate limiter
– Confg override: “rbd qos iops limit = X”
● Trash purge CLI
– Human-readable “--eipired-before X” optonal
– Pool-usage “--threshold <percent>” optonal
● Creaton of image format v1 disabled
– Deprecated since Jewel
Out-of-Tree Features /
Works-in-Progress
Linui Kernel RBD Driver (krbd)
● Support for “fancy” image striping
– Already supported simple striping (stripe unit ==
object size, stripe count == 1)
– Might be preferable in small, sequental IO
workloads
– (Hopefully) included in 4.17
● Support for object-map/fast-dif in-progress
● Support for deep-laten is planned
● New operatons feature whitelisted in 4.16 rc1
Linui IO (LIO) Target
● LIO + tcmu-runner provide SCSI/iSCSI interface to RBD
● Live LUN resize support
● Persistent Group Reservatons (PGRs)
– Required for Windows Clustering
– Required for VMware Certfcaton
● Graceful failover for shared LUNs
– Avoid actve/optmized path ping-pong when failure
visible from only a subset of initators
● Integrate more storage admin worklows into tools
Works-in-Progress
QoS Reservatons
● Utlizes the dmClock work scheduler within the OSDs
– Conditonal on its completon
● Minimum IOPS reservaton vs throtling
– Reduce impact from noisy neighbors
● Interface / CLI changes stll TBD
Persistent Read-Only Cache
● Client-side, hot-spot cache for read-only image parent
objects
● Ofoad cluster for heavily re-used parents
– VDI infrastructure / common “golden” images
Client
LIBRBD
rbd-cache
Read/Promote Object
Object Hit Query
Demote Object
Persistent Writeback Cache
Goal: reduce tail latency
Persistent Writeback Cache, cont
● Replicated Write Log (RWL) plug-in for librbd cache
● All writes are streamed to persistent storage
– Writethrough mode: safely locally persisted on each
write
– Writeback mode: safely locally persisted on lush
● Uses pmem.io library for atomic operatons
– Optmized for persistent memory (i.e. NVDIMM)
– Optonally supports librpmem for RDMA streaming to
secondary host
Persistent Writeback Cache, cont
Client
LIBRBD
WRITES
Sync Point
Batched
Writeback
Log Append
Ceph Cluster
PMEM
Performance Improvements
● librbd uses more CPU per IOP than krbd
● Minor tweaks to librbd/librados can narrow the gap
4K rand-read 4K rand-write
20000
25000
30000
35000
40000
45000
50000
55000
60000
65000
0
50
100
150
200
250
librbd (current)
librbd (tweaked)
krbd
librbd (current)
librbd (tweaked)
IOPS
IOPS/%CPU
Performance Improvements
● Current client libraries
Flame Graph Search
AsyncConnec..
Mess..
sock_..
b..
sys_read
ip_l..
___sys_sendmsg
__GI___clone
ip_rcv
b..
i..
inet..
Context::..
boost::variant<lib..
do_soft..
fn-radosclient
rbd_aio_read
read
start_thread
librbd..
librbd::io:..
Connecte..
Objecter::ms..
__GI___clone
non-virtual ..
AsyncConn..
sy..
w..
l..
F..
PosixCon..
li..
MOS..
process..
__G..
librbd::io::Object..
l..
b..
n..
Objecter::_o..
librados::C_A..
As..
tcp_sendmsg
b..
ip_fnish..
Objecter::ha..
boost::detail::var..
thread_main
do_syscall_64
__vfs..
sock_sendmsg
librbd::io::ObjectD..
__sys_sendmsg
DispatchQueu..
operator
en..
boost::detail::var..
b..
Finisher::fnisher_th..
librados::IoCt..
S..
do..
t..
l..
tcp_..
__GI___clone
tcp_transmit_skb
[libstdc++.so.6.0.24]
f..
librbd::io::Object..
do_io
en..
__sendmsg
librbd::io::..
PosixConnectedSocketImpl:..
Co..
do_sys..
librbd::io::Object..
entry_..
b..
start_thread
tcp_sendmsg_locked
td_io_queue
Obj..
librbd::io::ImageReadRe..
_..
[unkn..
ip_output
EventCenter::process_events
librbd::io::..
fo
tc..
O..
__tcp_push_pendi..
[..
start_thread
fo_rbd_queue
msgr-worker
b..
Objecter::_o..
__neti..
_..
entry_SYSCALL_64
net_rx_..
__libc_r..
b..
boost::variant<lib..
__softi..
vfs_read
de..
ep..
Objec..
PosixConnectedSocketImp..
Asyn..librbd::io::ImageRequest..
boost::variant<lib..
boost::apply_visit..
Messenger::m..
Co..
boost::detail::var..
boost::detail::var..
AsyncCon..
do_soft..
AsyncConnection::process
O..
Objecter::op_..
librados::IoCtx..
Context::complete
librbd::io::ImageReques..
i..
ip_..
t..
li..
tcp..
tcp_write_xmit
tc..
__local..
r..
_M_invoke
As..
Epo..
Performance Improvements, cont
● Tweaked client libraries
Flame Graph Search
inet_..
t..
Messenger::ms_fast_dispa..
Context::c..
start_thread
io..
f..
do_softi..
boost::apply_visitor<libr..
boost::variant<librbd::io..
librbd::io::ImageRequest<librbd..
Re..
Objecter::handle_osd_op..
librbd::io::ObjectReadReq..
librbd::io::ObjectDispatc..
Objecter::_op_su..
ip..
ip_l..
librbd::..
boost::detail::variant::v..
boost::variant<librbd::io..
librados::IoCtx::aio_o..
entry_S..
sock_..
ip_fnish..
__vfs_..
Ob..
tcp_sendmsg_locked
__tcp_push_pendin..
decod..sock_sendmsg
fo_rbd_queue
Epo..
do_syscall_64
entry_SYSCALL_64
boost::detail::variant::v..
msgr-worker
Obj..
tcp..
PosixConnectedSocketImpl::send
r..
boost::variant<librbd::io..
tcp_r..
librados:..
AsyncCo..
tcp_sendmsg
vfs_read
boost::detail::variant::i..
wa..
do..
__GI___clone
start_thread
librbd::io::ImageRequestWQ<librb..
fo
AsyncConn.. M..
ip_..
__G..
Ob..
en..
__netif..
tcp_transmit_skb
AsyncConnection::process
__sendmsg
librbd:..
Objecter::op_submit
PosixConnectedSocketImpl::..
librbd::..
ip_output
librbd::io::ObjectDispatc.. ip_rcv
__libc_r..
Message..
_M_invoke
Contex..
DispatchQueue::fast_dispa..
__sys_sendmsg
lib..
librbd::io::ImageReadRequest<.. s..
Context::c..
__local_..
do_softi..
tc..
l..
StripeGenerator::fle_to_exte..
operator
t..
thread_main
Objecter::ms_dispatch sy..
rbd_aio_read
librbd::io::ObjectDispatch..
OS..
librbd::io::ImageReadRequest<l..
Asy..
OS..
ep..
__..
do_io
AsyncConnection..
non-virtual thunk to Obj..
PosixConn..
do_sysc..
tc..
tcp_write_xmit
nf..
EventCenter::process_events
__GI___clone
As..
MOSDO..
AsyncConne..
[libstdc++.so.6.0.24]
read
enc..
sys_read
process..
Connected..
librados::IoCtxImpl::..
boost::detail::variant::v..
td_io_queue
net_rx_..
Objecter::_op_sub..
___sys_sendmsg
__softir..
Feature Backlog
● librbd
– Improved image/pool
confguraton overrides
– Client-side encrypton
– Improve performance
● Mirroring
– Deep scrub and repair
– QoS throtles
– Improved stats
– Improve performance
Questions?
Thank You!

RBD: What will the future bring? - Jason Dillaman

  • 1.
    RBD: What Willthe Future Bring? Jason Dillaman RBD Project Technical Lead
  • 2.
    Introductons Jason Dillaman ● Ceph developersince 2014 ● Lead for RADOS Block Device (RBD) ● Principal Software Engineer at Red Hat ● dillaman@redhat.com
  • 3.
    Ceph Components RGW A webservices gateway for object storage, compatible with S3 and Swift LIBRADOS A library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP) RADOS A software-based, reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes and lightweight monitors RBD A reliable, fully-distributed block device with cloud platform integration CEPHFS A distributed fle system with POSIX semantics and scale-out metadata management OBJECT BLOCK FILE
  • 4.
    RADOS Block Device ●Block device abstracton ● Striped over fied-size objects ● Highlights – Broad integraton – Thinly provisioned – Snapshots – Copy-on-write clones
  • 5.
    RADOS Block Device LIBRADOS LIBRBD MM M KRBD User-space Client Linux Host LIBCEPH IMAGE IO
  • 6.
  • 7.
    Deep-Copy of Images ●Generic version of mirroring’s “image sync” logic – Preserves clone parent linkage and snapshots – Sparse copy of backing objects ● Useful for copying between pools or changing fied image format parameters ● Optonally laten copies of cloned images ● Usage: – “rbd deep copy <src-snap-spec> <dest-image-spec>” – Similar to “rbd eiport --eiport-format 2 <src-snap-spec> - | rbd import --eiport-format 2 <dest-image-spec> -”
  • 8.
    Live Image Migraton ●Live “deep-copy” of image – Safe to use with concurrent IO against image ● Reads follow hierarchy chain ● Writes only afect destnaton – May require “copy-up” from hierarchy chain LIBRADOS LIBRBD User-space Client READSWRITES DESTINATIONSOURCEPARENT (OPTIONAL)
  • 9.
    Live Image Migraton,cont ● Useful for copying between pools or changing fied image format parameters – Cannot live-migrate between clusters (use mirroring) ● Alternatve to bolt-on solutons – QEMU drive-mirror – Kernel sofware RAID “repair” ● Usage: – “rbd migraton prepare <src-image> <dst-image>” – “rbd migraton eiecute” – “rbd migraton commit”
  • 10.
    Image Cloning (v2) ●Simplifed operatons for eiistng clone feature – No change to data low handling – Supported in krbd (new feature bit whitelisted) ● Parent snapshots no longer need to be protected – Atomic reference countng in the OSD – OSD “profle rbd[-read-only]” caps whitelist support ● Snapshots with linked children can be “deleted” – In reality: moved to “trash” snapshot namespace – Trashed snapshot auto-removed
  • 11.
    Image Cloning (v2),cont ● Support enabled automatcally when Mimic clients required – “ceph osd set-require-min-compat-client mimic” – Confg override: “rbd default clone format = [1, 2, auto]” ● Usage: – “rbd snap create <parent-image>@<snap>” – “rbd clone <parent-image>@<snap> <child image>” – “rbd snap rm <parent-image>@<snap>” – “rbd snap ls --all parent” SNAPID NAME SIZE TIMESTAMP NAMESPACE 4 29da36c8-9ff6-46b7-9888-9e9804c22525 1024 kB Sun Mar 18 20:41:13 2018 trash
  • 12.
    Image Groups ● Associaterelated-images together ● Perform snapshots concurrently against grouped images ● Future work: – Need to add support for group cloning – Need to add support for group journaling (consistency group) ● Usage: – “rbd group <group-name>” – “rbd group image add <image-name>” – “rbd group snap create <group-name>@<snap-name>”
  • 13.
    Actve/Actve Mirroring ● Actve/PassiveHA support added in Luminous – Single daemon-per-pool elected to leader – Failure of leader will result in new leader electon ● Actve/Actve leader assigns images to followers – Simple “equal # of images” policy – Smarter load-based policies possible in future rbd-mirror LIBRBD SITE A rbd-mirror LIBRBD rbd-mirror LIBRBD FOLLOWERFOLLOWERLEADER Acquire/Release Image
  • 14.
    Additonal Mirroring Improvements ●Deferred Image Deletons – Propagated deletes from primary move non-primary image to RBD trash – Confg override: “rbd mirroring delete delay = <seconds>” – Restore via “rbd trash restore <image-id-spec>” ● Clone non-primary images – Helpful for OpenStack Glance-style workloads
  • 15.
    Minor Improvements ● IOPSthrotling – Client-side rate limiter – Confg override: “rbd qos iops limit = X” ● Trash purge CLI – Human-readable “--eipired-before X” optonal – Pool-usage “--threshold <percent>” optonal ● Creaton of image format v1 disabled – Deprecated since Jewel
  • 16.
  • 17.
    Linui Kernel RBDDriver (krbd) ● Support for “fancy” image striping – Already supported simple striping (stripe unit == object size, stripe count == 1) – Might be preferable in small, sequental IO workloads – (Hopefully) included in 4.17 ● Support for object-map/fast-dif in-progress ● Support for deep-laten is planned ● New operatons feature whitelisted in 4.16 rc1
  • 18.
    Linui IO (LIO)Target ● LIO + tcmu-runner provide SCSI/iSCSI interface to RBD ● Live LUN resize support ● Persistent Group Reservatons (PGRs) – Required for Windows Clustering – Required for VMware Certfcaton ● Graceful failover for shared LUNs – Avoid actve/optmized path ping-pong when failure visible from only a subset of initators ● Integrate more storage admin worklows into tools
  • 19.
  • 20.
    QoS Reservatons ● Utlizesthe dmClock work scheduler within the OSDs – Conditonal on its completon ● Minimum IOPS reservaton vs throtling – Reduce impact from noisy neighbors ● Interface / CLI changes stll TBD
  • 21.
    Persistent Read-Only Cache ●Client-side, hot-spot cache for read-only image parent objects ● Ofoad cluster for heavily re-used parents – VDI infrastructure / common “golden” images Client LIBRBD rbd-cache Read/Promote Object Object Hit Query Demote Object
  • 22.
  • 23.
    Persistent Writeback Cache,cont ● Replicated Write Log (RWL) plug-in for librbd cache ● All writes are streamed to persistent storage – Writethrough mode: safely locally persisted on each write – Writeback mode: safely locally persisted on lush ● Uses pmem.io library for atomic operatons – Optmized for persistent memory (i.e. NVDIMM) – Optonally supports librpmem for RDMA streaming to secondary host
  • 24.
    Persistent Writeback Cache,cont Client LIBRBD WRITES Sync Point Batched Writeback Log Append Ceph Cluster PMEM
  • 25.
    Performance Improvements ● librbduses more CPU per IOP than krbd ● Minor tweaks to librbd/librados can narrow the gap 4K rand-read 4K rand-write 20000 25000 30000 35000 40000 45000 50000 55000 60000 65000 0 50 100 150 200 250 librbd (current) librbd (tweaked) krbd librbd (current) librbd (tweaked) IOPS IOPS/%CPU
  • 26.
    Performance Improvements ● Currentclient libraries Flame Graph Search AsyncConnec.. Mess.. sock_.. b.. sys_read ip_l.. ___sys_sendmsg __GI___clone ip_rcv b.. i.. inet.. Context::.. boost::variant<lib.. do_soft.. fn-radosclient rbd_aio_read read start_thread librbd.. librbd::io:.. Connecte.. Objecter::ms.. __GI___clone non-virtual .. AsyncConn.. sy.. w.. l.. F.. PosixCon.. li.. MOS.. process.. __G.. librbd::io::Object.. l.. b.. n.. Objecter::_o.. librados::C_A.. As.. tcp_sendmsg b.. ip_fnish.. Objecter::ha.. boost::detail::var.. thread_main do_syscall_64 __vfs.. sock_sendmsg librbd::io::ObjectD.. __sys_sendmsg DispatchQueu.. operator en.. boost::detail::var.. b.. Finisher::fnisher_th.. librados::IoCt.. S.. do.. t.. l.. tcp_.. __GI___clone tcp_transmit_skb [libstdc++.so.6.0.24] f.. librbd::io::Object.. do_io en.. __sendmsg librbd::io::.. PosixConnectedSocketImpl:.. Co.. do_sys.. librbd::io::Object.. entry_.. b.. start_thread tcp_sendmsg_locked td_io_queue Obj.. librbd::io::ImageReadRe.. _.. [unkn.. ip_output EventCenter::process_events librbd::io::.. fo tc.. O.. __tcp_push_pendi.. [.. start_thread fo_rbd_queue msgr-worker b.. Objecter::_o.. __neti.. _.. entry_SYSCALL_64 net_rx_.. __libc_r.. b.. boost::variant<lib.. __softi.. vfs_read de.. ep.. Objec.. PosixConnectedSocketImp.. Asyn..librbd::io::ImageRequest.. boost::variant<lib.. boost::apply_visit.. Messenger::m.. Co.. boost::detail::var.. boost::detail::var.. AsyncCon.. do_soft.. AsyncConnection::process O.. Objecter::op_.. librados::IoCtx.. Context::complete librbd::io::ImageReques.. i.. ip_.. t.. li.. tcp.. tcp_write_xmit tc.. __local.. r.. _M_invoke As.. Epo..
  • 27.
    Performance Improvements, cont ●Tweaked client libraries Flame Graph Search inet_.. t.. Messenger::ms_fast_dispa.. Context::c.. start_thread io.. f.. do_softi.. boost::apply_visitor<libr.. boost::variant<librbd::io.. librbd::io::ImageRequest<librbd.. Re.. Objecter::handle_osd_op.. librbd::io::ObjectReadReq.. librbd::io::ObjectDispatc.. Objecter::_op_su.. ip.. ip_l.. librbd::.. boost::detail::variant::v.. boost::variant<librbd::io.. librados::IoCtx::aio_o.. entry_S.. sock_.. ip_fnish.. __vfs_.. Ob.. tcp_sendmsg_locked __tcp_push_pendin.. decod..sock_sendmsg fo_rbd_queue Epo.. do_syscall_64 entry_SYSCALL_64 boost::detail::variant::v.. msgr-worker Obj.. tcp.. PosixConnectedSocketImpl::send r.. boost::variant<librbd::io.. tcp_r.. librados:.. AsyncCo.. tcp_sendmsg vfs_read boost::detail::variant::i.. wa.. do.. __GI___clone start_thread librbd::io::ImageRequestWQ<librb.. fo AsyncConn.. M.. ip_.. __G.. Ob.. en.. __netif.. tcp_transmit_skb AsyncConnection::process __sendmsg librbd:.. Objecter::op_submit PosixConnectedSocketImpl::.. librbd::.. ip_output librbd::io::ObjectDispatc.. ip_rcv __libc_r.. Message.. _M_invoke Contex.. DispatchQueue::fast_dispa.. __sys_sendmsg lib.. librbd::io::ImageReadRequest<.. s.. Context::c.. __local_.. do_softi.. tc.. l.. StripeGenerator::fle_to_exte.. operator t.. thread_main Objecter::ms_dispatch sy.. rbd_aio_read librbd::io::ObjectDispatch.. OS.. librbd::io::ImageReadRequest<l.. Asy.. OS.. ep.. __.. do_io AsyncConnection.. non-virtual thunk to Obj.. PosixConn.. do_sysc.. tc.. tcp_write_xmit nf.. EventCenter::process_events __GI___clone As.. MOSDO.. AsyncConne.. [libstdc++.so.6.0.24] read enc.. sys_read process.. Connected.. librados::IoCtxImpl::.. boost::detail::variant::v.. td_io_queue net_rx_.. Objecter::_op_sub.. ___sys_sendmsg __softir..
  • 28.
  • 29.
    ● librbd – Improvedimage/pool confguraton overrides – Client-side encrypton – Improve performance ● Mirroring – Deep scrub and repair – QoS throtles – Improved stats – Improve performance
  • 30.
  • 31.