3. Ceph Components
RGW
A web services gateway for
object storage, compatible with
S3 and Swift
LIBRADOS
A library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)
RADOS
A software-based, reliable, autonomous, distributed object store comprised of
self-healing, self-managing, intelligent storage nodes and lightweight monitors
RBD
A reliable, fully-distributed block
device with cloud platform
integration
CEPHFS
A distributed fle system with
POSIX semantics and scale-out
metadata management
OBJECT BLOCK FILE
7. Deep-Copy of Images
● Generic version of mirroring’s “image sync” logic
– Preserves clone parent linkage and snapshots
– Sparse copy of backing objects
● Useful for copying between pools or changing fied
image format parameters
● Optonally laten copies of cloned images
● Usage:
– “rbd deep copy <src-snap-spec> <dest-image-spec>”
– Similar to “rbd eiport --eiport-format 2 <src-snap-spec> -
| rbd import --eiport-format 2 <dest-image-spec> -”
8. Live Image Migraton
● Live “deep-copy” of image
– Safe to use with concurrent IO against image
● Reads follow hierarchy chain
● Writes only afect destnaton
– May require “copy-up” from hierarchy chain
LIBRADOS
LIBRBD
User-space Client READSWRITES
DESTINATIONSOURCEPARENT
(OPTIONAL)
9. Live Image Migraton, cont
● Useful for copying between pools or changing fied image
format parameters
– Cannot live-migrate between clusters (use mirroring)
● Alternatve to bolt-on solutons
– QEMU drive-mirror
– Kernel sofware RAID “repair”
● Usage:
– “rbd migraton prepare <src-image> <dst-image>”
– “rbd migraton eiecute”
– “rbd migraton commit”
10. Image Cloning (v2)
● Simplifed operatons for eiistng clone feature
– No change to data low handling
– Supported in krbd (new feature bit whitelisted)
● Parent snapshots no longer need to be protected
– Atomic reference countng in the OSD
– OSD “profle rbd[-read-only]” caps whitelist support
● Snapshots with linked children can be “deleted”
– In reality: moved to “trash” snapshot namespace
– Trashed snapshot auto-removed
11. Image Cloning (v2), cont
● Support enabled automatcally when Mimic clients
required
– “ceph osd set-require-min-compat-client mimic”
– Confg override: “rbd default clone format = [1, 2, auto]”
● Usage:
– “rbd snap create <parent-image>@<snap>”
– “rbd clone <parent-image>@<snap> <child image>”
– “rbd snap rm <parent-image>@<snap>”
– “rbd snap ls --all parent”
SNAPID NAME SIZE TIMESTAMP NAMESPACE
4 29da36c8-9ff6-46b7-9888-9e9804c22525 1024 kB Sun Mar 18 20:41:13 2018 trash
12. Image Groups
● Associate related-images together
● Perform snapshots concurrently against grouped images
● Future work:
– Need to add support for group cloning
– Need to add support for group journaling (consistency
group)
● Usage:
– “rbd group <group-name>”
– “rbd group image add <image-name>”
– “rbd group snap create <group-name>@<snap-name>”
13. Actve/Actve Mirroring
● Actve/Passive HA support added in Luminous
– Single daemon-per-pool elected to leader
– Failure of leader will result in new leader electon
● Actve/Actve leader assigns images to followers
– Simple “equal # of images” policy
– Smarter load-based policies possible in future
rbd-mirror
LIBRBD
SITE A
rbd-mirror
LIBRBD
rbd-mirror
LIBRBD
FOLLOWERFOLLOWERLEADER
Acquire/Release Image
17. Linui Kernel RBD Driver (krbd)
● Support for “fancy” image striping
– Already supported simple striping (stripe unit ==
object size, stripe count == 1)
– Might be preferable in small, sequental IO
workloads
– (Hopefully) included in 4.17
● Support for object-map/fast-dif in-progress
● Support for deep-laten is planned
● New operatons feature whitelisted in 4.16 rc1
18. Linui IO (LIO) Target
● LIO + tcmu-runner provide SCSI/iSCSI interface to RBD
● Live LUN resize support
● Persistent Group Reservatons (PGRs)
– Required for Windows Clustering
– Required for VMware Certfcaton
● Graceful failover for shared LUNs
– Avoid actve/optmized path ping-pong when failure
visible from only a subset of initators
● Integrate more storage admin worklows into tools
20. QoS Reservatons
● Utlizes the dmClock work scheduler within the OSDs
– Conditonal on its completon
● Minimum IOPS reservaton vs throtling
– Reduce impact from noisy neighbors
● Interface / CLI changes stll TBD
21. Persistent Read-Only Cache
● Client-side, hot-spot cache for read-only image parent
objects
● Ofoad cluster for heavily re-used parents
– VDI infrastructure / common “golden” images
Client
LIBRBD
rbd-cache
Read/Promote Object
Object Hit Query
Demote Object