Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Ceph Block Devices: A Deep Dive

6,584 views

Published on

In this session, you'll learn how RBD works, including how it:

Uses RADOS classes to make access easier from user space and within the Linux kernel.
Implements thin provisioning.
Builds on RADOS self-managed snapshots for cloning and differential backups.
Increases performance with caching of various kinds.
Uses watch/notify RADOS primitives to handle online management operations.
Integrates with QEMU, libvirt, and OpenStack.

Published in: Technology
  • Be the first to comment

Ceph Block Devices: A Deep Dive

  1. 1. Ceph Block Devices: A Deep Dive Josh Durgin RBD Lead June 24, 2015
  2. 2. Ceph Motivating Principles ● All components must scale horizontally ● There can be no single point of failure ● The solution must be hardware agnostic ● Should use commodity hardware ● Self-manage wherever possible ● Open Source (LGPL) ● Move beyond legacy approaches – client/cluster instead of client/server – Ad hoc HA
  3. 3. Ceph Components RGW A web services gateway for object storage, compatible with S3 and Swift LIBRADOS A library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP) RADOS A software-based, reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes and lightweight monitors RBD A reliable, fully- distributed block device with cloud platform integration CEPHFS A distributed file system with POSIX semantics and scale-out metadata management APP HOST/VM CLIENT
  4. 4. Ceph Components RGW A web services gateway for object storage, compatible with S3 and Swift LIBRADOS A library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP) RADOS A software-based, reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes and lightweight monitors RBD A reliable, fully- distributed block device with cloud platform integration CEPHFS A distributed file system with POSIX semantics and scale-out metadata management APP HOST/VM CLIENT
  5. 5. Storing Virtual Disks M M RADOS CLUSTER HYPERVISOR LIBRBD VM
  6. 6. Kernel Module M M RADOS CLUSTER LINUX HOST KRBD
  7. 7. RBD ● Stripe images across entire cluster (pool) ● Read-only snapshots ● Copy-on-write clones ● Broad integration – QEMU, libvirt – Linux kernel – iSCSI (STGT, LIO) – OpenStack, CloudStack, OpenNebula, Ganeti, Proxmox, oVirt ● Incremental backup (relative to snapshots)
  8. 8. Ceph Components RGW A web services gateway for object storage, compatible with S3 and Swift LIBRADOS A library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP) RADOS A software-based, reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes and lightweight monitors RBD A reliable, fully- distributed block device with cloud platform integration CEPHFS A distributed file system with POSIX semantics and scale-out metadata management APP HOST/VM CLIENT
  9. 9. RADOS ● Flat namespace within a pool ● Rich object API – Bytes, attributes, key/value data – Partial overwrite of existing data – Single-object compound atomic operations – RADOS classes (stored procedures) ● Strong consistency (CP system) ● Infrastructure aware, dynamic topology ● Hash-based placement (CRUSH) ● Direct client to server data path
  10. 10. RADOS Components OSDs: 10s to 1000s in a cluster  One per disk (or one per SSD, RAID group…)  Serve stored objects to clients  Intelligently peer for replication & recovery Monitors:  Maintain cluster membership and state  Provide consensus for distributed decision-making  Small, odd number (e.g., 5)  Not part of data path M
  11. 11. Ceph Components RGW A web services gateway for object storage, compatible with S3 and Swift LIBRADOS A library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP) RADOS A software-based, reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes and lightweight monitors RBD A reliable, fully- distributed block device with cloud platform integration CEPHFS A distributed file system with POSIX semantics and scale-out metadata management APP HOST/VM CLIENT
  12. 12. Metadata ● rbd_directory – Maps image name to id, and vice versa ● rbd_children – Lists clones in a pool, indexed by parent ● rbd_id.$image_name – The internal id, locatable using only the user-specified image name ● rbd_header.$image_id – Per-image metadata
  13. 13. Data ● rbd_data.* objects – Named based on offset in image – Non-existent to start with – Plain data in each object – Snapshots handled by rados – Often sparse
  14. 14. Striping ● Objects are uniformly sized – Default is simple 4MB divisions of device ● Randomly distributed among OSDs by CRUSH ● Parallel work is spread across many spindles ● No single set of servers responsible for image ● Small objects lower OSD usage variance
  15. 15. I/O M M RADOS CLUSTER HYPERVISOR LIBRBD CACHE VM
  16. 16. Snapshots ● Object granularity ● Snapshot context [list of snap ids, latest snap id] – Stored in rbd image header (self-managed) – Sent with every write ● Snapshot ids managed by monitors ● Deleted asynchronously ● RADOS keeps per-object overwrite stats, so diffs are easy
  17. 17. LIBRBD Snap context: ([], 7) write Snapshots
  18. 18. LIBRBD Create snap 8 Snapshots
  19. 19. LIBRBD Set snap context to ([8], 8) Snapshots
  20. 20. LIBRBD Write with ([8], 8) Snapshots
  21. 21. LIBRBD Write with ([8], 8) Snap 8 Snapshots
  22. 22. Watch/Notify ● establish stateful 'watch' on an object – client interest persistently registered with object – client keeps connection to OSD open ● send 'notify' messages to all watchers – notify message (and payload) sent to all watchers – notification (and reply payloads) on completion ● strictly time-bounded liveness check on watch – no notifier falsely believes we got a message
  23. 23. Watch/Notify HYPERVISOR LIBRBD /usr/bin/rbd LIBRBD watch HYPERVISOR LIBRBD /usr/bin/rbd LIBRBD watch
  24. 24. Watch/Notify HYPERVISOR LIBRBD /usr/bin/rbd LIBRBD HYPERVISOR LIBRBD /usr/bin/rbd LIBRBD Add snapshot
  25. 25. Watch/Notify HYPERVISOR LIBRBD /usr/bin/rbd LIBRBD notify HYPERVISOR LIBRBD /usr/bin/rbd LIBRBD notify
  26. 26. Watch/Notify HYPERVISOR LIBRBD /usr/bin/rbd LIBRBD notify ack HYPERVISOR LIBRBD /usr/bin/rbd LIBRBD notify complete
  27. 27. Watch/Notify HYPERVISOR LIBRBD read metadata HYPERVISOR LIBRBD
  28. 28. Clones ● Copy-on-write (and optionally read, in Hammer) ● Object granularity ● Independent settings – striping, feature bits, object size can differ – can be in different pools ● Clones are based on protected snapshots – 'protected' means they can't be deleted ● Can be flattened – Copy all data from parent – Remove parent relationship
  29. 29. Clones - read LIBRBD clone parent
  30. 30. Clones - write LIBRBD clone parent
  31. 31. LIBRBD clone parent read Clones - write
  32. 32. LIBRBD clone parent copy up, write Clones - write
  33. 33. Ceph and OpenStack
  34. 34. Virtualized Setup ● Secret key stored by libvirt ● XML defining VM fed in, includes – Monitor addresses – Client name – QEMU block device cache setting ● Writeback recommended – Bus on which to attach block device ● virtio-blk/virtio-scsi recommended ● Ide ok for legacy systems – Discard options (ide/scsi only), I/O throttling if desired
  35. 35. libvirtvm.xml QEMU LIBRBD CACHE qemu ­enable­kvm ... VM Virtualized Setup
  36. 36. Kernel RBD ● rbd map sets everything up ● /etc/ceph/rbdmap is like /etc/fstab ● udev adds handy symlinks: – /dev/rbd/$pool/$image[@snap] ● striping v2 and later feature bits not supported yet ● Can be used to back LIO, NFS, SMB, etc. ● No specialized cache, page cache used by filesystem on top
  37. 37. What's new in Hammer ● Copy-on-read ● Librbd cache enabled in safe mode by default ● Readahead during boot ● Lttng tracepoints ● Allocation hints ● Cache hints ● Exclusive locking (off by default for now) ● Object map (off by default for now)
  38. 38. Infernalis ● Easier space tracking (rbd du) ● Faster differential backups ● Per-image metadata – Can persist rbd options ● RBD journaling ● Enabling new features on the fly
  39. 39. Future Work ● Kernel client catch up ● RBD mirroring ● Consistency groups ● QoS for rados (policy at rbd level) ● Active/Active iSCSI ● Performance improvements – Newstore osd backend – Improved cache tiering – Finer-grained caching – Many more
  40. 40. Questions? Josh Durgin jdurgin@redhat.com

×