Ceph and Mirantis OpenStack


On January 14, 2014, Dmitry Borodaenko presented information on Ceph and OpenStack. This is the slide deck for that presentation.

  1. 1. Ceph in Mirantis OpenStack Dmitry Borodaenko Mountain View, 2014
  2. 2. The Plan 1. What is Ceph? 2. What is Mirantis OpenStack? 3. How does Ceph fit into OpenStack? 4. What has Fuel ever done for Ceph? 5. What does it look like? 6. Things we’ve done 7. Disk partition for Ceph OSD 8. Cephx authentication settings 9. Types of VM migrations 10. Live VM migrations with Ceph 11. Thinks we left undone 12. Diagnostics and troubleshooting 13. Resources
  3. 3. What is Ceph? Ceph is a free clustered storage platform that provides unified object, block, and file storage. Object Storage RADOS objects support snapshotting, replication, and consistency. Block Storage RBD block devices are thinly provisioned over RADOS objects and can be accessed by QEMU via librbd library. Kernel Module librbd RADOS Protocol OSDs Monitors File Storage CephFS metadata servers (MDS) provide a POSIX-compliant overlay over RADOS.
  4. 4. What is Mirantis OpenStack? OpenStack is an open source cloud computing platform. Nova provisions stores objects in VM provides volumes for Cinder provides images for Swift stores images in Glance Mirantis ships hardened OpenStack packages and provides Fuel utility to simplify deployment of OpenStack and Ceph. Fuel uses Cobbler, MCollective, and Puppet to discover nodes, provision OS, and setup OpenStack services. Fuel master node Astute Nailgun Target node serialize orchestrate Cobbler facts configure Puppet start MCollective provision MCollective Agent
  5. 5. How does Ceph fit into OpenStack? RBD drivers for OpenStack make libvirt configure the QEMU interface to librbd. OpenStack libvirt Ceph benefits: Multi-node striping and redundancy for block storage (Cinder volumes and Nova ephemeral drives) Copy-on-write cloning of images to volumes and instances configures QEMU librbd Unified storage pool for all types of storage (object, block, POSIX) librados Live migration of Ceph-backed instances OSDs Monitors Problems: sensitivity to clock drift, multi-site (async replication in Emperor), block storage density (erasure coding in Firefly), Swift API gap (rbd backend for Swift)
  6. 6. What has Fuel ever done for Ceph? 1. Fuel deploys Ceph Monitors and OSDs on dedicated nodes or in combination with OpenStack components. controller controller 1 ceph-mon storage n ... storage 1 controller ceph-mon ceph-osd ceph-mon ceph-osd storage network controller 3 controller 2 controller compute n ... compute 1 nova ceph client management network 2. Creates partitions for OSDs when nodes are provisioned. 3. Creates separate RADOS pools and sets up Cephx authentication for Cinder, Glance, and Nova. 4. Configures Cinder, Glance, and Nova to use RBD backend with the right pools and credentials. 5. Deploys RADOS Gateway (S3 and Swift API frontend to Ceph) behind HAProxy on controller nodes.
  7. 7. What does it look like? Select storage options ⇒ assign roles to nodes ⇒ allocate disks:
  8. 8. Things we’ve done 1. Set the right GPT type GUIDs on OSD and journal partitions for udev automount rules 2. ceph-deploy: set up root SSH between Ceph nodes 3. Basic Ceph settings: cephx, pool size, networks 4. Cephx: ceph auth command line can’t be split 5. Rados Gateway: has to be the Inktank’s fork of FastCGI, set an infinite revocation interval for UUID auth tokens to work 6. Patch Cinder to convert non-raw images when creating an RBD backed volume from Glance 7. Patch Nova: clone RBD backed Glance images into RBD backed ephemeral volumes, pass RBD user to qemu-img 8. Ephemeral RBD: disable SSH key injection, set up Nova, libvirt, and QEMU for live migrations
  9. 9. Disk partitioning for Ceph OSD Flow of disk partitioning information during discovery, configuration, provisioning, and deployment: Fuel master node Fuel UI allocation ceph-osd role volumes openstack.json Target node Nailgun ks_spaces Cobbler disks MCAgent parted pmanager sgdisk scan scan create osd:journal Base OS OSD set type Facter osd_devices_list Puppet ceph::osd OSD Journal ceph-deploy GPT partition type GUIDs according to ceph-disk: JOURNAL_UUID = ’ 45 b0969e -9 b03 -4 f30 - b4c6 - b4b80ceff106 ’ OSD_UUID = ’4 fbd7e29 -9 d25 -41 b8 - afd0 -062 c0ceff05d ’ If more than one device is allocated for OSD Journal, journal devices are evenly distributed between OSDs.
  10. 10. Cephx authentication settings Monitor ACL is the same for all Cephx users: allow r OSD ACLs vary per OpenStack component: Glance: allow class - read object_prefix rbd_children , allow rwx pool = images Cinder: allow class - read object_prefix rbd_children , allow rwx pool = volumes allow rx pool = images Nova: allow class - read object_prefix rbd_children , allow rwx pool = volumes allow rx pool = images allow rwx pool = compute Watch out: Cephx is easily tripped up by unexpected whitespace in ceph auth command line parameters, so we have to keep them all on a single line.
  11. 11. Types of VM migrations OpenStack: Live vs offline: Is VM stopped during migration? Block vs shared storage vs volume-backed: Is VM data shared between nodes? Is VM metadata (e.g. libvirt domain XML) shared? Libvirt: Native vs tunneled: Is VM state transferred directly between hypervisors or tunneled by libvirtd? Direct vs peer-to-peer: Is migration controlled by libvirt client or by source libvirtd? Managed vs unmanaged: Is migration controlled by libvirt or by hypervisor itself? Our type: Live, volume-backed*, native, peer-to-peer, managed.
  12. 12. Live VM migrations with Ceph Enable native peer to peer live migration: VM-A Nova VM-B VM-C libvirtd Source compute node VM-C VM-D libvirtd VM-E Nova Destination compute node libvirt VIR_MIGRATE_* flags: LIVE, PEER2PEER, UNDEFINE_SOURCE, PERSIST_DEST Patch Nova to decouple shared volumes from shared libvirt metadata logic during live migration Set VNC listen address to and block VNC from outside the management network in iptables Open ports 49152+ between computes for QEMU migrations
  13. 13. Things we left undone 1. Non-root user with sudo for ceph-deploy 2. Calculate PG numbers based on the number of OSDs 3. Ceph public network should go to a second storage network instead of management 4. Dedicated Monitor nodes, list all Monitors in ceph.conf on each Ceph node 5. Multi-backend configuration for Cinder 6. A better way to configure pools for OpenStack services (than CEPH_ARGS in the init script) 7. Make Nova update VM’s VNC listen address to vncserver_listen of the destination compute after migration 8. Replace ’qemu-img convert’ with clone_image() in LibvirtDriver.snapshot() in Nova
  14. 14. Diagnostics and troubleshooting ceph -s ceph osd tree cinder create 1 rados df qemu - img convert -O raw cirros . qcow2 cirros . raw glance image - create -- name cirros - raw --is - public yes -- container - format bare -- disk - format raw < cirros . raw nova boot -- flavor 1 -- image cirros - raw vm0 nova live - migration vm0 node -3 disk partitioning failed during provisioning – check if traces of previous partition tables are left on any drives ’ceph-deploy config pull’ failed – check if the node can ssh to the primary controller over management network HEALTH_WARN: clock skew detected – check your ntpd settings, make sure your NTP server is reachable from all nodes ENOSPC when storing small objects in RGW – try setting a smaller rgw object stripe size
  15. 15. Resources Read the docs: http://ceph.com/docs/next/rbd/rbd-openstack/ http://docs.mirantis.com/fuel/fuel-4.0/ http://libvirt.org/migration.html http://docs.openstack.org/admin-guide-cloud/content/ ch_introduction-to-openstack-compute.html Get the code: Mirantis OpenStack ISO image and VirtualBox scripts, ceph Puppet module for Fuel, Josh Durgin’s havana-ephemeral-rbd branch for Nova. Vote on Nova bugs: #1226351, #1261675, #1262450, #1262914. Sign up for Mirantis and Inktank webcast on Ceph and OpenStack.