Ceph in Mirantis OpenStack

Dmitry Borodaenko

Mountain View, 2014
The Plan
1. What is Ceph?
2. What is Mirantis OpenStack?
3. How does Ceph fit into OpenStack?
4. What has Fuel ever done for Ceph?
5. What does it look like?
6. Things we’ve done
7. Disk partition for Ceph OSD
8. Cephx authentication settings
9. Types of VM migrations
10. Live VM migrations with Ceph
11. Thinks we left undone
12. Diagnostics and troubleshooting
13. Resources
What is Ceph?
Ceph is a free clustered storage platform that provides unified
object, block, and file storage.
Object Storage RADOS objects support snapshotting, replication,
and consistency.
Block Storage RBD block devices are thinly provisioned over
RADOS objects and can be accessed by QEMU via
librbd library.
Kernel Module

librbd

RADOS Protocol
OSDs

Monitors

File Storage CephFS metadata servers (MDS) provide a
POSIX-compliant overlay over RADOS.
What is Mirantis OpenStack?
OpenStack is an open source cloud computing platform.
Nova

provisions

stores
objects in

VM

provides
volumes for

Cinder

provides
images for

Swift
stores
images in
Glance

Mirantis ships hardened OpenStack packages and provides Fuel
utility to simplify deployment of OpenStack and Ceph.
Fuel uses Cobbler, MCollective, and Puppet to discover
nodes, provision OS, and setup OpenStack services.
Fuel master node
Astute

Nailgun

Target node
serialize

orchestrate
Cobbler

facts

configure

Puppet

start

MCollective

provision

MCollective Agent
How does Ceph fit into OpenStack?
RBD drivers for OpenStack make libvirt
configure the QEMU interface to librbd.

OpenStack
libvirt

Ceph benefits:
Multi-node striping and redundancy
for block storage (Cinder volumes
and Nova ephemeral drives)
Copy-on-write cloning of images to
volumes and instances

configures

QEMU
librbd

Unified storage pool for all types of
storage (object, block, POSIX)

librados

Live migration of Ceph-backed
instances

OSDs

Monitors

Problems: sensitivity to clock drift, multi-site (async replication in
Emperor), block storage density (erasure coding in Firefly), Swift
API gap (rbd backend for Swift)
What has Fuel ever done for Ceph?
1. Fuel deploys Ceph Monitors and OSDs on dedicated nodes or
in combination with OpenStack components.

controller
controller 1
ceph-mon

storage n
...
storage 1

controller
ceph-mon

ceph-osd

ceph-mon

ceph-osd

storage network

controller 3
controller 2
controller

compute n
...
compute 1
nova
ceph client

management network

2. Creates partitions for OSDs when nodes are provisioned.
3. Creates separate RADOS pools and sets up Cephx
authentication for Cinder, Glance, and Nova.
4. Configures Cinder, Glance, and Nova to use RBD backend
with the right pools and credentials.
5. Deploys RADOS Gateway (S3 and Swift API frontend to
Ceph) behind HAProxy on controller nodes.
What does it look like?
Select storage options ⇒ assign roles to nodes ⇒ allocate disks:
Things we’ve done
1. Set the right GPT type GUIDs on OSD and journal partitions
for udev automount rules
2. ceph-deploy: set up root SSH between Ceph nodes
3. Basic Ceph settings: cephx, pool size, networks
4. Cephx: ceph auth command line can’t be split
5. Rados Gateway: has to be the Inktank’s fork of FastCGI, set
an infinite revocation interval for UUID auth tokens to work
6. Patch Cinder to convert non-raw images when creating an
RBD backed volume from Glance
7. Patch Nova: clone RBD backed Glance images into RBD
backed ephemeral volumes, pass RBD user to qemu-img
8. Ephemeral RBD: disable SSH key injection, set up Nova,
libvirt, and QEMU for live migrations
Disk partitioning for Ceph OSD
Flow of disk partitioning information during discovery,
configuration, provisioning, and deployment:
Fuel master node
Fuel UI

allocation
ceph-osd
role volumes

openstack.json

Target node
Nailgun
ks_spaces

Cobbler

disks

MCAgent

parted
pmanager
sgdisk

scan

scan

create

osd:journal
Base OS
OSD

set
type

Facter
osd_devices_list

Puppet
ceph::osd

OSD
Journal

ceph-deploy

GPT partition type GUIDs according to ceph-disk:
JOURNAL_UUID = ’ 45 b0969e -9 b03 -4 f30 - b4c6 - b4b80ceff106 ’
OSD_UUID
= ’4 fbd7e29 -9 d25 -41 b8 - afd0 -062 c0ceff05d ’

If more than one device is allocated for OSD Journal, journal
devices are evenly distributed between OSDs.
Cephx authentication settings
Monitor ACL is the same for all Cephx users:
allow r

OSD ACLs vary per OpenStack component:
Glance: allow class - read object_prefix rbd_children ,
allow rwx pool = images

Cinder: allow class - read object_prefix rbd_children ,
allow rwx pool = volumes
allow rx pool = images

Nova: allow class - read object_prefix rbd_children ,
allow rwx pool = volumes
allow rx pool = images
allow rwx pool = compute

Watch out: Cephx is easily tripped up by unexpected whitespace in
ceph auth command line parameters, so we have to keep them all
on a single line.
Types of VM migrations
OpenStack:
Live vs offline: Is VM stopped during migration?
Block vs shared storage vs volume-backed: Is VM data shared
between nodes? Is VM metadata (e.g. libvirt domain
XML) shared?
Libvirt:
Native vs tunneled: Is VM state transferred directly between
hypervisors or tunneled by libvirtd?
Direct vs peer-to-peer: Is migration controlled by libvirt client or by
source libvirtd?
Managed vs unmanaged: Is migration controlled by libvirt or by
hypervisor itself?
Our type:
Live, volume-backed*, native, peer-to-peer, managed.
Live VM migrations with Ceph
Enable native peer to peer live migration:

VM-A

Nova

VM-B

VM-C

libvirtd

Source compute node

VM-C

VM-D

libvirtd

VM-E

Nova

Destination compute node

libvirt VIR_MIGRATE_* flags: LIVE, PEER2PEER,
UNDEFINE_SOURCE, PERSIST_DEST
Patch Nova to decouple shared volumes from shared libvirt
metadata logic during live migration
Set VNC listen address to 0.0.0.0 and block VNC from outside
the management network in iptables
Open ports 49152+ between computes for QEMU migrations
Things we left undone
1. Non-root user with sudo for ceph-deploy
2. Calculate PG numbers based on the number of OSDs
3. Ceph public network should go to a second storage network
instead of management
4. Dedicated Monitor nodes, list all Monitors in ceph.conf on
each Ceph node
5. Multi-backend configuration for Cinder
6. A better way to configure pools for OpenStack services (than
CEPH_ARGS in the init script)
7. Make Nova update VM’s VNC listen address to
vncserver_listen of the destination compute after migration
8. Replace ’qemu-img convert’ with clone_image() in
LibvirtDriver.snapshot() in Nova
Diagnostics and troubleshooting
ceph -s
ceph osd tree
cinder create 1
rados df
qemu - img convert -O raw cirros . qcow2 cirros . raw
glance image - create -- name cirros - raw --is - public yes 
-- container - format bare -- disk - format raw < cirros . raw
nova boot -- flavor 1 -- image cirros - raw vm0
nova live - migration vm0 node -3

disk partitioning failed during provisioning – check if traces of
previous partition tables are left on any drives
’ceph-deploy config pull’ failed – check if the node can ssh to the
primary controller over management network
HEALTH_WARN: clock skew detected – check your ntpd settings,
make sure your NTP server is reachable from all nodes
ENOSPC when storing small objects in RGW – try setting a
smaller rgw object stripe size
Resources
Read the docs:
http://ceph.com/docs/next/rbd/rbd-openstack/
http://docs.mirantis.com/fuel/fuel-4.0/
http://libvirt.org/migration.html
http://docs.openstack.org/admin-guide-cloud/content/
ch_introduction-to-openstack-compute.html
Get the code:
Mirantis OpenStack ISO image and VirtualBox scripts,
ceph Puppet module for Fuel,
Josh Durgin’s havana-ephemeral-rbd branch for Nova.
Vote on Nova bugs:
#1226351, #1261675, #1262450, #1262914.
Sign up for Mirantis and Inktank webcast on Ceph and OpenStack.

Ceph and Mirantis OpenStack

  • 1.
    Ceph in MirantisOpenStack Dmitry Borodaenko Mountain View, 2014
  • 2.
    The Plan 1. Whatis Ceph? 2. What is Mirantis OpenStack? 3. How does Ceph fit into OpenStack? 4. What has Fuel ever done for Ceph? 5. What does it look like? 6. Things we’ve done 7. Disk partition for Ceph OSD 8. Cephx authentication settings 9. Types of VM migrations 10. Live VM migrations with Ceph 11. Thinks we left undone 12. Diagnostics and troubleshooting 13. Resources
  • 3.
    What is Ceph? Cephis a free clustered storage platform that provides unified object, block, and file storage. Object Storage RADOS objects support snapshotting, replication, and consistency. Block Storage RBD block devices are thinly provisioned over RADOS objects and can be accessed by QEMU via librbd library. Kernel Module librbd RADOS Protocol OSDs Monitors File Storage CephFS metadata servers (MDS) provide a POSIX-compliant overlay over RADOS.
  • 4.
    What is MirantisOpenStack? OpenStack is an open source cloud computing platform. Nova provisions stores objects in VM provides volumes for Cinder provides images for Swift stores images in Glance Mirantis ships hardened OpenStack packages and provides Fuel utility to simplify deployment of OpenStack and Ceph. Fuel uses Cobbler, MCollective, and Puppet to discover nodes, provision OS, and setup OpenStack services. Fuel master node Astute Nailgun Target node serialize orchestrate Cobbler facts configure Puppet start MCollective provision MCollective Agent
  • 5.
    How does Cephfit into OpenStack? RBD drivers for OpenStack make libvirt configure the QEMU interface to librbd. OpenStack libvirt Ceph benefits: Multi-node striping and redundancy for block storage (Cinder volumes and Nova ephemeral drives) Copy-on-write cloning of images to volumes and instances configures QEMU librbd Unified storage pool for all types of storage (object, block, POSIX) librados Live migration of Ceph-backed instances OSDs Monitors Problems: sensitivity to clock drift, multi-site (async replication in Emperor), block storage density (erasure coding in Firefly), Swift API gap (rbd backend for Swift)
  • 6.
    What has Fuelever done for Ceph? 1. Fuel deploys Ceph Monitors and OSDs on dedicated nodes or in combination with OpenStack components. controller controller 1 ceph-mon storage n ... storage 1 controller ceph-mon ceph-osd ceph-mon ceph-osd storage network controller 3 controller 2 controller compute n ... compute 1 nova ceph client management network 2. Creates partitions for OSDs when nodes are provisioned. 3. Creates separate RADOS pools and sets up Cephx authentication for Cinder, Glance, and Nova. 4. Configures Cinder, Glance, and Nova to use RBD backend with the right pools and credentials. 5. Deploys RADOS Gateway (S3 and Swift API frontend to Ceph) behind HAProxy on controller nodes.
  • 7.
    What does itlook like? Select storage options ⇒ assign roles to nodes ⇒ allocate disks:
  • 8.
    Things we’ve done 1.Set the right GPT type GUIDs on OSD and journal partitions for udev automount rules 2. ceph-deploy: set up root SSH between Ceph nodes 3. Basic Ceph settings: cephx, pool size, networks 4. Cephx: ceph auth command line can’t be split 5. Rados Gateway: has to be the Inktank’s fork of FastCGI, set an infinite revocation interval for UUID auth tokens to work 6. Patch Cinder to convert non-raw images when creating an RBD backed volume from Glance 7. Patch Nova: clone RBD backed Glance images into RBD backed ephemeral volumes, pass RBD user to qemu-img 8. Ephemeral RBD: disable SSH key injection, set up Nova, libvirt, and QEMU for live migrations
  • 9.
    Disk partitioning forCeph OSD Flow of disk partitioning information during discovery, configuration, provisioning, and deployment: Fuel master node Fuel UI allocation ceph-osd role volumes openstack.json Target node Nailgun ks_spaces Cobbler disks MCAgent parted pmanager sgdisk scan scan create osd:journal Base OS OSD set type Facter osd_devices_list Puppet ceph::osd OSD Journal ceph-deploy GPT partition type GUIDs according to ceph-disk: JOURNAL_UUID = ’ 45 b0969e -9 b03 -4 f30 - b4c6 - b4b80ceff106 ’ OSD_UUID = ’4 fbd7e29 -9 d25 -41 b8 - afd0 -062 c0ceff05d ’ If more than one device is allocated for OSD Journal, journal devices are evenly distributed between OSDs.
  • 10.
    Cephx authentication settings MonitorACL is the same for all Cephx users: allow r OSD ACLs vary per OpenStack component: Glance: allow class - read object_prefix rbd_children , allow rwx pool = images Cinder: allow class - read object_prefix rbd_children , allow rwx pool = volumes allow rx pool = images Nova: allow class - read object_prefix rbd_children , allow rwx pool = volumes allow rx pool = images allow rwx pool = compute Watch out: Cephx is easily tripped up by unexpected whitespace in ceph auth command line parameters, so we have to keep them all on a single line.
  • 11.
    Types of VMmigrations OpenStack: Live vs offline: Is VM stopped during migration? Block vs shared storage vs volume-backed: Is VM data shared between nodes? Is VM metadata (e.g. libvirt domain XML) shared? Libvirt: Native vs tunneled: Is VM state transferred directly between hypervisors or tunneled by libvirtd? Direct vs peer-to-peer: Is migration controlled by libvirt client or by source libvirtd? Managed vs unmanaged: Is migration controlled by libvirt or by hypervisor itself? Our type: Live, volume-backed*, native, peer-to-peer, managed.
  • 12.
    Live VM migrationswith Ceph Enable native peer to peer live migration: VM-A Nova VM-B VM-C libvirtd Source compute node VM-C VM-D libvirtd VM-E Nova Destination compute node libvirt VIR_MIGRATE_* flags: LIVE, PEER2PEER, UNDEFINE_SOURCE, PERSIST_DEST Patch Nova to decouple shared volumes from shared libvirt metadata logic during live migration Set VNC listen address to 0.0.0.0 and block VNC from outside the management network in iptables Open ports 49152+ between computes for QEMU migrations
  • 13.
    Things we leftundone 1. Non-root user with sudo for ceph-deploy 2. Calculate PG numbers based on the number of OSDs 3. Ceph public network should go to a second storage network instead of management 4. Dedicated Monitor nodes, list all Monitors in ceph.conf on each Ceph node 5. Multi-backend configuration for Cinder 6. A better way to configure pools for OpenStack services (than CEPH_ARGS in the init script) 7. Make Nova update VM’s VNC listen address to vncserver_listen of the destination compute after migration 8. Replace ’qemu-img convert’ with clone_image() in LibvirtDriver.snapshot() in Nova
  • 14.
    Diagnostics and troubleshooting ceph-s ceph osd tree cinder create 1 rados df qemu - img convert -O raw cirros . qcow2 cirros . raw glance image - create -- name cirros - raw --is - public yes -- container - format bare -- disk - format raw < cirros . raw nova boot -- flavor 1 -- image cirros - raw vm0 nova live - migration vm0 node -3 disk partitioning failed during provisioning – check if traces of previous partition tables are left on any drives ’ceph-deploy config pull’ failed – check if the node can ssh to the primary controller over management network HEALTH_WARN: clock skew detected – check your ntpd settings, make sure your NTP server is reachable from all nodes ENOSPC when storing small objects in RGW – try setting a smaller rgw object stripe size
  • 15.
    Resources Read the docs: http://ceph.com/docs/next/rbd/rbd-openstack/ http://docs.mirantis.com/fuel/fuel-4.0/ http://libvirt.org/migration.html http://docs.openstack.org/admin-guide-cloud/content/ ch_introduction-to-openstack-compute.html Getthe code: Mirantis OpenStack ISO image and VirtualBox scripts, ceph Puppet module for Fuel, Josh Durgin’s havana-ephemeral-rbd branch for Nova. Vote on Nova bugs: #1226351, #1261675, #1262450, #1262914. Sign up for Mirantis and Inktank webcast on Ceph and OpenStack.