Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Keeping OpenStack storage trendy with Ceph and containers

5,721 views

Published on

The conventional approach to deploying applications on OpenStack uses virtual machines (usually KVM) backed by block devices (usually Ceph RBD). As interest increases in container-based application deployment models like Docker, it is worth looking at what alternatives exist for combining compute and storage (both shared and non-shared). Mapping RBD block devices directly to host kernels trades isolation for performance and may be appropriate for many private clouds without significant changes to the infrastructure. More importantly, moving away from a virtualization allows for non-block interfaces and a range of alternative models based on file or object.

Attendees will leave this talk with a basic understanding of the storage components and services available to both virtual machines and Linux containers, a view of a several ways they can be combined and the performance, reliability, and security trade-offs associated with those possibilities, and several proposals for how the relevant OpenStack projects (Nova, Cinder, Manila) can work together to make it easy.

Published in: Software

Keeping OpenStack storage trendy with Ceph and containers

  1. 1. KEEPING OPENSTACK STORAGE TRENDY WITH CEPH AND CONTAINERS SAGE WEIL, HAOMAI WANG OPENSTACK SUMMIT - 2015.05.20
  2. 2. 2 AGENDA ● Motivation ● Block ● File ● Container orchestration ● Summary
  3. 3. MOTIVATION
  4. 4. 4 WEB APPLICATION APP SERVER APP SERVER APP SERVER APP SERVER A CLOUD SMORGASBORD ● Compelling clouds offer options ● Compute – VM (KVM, Xen, …) – Containers (lxc, Docker, OpenVZ, ...) ● Storage – Block (virtual disk) – File (shared) – Object (RESTful, …) – Key/value – NoSQL – SQL
  5. 5. 5 WHY CONTAINERS? Technology ● Performance – Shared kernel – Faster boot – Lower baseline overhead – Better resource sharing ● Storage – Shared kernel → efficient IO – Small image → efficient deployment Ecosystem ● Emerging container host OSs – Atomic – http://projectatomic.io ● os-tree (s/rpm/git/) – CoreOS ● systemd + etcd + fleet – Snappy Ubuntu ● New app provisioning model – Small, single-service containers – Standalone execution environment ● New open container spec nulecule – https://github.com/projectatomic/nulecule
  6. 6. 6 WHY NOT CONTAINERS? Technology ● Security – Shared kernel – Limited isolation ● OS flexibility – Shared kernel limits OS choices ● Inertia Ecosystem ● New models don't capture many legacy services
  7. 7. 7 WHY CEPH? ● All components scale horizontally ● No single point of failure ● Hardware agnostic, commodity hardware ● Self-manage whenever possible ● Open source (LGPL) ● Move beyond legacy approaches – client/cluster instead of client/server – avoid ad hoc HA
  8. 8. 8 CEPH COMPONENTS RGW A web services gateway for object storage, compatible with S3 and Swift LIBRADOS A library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP) RADOS A software-based, reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes and lightweight monitors RBD A reliable, fully-distributed block device with cloud platform integration CEPHFS A distributed file system with POSIX semantics and scale-out metadata management APP HOST/VM CLIENT
  9. 9. BLOCK STORAGE
  10. 10. 10 EXISTING BLOCK STORAGE MODEL VM ● VMs are the unit of cloud compute ● Block devices are the unit of VM storage – ephemeral: not redundant, discarded when VM dies – persistent volumes: durable, (re)attached to any VM ● Block devices are single-user ● For shared storage, – use objects (e.g., Swift or S3) – use a database (e.g., Trove) – ...
  11. 11. 11 KVM + LIBRBD.SO ● Model – Nova → libvirt → KVM → librbd.so – Cinder → rbd.py → librbd.so – Glance → rbd.py → librbd.so ● Pros – proven – decent performance – good security ● Cons – performance could be better ● Status – most common deployment model today (~44% in latest survey) M M RADOS CLUSTER QEMU / KVM LIBRBD VM NOVA CINDER
  12. 12. 12 MULTIPLE CEPH DRIVERS ● librbd.so – qemu-kvm – rbd-fuse (experimental) ● rbd.ko (Linux kernel) – /dev/rbd* – stable and well-supported on modern kernels and distros – some feature gap ● no client-side caching ● no “fancy striping” – performance delta ● more efficient → more IOPS ● no client-side cache → higher latency for some workloads
  13. 13. 13 LXC + CEPH.KO ● The model – libvirt-based lxc containers – map kernel RBD on host – pass host device to libvirt, container ● Pros – fast and efficient – implement existing Nova API ● Cons – weaker security than VM ● Status – lxc is maintained – lxc is less widely used – no prototype M M RADOS CLUSTER LINUX HOST RBD.KO CONTAINER NOVA
  14. 14. 14 NOVA-DOCKER + CEPH.KO ● The model – docker container as mini-host – map kernel RBD on host – pass RBD device to container, or – mount RBD, bind dir to container ● Pros – buzzword-compliant – fast and efficient ● Cons – different image format – different app model – only a subset of docker feature set ● Status – no prototype – nova-docker is out of tree https://wiki.openstack.org/wiki/Docker
  15. 15. 15 IRONIC + CEPH.KO ● The model – bare metal provisioning – map kernel RBD directly from guest image ● Pros – fast and efficient – traditional app deployment model ● Cons – guest OS must support rbd.ko – requires agent – boot-from-volume tricky ● Status – Cinder and Ironic integration is a hot topic at summit ● 5:20p Wednesday (cinder) – no prototype ● References – https://wiki.openstack.org/wiki/Ironic/blueprints/ cinder-integration M M RADOS CLUSTER LINUX HOST RBD.KO
  16. 16. 16 BLOCK - SUMMARY ● But – block storage is same old boring – volumes are only semi-elastic (grow, not shrink; tedious to resize) – storage is not shared between guests performance efficiency VM client cache striping same images? exists kvm + librbd.so best good X X X yes X lxc + rbd.ko good best close nova-docker + rbd.ko good best no ironic + rbd.ko good best close? planned!
  17. 17. FILE STORAGE
  18. 18. 18 MANILA FILE STORAGE ● Manila manages file volumes – create/delete, share/unshare – tenant network connectivity – snapshot management ● Why file storage? – familiar POSIX semantics – fully shared volume – many clients can mount and share data – elastic storage – amount of data can grow/shrink without explicit provisioning MANILA
  19. 19. 19 MANILA CAVEATS ● Last mile problem – must connect storage to guest network – somewhat limited options (focus on Neutron) ● Mount problem – Manila makes it possible for guest to mount – guest is responsible for actual mount – ongoing discussion around a guest agent … ● Current baked-in assumptions about both of these MANILA
  20. 20. 20 ? APPLIANCE DRIVERS ● Appliance drivers – tell an appliance to export NFS to guests – map appliance IP into tenant network (Neutron) – boring (closed, proprietary, expensive, etc.) ● Status – several drivers from usual suspects – security punted to vendor NFS MANILA
  21. 21. 21 GANESHA DRIVER ● Model – service VM running nfs-ganesha server – mount file system on storage network – export NFS to tenant network – map IP into tenant network ● Status – in-tree, well-supported KVM GANESHA ??? NFS MANILA ???
  22. 22. 22 KVM GANESHA KVM + GANESHA + LIBCEPHFS ● Model – existing Ganesha driver, backed by Ganesha's libcephfs FSAL ● Pros – simple, existing model – security ● Cons – extra hop → higher latency – service VM is SpoF – service VM consumes resources ● Status – Manila Ganesha driver exists – untested with CephFS M M RADOS CLUSTER LIBCEPHFS KVM NFS NFS.KO MANILA NATIVE CEPH
  23. 23. 23 KVM + CEPH.KO (CEPH-NATIVE) ● Model – allow tenant access to storage network – mount CephFS directly from tenant VM ● Pros – best performance – access to full CephFS feature set – simple ● Cons – guest must have modern distro/kernel – exposes tenant to Ceph cluster – must deliver mount secret to client ● Status – no prototype – CephFS isolation/security is work-in-progress KVM M M RADOS CLUSTER CEPH.KO MANILA NATIVE CEPH
  24. 24. 24 NETWORK-ONLY MODEL IS LIMITING ● Current assumption of NFS or CIFS sucks ● Always relying on guest mount support sucks – mount -t ceph -o what? ● Even assuming storage connectivity is via the network sucks ● There are other options! – KVM virtfs/9p ● fs pass-through to host ● 9p protocol ● virtio for fast data transfer ● upstream; not widely used – NFS re-export from host ● mount and export fs on host ● private host/guest net ● avoid network hop from NFS service VM – containers and 'mount --bind'
  25. 25. 25 NOVA “ATTACH FS” API ● Mount problem is ongoing discussion by Manila team – discussed this morning – simple prototype using cloud-init – Manila agent? leverage Zaqar tenant messaging service? ● A different proposal – expand Nova to include “attach/detach file system” API – analogous to current attach/detach volume for block – each Nova driver may implement function differently – “plumb” storage to tenant VM or container ● Open question – Would API do the final “mount” step as well? (I say yes!)
  26. 26. 26 KVM + VIRTFS/9P + CEPHFS.KO ● Model – mount kernel CephFS on host – pass-through to guest via virtfs/9p ● Pros – security: tenant remains isolated from storage net + locked inside a directory ● Cons – require modern Linux guests – 9p not supported on some distros – “virtfs is ~50% slower than a native mount?” ● Status – Prototype from Haomai Wang HOST M M RADOS CLUSTER KVM VIRTFS MANILA NATIVE CEPH CEPH.KO VM 9P NOVA
  27. 27. 27 KVM + NFS + CEPHFS.KO ● Model – mount kernel CephFS on host – pass-through to guest via NFS ● Pros – security: tenant remains isolated from storage net + locked inside a directory – NFS is more standard ● Cons – NFS has weak caching consistency – NFS is slower ● Status – no prototype HOST M M RADOS CLUSTER KVM MANILA NATIVE CEPH CEPH.KO VM NFS NOVA
  28. 28. 28 (LXC, NOVA-DOCKER) + CEPHFS.KO ● Model – host mounts CephFS directly – mount --bind share into container namespace ● Pros – best performance – full CephFS semantics ● Cons – rely on container for security ● Status – no prototype HOST M M RADOS CLUSTER CONTAINER MANILA NATIVE CEPH CEPH.KO NOVA
  29. 29. 29 IRONIC + CEPHFS.KO ● Model – mount CephFS directly from bare metal “guest” ● Pros – best performance – full feature set ● Cons – rely on CephFS security – networking? – agent to do the mount? ● Status – no prototype – no suitable (ironic) agent (yet) HOST M M RADOS CLUSTER MANILA NATIVE CEPH CEPH.KO NOVA
  30. 30. 30 THE MOUNT PROBLEM ● Containers may break the current 'network fs' assumption – mounting becomes driver-dependent; harder for tenant to do the right thing ● Nova “attach fs” API could provide the needed entry point – KVM: qemu-guest-agent – Ironic: no guest agent yet... – containers (lxc, nova-docker): use mount --bind from host ● Or, make tenant do the final mount? – Manila API to provide command (template) to perform the mount ● e.g., “mount -t ceph $cephmonip:/manila/$uuid $PATH -o ...” – Nova lxc and docker ● bind share to a “dummy” device /dev/manila/$uuid ● API mount command is 'mount --bind /dev/manila/$uuid $PATH'
  31. 31. 31 SECURITY: NO FREE LUNCH ● (KVM, Ironic) + ceph.ko – access to storage network relies on Ceph security ● KVM + (virtfs/9p, NFS) + ceph.ko – better security, but – pass-through/proxy limits performance ● (by how much?) ● Containers – security (vs a VM) is weak at baseline, but – host performs the mount; tenant locked into their share directory
  32. 32. 32 PERFORMANCE ● 2 nodes – Intel E5-2660 – 96GB RAM – 10gb NIC ● Server – 3 OSD (Intel S3500) – 1 MON – 1 MDS ● Client VMs – 4 cores – 2GB RAM ● iozone, 2x available RAM ● CephFS native – VM ceph.ko → server ● CephFS 9p/virtfs – VM 9p → host ceph.ko → server ● CephFS NFS – VM NFS → server ceph.ko → server
  33. 33. 33 SEQUENTIAL
  34. 34. 34 RANDOM
  35. 35. 35 SUMMARY MATRIX performance consistency VM gateway net hops security agent mount agent prototype kvm + ganesha + libcephfs slower (?) weak (nfs) X X 2 host X X kvm + virtfs + ceph.ko good good X X 1 host X X kvm + nfs + ceph.ko good weak (nfs) X X 1 host X kvm + ceph.ko better best X 1 ceph X lxc + ceph.ko best best 1 ceph nova-docker + ceph.ko best best 1 ceph IBM talk - Thurs 9am ironic + ceph.ko best best 1 ceph X X
  36. 36. CONTAINER ORCHESTRATION
  37. 37. 37 CONTAINERS ARE DIFFERENT ● nova-docker implements a Nova view of a (Docker) container – treats container like a standalone system – does not leverage most of what Docker has to offer – Nova == IaaS abstraction ● Kubernetes is the new hotness – higher-level orchestration for containers – draws on years of Google experience running containers at scale – vibrant open source community
  38. 38. 38 KUBERNETES SHARED STORAGE ● Pure Kubernetes – no OpenStack ● Volume drivers – Local ● hostPath, emptyDir – Unshared ● iSCSI, GCEPersistentDisk, Amazon EBS, Ceph RBD – local fs on top of existing device – Shared ● NFS, GlusterFS, Amazon EFS, CephFS ● Status – Ceph drivers under review ● Finalizing model for secret storage, cluster parameters (e.g., mon IPs) – Drivers expect pre-existing volumes ● recycled; missing REST API to create/destroy volumes
  39. 39. 39 KUBERNETES ON OPENSTACK ● Provision Nova VMs – KVM or ironic – Atomic or CoreOS ● Kubernetes per tenant ● Provision storage devices – Cinder for volumes – Manila for shares ● Kubernetes binds into pod/container ● Status – Prototype Cinder plugin for Kubernetes https://github.com/spothanis/kubernetes/tree/cinder-vol-plugin KVM Kube node nginx pod mysql pod KVM Kube node nginx pod mysql pod KVM Kube master Volume controller ... CINDER MANILA NOVA
  40. 40. 40 WHAT NEXT? ● Ironic agent – enable Cinder (and Manila?) on bare metal – Cinder + Ironic ● 5:20p Wednesday (Cinder) ● Expand breadth of Manila drivers – virtfs/9p, ceph-native, NFS proxy via host, etc. – the last mile is not always the tenant network! ● Nova “attach fs” API (or equivalent) – simplify tenant experience – paper over VM vs container vs bare metal differences
  41. 41. THANK YOU! Sage Weil CEPH PRINCIPAL ARCHITECT Haomai Wang FREE AGENT sage@redhat.com haomaiwang@gmail.com @liewegas
  42. 42. 42 FOR MORE INFORMATION ● http://ceph.com ● http://github.com/ceph ● http://tracker.ceph.com ● Mailing lists – ceph-users@ceph.com – ceph-devel@vger.kernel.org ● irc.oftc.net – #ceph – #ceph-devel ● Twitter – @ceph

×