Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

John Spray - Ceph in Kubernetes


Published on

John Spray - Ceph in Kubernetes. Talk from the CloudStack / Ceph day Thursday, April 19 in London

Published in: Technology
  • Be the first to comment

  • Be the first to like this

John Spray - Ceph in Kubernetes

  1. 1. Combining Ceph with Kubernetes 19 April 2018 John Spray Principal Software Engineer, Offce of the CTO <>
  2. 2. 2 About me commit ac30e6cee2b2d3815438f1a392a951d511bddfd4 Author: John Spray <> Date: Thu Jun 30 14:05:02 2016 +0100 mgr: create ceph-mgr service Signed-off-by: John Spray <>
  3. 3. 3 Ceph operations today ● RPM packages (all daemons on server same version) ● Physical services confgured by external orchestrator: ● Ansible, salt, etc ● Logical entities confgured via Ceph itself (pools, flesystems, auth): ● CLI, mgr module interface, restful module ● Separate workfow from the physical deployment ● Plus some external monitoring to make sure your services stay up
  4. 4. 4 Pain points ● All those elements combine to create a high surface area beween users and the software. ● Lots of human decision making, opportunities for mistakes ● In practice, deployments often kept relatively static after initial decision making is done. Can new container environments enable something better?
  5. 5. 5 Glorious Container Future ● Unicorns for everyone! ● Ice cream for breakfast! ● Every Ceph cluster comes with a free Pony! ● Sunny and warm every day!
  6. 6. 6 The real container future ● Kubernetes is a tool that implements the basic operations that we need for the management of cluster services ● Deploy builds (in container format) ● Detect devices, start container in specifc location (OSD) ● Schedule/place groups of services (MDS, RGW) ● If we were writing a Ceph management server/agent today, it would look much like Kubernetes: so let’s just use Kubernetes! Kubernetes gives us the primitives, we still have to do the business logic and UI
  7. 7. 7 Why Kubernetes? ● Widely adopted (Red Hat OpenShift, Google Compute Engine, Amazon EKS, etc.) ● CLI/REST driven (extensible API) ● Lightweight design
  8. 8. Rook
  9. 9. 9 Rook ● Simplifed, container-native way of consuming Ceph ● Built for Kubernetes, extending the Kubernetes API ● CNCF inception project
  10. 10. 10 Rook components ● Image: Ceph and Rook binaries in one artifact ● ‘agent’ handles mounting volumes ● Hide complexity of client version, kernel version variations ● ‘operator’ watches objects in etcd, manipulates Ceph in response ● Create a “Filesystem” object, Rook operator does corresponding “ceph fs new”
  11. 11. 11 Rook example $ kubectl create -f rook-cluster.yaml $ kubectl -n rook get pod NAME READY STATUS rook-api-1511082791-7qs0m 1/1 Running rook-ceph-mgr0-1279756402-wc4vt 1/1 Running rook-ceph-mon0-jflt5 1/1 Running rook-ceph-mon1-wkc8p 1/1 Running rook-ceph-mon2-p31dj 1/1 Running rook-ceph-osd-0h6nb 1/1 Running
  12. 12. 12 Rook user interface ● Rook objects are created via the extensible Kubernetes API service (Custom Resource Defintnoiss ● aka: kubectl + yaml fles ● This style is consistent with Kubernetes ecosystem, but could beneft from a friendlier layer on top ● “point and click” is desirable for many users (& vendors) ● declarative confguration not always a good ft for storage: deleting a pool should require a confrmation button!
  13. 13. Combining Rook with ceph-mgr
  14. 14. 14 “Just give me the storage” ● Rook’s simplifed model is suitable for people who do not want to pay any attention to how Ceph is confgured: they just want to see a volume attached to their container. ● However: people buying hardware (or paying for cloud) often care a lot about how the storage cluster is confgured. ● Lifecycle: start out not caring about details, but care more and more as time goes on, eventually want to get into the details and optimize use of resources.
  15. 15. 15 What is ceph-mgr? ● Component of RADOS: a sibling of the mon and OSD daemons. C++ code using same auth/networking stack. ● Mandatory component: includes key functionality ● Host to python modules that do monitoring/management ● Relatively simple in itself: the fun parts are the python modules.
  16. 16. 16 dashboard module ● Mnmnc (13.2.x) release includesan extended management web UI based on OpeiAttnc ● Would like Kubernetes integration, so that we can create containers from the dashboard too: ● The “Create Filesystem” button starts MDS cluster ● A “Create OSD” button that starts OSDs → Call out to Rook from ceph-mgr (aid to other orchestrators toos
  17. 17. 17 Why not build Rook-like functionality into mgr? 1. Upgrades! An out-of-Ceph component that knows how to orchestrate a Ceph upgrade, while other Ceph services may be offine (aka “who manages the manager?”) 2. Commonality between simplifed pure-Rook systems and fully-featured containerized Ceph clusters. 3. Retain Rook’s client mounting/volume capabilities: we are publishing info about Ceph cluster into K8s so that Rook can take care of the volume management.
  18. 18. 18 How can we re-use the Rook operator How can we share Rook’s code for running containers, without limiting ourselves to their Ceph feature subset? → Modify Rook to make the non-container parts of CRD objects optional (e.g. pools on a Filesystem) → ceph-mgr creates cut-down Filesystem object to get MDS containers created → migration path from pure-Rook systems to general purpose Ceph clusters
  19. 19. 19 Two ways to consume containerized Ceph Rook operator K8s ceph-mgr Rook user Kubectl, lnmnted feature setFull coitrol, ponit+clnck Mngratnoi (nf desnreds Ceph image
  20. 20. 20 What doesn’t Kubernetes do for us? ● Installing itself (obviously) ● Confguring the underlying networks ● Bootstrapping Rook → External setup tools will continue to have a role in the non-Ceph- specifc tasks
  21. 21. 21 Status/tasks ● Getting Rook to consume the upstream Ceph container image, instead of its own custom-built single-binary image. ● Patching Rook operator to enable doing just the container parts ● Patching Rook to enable injecting confg+key to manage an existing cluster ● Connecting ceph-mgr backend to drive Rook via the K8s API ● Exposing K8s-enabled workfows in the dashboard UI → Goal: one click Filesystem creation (...and one click {everything_else} too)
  22. 22. Other enabling work
  23. 23. 23 Background ● Recall: external orchestrators are handling physical deployment of services, but most logical management is still direct to Ceph ● Or is it? Increasingly, orchestrators mix physically deploying Ceph services with logical confguration: ● Rook creates volumes as CephFS flesystems, but this means creating underlying pools. How does it know how to confgure them? ● Same for anything deploying RGW ● Rook also exposes some health/monitoring of the Ceph cluster, but is this in terms a non-Ceph-expert can understand? ● We must continue to make managing Ceph easier, and where possible, remove need for intervention.
  24. 24. 24 Placement group merging Expernmeital for Mnmnc ● Historically, pg_num could be increased but not decreased ● Sometimes problematic, when e.g. physically shrinking a cluster, or if bad pg_nums were chosen. ● Bigger problem: prevented automatic pg_num selection, because mistakes could not be reversed. ● Implementation is not simple, and doing it still has an IO cost, but the option will be there → now we can autoselect pg_num!
  25. 25. 25 Automatic pg_num selection Expernmeital for Mnmnc ● Hard (impossible?) to do perfectly ● Pretty easy to do useful common cases: ● Select initial pg_nums according to expected space use ● Increase pg_nums if actual space use has gone ~2x over ideal PG capacity ● Decrease pg_num for underused pools if another pool needs to increase theirs ● Not an optimiser! But probably going to do the job as well as most humans are doing it today.
  26. 26. 26 Automatic pg_num selection (continued) Expernmeital for Mnmnc ● Prompting users for expected capacity makes sense for data pools, but not for metadata pools: ● Combine data and metadata pool creation into one command ● Wrap pools into new “poolset” structure describing policy ● Auto-construct poolsets for existing deployments, but don’t auto-adjust unless explicitly enabled ceph poolset create cephfs my_filesystem 100GB
  27. 27. 27 Progress bars Expernmeital for Mnmnc ● Health reporting was improved in lumnious, but in many cases it is still too low level. ● Especially placement groups: ● hard to distinguish between real problems and normal rebalancing ● Once we start auto-picking pg_num, users won’t know what a PG is until they see them in the health status ● Introduce `progress` module to synthesize high level view from PG state: “56% recovered from failure of OSD 123”
  28. 28. 28 Wrap up ● All these improvements reduce cognitive load on ordinary user. ● Do not need to know what an MDS is: ask Rook for a flesystem, and get one. ● Do not need to know what a placement group is ● Do not need to know magic commands: look at the dashboard ● Actions that no longer require human thought can now be tied into automated workfows: fulfl the promise of software defned storage.
  29. 29. Q&A