Combining Ceph with Kubernetes
19 April 2018
John Spray
Principal Software Engineer, Offce of the CTO
<john.spray@redhat.com>
2
About me
commit ac30e6cee2b2d3815438f1a392a951d511bddfd4
Author: John Spray <john.spray@redhat.com>
Date: Thu Jun 30 14:05:02 2016 +0100
mgr: create ceph-mgr service
Signed-off-by: John Spray <john.spray@redhat.com>
3
Ceph operations today
● RPM packages (all daemons on server same version)
● Physical services confgured by external orchestrator:
● Ansible, salt, etc
● Logical entities confgured via Ceph itself (pools,
flesystems, auth):
● CLI, mgr module interface, restful module
● Separate workfow from the physical deployment
● Plus some external monitoring to make sure your services
stay up
4
Pain points
● All those elements combine to create a high surface area
beween users and the software.
● Lots of human decision making, opportunities for mistakes
● In practice, deployments often kept relatively static after initial
decision making is done.
Can new container environments enable something better?
5
Glorious Container Future
● Unicorns for everyone!
● Ice cream for breakfast!
● Every Ceph cluster comes with a free Pony!
● Sunny and warm every day!
6
The real container future
● Kubernetes is a tool that implements the basic operations that
we need for the management of cluster services
● Deploy builds (in container format)
● Detect devices, start container in specifc location (OSD)
● Schedule/place groups of services (MDS, RGW)
● If we were writing a Ceph management server/agent today, it
would look much like Kubernetes: so let’s just use Kubernetes!
Kubernetes gives us the primitives, we still have to do the
business logic and UI
7
Why Kubernetes?
● Widely adopted (Red Hat OpenShift, Google Compute
Engine, Amazon EKS, etc.)
● CLI/REST driven (extensible API)
● Lightweight design
Rook
9
Rook
● Simplifed, container-native way of consuming Ceph
● Built for Kubernetes, extending the Kubernetes API
● CNCF inception project
http://rook.io/
http://github.com/rook/
10
Rook components
● Image: Ceph and Rook binaries in one artifact
● ‘agent’ handles mounting volumes
● Hide complexity of client version, kernel version variations
● ‘operator’ watches objects in etcd, manipulates Ceph in
response
● Create a “Filesystem” object, Rook operator does corresponding
“ceph fs new”
11
Rook example
$ kubectl create -f rook-cluster.yaml
$ kubectl -n rook get pod
NAME READY STATUS
rook-api-1511082791-7qs0m 1/1 Running
rook-ceph-mgr0-1279756402-wc4vt 1/1 Running
rook-ceph-mon0-jflt5 1/1 Running
rook-ceph-mon1-wkc8p 1/1 Running
rook-ceph-mon2-p31dj 1/1 Running
rook-ceph-osd-0h6nb 1/1 Running
12
Rook user interface
● Rook objects are created via the extensible Kubernetes API
service (Custom Resource Defintnoiss
● aka: kubectl + yaml fles
● This style is consistent with Kubernetes ecosystem, but
could beneft from a friendlier layer on top
● “point and click” is desirable for many users (& vendors)
● declarative confguration not always a good ft for storage: deleting a pool
should require a confrmation button!
Combining Rook with ceph-mgr
14
“Just give me the storage”
● Rook’s simplifed model is suitable for people who do not want to
pay any attention to how Ceph is confgured: they just want to
see a volume attached to their container.
● However: people buying hardware (or paying for cloud) often
care a lot about how the storage cluster is confgured.
● Lifecycle: start out not caring about details, but care more and
more as time goes on, eventually want to get into the details and
optimize use of resources.
15
What is ceph-mgr?
● Component of RADOS: a sibling of the mon and OSD
daemons. C++ code using same auth/networking stack.
● Mandatory component: includes key functionality
● Host to python modules that do monitoring/management
● Relatively simple in itself: the fun parts are the python
modules.
16
dashboard module
● Mnmnc (13.2.x) release includesan extended management
web UI based on OpeiAttnc
● Would like Kubernetes integration, so that we can create
containers from the dashboard too:
● The “Create Filesystem” button starts MDS cluster
● A “Create OSD” button that starts OSDs
→ Call out to Rook from ceph-mgr
(aid to other orchestrators toos
17
Why not build Rook-like functionality into mgr?
1. Upgrades! An out-of-Ceph component that knows how to
orchestrate a Ceph upgrade, while other Ceph services may be
offine (aka “who manages the manager?”)
2. Commonality between simplifed pure-Rook systems and
fully-featured containerized Ceph clusters.
3. Retain Rook’s client mounting/volume capabilities: we are
publishing info about Ceph cluster into K8s so that Rook can
take care of the volume management.
18
How can we re-use the Rook operator
How can we share Rook’s code for running containers, without
limiting ourselves to their Ceph feature subset?
→ Modify Rook to make the non-container parts of CRD
objects optional (e.g. pools on a Filesystem)
→ ceph-mgr creates cut-down Filesystem object to get
MDS containers created
→ migration path from pure-Rook systems to general
purpose Ceph clusters
19
Two ways to consume containerized Ceph
Rook operator
K8s
ceph-mgr Rook user
Kubectl, lnmnted feature setFull coitrol, ponit+clnck
Mngratnoi (nf desnreds
Ceph image
20
What doesn’t Kubernetes do for us?
● Installing itself (obviously)
● Confguring the underlying networks
● Bootstrapping Rook
→ External setup tools will continue to have a role in the non-Ceph-
specifc tasks
21
Status/tasks
● Getting Rook to consume the upstream Ceph container image,
instead of its own custom-built single-binary image.
● Patching Rook operator to enable doing just the container parts
● Patching Rook to enable injecting confg+key to manage an
existing cluster
● Connecting ceph-mgr backend to drive Rook via the K8s API
● Exposing K8s-enabled workfows in the dashboard UI
→ Goal: one click Filesystem creation
(...and one click {everything_else} too)
Other enabling work
23
Background
● Recall: external orchestrators are handling physical deployment
of services, but most logical management is still direct to Ceph
● Or is it? Increasingly, orchestrators mix physically deploying
Ceph services with logical confguration:
● Rook creates volumes as CephFS flesystems, but this means creating
underlying pools. How does it know how to confgure them?
● Same for anything deploying RGW
● Rook also exposes some health/monitoring of the Ceph cluster, but is this in
terms a non-Ceph-expert can understand?
● We must continue to make managing Ceph easier, and where
possible, remove need for intervention.
24
Placement group merging
Expernmeital for Mnmnc
● Historically, pg_num could be increased but not decreased
● Sometimes problematic, when e.g. physically shrinking a cluster,
or if bad pg_nums were chosen.
● Bigger problem: prevented automatic pg_num selection,
because mistakes could not be reversed.
● Implementation is not simple, and doing it still has an IO cost,
but the option will be there → now we can autoselect pg_num!
25
Automatic pg_num selection
Expernmeital for Mnmnc
● Hard (impossible?) to do perfectly
● Pretty easy to do useful common cases:
● Select initial pg_nums according to expected space use
● Increase pg_nums if actual space use has gone ~2x over ideal PG capacity
● Decrease pg_num for underused pools if another pool needs to increase
theirs
● Not an optimiser! But probably going to do the job as well as
most humans are doing it today.
26
Automatic pg_num selection (continued)
Expernmeital for Mnmnc
● Prompting users for expected capacity makes sense for data
pools, but not for metadata pools:
● Combine data and metadata pool creation into one command
● Wrap pools into new “poolset” structure describing policy
● Auto-construct poolsets for existing deployments, but don’t auto-adjust
unless explicitly enabled
ceph poolset create cephfs my_filesystem 100GB
27
Progress bars
Expernmeital for Mnmnc
● Health reporting was improved in lumnious, but in many cases it
is still too low level.
● Especially placement groups:
● hard to distinguish between real problems and normal rebalancing
● Once we start auto-picking pg_num, users won’t know what a PG is until
they see them in the health status
● Introduce `progress` module to synthesize high level view from
PG state: “56% recovered from failure of OSD 123”
28
Wrap up
● All these improvements reduce cognitive load on ordinary user.
● Do not need to know what an MDS is: ask Rook for a flesystem, and get one.
● Do not need to know what a placement group is
● Do not need to know magic commands: look at the dashboard
● Actions that no longer require human thought can now be tied
into automated workfows: fulfl the promise of software defned
storage.
Q&A

John Spray - Ceph in Kubernetes

  • 1.
    Combining Ceph withKubernetes 19 April 2018 John Spray Principal Software Engineer, Offce of the CTO <john.spray@redhat.com>
  • 2.
    2 About me commit ac30e6cee2b2d3815438f1a392a951d511bddfd4 Author:John Spray <john.spray@redhat.com> Date: Thu Jun 30 14:05:02 2016 +0100 mgr: create ceph-mgr service Signed-off-by: John Spray <john.spray@redhat.com>
  • 3.
    3 Ceph operations today ●RPM packages (all daemons on server same version) ● Physical services confgured by external orchestrator: ● Ansible, salt, etc ● Logical entities confgured via Ceph itself (pools, flesystems, auth): ● CLI, mgr module interface, restful module ● Separate workfow from the physical deployment ● Plus some external monitoring to make sure your services stay up
  • 4.
    4 Pain points ● Allthose elements combine to create a high surface area beween users and the software. ● Lots of human decision making, opportunities for mistakes ● In practice, deployments often kept relatively static after initial decision making is done. Can new container environments enable something better?
  • 5.
    5 Glorious Container Future ●Unicorns for everyone! ● Ice cream for breakfast! ● Every Ceph cluster comes with a free Pony! ● Sunny and warm every day!
  • 6.
    6 The real containerfuture ● Kubernetes is a tool that implements the basic operations that we need for the management of cluster services ● Deploy builds (in container format) ● Detect devices, start container in specifc location (OSD) ● Schedule/place groups of services (MDS, RGW) ● If we were writing a Ceph management server/agent today, it would look much like Kubernetes: so let’s just use Kubernetes! Kubernetes gives us the primitives, we still have to do the business logic and UI
  • 7.
    7 Why Kubernetes? ● Widelyadopted (Red Hat OpenShift, Google Compute Engine, Amazon EKS, etc.) ● CLI/REST driven (extensible API) ● Lightweight design
  • 8.
  • 9.
    9 Rook ● Simplifed, container-nativeway of consuming Ceph ● Built for Kubernetes, extending the Kubernetes API ● CNCF inception project http://rook.io/ http://github.com/rook/
  • 10.
    10 Rook components ● Image:Ceph and Rook binaries in one artifact ● ‘agent’ handles mounting volumes ● Hide complexity of client version, kernel version variations ● ‘operator’ watches objects in etcd, manipulates Ceph in response ● Create a “Filesystem” object, Rook operator does corresponding “ceph fs new”
  • 11.
    11 Rook example $ kubectlcreate -f rook-cluster.yaml $ kubectl -n rook get pod NAME READY STATUS rook-api-1511082791-7qs0m 1/1 Running rook-ceph-mgr0-1279756402-wc4vt 1/1 Running rook-ceph-mon0-jflt5 1/1 Running rook-ceph-mon1-wkc8p 1/1 Running rook-ceph-mon2-p31dj 1/1 Running rook-ceph-osd-0h6nb 1/1 Running
  • 12.
    12 Rook user interface ●Rook objects are created via the extensible Kubernetes API service (Custom Resource Defintnoiss ● aka: kubectl + yaml fles ● This style is consistent with Kubernetes ecosystem, but could beneft from a friendlier layer on top ● “point and click” is desirable for many users (& vendors) ● declarative confguration not always a good ft for storage: deleting a pool should require a confrmation button!
  • 13.
  • 14.
    14 “Just give methe storage” ● Rook’s simplifed model is suitable for people who do not want to pay any attention to how Ceph is confgured: they just want to see a volume attached to their container. ● However: people buying hardware (or paying for cloud) often care a lot about how the storage cluster is confgured. ● Lifecycle: start out not caring about details, but care more and more as time goes on, eventually want to get into the details and optimize use of resources.
  • 15.
    15 What is ceph-mgr? ●Component of RADOS: a sibling of the mon and OSD daemons. C++ code using same auth/networking stack. ● Mandatory component: includes key functionality ● Host to python modules that do monitoring/management ● Relatively simple in itself: the fun parts are the python modules.
  • 16.
    16 dashboard module ● Mnmnc(13.2.x) release includesan extended management web UI based on OpeiAttnc ● Would like Kubernetes integration, so that we can create containers from the dashboard too: ● The “Create Filesystem” button starts MDS cluster ● A “Create OSD” button that starts OSDs → Call out to Rook from ceph-mgr (aid to other orchestrators toos
  • 17.
    17 Why not buildRook-like functionality into mgr? 1. Upgrades! An out-of-Ceph component that knows how to orchestrate a Ceph upgrade, while other Ceph services may be offine (aka “who manages the manager?”) 2. Commonality between simplifed pure-Rook systems and fully-featured containerized Ceph clusters. 3. Retain Rook’s client mounting/volume capabilities: we are publishing info about Ceph cluster into K8s so that Rook can take care of the volume management.
  • 18.
    18 How can were-use the Rook operator How can we share Rook’s code for running containers, without limiting ourselves to their Ceph feature subset? → Modify Rook to make the non-container parts of CRD objects optional (e.g. pools on a Filesystem) → ceph-mgr creates cut-down Filesystem object to get MDS containers created → migration path from pure-Rook systems to general purpose Ceph clusters
  • 19.
    19 Two ways toconsume containerized Ceph Rook operator K8s ceph-mgr Rook user Kubectl, lnmnted feature setFull coitrol, ponit+clnck Mngratnoi (nf desnreds Ceph image
  • 20.
    20 What doesn’t Kubernetesdo for us? ● Installing itself (obviously) ● Confguring the underlying networks ● Bootstrapping Rook → External setup tools will continue to have a role in the non-Ceph- specifc tasks
  • 21.
    21 Status/tasks ● Getting Rookto consume the upstream Ceph container image, instead of its own custom-built single-binary image. ● Patching Rook operator to enable doing just the container parts ● Patching Rook to enable injecting confg+key to manage an existing cluster ● Connecting ceph-mgr backend to drive Rook via the K8s API ● Exposing K8s-enabled workfows in the dashboard UI → Goal: one click Filesystem creation (...and one click {everything_else} too)
  • 22.
  • 23.
    23 Background ● Recall: externalorchestrators are handling physical deployment of services, but most logical management is still direct to Ceph ● Or is it? Increasingly, orchestrators mix physically deploying Ceph services with logical confguration: ● Rook creates volumes as CephFS flesystems, but this means creating underlying pools. How does it know how to confgure them? ● Same for anything deploying RGW ● Rook also exposes some health/monitoring of the Ceph cluster, but is this in terms a non-Ceph-expert can understand? ● We must continue to make managing Ceph easier, and where possible, remove need for intervention.
  • 24.
    24 Placement group merging Expernmeitalfor Mnmnc ● Historically, pg_num could be increased but not decreased ● Sometimes problematic, when e.g. physically shrinking a cluster, or if bad pg_nums were chosen. ● Bigger problem: prevented automatic pg_num selection, because mistakes could not be reversed. ● Implementation is not simple, and doing it still has an IO cost, but the option will be there → now we can autoselect pg_num!
  • 25.
    25 Automatic pg_num selection Expernmeitalfor Mnmnc ● Hard (impossible?) to do perfectly ● Pretty easy to do useful common cases: ● Select initial pg_nums according to expected space use ● Increase pg_nums if actual space use has gone ~2x over ideal PG capacity ● Decrease pg_num for underused pools if another pool needs to increase theirs ● Not an optimiser! But probably going to do the job as well as most humans are doing it today.
  • 26.
    26 Automatic pg_num selection(continued) Expernmeital for Mnmnc ● Prompting users for expected capacity makes sense for data pools, but not for metadata pools: ● Combine data and metadata pool creation into one command ● Wrap pools into new “poolset” structure describing policy ● Auto-construct poolsets for existing deployments, but don’t auto-adjust unless explicitly enabled ceph poolset create cephfs my_filesystem 100GB
  • 27.
    27 Progress bars Expernmeital forMnmnc ● Health reporting was improved in lumnious, but in many cases it is still too low level. ● Especially placement groups: ● hard to distinguish between real problems and normal rebalancing ● Once we start auto-picking pg_num, users won’t know what a PG is until they see them in the health status ● Introduce `progress` module to synthesize high level view from PG state: “56% recovered from failure of OSD 123”
  • 28.
    28 Wrap up ● Allthese improvements reduce cognitive load on ordinary user. ● Do not need to know what an MDS is: ask Rook for a flesystem, and get one. ● Do not need to know what a placement group is ● Do not need to know magic commands: look at the dashboard ● Actions that no longer require human thought can now be tied into automated workfows: fulfl the promise of software defned storage.
  • 29.