SRE principles and (Kubernetes) Operator practice | DevNation Tech Talk

2
Why should you care about Operators?

3
Any application in any system must be
installed, configured, managed and
upgraded over time
Patching is critical to security

“Anything that isn’t
automated is slowing you
down”

5
$ kubectl scale deploy/staticweb --replicas=3

6
Deploying a database is easy

7
$ kubectl create deployment db --image=quay.io/my/db

8
Running a database over time is harder

9
● Resize/Upgrade
● Reconfigure
● Backup
● Healing

11
1. Application-specific custom controllers
2. Custom resource definitions (CRD)
Extending the Kubernetes API

Custom Resource
Developer /
Kubernetes User
Deployments
StatefulSets
Autoscalers
Secrets
Conﬁg maps
PersistentVolume
How Does an Operator Work?
K8s API
kind: ProductionReadyDatabase
apiVersion:
database .example.com/v1alpha1
metadata:
name: my-important -database
spec:
connectionPoolSize: 300
readReplicas: 2
version: v4.0.1
Custom Kubernetes
Controller
Watch Events
Reconciliation
+
Custom Resource Deﬁnition
Kubernetes Operator Native Kubernetes
Resources

13
Custom Resource (CR)
kind: ProductionReadyDatabase
apiVersion: database.example.com/v1alpha1
metadata:
name: my-production-ready-database
spec:
clusterSize: 3
readReplicas: 2
version: v4.0.1
[...]

14
Operators are automated software
managers software SREs that manage
the entire lifecycle of Kubernetes
applications

controllers
Hausenblas, Schimanski. Programming Kubernetes. O’Reilly, 2019.

Value of Operators
Improve the “time to
ﬁrst value” for your
customers
Minimize software upgrade
risk and associated
operational costs
Embed best practices
from the experts – you
– into the Operator
Provide a cloud-like
"As a Service"
experience

Red Hat Products
ISV Partners
Community
OPERATOR HUB
Operator Hub - Allows
administrators to
selectively make
operators available from
curated sources to users
in the cluster.

...and many more
OPERATORS ACROSS THE INDUSTRY

19
Operator Maturity Model
Phase I Phase II Phase III Phase IV Phase V
Basic Install
Automated application
provisioning and
conﬁguration management
Seamless Upgrades
Patch and minor version
upgrades supported
Full Lifecycle
App lifecycle, storage
lifecycle (backup, failure
recovery)
Deep Insights
Metrics, alerts, log
processing and workload
analysis
Auto Pilot
Horizontal/vertical scaling,
auto conﬁg tuning, abnormal
detection, scheduling tuning

20
● O’Reilly “SRE Book” (Beyer et al)
● Carla Geisser: “Human intervention… is a bug”
● SREs write code to fix those bugs
● SREs write software to run other software
● SREs write Kubernetes Operators
Site Reliability Engineering (SRE)

21
● Can you set operand configuration in the CR?
● Do CR changes cause non-disruptive updates to the Operand?
● Does CR status show what has and hasn’t been applied?
Level 1
Installation - Deployment

22
● Can the Operator upgrade its Operand?
● Without disruption?
● Does CR status show what has and hasn’t been upgraded?
Level 2
Upgrades

23
● Can your Operator back up its Operand?
● Can your Operator restore from a previous Operand backup?
● Ready/Live probes? Active monitoring of basic execution state?
● CPU and other requests and limits set for Operand?
Level 3
Full Lifecycle Management

24
● Does the Operator expose metrics about its own health?
● Metrics and alerts for the Operand?
● Does CR status show what has and hasn’t been applied?
Level 4
Deep Insights

25
The RED Method defines the three key metrics for every service in your
architecture.
● Rate (the number of requests per second)
● Errors (the number of those requests that are failing)
● Duration (the amount of time those requests take)
RED
Rate (aka Traffic) - Errors - Duration (aka Latency)

26
● Marine autopilots are reasonable models, especially with rudder
position feedback
● Auto scaling, healing, tuning
○ Detect condition from metrics, scale horizontally (Replicas) or vertically
(Requests/Limits)
○ Think especially about scaling back down; resource savings
○ Detecting deterioration in Operand(s) (based on Level 4’s metrics) and take
action to redeploy or reconfigure
● CR Status, custom Events: Clear status and especially error
conditions
Level 5
Auto Pilot

27
“Toil Not, Neither Spin” (Kubernetes Operators, Dobies & Wood)
SRE defines “toil” as:
● Automatable - your computer would enjoy it!
● Without enduring value - needs done but doesn’t change the
system
● Grows linearly with growth of the system
Level 5 (cont.)
Auto Pilot

28
Operator Maturity Model
Phase I Phase II Phase III Phase IV Phase V
Basic Install
Automated application
provisioning and
conﬁguration management
Seamless Upgrades
Patch and minor version
upgrades supported
Full Lifecycle
App lifecycle, storage
lifecycle (backup, failure
recovery)
Deep Insights
Metrics, alerts, log
processing and workload
analysis
Auto Pilot
Horizontal/vertical scaling,
auto conﬁg tuning, abnormal
detection, scheduling tuning

29
● SRE stuff: Add metrics awareness and tuning to your Operator
● Other APIs / API representations: k8fs?
● K8fs presents Kubernetes API as a synthetic file hierarchy
● % cp manifest.yaml /mnt/k8s/ns/default/deployments/
● % echo 3 >/mnt/k8s/ns/default/deployments/myapp/replicas
Experiments/Challenge Coins
“...left as an exercise for the reader…”

30
https://operatorframework.io
https://operatorhub.io
https://learn.openshift.com/operatorframework/
http://bit.ly/kubernetes-operators
Resources

SRE principles and (Kubernetes) Operator practice | DevNation Tech Talk

SRE principles and (Kubernetes) Operator practice | DevNation Tech Talk

More Related Content

What's hot

Similar to SRE principles and (Kubernetes) Operator practice | DevNation Tech Talk

More from Red Hat Developers

Recently uploaded

SRE principles and (Kubernetes) Operator practice | DevNation Tech Talk