SRE & Kubernetes

SRE & Kubernetes
February, 2022

Hello!
Afkham Azeez
VP & Deputy CTO - Cloud
azeez@wso2.com

3
Off-roading, camping, birding & nature enthusiast
Amateur radio operator - 4S7AZE
/afkham_azeez /afkhamazeez

Software development in 2020 and beyond…
A paradigm shift
● Major changes to how software is designed & built are taking place
● Businesses have realized that they have to build digital experiences
● Building a ‘Digitally-driven Business’ takes time and signiﬁcant engineering
effort
4

Cloud-native software engineering
Building for the Cloud, on the Cloud
● Start building your product on the cloud
⦿ Have your dev environment on the cloud
● Multi-environment on the cloud
⦿ dev, test, staging, prod
● Leverage cloud services and APIs
⦿ Don’t run everything yourself
● Containers & Kubernetes are game changers
5

With great power comes great complexity!

What is Kubernetes?
● A cluster operating system
● A collection of control loops
7
https://buttondown.email/nelhage/archive/two-reasons-kubernetes-is-so-complex/

IaC
● The process of managing and provisioning computer data centers through
machine-readable definition files, rather than physical hardware configuration or
interactive configuration tools
● Everything is code
⦿ Cluster creation
⦿ Creating workloads
⦿ System configuration
⦿ Security
⦿ etc.
8

Site Reliability Engineering
● SRE is an approach taken to solve IT Operations challenges using Software
Engineering principles.
● SREs use software as a tool to manage cloud systems, diagnose problems, and
automate tasks.
● A key role of SRE is to ﬁnd the right balance between releasing new features
and ensuring they are reliable;
⦿ Dev teams want to deploy as many features as possible as soon as possible
⦿ SRE tries to facilitates the dev team’s goals while ensuring reliability
● What is reliability?
⦿ Minimizing the impact on end users by minimizing outages
9

What do SREs do?
● Deﬁne compliance standards & processes
● Write cluster/system setup code
● Deﬁne build pipelines & help dev teams setup pipelines
● Setup monitoring and alerting (code)
● Plan backup and recovery
● Plan DR strategy
● Threat modeling & security scanning
● Incident management
● Chaos engineering
● Root cause analysis
● Perform routine tasks
● Cost analysis & optimization
10

CICD and GitOps
● Git repos as the single & central sources of truth of the current cluster
conﬁguration
● Use standard git practices
⦿ fork -> branch -> change -> build -> send PR -> CI -> review -> merge -> CD
12

Logging
13
omsagent
Node 2
omsagent
Node 1
omsagent
omsagent
-rs
Node 3
Kubernetes Cluster
Data Explorer

Log Analytics
14
https://docs.microsoft.com/en-us/azure/azure-monitor/containers/container-insights-log-query

Logging + analytics + alerting
15
Log publishing Analytics
Issue or
anomaly
detection
Alerting
Incident
management

Observability, Monitoring & Alerting
● Observability vs monitoring - monitoring is what you do after a system is
observable
● System level monitoring
⦿ Cluster, pod, node health
⦿ System level services/APIs health - includes errors & latencies
⦿ System logs
⦿ Intrusion detection
⦿ DoS
● Application level monitoring
⦿ Application level services/APIs health - includes errors & latencies
⦿ Internal application level observability
⦿ Application logs
16

Incident Management
Unplanned interruption to or quality reduction
of an IT service
17

Normal incident management process
18

Major incident management process
19

SLI, SLO, SLA
● SLI
⦿ Metrics used to measure the level of service provided to end-users (e.g., availability,
latency, throughput)
● SLO
⦿ Targeted levels of service, measured by SLIs
⦿ Typically expressed as a percentage over a period of time
⦿ Help you figure out the right balance between product innovation and reliability
● SLA
⦿ Contractual agreements that outline the level of service end users can expect
⦿ If these promises are not met, there can be significant consequences for the provider,
which are often financial in nature
20

Error Budget
● Error budget = 1-SLO
● Acceptable levels of unreliability for a service before it falls out of compliance
with an SLO
● Measure of risk you can take to
⦿ get new features in
⦿ stop services for maintenance
⦿ routine improvements
⦿ network and infrastructure outages
⦿ unforeseen circumstances
21

Toil & Toil Budget
● Toil
⦿ Kind of work tied to running a production service that tends to be manual, repetitive,
automatable, tactical, devoid of enduring value, and that scales linearly as a service
grows.
⦿ The SRE discipline focuses on a slump of toil as much as possible.
● Toil budget
⦿ A measure of acceptable toil
22

Cron jobs
apiVersion: batch/v1
kind: CronJob
metadata:
name: expenserpt
spec:
schedule: "0 0 1 * *"
jobTemplate:
spec:
template:
spec:
containers:
- name: report
image: expenserpt
imagePullPolicy: IfNotPresent
restartPolicy: OnFailure
23

Cost management
● Use tools provided by cloud platforms
● Set proper cost thresholds
● Resource audit & cost analysis reports
● Set up a cost management team &
weekly reviews
24
● Kubecost
⦿ Provides real-time cost visibility and
insights for teams using Kubernetes
⦿ Helps to continuously reduce cloud
costs

Anti-fragility
● Improving resilience using ﬁre drills, chaos monkey, security and automation
● Kubernetes liveness & readiness probes can be used for health checks
● Kubernetes secret management for sensitive data using Secrets and CSI
25

Security
● Threat modeling using methodologies such as STRIDE
● Scan code repos using tools such as Checkov
● Security specialists - DevSecOps
● Security Operations Center (SOC)
● Kubernetes
⦿ Service Accounts, roles & role bindings
⦿ Network Policies
⦿ Cluster and namespace level isolation
⦿ mTLS enforcement via service meshes
26

Business Continuity & Disaster Recovery
● BCP is the process involved in creating a system of prevention and recovery
from potential threats to a company
● What is a disaster?
⦿ An unforeseen event that could potentially put the organization at risk by interfering
with operations
● Ideally there should be BC plans for all functions of the company which are
amalgamated into a single corporate BC plan
27

How can your organization adopt SRE?
● Start small & evolve
● Analyze existing team structures/processes and see how they can be adopted
● Recruiting experienced SREs can be hard
⦿ Dev2SRE program
● On the job training
● Certifications are important
⦿ CKAD, CKA, CKS
⦿ Cloud platform certifications - Azure, AWS, GCP etc.
⦿ “Well architected” programs
● Maintain a central knowledge base - document everything
● Define standards, conventions & best practices and ensure that those are
followed
● Define and continuously improve processes
● Work closely with development teams. Engage with all stakeholders.
● Get standards certifications/reports - SOC2, ISO 27001, HIPAA, HITRUST etc 30

TL;DR
● Kubernetes & even app development are just the tip of the iceberg in your
organization’s overall SRE & cloud native story
● Establishment of the SRE discipline is essential for running seamless
operations
● Start small, adapt & evolve
31

SRE & Kubernetes

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to SRE & Kubernetes

Similar to SRE & Kubernetes (20)

More from Afkham Azeez

More from Afkham Azeez (20)

Recently uploaded

Recently uploaded (20)

SRE & Kubernetes