SRE & Kubernetes
February, 2022
Hello!
Afkham Azeez
VP & Deputy CTO - Cloud
azeez@wso2.com
3
Off-roading, camping, birding & nature enthusiast
Amateur radio operator - 4S7AZE
/afkham_azeez /afkhamazeez
Software development in 2020 and beyond…
A paradigm shift
● Major changes to how software is designed & built are taking place
● Businesses have realized that they have to build digital experiences
● Building a ‘Digitally-driven Business’ takes time and significant engineering
effort
4
Cloud-native software engineering
Building for the Cloud, on the Cloud
● Start building your product on the cloud
⦿ Have your dev environment on the cloud
● Multi-environment on the cloud
⦿ dev, test, staging, prod
● Leverage cloud services and APIs
⦿ Don’t run everything yourself
● Containers & Kubernetes are game changers
5
With great power comes great complexity!
What is Kubernetes?
● A cluster operating system
● A collection of control loops
7
https://buttondown.email/nelhage/archive/two-reasons-kubernetes-is-so-complex/
IaC
● The process of managing and provisioning computer data centers through
machine-readable definition files, rather than physical hardware configuration or
interactive configuration tools
● Everything is code
⦿ Cluster creation
⦿ Creating workloads
⦿ System configuration
⦿ Security
⦿ etc.
8
Site Reliability Engineering
● SRE is an approach taken to solve IT Operations challenges using Software
Engineering principles.
● SREs use software as a tool to manage cloud systems, diagnose problems, and
automate tasks.
● A key role of SRE is to find the right balance between releasing new features
and ensuring they are reliable;
⦿ Dev teams want to deploy as many features as possible as soon as possible
⦿ SRE tries to facilitates the dev team’s goals while ensuring reliability
● What is reliability?
⦿ Minimizing the impact on end users by minimizing outages
9
What do SREs do?
● Define compliance standards & processes
● Write cluster/system setup code
● Define build pipelines & help dev teams setup pipelines
● Setup monitoring and alerting (code)
● Plan backup and recovery
● Plan DR strategy
● Threat modeling & security scanning
● Incident management
● Chaos engineering
● Root cause analysis
● Perform routine tasks
● Cost analysis & optimization
10
Core Concepts & Methodologies
CICD and GitOps
● Git repos as the single & central sources of truth of the current cluster
configuration
● Use standard git practices
⦿ fork -> branch -> change -> build -> send PR -> CI -> review -> merge -> CD
12
Logging
13
omsagent
Node 2
omsagent
Node 1
omsagent
omsagent
-rs
Node 3
Kubernetes Cluster
Data Explorer
Log Analytics
14
https://docs.microsoft.com/en-us/azure/azure-monitor/containers/container-insights-log-query
Logging + analytics + alerting
15
Log publishing Analytics
Issue or
anomaly
detection
Alerting
Incident
management
Observability, Monitoring & Alerting
● Observability vs monitoring - monitoring is what you do after a system is
observable
● System level monitoring
⦿ Cluster, pod, node health
⦿ System level services/APIs health - includes errors & latencies
⦿ System logs
⦿ Intrusion detection
⦿ DoS
● Application level monitoring
⦿ Application level services/APIs health - includes errors & latencies
⦿ Internal application level observability
⦿ Application logs
16
Incident Management
Unplanned interruption to or quality reduction
of an IT service
17
Normal incident management process
18
Major incident management process
19
SLI, SLO, SLA
● SLI
⦿ Metrics used to measure the level of service provided to end-users (e.g., availability,
latency, throughput)
● SLO
⦿ Targeted levels of service, measured by SLIs
⦿ Typically expressed as a percentage over a period of time
⦿ Help you figure out the right balance between product innovation and reliability
● SLA
⦿ Contractual agreements that outline the level of service end users can expect
⦿ If these promises are not met, there can be significant consequences for the provider,
which are often financial in nature
20
Error Budget
● Error budget = 1-SLO
● Acceptable levels of unreliability for a service before it falls out of compliance
with an SLO
● Measure of risk you can take to
⦿ get new features in
⦿ stop services for maintenance
⦿ routine improvements
⦿ network and infrastructure outages
⦿ unforeseen circumstances
21
Toil & Toil Budget
● Toil
⦿ Kind of work tied to running a production service that tends to be manual, repetitive,
automatable, tactical, devoid of enduring value, and that scales linearly as a service
grows.
⦿ The SRE discipline focuses on a slump of toil as much as possible.
● Toil budget
⦿ A measure of acceptable toil
22
Cron jobs
apiVersion: batch/v1
kind: CronJob
metadata:
name: expenserpt
spec:
schedule: "0 0 1 * *"
jobTemplate:
spec:
template:
spec:
containers:
- name: report
image: expenserpt
imagePullPolicy: IfNotPresent
restartPolicy: OnFailure
23
Cost management
● Use tools provided by cloud platforms
● Set proper cost thresholds
● Resource audit & cost analysis reports
● Set up a cost management team &
weekly reviews
24
● Kubecost
⦿ Provides real-time cost visibility and
insights for teams using Kubernetes
⦿ Helps to continuously reduce cloud
costs
Anti-fragility
● Improving resilience using fire drills, chaos monkey, security and automation
● Kubernetes liveness & readiness probes can be used for health checks
● Kubernetes secret management for sensitive data using Secrets and CSI
25
Security
● Threat modeling using methodologies such as STRIDE
● Scan code repos using tools such as Checkov
● Security specialists - DevSecOps
● Security Operations Center (SOC)
● Kubernetes
⦿ Service Accounts, roles & role bindings
⦿ Network Policies
⦿ Cluster and namespace level isolation
⦿ mTLS enforcement via service meshes
26
Business Continuity & Disaster Recovery
● BCP is the process involved in creating a system of prevention and recovery
from potential threats to a company
● What is a disaster?
⦿ An unforeseen event that could potentially put the organization at risk by interfering
with operations
● Ideally there should be BC plans for all functions of the company which are
amalgamated into a single corporate BC plan
27
Adopting SRE
A way of structuring teams
29
How can your organization adopt SRE?
● Start small & evolve
● Analyze existing team structures/processes and see how they can be adopted
● Recruiting experienced SREs can be hard
⦿ Dev2SRE program
● On the job training
● Certifications are important
⦿ CKAD, CKA, CKS
⦿ Cloud platform certifications - Azure, AWS, GCP etc.
⦿ “Well architected” programs
● Maintain a central knowledge base - document everything
● Define standards, conventions & best practices and ensure that those are
followed
● Define and continuously improve processes
● Work closely with development teams. Engage with all stakeholders.
● Get standards certifications/reports - SOC2, ISO 27001, HIPAA, HITRUST etc 30
TL;DR
● Kubernetes & even app development are just the tip of the iceberg in your
organization’s overall SRE & cloud native story
● Establishment of the SRE discipline is essential for running seamless
operations
● Start small, adapt & evolve
31
Question Time!
wso2.com
Thanks!

SRE & Kubernetes

  • 1.
  • 2.
    Hello! Afkham Azeez VP &Deputy CTO - Cloud azeez@wso2.com
  • 3.
    3 Off-roading, camping, birding& nature enthusiast Amateur radio operator - 4S7AZE /afkham_azeez /afkhamazeez
  • 4.
    Software development in2020 and beyond… A paradigm shift ● Major changes to how software is designed & built are taking place ● Businesses have realized that they have to build digital experiences ● Building a ‘Digitally-driven Business’ takes time and significant engineering effort 4
  • 5.
    Cloud-native software engineering Buildingfor the Cloud, on the Cloud ● Start building your product on the cloud ⦿ Have your dev environment on the cloud ● Multi-environment on the cloud ⦿ dev, test, staging, prod ● Leverage cloud services and APIs ⦿ Don’t run everything yourself ● Containers & Kubernetes are game changers 5
  • 6.
    With great powercomes great complexity!
  • 7.
    What is Kubernetes? ●A cluster operating system ● A collection of control loops 7 https://buttondown.email/nelhage/archive/two-reasons-kubernetes-is-so-complex/
  • 8.
    IaC ● The processof managing and provisioning computer data centers through machine-readable definition files, rather than physical hardware configuration or interactive configuration tools ● Everything is code ⦿ Cluster creation ⦿ Creating workloads ⦿ System configuration ⦿ Security ⦿ etc. 8
  • 9.
    Site Reliability Engineering ●SRE is an approach taken to solve IT Operations challenges using Software Engineering principles. ● SREs use software as a tool to manage cloud systems, diagnose problems, and automate tasks. ● A key role of SRE is to find the right balance between releasing new features and ensuring they are reliable; ⦿ Dev teams want to deploy as many features as possible as soon as possible ⦿ SRE tries to facilitates the dev team’s goals while ensuring reliability ● What is reliability? ⦿ Minimizing the impact on end users by minimizing outages 9
  • 10.
    What do SREsdo? ● Define compliance standards & processes ● Write cluster/system setup code ● Define build pipelines & help dev teams setup pipelines ● Setup monitoring and alerting (code) ● Plan backup and recovery ● Plan DR strategy ● Threat modeling & security scanning ● Incident management ● Chaos engineering ● Root cause analysis ● Perform routine tasks ● Cost analysis & optimization 10
  • 11.
    Core Concepts &Methodologies
  • 12.
    CICD and GitOps ●Git repos as the single & central sources of truth of the current cluster configuration ● Use standard git practices ⦿ fork -> branch -> change -> build -> send PR -> CI -> review -> merge -> CD 12
  • 13.
  • 14.
  • 15.
    Logging + analytics+ alerting 15 Log publishing Analytics Issue or anomaly detection Alerting Incident management
  • 16.
    Observability, Monitoring &Alerting ● Observability vs monitoring - monitoring is what you do after a system is observable ● System level monitoring ⦿ Cluster, pod, node health ⦿ System level services/APIs health - includes errors & latencies ⦿ System logs ⦿ Intrusion detection ⦿ DoS ● Application level monitoring ⦿ Application level services/APIs health - includes errors & latencies ⦿ Internal application level observability ⦿ Application logs 16
  • 17.
    Incident Management Unplanned interruptionto or quality reduction of an IT service 17
  • 18.
  • 19.
  • 20.
    SLI, SLO, SLA ●SLI ⦿ Metrics used to measure the level of service provided to end-users (e.g., availability, latency, throughput) ● SLO ⦿ Targeted levels of service, measured by SLIs ⦿ Typically expressed as a percentage over a period of time ⦿ Help you figure out the right balance between product innovation and reliability ● SLA ⦿ Contractual agreements that outline the level of service end users can expect ⦿ If these promises are not met, there can be significant consequences for the provider, which are often financial in nature 20
  • 21.
    Error Budget ● Errorbudget = 1-SLO ● Acceptable levels of unreliability for a service before it falls out of compliance with an SLO ● Measure of risk you can take to ⦿ get new features in ⦿ stop services for maintenance ⦿ routine improvements ⦿ network and infrastructure outages ⦿ unforeseen circumstances 21
  • 22.
    Toil & ToilBudget ● Toil ⦿ Kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows. ⦿ The SRE discipline focuses on a slump of toil as much as possible. ● Toil budget ⦿ A measure of acceptable toil 22
  • 23.
    Cron jobs apiVersion: batch/v1 kind:CronJob metadata: name: expenserpt spec: schedule: "0 0 1 * *" jobTemplate: spec: template: spec: containers: - name: report image: expenserpt imagePullPolicy: IfNotPresent restartPolicy: OnFailure 23
  • 24.
    Cost management ● Usetools provided by cloud platforms ● Set proper cost thresholds ● Resource audit & cost analysis reports ● Set up a cost management team & weekly reviews 24 ● Kubecost ⦿ Provides real-time cost visibility and insights for teams using Kubernetes ⦿ Helps to continuously reduce cloud costs
  • 25.
    Anti-fragility ● Improving resilienceusing fire drills, chaos monkey, security and automation ● Kubernetes liveness & readiness probes can be used for health checks ● Kubernetes secret management for sensitive data using Secrets and CSI 25
  • 26.
    Security ● Threat modelingusing methodologies such as STRIDE ● Scan code repos using tools such as Checkov ● Security specialists - DevSecOps ● Security Operations Center (SOC) ● Kubernetes ⦿ Service Accounts, roles & role bindings ⦿ Network Policies ⦿ Cluster and namespace level isolation ⦿ mTLS enforcement via service meshes 26
  • 27.
    Business Continuity &Disaster Recovery ● BCP is the process involved in creating a system of prevention and recovery from potential threats to a company ● What is a disaster? ⦿ An unforeseen event that could potentially put the organization at risk by interfering with operations ● Ideally there should be BC plans for all functions of the company which are amalgamated into a single corporate BC plan 27
  • 28.
  • 29.
    A way ofstructuring teams 29
  • 30.
    How can yourorganization adopt SRE? ● Start small & evolve ● Analyze existing team structures/processes and see how they can be adopted ● Recruiting experienced SREs can be hard ⦿ Dev2SRE program ● On the job training ● Certifications are important ⦿ CKAD, CKA, CKS ⦿ Cloud platform certifications - Azure, AWS, GCP etc. ⦿ “Well architected” programs ● Maintain a central knowledge base - document everything ● Define standards, conventions & best practices and ensure that those are followed ● Define and continuously improve processes ● Work closely with development teams. Engage with all stakeholders. ● Get standards certifications/reports - SOC2, ISO 27001, HIPAA, HITRUST etc 30
  • 31.
    TL;DR ● Kubernetes &even app development are just the tip of the iceberg in your organization’s overall SRE & cloud native story ● Establishment of the SRE discipline is essential for running seamless operations ● Start small, adapt & evolve 31
  • 32.
  • 33.