The document discusses software reliability engineering (SRE) practices for managing Kubernetes clusters. It describes how SRE teams use infrastructure as code, continuous integration/delivery (CI/CD), monitoring, logging, incident response processes, and other methodologies to ensure reliability and reduce toil. The document recommends that organizations adopt SRE practices gradually by starting small, defining standards, and working closely with development teams.
4. Software development in 2020 and beyond…
A paradigm shift
● Major changes to how software is designed & built are taking place
● Businesses have realized that they have to build digital experiences
● Building a ‘Digitally-driven Business’ takes time and significant engineering
effort
4
5. Cloud-native software engineering
Building for the Cloud, on the Cloud
● Start building your product on the cloud
⦿ Have your dev environment on the cloud
● Multi-environment on the cloud
⦿ dev, test, staging, prod
● Leverage cloud services and APIs
⦿ Don’t run everything yourself
● Containers & Kubernetes are game changers
5
7. What is Kubernetes?
● A cluster operating system
● A collection of control loops
7
https://buttondown.email/nelhage/archive/two-reasons-kubernetes-is-so-complex/
8. IaC
● The process of managing and provisioning computer data centers through
machine-readable definition files, rather than physical hardware configuration or
interactive configuration tools
● Everything is code
⦿ Cluster creation
⦿ Creating workloads
⦿ System configuration
⦿ Security
⦿ etc.
8
9. Site Reliability Engineering
● SRE is an approach taken to solve IT Operations challenges using Software
Engineering principles.
● SREs use software as a tool to manage cloud systems, diagnose problems, and
automate tasks.
● A key role of SRE is to find the right balance between releasing new features
and ensuring they are reliable;
⦿ Dev teams want to deploy as many features as possible as soon as possible
⦿ SRE tries to facilitates the dev team’s goals while ensuring reliability
● What is reliability?
⦿ Minimizing the impact on end users by minimizing outages
9
10. What do SREs do?
● Define compliance standards & processes
● Write cluster/system setup code
● Define build pipelines & help dev teams setup pipelines
● Setup monitoring and alerting (code)
● Plan backup and recovery
● Plan DR strategy
● Threat modeling & security scanning
● Incident management
● Chaos engineering
● Root cause analysis
● Perform routine tasks
● Cost analysis & optimization
10
12. CICD and GitOps
● Git repos as the single & central sources of truth of the current cluster
configuration
● Use standard git practices
⦿ fork -> branch -> change -> build -> send PR -> CI -> review -> merge -> CD
12
16. Observability, Monitoring & Alerting
● Observability vs monitoring - monitoring is what you do after a system is
observable
● System level monitoring
⦿ Cluster, pod, node health
⦿ System level services/APIs health - includes errors & latencies
⦿ System logs
⦿ Intrusion detection
⦿ DoS
● Application level monitoring
⦿ Application level services/APIs health - includes errors & latencies
⦿ Internal application level observability
⦿ Application logs
16
20. SLI, SLO, SLA
● SLI
⦿ Metrics used to measure the level of service provided to end-users (e.g., availability,
latency, throughput)
● SLO
⦿ Targeted levels of service, measured by SLIs
⦿ Typically expressed as a percentage over a period of time
⦿ Help you figure out the right balance between product innovation and reliability
● SLA
⦿ Contractual agreements that outline the level of service end users can expect
⦿ If these promises are not met, there can be significant consequences for the provider,
which are often financial in nature
20
21. Error Budget
● Error budget = 1-SLO
● Acceptable levels of unreliability for a service before it falls out of compliance
with an SLO
● Measure of risk you can take to
⦿ get new features in
⦿ stop services for maintenance
⦿ routine improvements
⦿ network and infrastructure outages
⦿ unforeseen circumstances
21
22. Toil & Toil Budget
● Toil
⦿ Kind of work tied to running a production service that tends to be manual, repetitive,
automatable, tactical, devoid of enduring value, and that scales linearly as a service
grows.
⦿ The SRE discipline focuses on a slump of toil as much as possible.
● Toil budget
⦿ A measure of acceptable toil
22
24. Cost management
● Use tools provided by cloud platforms
● Set proper cost thresholds
● Resource audit & cost analysis reports
● Set up a cost management team &
weekly reviews
24
● Kubecost
⦿ Provides real-time cost visibility and
insights for teams using Kubernetes
⦿ Helps to continuously reduce cloud
costs
25. Anti-fragility
● Improving resilience using fire drills, chaos monkey, security and automation
● Kubernetes liveness & readiness probes can be used for health checks
● Kubernetes secret management for sensitive data using Secrets and CSI
25
26. Security
● Threat modeling using methodologies such as STRIDE
● Scan code repos using tools such as Checkov
● Security specialists - DevSecOps
● Security Operations Center (SOC)
● Kubernetes
⦿ Service Accounts, roles & role bindings
⦿ Network Policies
⦿ Cluster and namespace level isolation
⦿ mTLS enforcement via service meshes
26
27. Business Continuity & Disaster Recovery
● BCP is the process involved in creating a system of prevention and recovery
from potential threats to a company
● What is a disaster?
⦿ An unforeseen event that could potentially put the organization at risk by interfering
with operations
● Ideally there should be BC plans for all functions of the company which are
amalgamated into a single corporate BC plan
27
30. How can your organization adopt SRE?
● Start small & evolve
● Analyze existing team structures/processes and see how they can be adopted
● Recruiting experienced SREs can be hard
⦿ Dev2SRE program
● On the job training
● Certifications are important
⦿ CKAD, CKA, CKS
⦿ Cloud platform certifications - Azure, AWS, GCP etc.
⦿ “Well architected” programs
● Maintain a central knowledge base - document everything
● Define standards, conventions & best practices and ensure that those are
followed
● Define and continuously improve processes
● Work closely with development teams. Engage with all stakeholders.
● Get standards certifications/reports - SOC2, ISO 27001, HIPAA, HITRUST etc 30
31. TL;DR
● Kubernetes & even app development are just the tip of the iceberg in your
organization’s overall SRE & cloud native story
● Establishment of the SRE discipline is essential for running seamless
operations
● Start small, adapt & evolve
31