Site Reliability Engineering (SRE) - Tech Talk by Keet Sugathadasa

Site Reliability Engineering
Presenter Name: Keet Malin Sugathadasa
Designation: Associate Technical Lead

Presented By
Keet Malin Sugathadasa
Associate Tech Lead at Cognite
More than 3 years of experience in
various roles related to Software
Engineering
Contributor to NPM and
Stackoverflow
Research Interests –Cyber
Security, Cloud Computing,
Distributed Computing.

AGENDA
• What is Site Reliability Engineering (SRE)
• The 5 Pillars of SRE
• SLOs, SLIs, SLAs
• Error Budgets
• Toil
• Ensuring Successful operations of a
production system

What is DevOps
Like Agile came in to remove the gap between BA &
Dev, DevOps made the gap between Dev & Ops go
away

What is SRE?
• DevOps has been a community built set of practices, a culture;
• while SRE was groomed inside Google as a secret sauce.

• SRE teams share ownership of production with
developers
• SRE teams get involved in development at very early
stages
• But products may not start with SRE support at first.
When onboarding, following items get checked
• System architecture and interservice dependencies
• Instrumentation, metrics, and monitoring
• Emergency response
• Capacity planning
• Change management
• Performance: availability, latency, and efficiency
Reduce Silos

Blameless Postmortems
• When things have actually gone bazooka,
who’s fault is it?
• Answer: Nobody’s. It's the system’s fault.
It allowed people to act that way!
• Ask WHY not WHO!
If nobody is blamed, people open up, and
then the root cause cascade opens up.

Agility[Devs] vs Stability[Ops]
• What is availability?
• Clear definitions
• How available you want to be?
• Clear numerical indicators
• What to do when availability is
not met?

SLI - SLO - SLA : Service Level what?
Service Level Indicator: A metric aggregated over time, ( 90th percentile, median )
• Batch throughput
• Failures per request
• Is the ratios of errors to total number of requests received in last 5 minutes < 1%?
• Request latency
• Is the average latency of requests in last 5 minutes < 300ms?
• Is the 90th percentile of the latency of requests in last 5 minutes < 300ms?
Service Level Objectives: Number which SLI needs to be
• Is above indicator is YES 99.9% of the time?
• Monitor the SLIs over a long time and decide this
Service Level Agreement: A legal agreement
• The the level of reliability I promise & what will I do if I do not
• Usually based on SLOs but a business agreement

Risk and availability
• 100% availability is impossible.
• Each 9 you add to the SLO,
increases your cost
• Each 9 you add, you lose your
comfort

Error Budgets
• Once you decide the SLO, you get X number of minutes to go unavailable.
• X is your Error Budget
• If you reach that budget, you cannot release new features anymore
• Under AND over spending is bad.

Gradual change
• Updates should be pushed as canaries, not as bulk version changes
• Less code change means lesser mean time to recover on failure
• Rate of change would depend on selection of SLO

Toil
Toil is the manual repetitive work tied to running in PROD ( which can be
automated )

Toil & Toil budget
SREs actively measure Toil. Toil budget should be
around 30% to 50%
If toil is not kept at its margins, it fills up to 100%
easily
But a little amount of toil is not harmful.
• Automation might be harder than the manual
work
• Helps newcomers to orient themselves

Measuring
Service reliability needs to be measured
• Uptime
• Mean time to failure
• Mean time to recover

Whatsapp (Example Use case)
• Message Delivery Time
• Message Throughput
• Image Resolution (Compression Algorithm)
• Video Compression Quality
• Etc etc

Site Reliability Engineering (SRE) - Tech Talk by Keet Sugathadasa

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Site Reliability Engineering (SRE) - Tech Talk by Keet Sugathadasa

Similar to Site Reliability Engineering (SRE) - Tech Talk by Keet Sugathadasa (20)

More from Keet Sugathadasa

More from Keet Sugathadasa (9)

Recently uploaded

Recently uploaded (20)

Site Reliability Engineering (SRE) - Tech Talk by Keet Sugathadasa