Site Reliability Engineering
Presenter Name: Keet Malin Sugathadasa
Designation: Associate Technical Lead
Presented By
Keet Malin Sugathadasa
Associate Tech Lead at Cognite
More than 3 years of experience in
various roles related to Software
Engineering
Contributor to NPM and
Stackoverflow
Research Interests –Cyber
Security, Cloud Computing,
Distributed Computing.
AGENDA
• What is Site Reliability Engineering (SRE)
• The 5 Pillars of SRE
• SLOs, SLIs, SLAs
• Error Budgets
• Toil
• Ensuring Successful operations of a
production system
What is DevOps
Like Agile came in to remove the gap between BA &
Dev, DevOps made the gap between Dev & Ops go
away
What is SRE?
• DevOps has been a community built set of practices, a culture;
• while SRE was groomed inside Google as a secret sauce.
Reduce Organizational Silos
• SRE teams share ownership of production with
developers
• SRE teams get involved in development at very early
stages
• But products may not start with SRE support at first.
When onboarding, following items get checked
• System architecture and interservice dependencies
• Instrumentation, metrics, and monitoring
• Emergency response
• Capacity planning
• Change management
• Performance: availability, latency, and efficiency
Reduce Silos
Accept Failure as Normal
Blameless Postmortems
• When things have actually gone bazooka,
who’s fault is it?
• Answer: Nobody’s. It's the system’s fault.
It allowed people to act that way!
• Ask WHY not WHO!
If nobody is blamed, people open up, and
then the root cause cascade opens up.
Agility[Devs] vs Stability[Ops]
• What is availability?
• Clear definitions
• How available you want to be?
• Clear numerical indicators
• What to do when availability is
not met?
SLI - SLO - SLA : Service Level what?
Service Level Indicator: A metric aggregated over time, ( 90th percentile, median )
• Batch throughput
• Failures per request
• Is the ratios of errors to total number of requests received in last 5 minutes < 1%?
• Request latency
• Is the average latency of requests in last 5 minutes < 300ms?
• Is the 90th percentile of the latency of requests in last 5 minutes < 300ms?
Service Level Objectives: Number which SLI needs to be
• Is above indicator is YES 99.9% of the time?
• Monitor the SLIs over a long time and decide this
Service Level Agreement: A legal agreement
• The the level of reliability I promise & what will I do if I do not
• Usually based on SLOs but a business agreement
Risk and availability
• 100% availability is impossible.
• Each 9 you add to the SLO,
increases your cost
• Each 9 you add, you lose your
comfort
Error Budgets
• Once you decide the SLO, you get X number of minutes to go unavailable.
• X is your Error Budget
• If you reach that budget, you cannot release new features anymore
• Under AND over spending is bad.
Implement Gradual Change
Gradual change
• Updates should be pushed as canaries, not as bulk version changes
• Less code change means lesser mean time to recover on failure
• Rate of change would depend on selection of SLO
Tooling & Automation
Toil
Toil is the manual repetitive work tied to running in PROD ( which can be
automated )
Toil & Toil budget
SREs actively measure Toil. Toil budget should be
around 30% to 50%
If toil is not kept at its margins, it fills up to 100%
easily
But a little amount of toil is not harmful.
• Automation might be harder than the manual
work
• Helps newcomers to orient themselves
Measuring
Service reliability needs to be measured
• Uptime
• Mean time to failure
• Mean time to recover
Whatsapp (Example Use case)
• Message Delivery Time
• Message Throughput
• Image Resolution (Compression Algorithm)
• Video Compression Quality
• Etc etc
Hope is not a
Strategy!
Thank you

Site Reliability Engineering (SRE) - Tech Talk by Keet Sugathadasa

  • 1.
    Site Reliability Engineering PresenterName: Keet Malin Sugathadasa Designation: Associate Technical Lead
  • 2.
    Presented By Keet MalinSugathadasa Associate Tech Lead at Cognite More than 3 years of experience in various roles related to Software Engineering Contributor to NPM and Stackoverflow Research Interests –Cyber Security, Cloud Computing, Distributed Computing.
  • 3.
    AGENDA • What isSite Reliability Engineering (SRE) • The 5 Pillars of SRE • SLOs, SLIs, SLAs • Error Budgets • Toil • Ensuring Successful operations of a production system
  • 4.
    What is DevOps LikeAgile came in to remove the gap between BA & Dev, DevOps made the gap between Dev & Ops go away
  • 5.
    What is SRE? •DevOps has been a community built set of practices, a culture; • while SRE was groomed inside Google as a secret sauce.
  • 8.
  • 9.
    • SRE teamsshare ownership of production with developers • SRE teams get involved in development at very early stages • But products may not start with SRE support at first. When onboarding, following items get checked • System architecture and interservice dependencies • Instrumentation, metrics, and monitoring • Emergency response • Capacity planning • Change management • Performance: availability, latency, and efficiency Reduce Silos
  • 10.
  • 11.
    Blameless Postmortems • Whenthings have actually gone bazooka, who’s fault is it? • Answer: Nobody’s. It's the system’s fault. It allowed people to act that way! • Ask WHY not WHO! If nobody is blamed, people open up, and then the root cause cascade opens up.
  • 12.
    Agility[Devs] vs Stability[Ops] •What is availability? • Clear definitions • How available you want to be? • Clear numerical indicators • What to do when availability is not met?
  • 13.
    SLI - SLO- SLA : Service Level what? Service Level Indicator: A metric aggregated over time, ( 90th percentile, median ) • Batch throughput • Failures per request • Is the ratios of errors to total number of requests received in last 5 minutes < 1%? • Request latency • Is the average latency of requests in last 5 minutes < 300ms? • Is the 90th percentile of the latency of requests in last 5 minutes < 300ms? Service Level Objectives: Number which SLI needs to be • Is above indicator is YES 99.9% of the time? • Monitor the SLIs over a long time and decide this Service Level Agreement: A legal agreement • The the level of reliability I promise & what will I do if I do not • Usually based on SLOs but a business agreement
  • 15.
    Risk and availability •100% availability is impossible. • Each 9 you add to the SLO, increases your cost • Each 9 you add, you lose your comfort
  • 16.
    Error Budgets • Onceyou decide the SLO, you get X number of minutes to go unavailable. • X is your Error Budget • If you reach that budget, you cannot release new features anymore • Under AND over spending is bad.
  • 18.
  • 19.
    Gradual change • Updatesshould be pushed as canaries, not as bulk version changes • Less code change means lesser mean time to recover on failure • Rate of change would depend on selection of SLO
  • 20.
  • 21.
    Toil Toil is themanual repetitive work tied to running in PROD ( which can be automated )
  • 22.
    Toil & Toilbudget SREs actively measure Toil. Toil budget should be around 30% to 50% If toil is not kept at its margins, it fills up to 100% easily But a little amount of toil is not harmful. • Automation might be harder than the manual work • Helps newcomers to orient themselves
  • 23.
    Measuring Service reliability needsto be measured • Uptime • Mean time to failure • Mean time to recover
  • 24.
    Whatsapp (Example Usecase) • Message Delivery Time • Message Throughput • Image Resolution (Compression Algorithm) • Video Compression Quality • Etc etc
  • 25.
    Hope is nota Strategy!
  • 26.