SlideShare a Scribd company logo
Site Reliability engineering /
Emergency Procedures
Ashutosh Agarwal
Context: Launching new features vs reliability
● Product teams want to move fast
○ Deploy as many features as possible
○ Deploy as fast as possible
● Reliability teams want to be stable
○ Focus on reliability
○ Break as little as possible
Goal of SREs is to enable the product teams to move as fast as practically possible
without impacting users.
This talk
Users are having issues with your service
● What does “having issues with your service” mean ?
● How do we know if something went wrong ?
● What went wrong ?
● What action to take ?
● How to prevent these from occurring in future ?
Availability
Definition varies
● % Uptime
● % Requests served
Path towards high availability
● Add redundancy (N+2)
○ More replicas - webservers, database
○ Load Balancers
● Write more tests : integration tests
● More humans to respond to pagers
● Maintain checklist of possible bugs, and run through the list for each release
● Reduce release velocity
○ Release one feature at a time
○ Release a binary only after exhaustive automated & human testing
Quiz:
I want my
service
availability to be
… ?
❏ 97%
❏ 99.9%
❏ 99.999%
❏ 100%
Right answer
It depends. Trade-off between launch velocity
and system stability.
● A feature with low availability would be
unusable - always broken
● A feature with very high availability would
be unusable - very boring, features missing
Deciding on right availability target
● Infrastructure (shared service) vs user-facing service
● Mature Product vs New Product
● Paid vs Free Product
● High Revenue impact vs Low
Basically, it’s a very logical cost-benefit analysis.
Service Level Indicators (SLI),
Service Level Objectives (SLO)
Service Level Indicators
What are the metrics that matter ?
Latency
Throughput : Number of requests processed per second
Error Rate : % of requests answered
It is important that these metrics are clearly
thought-about and mutually agreed upon.
Service Level Objectives (SLOs)
What are the target values for your SLIs
Latency - 300ms.
Often, it is hard to define SLOs for some metrics.
In those cases:
- It is important to have a hard upper limit
- Enables abnormal behavior detection
- Gives a guiding benchmark to teams
Expected downtime
99% = 4 days of downtime a year
99.9% = 9 hours of downtime
99.99% = 1 hour of downtime
99.999% = 5 min of downtime
Error Budget
Ex: Target Availability 99.9%
Practically, the service is available 99.99% time.
The service has gained some error budget to launch new features faster now !
Likewise, vice-versa.
Reliable only up to the defined limit, no more
● Sometimes, systems may give false impression of being over-reliable
● Leads other teams to build with false assumptions
Extreme measure : Planned Downtime
Abnormality Detection
It would have been easy if …
Database
Server
Web
Server
A simple web application
But most often in reality
Load
Balancer
Database
Server
Database
Server
Master
/
Router
Cache
Layer
Service n
Database
Server
Database
Server
Master
/
Router
Cache
Layer
Service 1
Merge
Node
Web
Server
Web
Server
High-level architecture of a web-application
Monitoring
Without monitoring ...
● No way to know if my system is working, it is like flying in the dark
● Predicting a future issue, is not at all possible
Good monitoring
● Enables analysis of long-term trends
● Enables comparative analysis
○ 1 Machine with 32GB RAM vs 2 machines with 16 GM RAM each
● Enables conducting ad hoc analysis - latency shot up, what else changed ?
Good monitoring
● Never requires a human to interpret an alert
○ Humans add latency
○ Alert fatigue
○ Ignore, if alarms are usually false
● 3 kinds of monitoring output
○ Alert : True alarm - red alert. Human needs to take action.
○ Ticket : Alarm - yellow alert. Human needs to take action eventually
○ Logging : Not a alarm - no action needed.
Good Alerts
● Alert is a sign that something is
actually broken
Bad alert : Average latency is 125ms.
Good alert: Average latency is
breaking latency SLOs
● Alerts MUST be actionable.
● Alert for symptoms, not causes
What = Symptom, Why = (Possible)
Cause
“What” versus “Why” is an important
distinction between good monitoring
with maximum signal and minimal
noise.
Black box monitoring
● Mimics a system user
● Shows how your users view your
system
Types of monitoring
White box monitoring
● Viewing your system from the inside,
knowing very well how it functions
● For e.g each service keeps counters of
all backend service failures, latencies
Black-box monitoring example
Request
GET /mysite.html HTTP/1.1
Host: www.example.com
Response
HTTP/1.1 200 OK
Date: 01 Jan 1900
Content Type: text/html
<html> … </html>
White-box monitoring
Load
Balancer
Database
Server
Database
Server
Master
/
Router
Cache
Layer
Service n
Database
Server
Database
Server
Master
/
Router
Cache
Layer
Service 1
Merge
Node
Web
Server
Web
Server
qps: 100
peak: 500
num_shards: 9
backend_latency:
200ms
mem_size: 123 mb
peak: 1029 mb
backend_latency:
20ms
Life of an incident
10.59 PM
Error
404
11.00 PM
Error
404
Error
404
Error
404
Error
404
Error
404
10 users get 404 error in 1 minute.
11.00 PM
Error
404
Error
404
Error
404
Error
404
Error
404
10 users get 404 error in 1 minute.
Monitoring
system kicks in
Sends a
pager to
oncall
engineer /
SRE
In this case, black box monitoring kicked off an alert.
This is after the fact alert.
Sometimes, white-box monitoring can help predict an
issue is about to happen.
11.05 PM
Playbook
(s)
Verifies the
user impact
Finds out the
underlying
broken service
causing the
issue
Consults the
playbook of
underlying
service
Finds how to
disable a
particular
feature
White-box monitorning invariably plays a big role in figuring
out the origin of the problem.
Also, a good central logging system used by ALL servers
throughout the system.
11.10 PM
Load
Balancer
Database
Server
Database
Server
Master
/
Router
Cache
Layer
Service n
Database
Server
Database
Server
Master
/
Router
Cache
Layer
Service 1
Merge
Node
Web
Server
Web
Server
Issue
fixed
In summary
● Playbooks: guide to turning off features, feature flag names & descriptions
● Mean time to failure (MTTF)
● Mean time to recovery (MTTR)
Reliability = function (MTTF, MTTR)
● Statistically, having playbooks improves MTTR by 3x.
Culture of SREs
John is a hero.
John knows about the system more than
others.
John can fix this issue in under a minute.
John doesn’t need to read docs to fix it.
But there is only ONE John.
And John needs to sleep, eat …
And John can solve only ONE issue at a
time.
Avoid a culture of
heros.
Preventing future hiccups
Release management : Progressive rollouts
● Canary servers
○ A few production instances serving traffic from new binary
● Canary for sufficient time
○ Give the new binary some time to experience all possible use cases
● Canary becomes prod on green
● Changes be backward compatible with 2-3 releases
○ Binaries can be safely rolled back
No-blame postmortem
● Teams write detailed postmortem after resolving outages
● Not attributed to any single person / team
● Postmortem is structured as:
○ What went wrong
○ What was the immediate fix
○ What is the long term fix
○ What can we do to prevent it occurring in the future ?
Catastrophe Testing (Artificial)
Dedicated days to stress test the system
● Bring down a few machines
● Lose network connectivity
In summary
● Judiciously decide on what metrics to measure, and target values
● Monitor both user facing as well as internal metrics
● Alert only when necessary
● Plan for failure management - dashboards, playbooks and uniform logging
● Avoid a culture of heros
● Learn from failures
Questions ?

More Related Content

What's hot

Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...
Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...
Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...
ITSM Academy, Inc.
 
SRE in Startup
SRE in StartupSRE in Startup
SRE in Startup
Ladislav Prskavec
 
Site Reliability Engineering (SRE) - Tech Talk by Keet Sugathadasa
Site Reliability Engineering (SRE) - Tech Talk by Keet SugathadasaSite Reliability Engineering (SRE) - Tech Talk by Keet Sugathadasa
Site Reliability Engineering (SRE) - Tech Talk by Keet Sugathadasa
Keet Sugathadasa
 
SRE (service reliability engineer) on big DevOps platform running on the clou...
SRE (service reliability engineer) on big DevOps platform running on the clou...SRE (service reliability engineer) on big DevOps platform running on the clou...
SRE (service reliability engineer) on big DevOps platform running on the clou...
DevClub_lv
 
A Crash Course in Building Site Reliability
A Crash Course in Building Site ReliabilityA Crash Course in Building Site Reliability
A Crash Course in Building Site Reliability
Acquia
 
Sre summary
Sre summarySre summary
Sre summary
Yogesh Shah
 
What is Site Reliability Engineering (SRE)
What is Site Reliability Engineering (SRE)What is Site Reliability Engineering (SRE)
What is Site Reliability Engineering (SRE)
jeetendra mandal
 
DevOps Torino Meetup - SRE Concepts
DevOps Torino Meetup - SRE ConceptsDevOps Torino Meetup - SRE Concepts
DevOps Torino Meetup - SRE Concepts
Rauno De Pasquale
 
SRE 101
SRE 101SRE 101
SRE 101
Diego Pacheco
 
SRE Demystified - 01 - SLO SLI and SLA
SRE Demystified - 01 - SLO SLI and SLASRE Demystified - 01 - SLO SLI and SLA
SRE Demystified - 01 - SLO SLI and SLA
Dr Ganesh Iyer
 
SRE-iously! Reliability!
SRE-iously! Reliability!SRE-iously! Reliability!
SRE-iously! Reliability!
New Relic
 
Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...
Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...
Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...
DevOpsDays Tel Aviv
 
How Small Team Get Ready for SRE (public version)
How Small Team Get Ready for SRE (public version)How Small Team Get Ready for SRE (public version)
How Small Team Get Ready for SRE (public version)
Setyo Legowo
 
SRE Demystified - 05 - Toil Elimination
SRE Demystified - 05 - Toil EliminationSRE Demystified - 05 - Toil Elimination
SRE Demystified - 05 - Toil Elimination
Dr Ganesh Iyer
 
SRE-iously: Defining the Principles, Habits, and Practices of Site Reliabilit...
SRE-iously: Defining the Principles, Habits, and Practices of Site Reliabilit...SRE-iously: Defining the Principles, Habits, and Practices of Site Reliabilit...
SRE-iously: Defining the Principles, Habits, and Practices of Site Reliabilit...
New Relic
 
SRE vs DevOps
SRE vs DevOpsSRE vs DevOps
SRE vs DevOps
Levon Avakyan
 
Site reliability engineering - Lightning Talk
Site reliability engineering - Lightning TalkSite reliability engineering - Lightning Talk
Site reliability engineering - Lightning Talk
Michae Blakeney
 
Managing software projects & teams effectively
Managing software projects & teams effectivelyManaging software projects & teams effectively
Managing software projects & teams effectively
Ashutosh Agarwal
 
Site Reliability Engineer (SRE), We Keep The Lights On 24/7
Site Reliability Engineer (SRE), We Keep The Lights On 24/7Site Reliability Engineer (SRE), We Keep The Lights On 24/7
Site Reliability Engineer (SRE), We Keep The Lights On 24/7
NUS-ISS
 
DevOps & SRE at Google Scale
DevOps & SRE at Google ScaleDevOps & SRE at Google Scale
DevOps & SRE at Google Scale
Kaushik Bhattacharya
 

What's hot (20)

Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...
Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...
Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...
 
SRE in Startup
SRE in StartupSRE in Startup
SRE in Startup
 
Site Reliability Engineering (SRE) - Tech Talk by Keet Sugathadasa
Site Reliability Engineering (SRE) - Tech Talk by Keet SugathadasaSite Reliability Engineering (SRE) - Tech Talk by Keet Sugathadasa
Site Reliability Engineering (SRE) - Tech Talk by Keet Sugathadasa
 
SRE (service reliability engineer) on big DevOps platform running on the clou...
SRE (service reliability engineer) on big DevOps platform running on the clou...SRE (service reliability engineer) on big DevOps platform running on the clou...
SRE (service reliability engineer) on big DevOps platform running on the clou...
 
A Crash Course in Building Site Reliability
A Crash Course in Building Site ReliabilityA Crash Course in Building Site Reliability
A Crash Course in Building Site Reliability
 
Sre summary
Sre summarySre summary
Sre summary
 
What is Site Reliability Engineering (SRE)
What is Site Reliability Engineering (SRE)What is Site Reliability Engineering (SRE)
What is Site Reliability Engineering (SRE)
 
DevOps Torino Meetup - SRE Concepts
DevOps Torino Meetup - SRE ConceptsDevOps Torino Meetup - SRE Concepts
DevOps Torino Meetup - SRE Concepts
 
SRE 101
SRE 101SRE 101
SRE 101
 
SRE Demystified - 01 - SLO SLI and SLA
SRE Demystified - 01 - SLO SLI and SLASRE Demystified - 01 - SLO SLI and SLA
SRE Demystified - 01 - SLO SLI and SLA
 
SRE-iously! Reliability!
SRE-iously! Reliability!SRE-iously! Reliability!
SRE-iously! Reliability!
 
Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...
Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...
Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...
 
How Small Team Get Ready for SRE (public version)
How Small Team Get Ready for SRE (public version)How Small Team Get Ready for SRE (public version)
How Small Team Get Ready for SRE (public version)
 
SRE Demystified - 05 - Toil Elimination
SRE Demystified - 05 - Toil EliminationSRE Demystified - 05 - Toil Elimination
SRE Demystified - 05 - Toil Elimination
 
SRE-iously: Defining the Principles, Habits, and Practices of Site Reliabilit...
SRE-iously: Defining the Principles, Habits, and Practices of Site Reliabilit...SRE-iously: Defining the Principles, Habits, and Practices of Site Reliabilit...
SRE-iously: Defining the Principles, Habits, and Practices of Site Reliabilit...
 
SRE vs DevOps
SRE vs DevOpsSRE vs DevOps
SRE vs DevOps
 
Site reliability engineering - Lightning Talk
Site reliability engineering - Lightning TalkSite reliability engineering - Lightning Talk
Site reliability engineering - Lightning Talk
 
Managing software projects & teams effectively
Managing software projects & teams effectivelyManaging software projects & teams effectively
Managing software projects & teams effectively
 
Site Reliability Engineer (SRE), We Keep The Lights On 24/7
Site Reliability Engineer (SRE), We Keep The Lights On 24/7Site Reliability Engineer (SRE), We Keep The Lights On 24/7
Site Reliability Engineer (SRE), We Keep The Lights On 24/7
 
DevOps & SRE at Google Scale
DevOps & SRE at Google ScaleDevOps & SRE at Google Scale
DevOps & SRE at Google Scale
 

Similar to Overview of Site Reliability Engineering (SRE) & best practices

Strata+Hadoop 2017 San Jose: Lessons from a year of supporting Apache Kafka
Strata+Hadoop 2017 San Jose: Lessons from a year of supporting Apache KafkaStrata+Hadoop 2017 San Jose: Lessons from a year of supporting Apache Kafka
Strata+Hadoop 2017 San Jose: Lessons from a year of supporting Apache Kafka
confluent
 
Prometheus - Open Source Forum Japan
Prometheus  - Open Source Forum JapanPrometheus  - Open Source Forum Japan
Prometheus - Open Source Forum Japan
Brian Brazil
 
Prometheus (Prometheus London, 2016)
Prometheus (Prometheus London, 2016)Prometheus (Prometheus London, 2016)
Prometheus (Prometheus London, 2016)
Brian Brazil
 
An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)
Brian Brazil
 
Gatling - Bordeaux JUG
Gatling - Bordeaux JUGGatling - Bordeaux JUG
Gatling - Bordeaux JUGslandelle
 
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Brian Brazil
 
Sistemas Distribuidos
Sistemas DistribuidosSistemas Distribuidos
Sistemas Distribuidos
Locaweb
 
Tef con2016 (1)
Tef con2016 (1)Tef con2016 (1)
Tef con2016 (1)
ggarber
 
HowTo DR
HowTo DRHowTo DR
Netflix SRE perf meetup_slides
Netflix SRE perf meetup_slidesNetflix SRE perf meetup_slides
Netflix SRE perf meetup_slides
Ed Hunter
 
The on-call survival guide - how to be confident on-call
The on-call survival guide - how to be confident on-call The on-call survival guide - how to be confident on-call
The on-call survival guide - how to be confident on-call
Raygun
 
Diesel load testing tool
Diesel load testing toolDiesel load testing tool
Diesel load testing tool
Syed Zaid Irshad
 
Monitoring &amp; alerting presentation sabin&amp;mustafa
Monitoring &amp; alerting presentation sabin&amp;mustafaMonitoring &amp; alerting presentation sabin&amp;mustafa
Monitoring &amp; alerting presentation sabin&amp;mustafa
Lama K Banna
 
Ensuring Your Technology Will Scale
Ensuring Your Technology Will ScaleEnsuring Your Technology Will Scale
Ensuring Your Technology Will Scale
basissetventures
 
Square Engineering's "Fail Fast, Retry Soon" Performance Optimization Technique
Square Engineering's "Fail Fast, Retry Soon" Performance Optimization TechniqueSquare Engineering's "Fail Fast, Retry Soon" Performance Optimization Technique
Square Engineering's "Fail Fast, Retry Soon" Performance Optimization Technique
ScyllaDB
 
OSMC 2019 | How to improve database Observability by Charles Judith
OSMC 2019 | How to improve database Observability by Charles JudithOSMC 2019 | How to improve database Observability by Charles Judith
OSMC 2019 | How to improve database Observability by Charles Judith
NETWAYS
 
Benchmarks, performance, scalability, and capacity what s behind the numbers...
Benchmarks, performance, scalability, and capacity  what s behind the numbers...Benchmarks, performance, scalability, and capacity  what s behind the numbers...
Benchmarks, performance, scalability, and capacity what s behind the numbers...
james tong
 
Benchmarks, performance, scalability, and capacity what's behind the numbers
Benchmarks, performance, scalability, and capacity what's behind the numbersBenchmarks, performance, scalability, and capacity what's behind the numbers
Benchmarks, performance, scalability, and capacity what's behind the numbers
Justin Dorfman
 
Nfr testing(performance)
Nfr testing(performance)Nfr testing(performance)
Nfr testing(performance)
Dilip Sharma
 

Similar to Overview of Site Reliability Engineering (SRE) & best practices (20)

Strata+Hadoop 2017 San Jose: Lessons from a year of supporting Apache Kafka
Strata+Hadoop 2017 San Jose: Lessons from a year of supporting Apache KafkaStrata+Hadoop 2017 San Jose: Lessons from a year of supporting Apache Kafka
Strata+Hadoop 2017 San Jose: Lessons from a year of supporting Apache Kafka
 
Prometheus - Open Source Forum Japan
Prometheus  - Open Source Forum JapanPrometheus  - Open Source Forum Japan
Prometheus - Open Source Forum Japan
 
Prometheus (Prometheus London, 2016)
Prometheus (Prometheus London, 2016)Prometheus (Prometheus London, 2016)
Prometheus (Prometheus London, 2016)
 
An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)
 
Gatling - Bordeaux JUG
Gatling - Bordeaux JUGGatling - Bordeaux JUG
Gatling - Bordeaux JUG
 
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
 
Sistemas Distribuidos
Sistemas DistribuidosSistemas Distribuidos
Sistemas Distribuidos
 
Tef con2016 (1)
Tef con2016 (1)Tef con2016 (1)
Tef con2016 (1)
 
HowTo DR
HowTo DRHowTo DR
HowTo DR
 
Netflix SRE perf meetup_slides
Netflix SRE perf meetup_slidesNetflix SRE perf meetup_slides
Netflix SRE perf meetup_slides
 
The on-call survival guide - how to be confident on-call
The on-call survival guide - how to be confident on-call The on-call survival guide - how to be confident on-call
The on-call survival guide - how to be confident on-call
 
Diesel load testing tool
Diesel load testing toolDiesel load testing tool
Diesel load testing tool
 
Monitoring &amp; alerting presentation sabin&amp;mustafa
Monitoring &amp; alerting presentation sabin&amp;mustafaMonitoring &amp; alerting presentation sabin&amp;mustafa
Monitoring &amp; alerting presentation sabin&amp;mustafa
 
Ensuring Your Technology Will Scale
Ensuring Your Technology Will ScaleEnsuring Your Technology Will Scale
Ensuring Your Technology Will Scale
 
Square Engineering's "Fail Fast, Retry Soon" Performance Optimization Technique
Square Engineering's "Fail Fast, Retry Soon" Performance Optimization TechniqueSquare Engineering's "Fail Fast, Retry Soon" Performance Optimization Technique
Square Engineering's "Fail Fast, Retry Soon" Performance Optimization Technique
 
OSMC 2019 | How to improve database Observability by Charles Judith
OSMC 2019 | How to improve database Observability by Charles JudithOSMC 2019 | How to improve database Observability by Charles Judith
OSMC 2019 | How to improve database Observability by Charles Judith
 
Benchmarks, performance, scalability, and capacity what s behind the numbers...
Benchmarks, performance, scalability, and capacity  what s behind the numbers...Benchmarks, performance, scalability, and capacity  what s behind the numbers...
Benchmarks, performance, scalability, and capacity what s behind the numbers...
 
Benchmarks, performance, scalability, and capacity what's behind the numbers
Benchmarks, performance, scalability, and capacity what's behind the numbersBenchmarks, performance, scalability, and capacity what's behind the numbers
Benchmarks, performance, scalability, and capacity what's behind the numbers
 
sat_presentation
sat_presentationsat_presentation
sat_presentation
 
Nfr testing(performance)
Nfr testing(performance)Nfr testing(performance)
Nfr testing(performance)
 

Recently uploaded

The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
CatarinaPereira64715
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 

Recently uploaded (20)

The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 

Overview of Site Reliability Engineering (SRE) & best practices

  • 1. Site Reliability engineering / Emergency Procedures Ashutosh Agarwal
  • 2. Context: Launching new features vs reliability ● Product teams want to move fast ○ Deploy as many features as possible ○ Deploy as fast as possible ● Reliability teams want to be stable ○ Focus on reliability ○ Break as little as possible Goal of SREs is to enable the product teams to move as fast as practically possible without impacting users.
  • 3. This talk Users are having issues with your service ● What does “having issues with your service” mean ? ● How do we know if something went wrong ? ● What went wrong ? ● What action to take ? ● How to prevent these from occurring in future ?
  • 4. Availability Definition varies ● % Uptime ● % Requests served
  • 5. Path towards high availability ● Add redundancy (N+2) ○ More replicas - webservers, database ○ Load Balancers ● Write more tests : integration tests ● More humans to respond to pagers ● Maintain checklist of possible bugs, and run through the list for each release ● Reduce release velocity ○ Release one feature at a time ○ Release a binary only after exhaustive automated & human testing
  • 6. Quiz: I want my service availability to be … ? ❏ 97% ❏ 99.9% ❏ 99.999% ❏ 100%
  • 7. Right answer It depends. Trade-off between launch velocity and system stability. ● A feature with low availability would be unusable - always broken ● A feature with very high availability would be unusable - very boring, features missing
  • 8. Deciding on right availability target ● Infrastructure (shared service) vs user-facing service ● Mature Product vs New Product ● Paid vs Free Product ● High Revenue impact vs Low Basically, it’s a very logical cost-benefit analysis.
  • 9. Service Level Indicators (SLI), Service Level Objectives (SLO)
  • 10. Service Level Indicators What are the metrics that matter ? Latency Throughput : Number of requests processed per second Error Rate : % of requests answered It is important that these metrics are clearly thought-about and mutually agreed upon.
  • 11. Service Level Objectives (SLOs) What are the target values for your SLIs Latency - 300ms. Often, it is hard to define SLOs for some metrics. In those cases: - It is important to have a hard upper limit - Enables abnormal behavior detection - Gives a guiding benchmark to teams
  • 12. Expected downtime 99% = 4 days of downtime a year 99.9% = 9 hours of downtime 99.99% = 1 hour of downtime 99.999% = 5 min of downtime
  • 13. Error Budget Ex: Target Availability 99.9% Practically, the service is available 99.99% time. The service has gained some error budget to launch new features faster now ! Likewise, vice-versa.
  • 14. Reliable only up to the defined limit, no more ● Sometimes, systems may give false impression of being over-reliable ● Leads other teams to build with false assumptions Extreme measure : Planned Downtime
  • 15.
  • 17. It would have been easy if … Database Server Web Server A simple web application
  • 18. But most often in reality Load Balancer Database Server Database Server Master / Router Cache Layer Service n Database Server Database Server Master / Router Cache Layer Service 1 Merge Node Web Server Web Server High-level architecture of a web-application
  • 19. Monitoring Without monitoring ... ● No way to know if my system is working, it is like flying in the dark ● Predicting a future issue, is not at all possible Good monitoring ● Enables analysis of long-term trends ● Enables comparative analysis ○ 1 Machine with 32GB RAM vs 2 machines with 16 GM RAM each ● Enables conducting ad hoc analysis - latency shot up, what else changed ?
  • 20. Good monitoring ● Never requires a human to interpret an alert ○ Humans add latency ○ Alert fatigue ○ Ignore, if alarms are usually false ● 3 kinds of monitoring output ○ Alert : True alarm - red alert. Human needs to take action. ○ Ticket : Alarm - yellow alert. Human needs to take action eventually ○ Logging : Not a alarm - no action needed.
  • 21. Good Alerts ● Alert is a sign that something is actually broken Bad alert : Average latency is 125ms. Good alert: Average latency is breaking latency SLOs ● Alerts MUST be actionable. ● Alert for symptoms, not causes What = Symptom, Why = (Possible) Cause “What” versus “Why” is an important distinction between good monitoring with maximum signal and minimal noise.
  • 22. Black box monitoring ● Mimics a system user ● Shows how your users view your system Types of monitoring White box monitoring ● Viewing your system from the inside, knowing very well how it functions ● For e.g each service keeps counters of all backend service failures, latencies
  • 23. Black-box monitoring example Request GET /mysite.html HTTP/1.1 Host: www.example.com Response HTTP/1.1 200 OK Date: 01 Jan 1900 Content Type: text/html <html> … </html>
  • 24. White-box monitoring Load Balancer Database Server Database Server Master / Router Cache Layer Service n Database Server Database Server Master / Router Cache Layer Service 1 Merge Node Web Server Web Server qps: 100 peak: 500 num_shards: 9 backend_latency: 200ms mem_size: 123 mb peak: 1029 mb backend_latency: 20ms
  • 25. Life of an incident
  • 28. 11.00 PM Error 404 Error 404 Error 404 Error 404 Error 404 10 users get 404 error in 1 minute. Monitoring system kicks in Sends a pager to oncall engineer / SRE In this case, black box monitoring kicked off an alert. This is after the fact alert. Sometimes, white-box monitoring can help predict an issue is about to happen.
  • 29. 11.05 PM Playbook (s) Verifies the user impact Finds out the underlying broken service causing the issue Consults the playbook of underlying service Finds how to disable a particular feature White-box monitorning invariably plays a big role in figuring out the origin of the problem. Also, a good central logging system used by ALL servers throughout the system.
  • 31. In summary ● Playbooks: guide to turning off features, feature flag names & descriptions ● Mean time to failure (MTTF) ● Mean time to recovery (MTTR) Reliability = function (MTTF, MTTR) ● Statistically, having playbooks improves MTTR by 3x.
  • 33. John is a hero. John knows about the system more than others. John can fix this issue in under a minute. John doesn’t need to read docs to fix it. But there is only ONE John. And John needs to sleep, eat … And John can solve only ONE issue at a time. Avoid a culture of heros.
  • 35. Release management : Progressive rollouts ● Canary servers ○ A few production instances serving traffic from new binary ● Canary for sufficient time ○ Give the new binary some time to experience all possible use cases ● Canary becomes prod on green ● Changes be backward compatible with 2-3 releases ○ Binaries can be safely rolled back
  • 36. No-blame postmortem ● Teams write detailed postmortem after resolving outages ● Not attributed to any single person / team ● Postmortem is structured as: ○ What went wrong ○ What was the immediate fix ○ What is the long term fix ○ What can we do to prevent it occurring in the future ?
  • 37. Catastrophe Testing (Artificial) Dedicated days to stress test the system ● Bring down a few machines ● Lose network connectivity
  • 38. In summary ● Judiciously decide on what metrics to measure, and target values ● Monitor both user facing as well as internal metrics ● Alert only when necessary ● Plan for failure management - dashboards, playbooks and uniform logging ● Avoid a culture of heros ● Learn from failures