SlideShare a Scribd company logo
1 of 61
Download to read offline
Abeer Rahman
Toronto Agile Conference
Oct 2018
Getting started with
Site Reliability Engineering (SRE):
A guide to improving systems reliability
at production
@abeer486
3:32 am
Things
are
good
No… it
can’t be
down…
It’s
actually
down!
Oh,
things
look
bad..
Things
look
REALLY
BAD!
I’m just
going to
cry in
agony!
(Mental) State transitions during an outage…
Abeer Rahman
Engineering Manager, Deloitte
Focusing on Software delivery, Agile & DevOps for clients
@abeer486
Hi, I’m Abeer
Let’s shift
operations from
reactive
to
proactive
Topics
Closing
Q&A
Measuring
Systems
Budgeting
for Failures
Introduction
to SRE
Enabling
SREs
Managing
Incidents
Technology
Enablers
Introduction to
Site Reliability Engineering
Typical Software Lifecycle
Idea Business Development Operations Customers
The goal is not just for software to be pushed out, but to also be
maintained once it’s live
Sanity Check
How long does it take for an incident to reach the right team?
Can it be minutes, instead of hours?
How often does the same incident re-occur?
Can it be reduced to never?
Reliability is the outcome of a system to be able
to withstand some certain types of failure and still
remain functional from a user’s perspective.
…is what happens when a software engineer is put to solve
operations problems
Site Reliability Engineering…
Site reliability engineering (SRE) is a specialized skillset where software
engineers develop and contribute to the ongoing operation of systems in
production.
Site
Reliability
Engineer
Spend 50% of
contributions in
Systems Engineering
problems
--
infrastructure,
operating system &
network, databases
--
availability,
observability, efficiency
& reliability of services
Spend 50% of
contributions in
Software Engineering
problems
--
application design, data
structure, service design
& key integrations
--
development tasks that
contribute to systems
reliability
SRE team objectives
Site Reliability Engineers spend less than 50% of their time performing operational work to
allow them to spend more time on improving infrastructure and task automation
SRE approach to Operations:
• Treat operations like a software engineering problem
• Automate tasks normally done by system administers (such as
deployments)
• Design more reliable and operable service architectures from the
ground-up, not an afterthought
Support systems in production Contribute to Project work
The evolution of systems requires an evolution of systems
engineers
Why another role/discipline?
Common issues:
• Reliability is usually left to the architects to design, typically at the beginning
of a product design
• Non-functional requirements are not reviewed as often
• Change of functionality may impact previously assumed reliability
requirements
• Most outages are typically caused by change, and a typical Ops mindset does
not necessarily prepare for fast remedy
In the new world, think of systems as software-defined.
Incentives aren’t aligned
Wall of confusion
Development
Speed
Agility
Operations
Stability
Control
DevOps vs. SRE?
DevOps:
A set of principles & culture guidelines
that helps to breakdown the silos
between development and operations
/ networking / security
• Reduce silos
• Plan for failure
• Small batch changes
• Tooling & automation
• Measure everything
Site Reliability Engineering:
A set of practices with an emphasis of
strong engineering capabilities that
implement the DevOps practices, and
sets a job role + team
• Reduce silos by shared ownership
• Plan for failure using error budgets
• Small batch changes with focus on
stability
• Tooling & automation of manual
tasks
• Measure everything monitoring the
right things
SRE implements DevOps
SREs aim create more value in linking Operations activities with
development
This pattern is used by many
organizations that have a high
degree of organizational &
engineering maturity.
• SRE often act as gate-keepers
for production readiness
• Sub-standard software is
rejected and SREs provide
support in refortifying
potential operational issues
https://web.devopstopologies.com/
OpsSREDevOpsDev
Shared Responsibility Model
Application development teams take part in supporting applications in
productions
• SRE work can flow into application development teams
• 5% of operational work to developers
• Take part in on-calls (common these days)
• Give the pager to the developer once the reliability objectives are met
Enabling SREs
Designing & setting up SRE in
an organization
Enabling successful SRE teams requires buy-in across the entire
organization
Rollout of SRE in an Organization
The values of the company must align with how SRE operates
• Organizational mandate
• Leadership buy-in
• Shared Responsibilities
Horizontal enablement of SRE across the organization
Model 1: Centralized SRE Teams as a shared service
Centralized teams
Platform Operations
Product
Team
SRE
… … …
Product Mgmt.
Centralized SRE team that can act as a
shared service for all of the
organization:
• Common backlog of organization’s
reliability improvements – so rolled
out evenly
• SREs are brought in as part of
design, review, validate, go-lives
• Limited level of influence within team
Sample organizations:
• Facebook (via Production
Engineering)
Placing SREs in vertical slices across the organization
Model 2: Horizontal enablement across products/teams
(Embedded with)
Application teams
SREs embedded as full-time team
members as part of the product
teams:
• Treated like a team member of the
product team
• Regular contribution to the overall
product reliability design &
implementation
Sample organizations:
• Google
• Bloomberg
• Uber
Mature state: Pairing with application
team to uplift the application, then
rotate out
Platform Operations
Product
Team
SRE
Product Mgmt.
…SRE
…SRE
…SRE
Platform Operations
Using these as a baseline helps to track progress
Key metrics to measures during enablement of SRE
• Mean Time to Restore (MTTR) service
• Lead time to release (or rollback)
• Improved monitoring to catch & detect issues earlier
• Establishing error budgets to enable budget-based risk management
Finding key talent who have natural curiosity for software
systems and aggressively automates
Hiring & Onboarding SREs
Hiring for skillsets:
• Ideally, multi-skilled individuals who
have experience in writing software and
maintaining software
• Traditional Ops / Sys Admins who
are willing to learn scripting /
programming
• App developers who are willing to step
up for operational roles
• Exposure to and interest in systems
architecture
Onboarding SREs:
• SRE Bootcamps that consist of insight
into the system architecture, software
delivery process and operations
• On-call rotations with senior SRE
members
• Participation in incidents response
calls
Budgeting for failures
Error Budgets
How do we go about measuring availability of a system from the
eyes of a user?
Measuring reliability
• Easy to measure through
systematic means (e.g., server
uptime)
• Difficult to judge success for
distributed systems
• Does not really speak on behalf of
the user’s experience
Availability =
actual system uptime
anticipated system uptime
• Measures the amount for real users
for whom the service is working
• Better for measuring distributed
systems
Availability =
good interactions
total interactions
Taking one step further, setup a budget that allows for taking
risks
Budget for unreliability
100% - Availability target = Budget of unreliability
or, the Error Budget
Method:
1. SRE and Product Management jointly define an availability target
2. After that, an error budget is established
3. Monitoring (via uptime) is set in place to measure performance +
metrics relative to the objective set
4. A feedback loop is created for maintaining / utilizing the budget
Benefits of having Error Budgets
Shared responsibility for uptime
Functionality failures as well as infrastructure issues fall into the same error budget
Common incentive for Product teams & SREs
Creates a balance between reliability and feature innovation
Product teams can prioritize capabilities
The team now can decide where and how to spend the error budget
Reliability goals become credible & achievable
Provides a more pragmatic approach to setting a realistic availability
Error budgets become an explicit trackable metric that glues
product feature planning to service reliability
Setting up a reliability goal depends on the nature of the
business
Unavailability Window
Availability
Target
Allowed Unavailability Window
per year per 30 days
95 % 18.25 days 1.5 days
99 % 3.65 days 7.2 hours
99.9 % 8.76 hours 43.2 minutes
99.99 % 52.6 minutes 4.32 minutes
99.999 % 5.26 minutes 25.9 seconds
Question: Why is 100% reliability a not a good idea?
Product Management with SREs define service levels for the
systems as part of product design
Approach to defining Service Levels
Service levels allow us to:
• Balance features being built with operational reliability
• Qualify and quantify the user experience
• Set user expectations for availability and performance
Method to defining Service Levels
Service Level Agreement (SLA)
• A business contract the service provider + the customer
• Loss of service could be related to loss of business
• Typically, an SLA breach would mean some form of compensation
to the client
Service Level Objective (SLO)
• A threshold beyond which an improvement of the service is
required
• The point at which the users may consider opening up support
ticket, the "pain threshold", e.g., YouTube buffering
• Driven by business requirements, not just current performance
Service Level Indicator (SLI)
• Quantitative measure of an attribute of the service, such as
throughput, latency
• A directly measurable & observable by the users – not an internal
metric (response time vs. CPU utilization)
• Could be a way to represent the user's experience
Group exercise: Setting up Service Levels
Care-Van is an app that helps to find
& book van owners who are available
to give lifts to elderly people who have
mobility issues.
You are an SRE working with the
product team that designs the ride
request for the customers.
What do you suggest are good
measures/metrics of the following:
• Service Level Indicator (SLI)
• Service Level Objective (SLO)
• Service Level Agreement (SLA)
5 minutes
For Reference
SLI: a measurable attribute of a
service that represents
availability/performance of the
system
SLO: defines how the service
should perform, from the
perspective of the user
(measured via SLI)
SLA: a binding contract to
provide a customer some form of
compensation if the service did
not meet expectations
Sample solution
Service Level Indicator (SLI):
the latency of successful HTTP responses (HTTP 200)
Service Level Objective (SLO):
The latency of 95% of the responses must be less than 200ms
Service Level Agreement (SLA):
Customer compensated if the 95th percentile latency exceeds 300ms
Exhausting an Error Budget
As long as there is remaining error budget (in a rolling
timeframe), new features can be deployed.
Question: what can you do if you exhaust the error budget?
Possibly:
• No new feature launches (for a period of time)
• Change the velocity of updates
• Prioritize engineering efforts to reliability
Measuring the Systems
Monitoring & Alerting
It is the primary means in determining and maintaining
reliability of a system
Monitoring is a foundational capability for any organization
Product
Development
Capacity Planning
Test/Release Process
Post Mortems
Incident Response
Monitoring
Mikey Dickerson’s Hierarchy of Service Reliability
Monitoring helps with
identifying what needs
urgent attention, and
helps prioritization:
• Urgency
• Trends
• Planning
• Improvements
Golden Signals of Monitoring
Golden Signals
Latency
Traffic
Errors
Saturation
The case for better monitoring:
• Focused more on uptime to deduce
availability vs. user experience
• Monitoring system metrics, like CPU,
memory, storage are good indicators,
however may not mean anything
meaningful for the client experience
• “Noise overload”
It’s better to other metrics, such as
latency: are the response times taking
longer & longer?
It’s best to define
SLIs & SLOs using
these golden signals
as it accounts for
user’s perception of
the system
Outputs of a good monitoring system
Logging
Alerts
Tickets
Level of
urgency
A carefully crafted
monitoring system should
be able to distinguish
urgent requests from
important ones, and only
alert when required
An alert should only be paged for urgent requests that need
immediate attention
Alerting design consideration
Common issues:
• Not all alerts are helpful
• Alerts is not maintained regularly to keep up
with product functionality being rolled out
Good practices:
• Measure the system along with all of the parts,
but alert only on problems will cause or is
causing end-user pain. This helps to reduce
“alert fatigue”
• Alerts should be actionable (consider logging for
informational items)
Managing incidents
Incident Response & On-Calls
Effective incident response process increases user trust
Incident Response
Users don’t know always expect
a service to be perfect.
However, they expect their
service provider to be
transparent, fast to resolve
issues, and learns quickly from
mistakes.
Incident Response helps to
create that sense of trust when
there is a crisis.
Product
Development
Capacity Planning
Test/Release Process
Post Mortems
Incident Response
Monitoring
Dickerson’s Hierarchy of Service Reliability
People do not typically work well in situations of emergencies, so
having a structured incident response process is essential
Structured Incident Response
Have sufficient monitoring
• Service dashboards indicating throughputs, loads, latency, etc.
• Typically, an SLA breach would mean some form of compensation to the
client
Good alerting & communication process
• Notifying on-call member(s) and having an automated escalation process
if primary on-call is not reachable
• Notifying end-users before or as soon as there is degraded experience
Playbook-style plans & tools
• An easily accessible step-by-step checklist of actions to do when an
incident occurs
• Tools available for system introspection
1
2
3
Great resource:
https://www.incidentresponse.com/playbooks/improper-computer-usage
A post-mortem conducted after an incident helps to build a
culture of self-improvement
Post-mortems
Goals of writing a post-mortem:
• All contributing causes are understood and documented
• Action items are put in place to avoid repeating the same issue again
Good habits:
• Post-mortems conducted & closed within 3-5 days after an incident has
occurred
• Document in details errors, logs, steps taken and decisions made that led to
the incident/outage
• Document fixes if an SLO is breached, even if the SLA is not impacted
VW’s defeat device
scandal Oct 2015
 Transparency
 Accountability
 Blamelessness
https://nypost.com/2015/10/08/volkswagen-exec-blames-rogue-engineers-for-emissions-scandal/
Blameless culture enables faster experimentation without fear.
Because failures are part of the process of experimentation.
Conducting blameless post-mortem
“Human errors” are
system problems. It’s
better to fix systems
& processes to better
support people in
making good choices.
If the culture of
finger-point prevails,
there is no incentive
for team members to
bring forth issues in
fear of punishment.
Useful approach & questions:
• Conduct Five Why’s
• How can you provide better information to
make decisions?
• Was the information misleading? Can this
information be fixed in an automated way?
• Can the activity be automated so it requires
minimal human intervention?
• Could a new hire have made the same mistake
the same way?
John Allspaw’s reference paper on making decisions under pressure:
http://lup.lub.lu.se/luur/download?func=downloadFile&recordOId=8084520&fileOId=8084521
Gitlab’s database
outage post-mortem
in January 2017
 Transparency
 Accountability
 Blamelessness
https://about.gitlab.com/2017/02/10/postmortem-of-database-outage-of-january-31/
Create a work environments that have a balance between
planned/project-driven work and interrupt-driven work
Better On-Calls
Good practices for creating a more sustainable on-call:
• Have a single person be in charge of on-call (Note: if no one is on call, then
everyone becomes on call!)
• On-call person makes sure interrupt-based work is completed without
impacting others who are not on-call
• Create a rotation schedule to ensure relief
• No more than 2 key incidents per on-call engineer per shift(or staff
appropriately)
Project-driven work
Planned work to reduce toil
Interrupt-driven work
Supporting incidents
Enablers
Technology components
Technology capabilities that help with speed & reliability
Microservices
Microservices provide a
foundation to support
automation, rapid
deployment, and DevOps
infrastructure.
Cloud
Cloud architecture is a
direct response to the
need for agility.
Containers
Infrastructure manages
services and software
deployments on IT
resources where
components are
replaced rather than
changed.
Considerations for storage, compute, networking…
Infrastructure Design
Good practices:
• Version control your infrastructure design and code
• Offload as much work to cloud provider
• Design for reliability by having the application + data
situated in more than one datacenter
• Design your infrastructure architecture to support
Canary deployment
Using the cloud as an enabler for reliability
Distribute your infrastructure across multiple datacenters (or
regions/zones)
Sample taken from Google Cloud Platform, however, most cloud providers have this capability.
Considerations for business logic, data layer and presentation…
Application design
Good practices:
• Re-think how your features can be re-architected to
support for scalability (hence increasing reliability)
• Favour scale out over scale up
• De-couple / modular architecture: such as Microservices
• Distribute your applications across multiple geographies
• Consistent build, deploy process (as much as possible)
Design your
applications using
the principles of
Twelve-Factor App
 Enables “cloud-
readiness”
 Helps with
increasing
reliability
https://12factor.net/
Closing
Recap
Measuring
Systems
Budgeting
for Failures
Introduction
to SRE
Enabling
SREs
Managing
Incidents
Technology
Enablers
Get started with your own SRE journey!
Acknowledge: failures are inevitable, it’s best to plan for it
Buy-in: from leadership & top-management
Culture: creating a culture of self improvement
Skillset: look for multi-skilled individuals who strive on collaboration
1. Select a sample application
2. Identify it’s SLI/SLO
3. Create a baseline for current state
4. Incrementally update systems & processes to meet SLO
5. Rinse, repeat
Organization-wide Reliability Improvement backlog
Sources of improvement backlog:
• Previous outages and incidents, especially the ones that have been recurring
• Review of tools used in operations (monitoring, alerting, on-call schedule)
• Operational technical debt – automation of manual scripts
• Mass reboot of all your systems in the whole datacenter
• Do a disaster-recovery (DR) test
Using these as a baseline helps to track progress…
Metrics to measures during enablement of SRE
• Mean Time to Restore (MTTR) service
• Lead time to release (or rollback)
• Improved monitoring to catch & detect issues earlier
• Establishing error budgets to enable budget-based risk management
…or make one that makes sense to you!
Metrics to measures during enablement of SRE
Mean time-to-return-to-bed
(MTTRTB)
What to watch out for when enabling SRE…
Anti-patterns!
• Rebranding your existing Operations team to SRE is not SRE!
• Treat SREs like any other software developers – not like operations
teams
• 100% is not a good SLO. It’s better to plan for failure as so many
interactions are out of your hands
• Do not do the transformation in silo. Teams are willing to change;
best to involve them as part of the transformation journey. E.g.
Change Management, Incident Management, Security
Resources
@abeer486
Thank you!
Let’s shift
operations from
reactive
to
proactive

More Related Content

What's hot

Site reliability engineering
Site reliability engineeringSite reliability engineering
Site reliability engineeringJason Loeffler
 
Site reliability engineering - Lightning Talk
Site reliability engineering - Lightning TalkSite reliability engineering - Lightning Talk
Site reliability engineering - Lightning TalkMichae Blakeney
 
SRE 101 (Site Reliability Engineering)
SRE 101 (Site Reliability Engineering)SRE 101 (Site Reliability Engineering)
SRE 101 (Site Reliability Engineering)Hussain Mansoor
 
How to SRE when you have no SRE
How to SRE when you have no SREHow to SRE when you have no SRE
How to SRE when you have no SRESquadcast Inc
 
Building an SRE Organization @ Squarespace
Building an SRE Organization @ SquarespaceBuilding an SRE Organization @ Squarespace
Building an SRE Organization @ SquarespaceFranklin Angulo
 
SRE Demystified - 05 - Toil Elimination
SRE Demystified - 05 - Toil EliminationSRE Demystified - 05 - Toil Elimination
SRE Demystified - 05 - Toil EliminationDr Ganesh Iyer
 
How Small Team Get Ready for SRE (public version)
How Small Team Get Ready for SRE (public version)How Small Team Get Ready for SRE (public version)
How Small Team Get Ready for SRE (public version)Setyo Legowo
 
Service Level Terminology : SLA ,SLO & SLI
Service Level Terminology : SLA ,SLO & SLIService Level Terminology : SLA ,SLO & SLI
Service Level Terminology : SLA ,SLO & SLIKnoldus Inc.
 
SRE-iously! Reliability!
SRE-iously! Reliability!SRE-iously! Reliability!
SRE-iously! Reliability!New Relic
 
Kks sre book_ch1,2
Kks sre book_ch1,2Kks sre book_ch1,2
Kks sre book_ch1,2Chris Huang
 
What is Site Reliability Engineering (SRE)
What is Site Reliability Engineering (SRE)What is Site Reliability Engineering (SRE)
What is Site Reliability Engineering (SRE)jeetendra mandal
 
A Crash Course in Building Site Reliability
A Crash Course in Building Site ReliabilityA Crash Course in Building Site Reliability
A Crash Course in Building Site ReliabilityAcquia
 
SRE Demystified - 01 - SLO SLI and SLA
SRE Demystified - 01 - SLO SLI and SLASRE Demystified - 01 - SLO SLI and SLA
SRE Demystified - 01 - SLO SLI and SLADr Ganesh Iyer
 
Site Reliability Engineer (SRE), We Keep The Lights On 24/7
Site Reliability Engineer (SRE), We Keep The Lights On 24/7Site Reliability Engineer (SRE), We Keep The Lights On 24/7
Site Reliability Engineer (SRE), We Keep The Lights On 24/7NUS-ISS
 
The Next Wave of Reliability Engineering
The Next Wave of Reliability EngineeringThe Next Wave of Reliability Engineering
The Next Wave of Reliability EngineeringMichael Kehoe
 

What's hot (20)

Site reliability engineering
Site reliability engineeringSite reliability engineering
Site reliability engineering
 
Site reliability engineering - Lightning Talk
Site reliability engineering - Lightning TalkSite reliability engineering - Lightning Talk
Site reliability engineering - Lightning Talk
 
SRE 101 (Site Reliability Engineering)
SRE 101 (Site Reliability Engineering)SRE 101 (Site Reliability Engineering)
SRE 101 (Site Reliability Engineering)
 
SRE in Startup
SRE in StartupSRE in Startup
SRE in Startup
 
How to SRE when you have no SRE
How to SRE when you have no SREHow to SRE when you have no SRE
How to SRE when you have no SRE
 
Sre summary
Sre summarySre summary
Sre summary
 
Building an SRE Organization @ Squarespace
Building an SRE Organization @ SquarespaceBuilding an SRE Organization @ Squarespace
Building an SRE Organization @ Squarespace
 
SRE Demystified - 05 - Toil Elimination
SRE Demystified - 05 - Toil EliminationSRE Demystified - 05 - Toil Elimination
SRE Demystified - 05 - Toil Elimination
 
How Small Team Get Ready for SRE (public version)
How Small Team Get Ready for SRE (public version)How Small Team Get Ready for SRE (public version)
How Small Team Get Ready for SRE (public version)
 
Service Level Terminology : SLA ,SLO & SLI
Service Level Terminology : SLA ,SLO & SLIService Level Terminology : SLA ,SLO & SLI
Service Level Terminology : SLA ,SLO & SLI
 
SRE-iously! Reliability!
SRE-iously! Reliability!SRE-iously! Reliability!
SRE-iously! Reliability!
 
Kks sre book_ch1,2
Kks sre book_ch1,2Kks sre book_ch1,2
Kks sre book_ch1,2
 
What is Site Reliability Engineering (SRE)
What is Site Reliability Engineering (SRE)What is Site Reliability Engineering (SRE)
What is Site Reliability Engineering (SRE)
 
A Crash Course in Building Site Reliability
A Crash Course in Building Site ReliabilityA Crash Course in Building Site Reliability
A Crash Course in Building Site Reliability
 
SRE Demystified - 01 - SLO SLI and SLA
SRE Demystified - 01 - SLO SLI and SLASRE Demystified - 01 - SLO SLI and SLA
SRE Demystified - 01 - SLO SLI and SLA
 
SRE vs DevOps
SRE vs DevOpsSRE vs DevOps
SRE vs DevOps
 
SRE 101
SRE 101SRE 101
SRE 101
 
Site Reliability Engineer (SRE), We Keep The Lights On 24/7
Site Reliability Engineer (SRE), We Keep The Lights On 24/7Site Reliability Engineer (SRE), We Keep The Lights On 24/7
Site Reliability Engineer (SRE), We Keep The Lights On 24/7
 
DevOps & SRE at Google Scale
DevOps & SRE at Google ScaleDevOps & SRE at Google Scale
DevOps & SRE at Google Scale
 
The Next Wave of Reliability Engineering
The Next Wave of Reliability EngineeringThe Next Wave of Reliability Engineering
The Next Wave of Reliability Engineering
 

Similar to Getting started with Site Reliability Engineering (SRE)

Puppet Labs EMC DevOps Day NYC Aug-2015
Puppet Labs  EMC DevOps Day NYC Aug-2015Puppet Labs  EMC DevOps Day NYC Aug-2015
Puppet Labs EMC DevOps Day NYC Aug-2015Bob Sokol
 
How to Build High-Performing IT Teams - Including New Data on IT Performance ...
How to Build High-Performing IT Teams - Including New Data on IT Performance ...How to Build High-Performing IT Teams - Including New Data on IT Performance ...
How to Build High-Performing IT Teams - Including New Data on IT Performance ...Puppet
 
Webinar: A Roadmap for DevOps Success
Webinar: A Roadmap for DevOps SuccessWebinar: A Roadmap for DevOps Success
Webinar: A Roadmap for DevOps SuccessJules Pierre-Louis
 
Ais development strategy
Ais development strategyAis development strategy
Ais development strategyRahat Chowdhury
 
Webinar: "5 semplici passi per migliorare la Quality e i processi di Test".
Webinar: "5 semplici passi per migliorare la Quality e i processi di Test".Webinar: "5 semplici passi per migliorare la Quality e i processi di Test".
Webinar: "5 semplici passi per migliorare la Quality e i processi di Test".Emerasoft, solutions to collaborate
 
Software System Engineering - Chapter 1
Software System Engineering - Chapter 1Software System Engineering - Chapter 1
Software System Engineering - Chapter 1Fadhil Ismail
 
Technical Capabilities as enabler for Agile and DevOps
Technical Capabilities as enabler for Agile and DevOpsTechnical Capabilities as enabler for Agile and DevOps
Technical Capabilities as enabler for Agile and DevOpsNelis Boucké
 
Agile metrics - Agile KC Meeting 9/26/13
Agile metrics - Agile KC Meeting 9/26/13Agile metrics - Agile KC Meeting 9/26/13
Agile metrics - Agile KC Meeting 9/26/13molsonkc
 
Recent and-future-trends spm
Recent and-future-trends spmRecent and-future-trends spm
Recent and-future-trends spmPrakash Poudel
 
DevOps 101 - IBM Impact 2014
DevOps 101 - IBM Impact 2014 DevOps 101 - IBM Impact 2014
DevOps 101 - IBM Impact 2014 Sanjeev Sharma
 
Quality Testing and Agile at Salesforce
Quality Testing and Agile at Salesforce Quality Testing and Agile at Salesforce
Quality Testing and Agile at Salesforce Salesforce Engineering
 
Lect5 improving software economics
Lect5 improving software economicsLect5 improving software economics
Lect5 improving software economicsmeena466141
 
Project control and process instrumentation
Project control and process instrumentationProject control and process instrumentation
Project control and process instrumentationKuppusamy P
 
Quantifying DevOps Adoption Empirically for Demonstrable ROI
Quantifying DevOps Adoption Empirically for Demonstrable ROIQuantifying DevOps Adoption Empirically for Demonstrable ROI
Quantifying DevOps Adoption Empirically for Demonstrable ROIDevOps for Enterprise Systems
 
Introduction To Software Concepts Unit 1 & 2
Introduction To Software Concepts Unit 1 & 2Introduction To Software Concepts Unit 1 & 2
Introduction To Software Concepts Unit 1 & 2Raj vardhan
 
05_SQA_Overview.ppt
05_SQA_Overview.ppt05_SQA_Overview.ppt
05_SQA_Overview.pptSaqibHabib11
 
Balanced Agile Approach
Balanced Agile Approach Balanced Agile Approach
Balanced Agile Approach Cary Xie
 

Similar to Getting started with Site Reliability Engineering (SRE) (20)

Puppet Labs EMC DevOps Day NYC Aug-2015
Puppet Labs  EMC DevOps Day NYC Aug-2015Puppet Labs  EMC DevOps Day NYC Aug-2015
Puppet Labs EMC DevOps Day NYC Aug-2015
 
How to Build High-Performing IT Teams - Including New Data on IT Performance ...
How to Build High-Performing IT Teams - Including New Data on IT Performance ...How to Build High-Performing IT Teams - Including New Data on IT Performance ...
How to Build High-Performing IT Teams - Including New Data on IT Performance ...
 
Webinar: A Roadmap for DevOps Success
Webinar: A Roadmap for DevOps SuccessWebinar: A Roadmap for DevOps Success
Webinar: A Roadmap for DevOps Success
 
Ais development strategy
Ais development strategyAis development strategy
Ais development strategy
 
Webinar: "5 semplici passi per migliorare la Quality e i processi di Test".
Webinar: "5 semplici passi per migliorare la Quality e i processi di Test".Webinar: "5 semplici passi per migliorare la Quality e i processi di Test".
Webinar: "5 semplici passi per migliorare la Quality e i processi di Test".
 
Software System Engineering - Chapter 1
Software System Engineering - Chapter 1Software System Engineering - Chapter 1
Software System Engineering - Chapter 1
 
Technical Capabilities as enabler for Agile and DevOps
Technical Capabilities as enabler for Agile and DevOpsTechnical Capabilities as enabler for Agile and DevOps
Technical Capabilities as enabler for Agile and DevOps
 
Agile metrics - Agile KC Meeting 9/26/13
Agile metrics - Agile KC Meeting 9/26/13Agile metrics - Agile KC Meeting 9/26/13
Agile metrics - Agile KC Meeting 9/26/13
 
Recent and-future-trends spm
Recent and-future-trends spmRecent and-future-trends spm
Recent and-future-trends spm
 
Softwaretesting
SoftwaretestingSoftwaretesting
Softwaretesting
 
DevOps 101 - IBM Impact 2014
DevOps 101 - IBM Impact 2014 DevOps 101 - IBM Impact 2014
DevOps 101 - IBM Impact 2014
 
Quality Testing and Agile at Salesforce
Quality Testing and Agile at Salesforce Quality Testing and Agile at Salesforce
Quality Testing and Agile at Salesforce
 
Lect5 improving software economics
Lect5 improving software economicsLect5 improving software economics
Lect5 improving software economics
 
DevOps 1 (1).pptx
DevOps 1 (1).pptxDevOps 1 (1).pptx
DevOps 1 (1).pptx
 
Project control and process instrumentation
Project control and process instrumentationProject control and process instrumentation
Project control and process instrumentation
 
Quantifying DevOps Adoption Empirically for Demonstrable ROI
Quantifying DevOps Adoption Empirically for Demonstrable ROIQuantifying DevOps Adoption Empirically for Demonstrable ROI
Quantifying DevOps Adoption Empirically for Demonstrable ROI
 
Introduction To Software Concepts Unit 1 & 2
Introduction To Software Concepts Unit 1 & 2Introduction To Software Concepts Unit 1 & 2
Introduction To Software Concepts Unit 1 & 2
 
Software testing
Software testingSoftware testing
Software testing
 
05_SQA_Overview.ppt
05_SQA_Overview.ppt05_SQA_Overview.ppt
05_SQA_Overview.ppt
 
Balanced Agile Approach
Balanced Agile Approach Balanced Agile Approach
Balanced Agile Approach
 

Recently uploaded

Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Call Girls In Sukhdev Vihar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Sukhdev Vihar Delhi 💯Call Us 🔝8264348440🔝Call Girls In Sukhdev Vihar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Sukhdev Vihar Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Russian Call girl in Ajman +971563133746 Ajman Call girl Service
Russian Call girl in Ajman +971563133746 Ajman Call girl ServiceRussian Call girl in Ajman +971563133746 Ajman Call girl Service
Russian Call girl in Ajman +971563133746 Ajman Call girl Servicegwenoracqe6
 
(+971568250507 ))# Young Call Girls in Ajman By Pakistani Call Girls in ...
(+971568250507  ))#  Young Call Girls  in Ajman  By Pakistani Call Girls  in ...(+971568250507  ))#  Young Call Girls  in Ajman  By Pakistani Call Girls  in ...
(+971568250507 ))# Young Call Girls in Ajman By Pakistani Call Girls in ...Escorts Call Girls
 
WhatsApp 📞 8448380779 ✅Call Girls In Mamura Sector 66 ( Noida)
WhatsApp 📞 8448380779 ✅Call Girls In Mamura Sector 66 ( Noida)WhatsApp 📞 8448380779 ✅Call Girls In Mamura Sector 66 ( Noida)
WhatsApp 📞 8448380779 ✅Call Girls In Mamura Sector 66 ( Noida)Delhi Call girls
 
Trump Diapers Over Dems t shirts Sweatshirt
Trump Diapers Over Dems t shirts SweatshirtTrump Diapers Over Dems t shirts Sweatshirt
Trump Diapers Over Dems t shirts Sweatshirtrahman018755
 
Russian Call Girls Pune (Adult Only) 8005736733 Escort Service 24x7 Cash Pay...
Russian Call Girls Pune  (Adult Only) 8005736733 Escort Service 24x7 Cash Pay...Russian Call Girls Pune  (Adult Only) 8005736733 Escort Service 24x7 Cash Pay...
Russian Call Girls Pune (Adult Only) 8005736733 Escort Service 24x7 Cash Pay...SUHANI PANDEY
 
Moving Beyond Twitter/X and Facebook - Social Media for local news providers
Moving Beyond Twitter/X and Facebook - Social Media for local news providersMoving Beyond Twitter/X and Facebook - Social Media for local news providers
Moving Beyond Twitter/X and Facebook - Social Media for local news providersDamian Radcliffe
 
'Future Evolution of the Internet' delivered by Geoff Huston at Everything Op...
'Future Evolution of the Internet' delivered by Geoff Huston at Everything Op...'Future Evolution of the Internet' delivered by Geoff Huston at Everything Op...
'Future Evolution of the Internet' delivered by Geoff Huston at Everything Op...APNIC
 
VIP Model Call Girls Hadapsar ( Pune ) Call ON 9905417584 Starting High Prof...
VIP Model Call Girls Hadapsar ( Pune ) Call ON 9905417584 Starting  High Prof...VIP Model Call Girls Hadapsar ( Pune ) Call ON 9905417584 Starting  High Prof...
VIP Model Call Girls Hadapsar ( Pune ) Call ON 9905417584 Starting High Prof...singhpriety023
 
Top Rated Pune Call Girls Daund ⟟ 6297143586 ⟟ Call Me For Genuine Sex Servi...
Top Rated  Pune Call Girls Daund ⟟ 6297143586 ⟟ Call Me For Genuine Sex Servi...Top Rated  Pune Call Girls Daund ⟟ 6297143586 ⟟ Call Me For Genuine Sex Servi...
Top Rated Pune Call Girls Daund ⟟ 6297143586 ⟟ Call Me For Genuine Sex Servi...Call Girls in Nagpur High Profile
 
Enjoy Night⚡Call Girls Dlf City Phase 3 Gurgaon >༒8448380779 Escort Service
Enjoy Night⚡Call Girls Dlf City Phase 3 Gurgaon >༒8448380779 Escort ServiceEnjoy Night⚡Call Girls Dlf City Phase 3 Gurgaon >༒8448380779 Escort Service
Enjoy Night⚡Call Girls Dlf City Phase 3 Gurgaon >༒8448380779 Escort ServiceDelhi Call girls
 
On Starlink, presented by Geoff Huston at NZNOG 2024
On Starlink, presented by Geoff Huston at NZNOG 2024On Starlink, presented by Geoff Huston at NZNOG 2024
On Starlink, presented by Geoff Huston at NZNOG 2024APNIC
 
Hire↠Young Call Girls in Tilak nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esc...
Hire↠Young Call Girls in Tilak nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esc...Hire↠Young Call Girls in Tilak nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esc...
Hire↠Young Call Girls in Tilak nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esc...Delhi Call girls
 
Al Barsha Night Partner +0567686026 Call Girls Dubai
Al Barsha Night Partner +0567686026 Call Girls  DubaiAl Barsha Night Partner +0567686026 Call Girls  Dubai
Al Barsha Night Partner +0567686026 Call Girls DubaiEscorts Call Girls
 
Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.soniya singh
 
Call Now ☎ 8264348440 !! Call Girls in Sarai Rohilla Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Sarai Rohilla Escort Service Delhi N.C.R.Call Now ☎ 8264348440 !! Call Girls in Sarai Rohilla Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Sarai Rohilla Escort Service Delhi N.C.R.soniya singh
 
Call Girls In Ashram Chowk Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Ashram Chowk Delhi 💯Call Us 🔝8264348440🔝Call Girls In Ashram Chowk Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Ashram Chowk Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...
𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...
𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...Neha Pandey
 

Recently uploaded (20)

Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝
 
Call Girls In Sukhdev Vihar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Sukhdev Vihar Delhi 💯Call Us 🔝8264348440🔝Call Girls In Sukhdev Vihar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Sukhdev Vihar Delhi 💯Call Us 🔝8264348440🔝
 
Russian Call girl in Ajman +971563133746 Ajman Call girl Service
Russian Call girl in Ajman +971563133746 Ajman Call girl ServiceRussian Call girl in Ajman +971563133746 Ajman Call girl Service
Russian Call girl in Ajman +971563133746 Ajman Call girl Service
 
(+971568250507 ))# Young Call Girls in Ajman By Pakistani Call Girls in ...
(+971568250507  ))#  Young Call Girls  in Ajman  By Pakistani Call Girls  in ...(+971568250507  ))#  Young Call Girls  in Ajman  By Pakistani Call Girls  in ...
(+971568250507 ))# Young Call Girls in Ajman By Pakistani Call Girls in ...
 
WhatsApp 📞 8448380779 ✅Call Girls In Mamura Sector 66 ( Noida)
WhatsApp 📞 8448380779 ✅Call Girls In Mamura Sector 66 ( Noida)WhatsApp 📞 8448380779 ✅Call Girls In Mamura Sector 66 ( Noida)
WhatsApp 📞 8448380779 ✅Call Girls In Mamura Sector 66 ( Noida)
 
Trump Diapers Over Dems t shirts Sweatshirt
Trump Diapers Over Dems t shirts SweatshirtTrump Diapers Over Dems t shirts Sweatshirt
Trump Diapers Over Dems t shirts Sweatshirt
 
Russian Call Girls Pune (Adult Only) 8005736733 Escort Service 24x7 Cash Pay...
Russian Call Girls Pune  (Adult Only) 8005736733 Escort Service 24x7 Cash Pay...Russian Call Girls Pune  (Adult Only) 8005736733 Escort Service 24x7 Cash Pay...
Russian Call Girls Pune (Adult Only) 8005736733 Escort Service 24x7 Cash Pay...
 
Moving Beyond Twitter/X and Facebook - Social Media for local news providers
Moving Beyond Twitter/X and Facebook - Social Media for local news providersMoving Beyond Twitter/X and Facebook - Social Media for local news providers
Moving Beyond Twitter/X and Facebook - Social Media for local news providers
 
'Future Evolution of the Internet' delivered by Geoff Huston at Everything Op...
'Future Evolution of the Internet' delivered by Geoff Huston at Everything Op...'Future Evolution of the Internet' delivered by Geoff Huston at Everything Op...
'Future Evolution of the Internet' delivered by Geoff Huston at Everything Op...
 
VIP Model Call Girls Hadapsar ( Pune ) Call ON 9905417584 Starting High Prof...
VIP Model Call Girls Hadapsar ( Pune ) Call ON 9905417584 Starting  High Prof...VIP Model Call Girls Hadapsar ( Pune ) Call ON 9905417584 Starting  High Prof...
VIP Model Call Girls Hadapsar ( Pune ) Call ON 9905417584 Starting High Prof...
 
Top Rated Pune Call Girls Daund ⟟ 6297143586 ⟟ Call Me For Genuine Sex Servi...
Top Rated  Pune Call Girls Daund ⟟ 6297143586 ⟟ Call Me For Genuine Sex Servi...Top Rated  Pune Call Girls Daund ⟟ 6297143586 ⟟ Call Me For Genuine Sex Servi...
Top Rated Pune Call Girls Daund ⟟ 6297143586 ⟟ Call Me For Genuine Sex Servi...
 
Enjoy Night⚡Call Girls Dlf City Phase 3 Gurgaon >༒8448380779 Escort Service
Enjoy Night⚡Call Girls Dlf City Phase 3 Gurgaon >༒8448380779 Escort ServiceEnjoy Night⚡Call Girls Dlf City Phase 3 Gurgaon >༒8448380779 Escort Service
Enjoy Night⚡Call Girls Dlf City Phase 3 Gurgaon >༒8448380779 Escort Service
 
On Starlink, presented by Geoff Huston at NZNOG 2024
On Starlink, presented by Geoff Huston at NZNOG 2024On Starlink, presented by Geoff Huston at NZNOG 2024
On Starlink, presented by Geoff Huston at NZNOG 2024
 
Hire↠Young Call Girls in Tilak nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esc...
Hire↠Young Call Girls in Tilak nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esc...Hire↠Young Call Girls in Tilak nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esc...
Hire↠Young Call Girls in Tilak nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esc...
 
Al Barsha Night Partner +0567686026 Call Girls Dubai
Al Barsha Night Partner +0567686026 Call Girls  DubaiAl Barsha Night Partner +0567686026 Call Girls  Dubai
Al Barsha Night Partner +0567686026 Call Girls Dubai
 
Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.
 
Call Now ☎ 8264348440 !! Call Girls in Sarai Rohilla Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Sarai Rohilla Escort Service Delhi N.C.R.Call Now ☎ 8264348440 !! Call Girls in Sarai Rohilla Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Sarai Rohilla Escort Service Delhi N.C.R.
 
Call Girls In Ashram Chowk Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Ashram Chowk Delhi 💯Call Us 🔝8264348440🔝Call Girls In Ashram Chowk Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Ashram Chowk Delhi 💯Call Us 🔝8264348440🔝
 
6.High Profile Call Girls In Punjab +919053900678 Punjab Call GirlHigh Profil...
6.High Profile Call Girls In Punjab +919053900678 Punjab Call GirlHigh Profil...6.High Profile Call Girls In Punjab +919053900678 Punjab Call GirlHigh Profil...
6.High Profile Call Girls In Punjab +919053900678 Punjab Call GirlHigh Profil...
 
𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...
𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...
𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...
 

Getting started with Site Reliability Engineering (SRE)

  • 1. Abeer Rahman Toronto Agile Conference Oct 2018 Getting started with Site Reliability Engineering (SRE): A guide to improving systems reliability at production @abeer486
  • 4. Abeer Rahman Engineering Manager, Deloitte Focusing on Software delivery, Agile & DevOps for clients @abeer486 Hi, I’m Abeer Let’s shift operations from reactive to proactive
  • 7. Typical Software Lifecycle Idea Business Development Operations Customers
  • 8. The goal is not just for software to be pushed out, but to also be maintained once it’s live Sanity Check How long does it take for an incident to reach the right team? Can it be minutes, instead of hours? How often does the same incident re-occur? Can it be reduced to never?
  • 9. Reliability is the outcome of a system to be able to withstand some certain types of failure and still remain functional from a user’s perspective.
  • 10. …is what happens when a software engineer is put to solve operations problems Site Reliability Engineering… Site reliability engineering (SRE) is a specialized skillset where software engineers develop and contribute to the ongoing operation of systems in production. Site Reliability Engineer Spend 50% of contributions in Systems Engineering problems -- infrastructure, operating system & network, databases -- availability, observability, efficiency & reliability of services Spend 50% of contributions in Software Engineering problems -- application design, data structure, service design & key integrations -- development tasks that contribute to systems reliability
  • 11. SRE team objectives Site Reliability Engineers spend less than 50% of their time performing operational work to allow them to spend more time on improving infrastructure and task automation SRE approach to Operations: • Treat operations like a software engineering problem • Automate tasks normally done by system administers (such as deployments) • Design more reliable and operable service architectures from the ground-up, not an afterthought Support systems in production Contribute to Project work
  • 12. The evolution of systems requires an evolution of systems engineers Why another role/discipline? Common issues: • Reliability is usually left to the architects to design, typically at the beginning of a product design • Non-functional requirements are not reviewed as often • Change of functionality may impact previously assumed reliability requirements • Most outages are typically caused by change, and a typical Ops mindset does not necessarily prepare for fast remedy In the new world, think of systems as software-defined.
  • 13. Incentives aren’t aligned Wall of confusion Development Speed Agility Operations Stability Control
  • 14. DevOps vs. SRE? DevOps: A set of principles & culture guidelines that helps to breakdown the silos between development and operations / networking / security • Reduce silos • Plan for failure • Small batch changes • Tooling & automation • Measure everything Site Reliability Engineering: A set of practices with an emphasis of strong engineering capabilities that implement the DevOps practices, and sets a job role + team • Reduce silos by shared ownership • Plan for failure using error budgets • Small batch changes with focus on stability • Tooling & automation of manual tasks • Measure everything monitoring the right things SRE implements DevOps
  • 15. SREs aim create more value in linking Operations activities with development This pattern is used by many organizations that have a high degree of organizational & engineering maturity. • SRE often act as gate-keepers for production readiness • Sub-standard software is rejected and SREs provide support in refortifying potential operational issues https://web.devopstopologies.com/ OpsSREDevOpsDev
  • 16. Shared Responsibility Model Application development teams take part in supporting applications in productions • SRE work can flow into application development teams • 5% of operational work to developers • Take part in on-calls (common these days) • Give the pager to the developer once the reliability objectives are met
  • 17. Enabling SREs Designing & setting up SRE in an organization
  • 18. Enabling successful SRE teams requires buy-in across the entire organization Rollout of SRE in an Organization The values of the company must align with how SRE operates • Organizational mandate • Leadership buy-in • Shared Responsibilities
  • 19. Horizontal enablement of SRE across the organization Model 1: Centralized SRE Teams as a shared service Centralized teams Platform Operations Product Team SRE … … … Product Mgmt. Centralized SRE team that can act as a shared service for all of the organization: • Common backlog of organization’s reliability improvements – so rolled out evenly • SREs are brought in as part of design, review, validate, go-lives • Limited level of influence within team Sample organizations: • Facebook (via Production Engineering)
  • 20. Placing SREs in vertical slices across the organization Model 2: Horizontal enablement across products/teams (Embedded with) Application teams SREs embedded as full-time team members as part of the product teams: • Treated like a team member of the product team • Regular contribution to the overall product reliability design & implementation Sample organizations: • Google • Bloomberg • Uber Mature state: Pairing with application team to uplift the application, then rotate out Platform Operations Product Team SRE Product Mgmt. …SRE …SRE …SRE Platform Operations
  • 21. Using these as a baseline helps to track progress Key metrics to measures during enablement of SRE • Mean Time to Restore (MTTR) service • Lead time to release (or rollback) • Improved monitoring to catch & detect issues earlier • Establishing error budgets to enable budget-based risk management
  • 22. Finding key talent who have natural curiosity for software systems and aggressively automates Hiring & Onboarding SREs Hiring for skillsets: • Ideally, multi-skilled individuals who have experience in writing software and maintaining software • Traditional Ops / Sys Admins who are willing to learn scripting / programming • App developers who are willing to step up for operational roles • Exposure to and interest in systems architecture Onboarding SREs: • SRE Bootcamps that consist of insight into the system architecture, software delivery process and operations • On-call rotations with senior SRE members • Participation in incidents response calls
  • 24. How do we go about measuring availability of a system from the eyes of a user? Measuring reliability • Easy to measure through systematic means (e.g., server uptime) • Difficult to judge success for distributed systems • Does not really speak on behalf of the user’s experience Availability = actual system uptime anticipated system uptime • Measures the amount for real users for whom the service is working • Better for measuring distributed systems Availability = good interactions total interactions
  • 25. Taking one step further, setup a budget that allows for taking risks Budget for unreliability 100% - Availability target = Budget of unreliability or, the Error Budget Method: 1. SRE and Product Management jointly define an availability target 2. After that, an error budget is established 3. Monitoring (via uptime) is set in place to measure performance + metrics relative to the objective set 4. A feedback loop is created for maintaining / utilizing the budget
  • 26. Benefits of having Error Budgets Shared responsibility for uptime Functionality failures as well as infrastructure issues fall into the same error budget Common incentive for Product teams & SREs Creates a balance between reliability and feature innovation Product teams can prioritize capabilities The team now can decide where and how to spend the error budget Reliability goals become credible & achievable Provides a more pragmatic approach to setting a realistic availability Error budgets become an explicit trackable metric that glues product feature planning to service reliability
  • 27. Setting up a reliability goal depends on the nature of the business Unavailability Window Availability Target Allowed Unavailability Window per year per 30 days 95 % 18.25 days 1.5 days 99 % 3.65 days 7.2 hours 99.9 % 8.76 hours 43.2 minutes 99.99 % 52.6 minutes 4.32 minutes 99.999 % 5.26 minutes 25.9 seconds Question: Why is 100% reliability a not a good idea?
  • 28. Product Management with SREs define service levels for the systems as part of product design Approach to defining Service Levels Service levels allow us to: • Balance features being built with operational reliability • Qualify and quantify the user experience • Set user expectations for availability and performance
  • 29. Method to defining Service Levels Service Level Agreement (SLA) • A business contract the service provider + the customer • Loss of service could be related to loss of business • Typically, an SLA breach would mean some form of compensation to the client Service Level Objective (SLO) • A threshold beyond which an improvement of the service is required • The point at which the users may consider opening up support ticket, the "pain threshold", e.g., YouTube buffering • Driven by business requirements, not just current performance Service Level Indicator (SLI) • Quantitative measure of an attribute of the service, such as throughput, latency • A directly measurable & observable by the users – not an internal metric (response time vs. CPU utilization) • Could be a way to represent the user's experience
  • 30. Group exercise: Setting up Service Levels Care-Van is an app that helps to find & book van owners who are available to give lifts to elderly people who have mobility issues. You are an SRE working with the product team that designs the ride request for the customers. What do you suggest are good measures/metrics of the following: • Service Level Indicator (SLI) • Service Level Objective (SLO) • Service Level Agreement (SLA) 5 minutes For Reference SLI: a measurable attribute of a service that represents availability/performance of the system SLO: defines how the service should perform, from the perspective of the user (measured via SLI) SLA: a binding contract to provide a customer some form of compensation if the service did not meet expectations
  • 31. Sample solution Service Level Indicator (SLI): the latency of successful HTTP responses (HTTP 200) Service Level Objective (SLO): The latency of 95% of the responses must be less than 200ms Service Level Agreement (SLA): Customer compensated if the 95th percentile latency exceeds 300ms
  • 32. Exhausting an Error Budget As long as there is remaining error budget (in a rolling timeframe), new features can be deployed. Question: what can you do if you exhaust the error budget? Possibly: • No new feature launches (for a period of time) • Change the velocity of updates • Prioritize engineering efforts to reliability
  • 34. It is the primary means in determining and maintaining reliability of a system Monitoring is a foundational capability for any organization Product Development Capacity Planning Test/Release Process Post Mortems Incident Response Monitoring Mikey Dickerson’s Hierarchy of Service Reliability Monitoring helps with identifying what needs urgent attention, and helps prioritization: • Urgency • Trends • Planning • Improvements
  • 35. Golden Signals of Monitoring Golden Signals Latency Traffic Errors Saturation The case for better monitoring: • Focused more on uptime to deduce availability vs. user experience • Monitoring system metrics, like CPU, memory, storage are good indicators, however may not mean anything meaningful for the client experience • “Noise overload” It’s better to other metrics, such as latency: are the response times taking longer & longer? It’s best to define SLIs & SLOs using these golden signals as it accounts for user’s perception of the system
  • 36. Outputs of a good monitoring system Logging Alerts Tickets Level of urgency A carefully crafted monitoring system should be able to distinguish urgent requests from important ones, and only alert when required
  • 37. An alert should only be paged for urgent requests that need immediate attention Alerting design consideration Common issues: • Not all alerts are helpful • Alerts is not maintained regularly to keep up with product functionality being rolled out Good practices: • Measure the system along with all of the parts, but alert only on problems will cause or is causing end-user pain. This helps to reduce “alert fatigue” • Alerts should be actionable (consider logging for informational items)
  • 39. Effective incident response process increases user trust Incident Response Users don’t know always expect a service to be perfect. However, they expect their service provider to be transparent, fast to resolve issues, and learns quickly from mistakes. Incident Response helps to create that sense of trust when there is a crisis. Product Development Capacity Planning Test/Release Process Post Mortems Incident Response Monitoring Dickerson’s Hierarchy of Service Reliability
  • 40.
  • 41. People do not typically work well in situations of emergencies, so having a structured incident response process is essential Structured Incident Response Have sufficient monitoring • Service dashboards indicating throughputs, loads, latency, etc. • Typically, an SLA breach would mean some form of compensation to the client Good alerting & communication process • Notifying on-call member(s) and having an automated escalation process if primary on-call is not reachable • Notifying end-users before or as soon as there is degraded experience Playbook-style plans & tools • An easily accessible step-by-step checklist of actions to do when an incident occurs • Tools available for system introspection 1 2 3 Great resource: https://www.incidentresponse.com/playbooks/improper-computer-usage
  • 42. A post-mortem conducted after an incident helps to build a culture of self-improvement Post-mortems Goals of writing a post-mortem: • All contributing causes are understood and documented • Action items are put in place to avoid repeating the same issue again Good habits: • Post-mortems conducted & closed within 3-5 days after an incident has occurred • Document in details errors, logs, steps taken and decisions made that led to the incident/outage • Document fixes if an SLO is breached, even if the SLA is not impacted
  • 43. VW’s defeat device scandal Oct 2015  Transparency  Accountability  Blamelessness https://nypost.com/2015/10/08/volkswagen-exec-blames-rogue-engineers-for-emissions-scandal/
  • 44. Blameless culture enables faster experimentation without fear. Because failures are part of the process of experimentation. Conducting blameless post-mortem “Human errors” are system problems. It’s better to fix systems & processes to better support people in making good choices. If the culture of finger-point prevails, there is no incentive for team members to bring forth issues in fear of punishment. Useful approach & questions: • Conduct Five Why’s • How can you provide better information to make decisions? • Was the information misleading? Can this information be fixed in an automated way? • Can the activity be automated so it requires minimal human intervention? • Could a new hire have made the same mistake the same way? John Allspaw’s reference paper on making decisions under pressure: http://lup.lub.lu.se/luur/download?func=downloadFile&recordOId=8084520&fileOId=8084521
  • 45. Gitlab’s database outage post-mortem in January 2017  Transparency  Accountability  Blamelessness https://about.gitlab.com/2017/02/10/postmortem-of-database-outage-of-january-31/
  • 46. Create a work environments that have a balance between planned/project-driven work and interrupt-driven work Better On-Calls Good practices for creating a more sustainable on-call: • Have a single person be in charge of on-call (Note: if no one is on call, then everyone becomes on call!) • On-call person makes sure interrupt-based work is completed without impacting others who are not on-call • Create a rotation schedule to ensure relief • No more than 2 key incidents per on-call engineer per shift(or staff appropriately) Project-driven work Planned work to reduce toil Interrupt-driven work Supporting incidents
  • 48. Technology capabilities that help with speed & reliability Microservices Microservices provide a foundation to support automation, rapid deployment, and DevOps infrastructure. Cloud Cloud architecture is a direct response to the need for agility. Containers Infrastructure manages services and software deployments on IT resources where components are replaced rather than changed.
  • 49. Considerations for storage, compute, networking… Infrastructure Design Good practices: • Version control your infrastructure design and code • Offload as much work to cloud provider • Design for reliability by having the application + data situated in more than one datacenter • Design your infrastructure architecture to support Canary deployment
  • 50. Using the cloud as an enabler for reliability Distribute your infrastructure across multiple datacenters (or regions/zones) Sample taken from Google Cloud Platform, however, most cloud providers have this capability.
  • 51. Considerations for business logic, data layer and presentation… Application design Good practices: • Re-think how your features can be re-architected to support for scalability (hence increasing reliability) • Favour scale out over scale up • De-couple / modular architecture: such as Microservices • Distribute your applications across multiple geographies • Consistent build, deploy process (as much as possible)
  • 52. Design your applications using the principles of Twelve-Factor App  Enables “cloud- readiness”  Helps with increasing reliability https://12factor.net/
  • 55. Get started with your own SRE journey! Acknowledge: failures are inevitable, it’s best to plan for it Buy-in: from leadership & top-management Culture: creating a culture of self improvement Skillset: look for multi-skilled individuals who strive on collaboration 1. Select a sample application 2. Identify it’s SLI/SLO 3. Create a baseline for current state 4. Incrementally update systems & processes to meet SLO 5. Rinse, repeat
  • 56. Organization-wide Reliability Improvement backlog Sources of improvement backlog: • Previous outages and incidents, especially the ones that have been recurring • Review of tools used in operations (monitoring, alerting, on-call schedule) • Operational technical debt – automation of manual scripts • Mass reboot of all your systems in the whole datacenter • Do a disaster-recovery (DR) test
  • 57. Using these as a baseline helps to track progress… Metrics to measures during enablement of SRE • Mean Time to Restore (MTTR) service • Lead time to release (or rollback) • Improved monitoring to catch & detect issues earlier • Establishing error budgets to enable budget-based risk management
  • 58. …or make one that makes sense to you! Metrics to measures during enablement of SRE Mean time-to-return-to-bed (MTTRTB)
  • 59. What to watch out for when enabling SRE… Anti-patterns! • Rebranding your existing Operations team to SRE is not SRE! • Treat SREs like any other software developers – not like operations teams • 100% is not a good SLO. It’s better to plan for failure as so many interactions are out of your hands • Do not do the transformation in silo. Teams are willing to change; best to involve them as part of the transformation journey. E.g. Change Management, Incident Management, Security
  • 61. @abeer486 Thank you! Let’s shift operations from reactive to proactive