SlideShare a Scribd company logo
1 of 55
Download to read offline
S.R.E.
create ultra-scalable and highly
reliable systems
Ricardo Amaro
DevOps - https://events.drupal.org/node/13519
Who am I?
@Drupal
@ricardoamaro
Portugal
Lisbon
Drupal
Community
Family
+8 years Drupal
90’s Linux Adopter
5 years at Acquia
Site Reliability Engineer,
Senior Tier2 Ops
https://drupal.org/user/666176
About Acquia Metrics
○ Acquia Cloud:
○ # of Instances (17,200+)
○ # of Production Sites (54,000+)
○ # API Calls (3,000 + per sec)
○ # Of Availability Zones (20+)
○ # Of Regions (8)
We will talk about
A brief summary inspired on Google’s S.R.E. book
○ What is S.R.E?
○ Tenets of S.R.E.
○ Reliability & Toil
○ Error budget - keeping the Service Level Objective (SLO)
○ Development & Operations
○ Monitoring and Being On-Call
○ Release Engineering
○ Postmortem culture - Learning from failure
What is S.R.E.?
➔ Term crafted by Google in 2003.
➔ When Ben Treynor was hired to run “production” and ended up
“applying software engineering to an operations function”
➔ Motivation: “as a software engineer, how would I want to invest my
time to accomplish a set of repetitive tasks?”
Site Reliability Engineering
➔ SRE is taken seriously by major companies
Site Reliability Engineering
Microsoft
Apple
Amazon
SRE’s are engineers that...
➔ Apply the principles of computer science and engineering to
design and develop large, distributed computing systems.
➔ Write software for those systems alongside product developers.
➔ Build all additional pieces those systems need, like backups and
load balancing.
➔ Reuse old solutions for new problems.
Site Reliability Engineering
DevOps & S.R.E.
DevOps is a practice, which was coined around
2008, that encompasses automation of manual
tasks, continuous integration and continuous
delivery. It applies to a wide audience of
companies whereas SRE might be considered a
subset of DevOps that possesses additional skill
sets.
Source:
https://en.wikipedia.org/wiki/Site_reliability_engineering
Tenets of S.R.E.
Tenets of SRE
1. Ensuring a Durable Focus on Engineering
2. Pursuing Maximum Change Velocity
3. Monitoring
4. Emergency Response
5. Change Management
6. Demand Forecasting and Capacity Planning
7. Provisioning
8. Efficiency and Performance
➔ Hire only coders
➔ Have Service Level Objectives (SLOs) for your service
➔ Measure and report performance against SLOs
➔ Use Error Budgets and gate launches on them
➔ Have a Common staffing pool for SRE and DEV
➔ Excess Ops work overflows to DEV team
➔ Cap SRE operational load at 50% and share 5% with the DEV team
➔ On-call teams at least 8 or 6 people in rotation, per product
➔ Maximum of 2 events per on-call shift
➔ Post mortem for every event
➔ Post mortems are BLAMELESS and focus on process and technology, not people
How to achieve S.R.E.
Treynor’s Action items
IMPORTANT
IMPORTANT
Reliability & Toil
The latest feature
or
That the product works?
What is most the important Feature of a product?
How about the “503” feature ?
...most important thing is that the product works!
“Reliability is the most fundamental feature of any product.”
Ben Treynor, Google’s VP for 24/7 Operations
The 80’s Waterfall software delivery model
Operations @customer
➔ *Provisioning
➔ *Installing
➔ *Upgrading
➔ *Maintaining
➔ *Backups/Restore
➔ *Scaling
Source: wikipedia
Then came the web...
● Software as a Service
● Platform as a Service
● Cloud computing
● ...
➔ Operations overhead not on the customer side
➔ Features could now be delivered faster
➔ Customer feedback important for product improvements
Product
Development
Ship Features
Operations
Users
Opposite rewarding conflicts
Objectives:
➔ Ship new features
➔ Launch new products
Objectives:
➔ Reliability & Availability
➔ Provision & Scale
Dev Ops
The problem: Toil*
*exhausting labour
➔ Manual
➔ Repetitive
➔ Automatable
➔ Tactical (Unplanned work)
➔ No enduring value
➔ O(n) with service growth
(not just “work I don’t like to do.”)
An Old Solution to Toil
Caption goes here
● Scale with bodies
In the old operations model, you throw
people at a reliability problem and keep
pushing (sometimes for a year or more)
until the problem either goes away or
blows up in your face.
Has your business succeeds
workload tends to infinity
(x) time
● Cap Ops Workload
Because if you are successful and your
business grows you need to reduce
errors and toil. Put a 50% cap on Ops
work and leave most of the SRE team
time for writing code and reduce Toil.
(y)customers/traffic
Workload/Toil over time
➔ Keep operational work (i.e., toil) below 50% of each SREs time
➔ More than 50% of each SREs time is spent on:
◆ Engineering project work to reduce toil
◆ Add service features - improving reliability, performance,
utilization
➔ Improves career planning for the SRE
➔ Improves morale on the organization
➔ An SRE team can easily devolve into an Ops team if the 50% target
is broken
Why less Toil is Better?
S.R.E. - A modern solution
not bad...
S.R.E. - A modern solution
DEV + OPS
➔ This conflict is not inevitable
➔ The solution is: Error Budgets!
➔ Everyone agrees on an Error Budget (as we will explain next)
➔ SRE only prevents releases or Launches if the Error Budget is exceeded.
Dev Ops
error budget
keeping the SLO
➔ SLO - Service level objective is agreed as a means of measuring the performance of the
Service Provider.
➔ SLA - Service Level Agreement specifies what service is to be provided, how it is
supported, times, locations, costs, performance, and responsibilities of the parties
involved. SLOs are specific measurable characteristics of the SLA such as availability,
throughput, frequency, response time, or quality.
➔ SLI - Service Level Indicator is a measure of the service level provided by a service
provider to a customer. SLIs form the basis of Service Level Objectives (SLOs), which in
turn form the basis of Service Level Agreements (SLAs).
SLO, SLA & SLI Terminology
What is an Error Budget?
The business or the product establishes Service Level Objectives (SLOs) for the system, based on
Service Level indicators such as error rate, availability or latency...
Error Budget
Example: A 99.9% availability SLO means that the service can be 0.1% unavailable, which is the error budget.
100% - 99.9% = 0.1%
➔ 100% is the wrong reliability target for basically everything.
➔ Set a goal that acknowledges the trade-off and leaves an error budget
➔ Error budget can be spent on anything: launching features, etc.
➔ Error budget allows for discussion about how phased rollouts and 1%
experiments can maintain tolerable levels of errors.
➔ Goal of SRE team isn’t “zero outages” – SRE and product devs are incentive
aligned to spend the error budget to get maximum feature velocity.
➔ Out of Budget? No problems. Do more testing between releases.
How to obtain the Error Budget
➔ This puts an incentive to developers that drives them to value stability (not just change)
➔ And gives control that drives SREs to permit change (not just stability)
➔ It forces decisions based on metrics, not politics- nor feelings, just data
Error Budget
A Self-regulating mechanism
Development &
Operations
➔ Development and SRE teams share a
single staffing pool
◆ If all is Reliable Devs are
rewarded with teammates
◆ If Ops is overloaded, SREs are
contracted to support code
How are Development & Operations
teams organized?
Now tell me… Why should I hire you?
Systems, code…
Are you able to cook also?
➔ SREs are developer/sys-admin
hybrids
◆ They perform more Dev work as
things become stable
Development & Operations
Systems, code…
Are you able to cook also?
➔ SRE can only spend up to 50% of their
time on ops work
➔ If operational load exceeds 50%, the ops
work overflows to Dev
➔ Allow SRE to move to other projects
Highly motivated and effective teamwork
Monitoring and Being
On-Call
➔ Three valid kinds of monitoring output
◆ Alerts: human needs to take action immediately
● If you get a huge volume of critical email alerts disable them and stick with
paging
◆ Tickets: human needs to take action eventually
● On-call engineers can actually accomplish work when they aren’t being kept
up by pages at all hours. Ultimately, temporarily backing off on our alerts will
allow you to make faster progress toward a better service
◆ Logging: no action needed
Monitoring and taking action
➔ Maximum of 2 events per 8–12hour on-call shift
➔ Handle the event accurately and quickly, clean up and restore
normal service
➔ Conducting postmortems
➔ If more than 2 events occur regularly per on-call shift,
problems can’t be investigated
➔ Pager fatigue also won’t improve with scale
➔ If they receive fewer than one event per shift, keeping them
on point is a waste of their time
Being On-Call
➔ Monitoring should never require a human to
interpret any part of the alerting domain
➔ The four golden signals of monitoring are
latency, traffic, errors, and saturation.
Start to focus on these four
“Don’t suggest, expose!”
Dashboards
➔ An engineer can only react with urgency a
few times a day before they get fatigued
➔ Every page should be actionable
➔ Every page response should require
intelligence
➔ Pages should be about a new problem or
an event that hasn’t been seen before
Pager fatigue
A serious a problem to be addressed
Root Cause Analysis: The Core of Problem
Solving and Corrective
by Duke Okes
https://www.amazon.com/Root-Cause-Analysis-Problem-Corrective/
dp/0873897641
Find and eliminate all root causes
➔ When humans are really necessary, thinking and recording the best practices ahead
of time in a playbook or runbook improves 3x in the Mean Time To Repair (MTTR)
➔ SRE’s write and rely on on-call playbooks/runbooks
Example: http://docs.ansible.com/ansible/playbooks_intro.html
Playbooks/Runbooks
A healthy monitoring and alerting pipeline
should be simple and easy to reason about
Monitoring Conclusion
What do i do with this?
➔ Try always to have a high level stack overview
➔ Despite performance of services like databases
often must be performed on the system itself
➔ A dashboard might also be paired with a log, in
order to analyze historical correlations rapidly
Release Engineering
➔ All activities in between regular development and delivery of a software product
to the end user:
◆ i.e., integration, build, test execution, packaging and delivery of software
➔ “Accelerating the path from development to operations”
➔ A part of the SRE team where some more seasoned members are transitioned
there to conduct this highly important task
➔ Is an internal service
What is Release Engineering?
1. Use version control
2. Use the right building tool(s) for the job
3. Write simple and portable build files
4. Use a release process that is reproducible (CI process)
5. Use a package manager
6. Define upgrade process before reaching 1.0
7. Create detailed logs of changes made
8. Do “Canary”
9. Keep the big picture in mind
10. Apply these commands to yourself
10 Commandments of Release Engineering
Collaboration
developers, SRE’s and release engineers work together
Postmortem culture
Learning from failure
➔ Document written for ALL significant incidents
➔ Non-paged incidents are even more valuable -
monitoring gaps
➔ Explain what happened in detail
➔ Find all root causes of the event
➔ Assign actions to correct the problem or improve how it
is addressed next time
What are Postmortems?
Postmortems?!
Postmortems Are Blameless!
➔ Use a blame free postmortem culture, with the
goal of exposing faults
◆ Apply engineering to fix these faults
◆ Try not just avoid or minimize them
Learn and teach with postmortems
Source: http://www.xkcd.com/1495/
SERIOUSLY: BLAMELESS!
The Field Guide to Understanding
Human Error
by Sidney Dekker
https://www.amazon.com/Field-Guide-Understanding-Human
-Error/dp/0754648265
Conclusions
The S.R.E. Google Book
and more resources
● https://g.co/SREBook
● There is now #SRE on @hangops
Slack. https://t.co/btPgSGkGNz to
join.
QUESTIONS!
Evaluate This Session
THANK YOU!
WHAT DID YOU THINK?
We are hiring:
https://www.acquia.com/careers/open-positions
https://events.drupal.org/node/13519

More Related Content

What's hot

PAC 2020 Santorin - Joerek Van Gaalen
PAC 2020 Santorin - Joerek Van GaalenPAC 2020 Santorin - Joerek Van Gaalen
PAC 2020 Santorin - Joerek Van GaalenNeotys
 
Overview of Site Reliability Engineering (SRE) & best practices
Overview of Site Reliability Engineering (SRE) & best practicesOverview of Site Reliability Engineering (SRE) & best practices
Overview of Site Reliability Engineering (SRE) & best practicesAshutosh Agarwal
 
DevOps vs. Site Reliability Engineering (SRE) in Age of Kubernetes
DevOps vs. Site Reliability Engineering (SRE) in Age of KubernetesDevOps vs. Site Reliability Engineering (SRE) in Age of Kubernetes
DevOps vs. Site Reliability Engineering (SRE) in Age of KubernetesDevOps.com
 
DevOps Torino Meetup - SRE Concepts
DevOps Torino Meetup - SRE ConceptsDevOps Torino Meetup - SRE Concepts
DevOps Torino Meetup - SRE ConceptsRauno De Pasquale
 
Surviving the Script-apocalypse
Surviving the Script-apocalypseSurviving the Script-apocalypse
Surviving the Script-apocalypseDevOps.com
 
Security Certification or How I Learned to Stop Worrying & Love Stories - And...
Security Certification or How I Learned to Stop Worrying & Love Stories - And...Security Certification or How I Learned to Stop Worrying & Love Stories - And...
Security Certification or How I Learned to Stop Worrying & Love Stories - And...AgileNZ Conference
 
Monitoring at Facebook - Ran Leibman, Facebook - DevOpsDays Tel Aviv 2015
Monitoring at Facebook - Ran Leibman, Facebook - DevOpsDays Tel Aviv 2015Monitoring at Facebook - Ran Leibman, Facebook - DevOpsDays Tel Aviv 2015
Monitoring at Facebook - Ran Leibman, Facebook - DevOpsDays Tel Aviv 2015DevOpsDays Tel Aviv
 
Deep Dive into Disaster Recovery in the Cloud
Deep Dive into Disaster Recovery in the CloudDeep Dive into Disaster Recovery in the Cloud
Deep Dive into Disaster Recovery in the CloudBluelock
 
Serena Webcast: Accelerating Application Delivery with Continuous Testing
Serena Webcast: Accelerating Application Delivery with Continuous TestingSerena Webcast: Accelerating Application Delivery with Continuous Testing
Serena Webcast: Accelerating Application Delivery with Continuous TestingSerena Software
 
Key Measurements For Testers
Key Measurements For TestersKey Measurements For Testers
Key Measurements For TestersQA Programmer
 
2017 03-10 - vu amsterdam - testing safety critical systems
2017 03-10 - vu amsterdam - testing safety critical systems2017 03-10 - vu amsterdam - testing safety critical systems
2017 03-10 - vu amsterdam - testing safety critical systemsJaap van Ekris
 
DS Crisis Management Foundation - Lifecycle
DS Crisis Management Foundation - LifecycleDS Crisis Management Foundation - Lifecycle
DS Crisis Management Foundation - LifecycleDS
 
Verify Your Kubernetes Clusters with Upstream e2e tests
Verify Your Kubernetes Clusters with Upstream e2e testsVerify Your Kubernetes Clusters with Upstream e2e tests
Verify Your Kubernetes Clusters with Upstream e2e testsKen'ichi Ohmichi
 
ICPE2015
ICPE2015ICPE2015
ICPE2015swy351
 
Issre2010 malik
Issre2010 malikIssre2010 malik
Issre2010 malikSAIL_QU
 

What's hot (20)

PAC 2020 Santorin - Joerek Van Gaalen
PAC 2020 Santorin - Joerek Van GaalenPAC 2020 Santorin - Joerek Van Gaalen
PAC 2020 Santorin - Joerek Van Gaalen
 
Overview of Site Reliability Engineering (SRE) & best practices
Overview of Site Reliability Engineering (SRE) & best practicesOverview of Site Reliability Engineering (SRE) & best practices
Overview of Site Reliability Engineering (SRE) & best practices
 
DevOps vs. Site Reliability Engineering (SRE) in Age of Kubernetes
DevOps vs. Site Reliability Engineering (SRE) in Age of KubernetesDevOps vs. Site Reliability Engineering (SRE) in Age of Kubernetes
DevOps vs. Site Reliability Engineering (SRE) in Age of Kubernetes
 
DevOps Torino Meetup - SRE Concepts
DevOps Torino Meetup - SRE ConceptsDevOps Torino Meetup - SRE Concepts
DevOps Torino Meetup - SRE Concepts
 
Surviving the Script-apocalypse
Surviving the Script-apocalypseSurviving the Script-apocalypse
Surviving the Script-apocalypse
 
Security Certification or How I Learned to Stop Worrying & Love Stories - And...
Security Certification or How I Learned to Stop Worrying & Love Stories - And...Security Certification or How I Learned to Stop Worrying & Love Stories - And...
Security Certification or How I Learned to Stop Worrying & Love Stories - And...
 
Monitoring at Facebook - Ran Leibman, Facebook - DevOpsDays Tel Aviv 2015
Monitoring at Facebook - Ran Leibman, Facebook - DevOpsDays Tel Aviv 2015Monitoring at Facebook - Ran Leibman, Facebook - DevOpsDays Tel Aviv 2015
Monitoring at Facebook - Ran Leibman, Facebook - DevOpsDays Tel Aviv 2015
 
SRE in Startup
SRE in StartupSRE in Startup
SRE in Startup
 
Deep Dive into Disaster Recovery in the Cloud
Deep Dive into Disaster Recovery in the CloudDeep Dive into Disaster Recovery in the Cloud
Deep Dive into Disaster Recovery in the Cloud
 
Serena Webcast: Accelerating Application Delivery with Continuous Testing
Serena Webcast: Accelerating Application Delivery with Continuous TestingSerena Webcast: Accelerating Application Delivery with Continuous Testing
Serena Webcast: Accelerating Application Delivery with Continuous Testing
 
Key Measurements For Testers
Key Measurements For TestersKey Measurements For Testers
Key Measurements For Testers
 
091414 Rufran's Acumen Fuse Tips & Tricks 01-14 (Issues on imported durations)
091414 Rufran's Acumen Fuse Tips & Tricks 01-14 (Issues on imported durations)091414 Rufran's Acumen Fuse Tips & Tricks 01-14 (Issues on imported durations)
091414 Rufran's Acumen Fuse Tips & Tricks 01-14 (Issues on imported durations)
 
043015 Rufran's Acumen Fuse Tips and Tricks 01-15 (Duration Translation Issue)
043015 Rufran's Acumen Fuse Tips and Tricks 01-15 (Duration Translation Issue)043015 Rufran's Acumen Fuse Tips and Tricks 01-15 (Duration Translation Issue)
043015 Rufran's Acumen Fuse Tips and Tricks 01-15 (Duration Translation Issue)
 
2017 03-10 - vu amsterdam - testing safety critical systems
2017 03-10 - vu amsterdam - testing safety critical systems2017 03-10 - vu amsterdam - testing safety critical systems
2017 03-10 - vu amsterdam - testing safety critical systems
 
DS Crisis Management Foundation - Lifecycle
DS Crisis Management Foundation - LifecycleDS Crisis Management Foundation - Lifecycle
DS Crisis Management Foundation - Lifecycle
 
Verify Your Kubernetes Clusters with Upstream e2e tests
Verify Your Kubernetes Clusters with Upstream e2e testsVerify Your Kubernetes Clusters with Upstream e2e tests
Verify Your Kubernetes Clusters with Upstream e2e tests
 
ICPE2015
ICPE2015ICPE2015
ICPE2015
 
113015 - Understanding ALAP
113015 - Understanding ALAP113015 - Understanding ALAP
113015 - Understanding ALAP
 
Fundamentals Performance Testing
Fundamentals Performance TestingFundamentals Performance Testing
Fundamentals Performance Testing
 
Issre2010 malik
Issre2010 malikIssre2010 malik
Issre2010 malik
 

Viewers also liked

Um milhao de usuários simultâneos
Um milhao de usuários simultâneosUm milhao de usuários simultâneos
Um milhao de usuários simultâneosFernando Ike
 
DOES15 - Jody Mulkey - DevOps in the Enterprise: A Transformation Journey
DOES15 - Jody Mulkey - DevOps in the Enterprise: A Transformation JourneyDOES15 - Jody Mulkey - DevOps in the Enterprise: A Transformation Journey
DOES15 - Jody Mulkey - DevOps in the Enterprise: A Transformation JourneyGene Kim
 
Docker containers & the Future of Drupal testing
Docker containers & the Future of Drupal testing Docker containers & the Future of Drupal testing
Docker containers & the Future of Drupal testing Ricardo Amaro
 
Open Source Tools for Container Security and Compliance @Docker LA Meetup 2/13
Open Source Tools for Container Security and Compliance @Docker LA Meetup 2/13Open Source Tools for Container Security and Compliance @Docker LA Meetup 2/13
Open Source Tools for Container Security and Compliance @Docker LA Meetup 2/13Zach Hill
 
How To Train Your APIs
How To Train Your APIsHow To Train Your APIs
How To Train Your APIsAshley Roach
 
Drupal workshop ist 2014
Drupal workshop ist 2014Drupal workshop ist 2014
Drupal workshop ist 2014Ricardo Amaro
 
Microservice architecture
Microservice architectureMicroservice architecture
Microservice architectureSlim Ouertani
 
Building a REST API Microservice for the DevNet API Scavenger Hunt
Building a REST API Microservice for the DevNet API Scavenger HuntBuilding a REST API Microservice for the DevNet API Scavenger Hunt
Building a REST API Microservice for the DevNet API Scavenger HuntAshley Roach
 
Drupalcamp es 2013 drupal with lxc docker and vagrant
Drupalcamp es 2013  drupal with lxc docker and vagrant Drupalcamp es 2013  drupal with lxc docker and vagrant
Drupalcamp es 2013 drupal with lxc docker and vagrant Ricardo Amaro
 
Introduction to Infrastructure as Code & Automation / Introduction to Chef
Introduction to Infrastructure as Code & Automation / Introduction to ChefIntroduction to Infrastructure as Code & Automation / Introduction to Chef
Introduction to Infrastructure as Code & Automation / Introduction to ChefNathen Harvey
 
Priming Your Teams For Microservice Deployment to the Cloud
Priming Your Teams For Microservice Deployment to the CloudPriming Your Teams For Microservice Deployment to the Cloud
Priming Your Teams For Microservice Deployment to the CloudMatt Callanan
 
Docker security: Rolling out Trust in your container
Docker security: Rolling out Trust in your containerDocker security: Rolling out Trust in your container
Docker security: Rolling out Trust in your containerRonak Kogta
 
DOXLON November 2016 - Data Democratization Using Splunk
DOXLON November 2016 - Data Democratization Using SplunkDOXLON November 2016 - Data Democratization Using Splunk
DOXLON November 2016 - Data Democratization Using SplunkOutlyer
 
Drupal workshop fcul_2014
Drupal workshop fcul_2014Drupal workshop fcul_2014
Drupal workshop fcul_2014Ricardo Amaro
 
Docker Security
Docker SecurityDocker Security
Docker SecurityBladE0341
 
The free software history and communities’ journey ahead
The free software history and communities’ journey aheadThe free software history and communities’ journey ahead
The free software history and communities’ journey aheadRicardo Amaro
 
Docker and Cloud - Enables for DevOps - by ACA-IT
Docker and Cloud - Enables for DevOps - by ACA-ITDocker and Cloud - Enables for DevOps - by ACA-IT
Docker and Cloud - Enables for DevOps - by ACA-ITStijn Wijndaele
 
DevOps meetup 16oct docker and jenkins
DevOps meetup 16oct docker and jenkinsDevOps meetup 16oct docker and jenkins
DevOps meetup 16oct docker and jenkinsBenoit Wilcox
 
Docker (compose) in devops - prague docker meetup
Docker (compose) in devops - prague docker meetupDocker (compose) in devops - prague docker meetup
Docker (compose) in devops - prague docker meetupJuraj Kojdjak
 

Viewers also liked (20)

Um milhao de usuários simultâneos
Um milhao de usuários simultâneosUm milhao de usuários simultâneos
Um milhao de usuários simultâneos
 
DOES15 - Jody Mulkey - DevOps in the Enterprise: A Transformation Journey
DOES15 - Jody Mulkey - DevOps in the Enterprise: A Transformation JourneyDOES15 - Jody Mulkey - DevOps in the Enterprise: A Transformation Journey
DOES15 - Jody Mulkey - DevOps in the Enterprise: A Transformation Journey
 
Docker containers & the Future of Drupal testing
Docker containers & the Future of Drupal testing Docker containers & the Future of Drupal testing
Docker containers & the Future of Drupal testing
 
Open Source Tools for Container Security and Compliance @Docker LA Meetup 2/13
Open Source Tools for Container Security and Compliance @Docker LA Meetup 2/13Open Source Tools for Container Security and Compliance @Docker LA Meetup 2/13
Open Source Tools for Container Security and Compliance @Docker LA Meetup 2/13
 
How To Train Your APIs
How To Train Your APIsHow To Train Your APIs
How To Train Your APIs
 
Drupal workshop ist 2014
Drupal workshop ist 2014Drupal workshop ist 2014
Drupal workshop ist 2014
 
Microservice architecture
Microservice architectureMicroservice architecture
Microservice architecture
 
Building a REST API Microservice for the DevNet API Scavenger Hunt
Building a REST API Microservice for the DevNet API Scavenger HuntBuilding a REST API Microservice for the DevNet API Scavenger Hunt
Building a REST API Microservice for the DevNet API Scavenger Hunt
 
Drupalcamp es 2013 drupal with lxc docker and vagrant
Drupalcamp es 2013  drupal with lxc docker and vagrant Drupalcamp es 2013  drupal with lxc docker and vagrant
Drupalcamp es 2013 drupal with lxc docker and vagrant
 
Introduction to Infrastructure as Code & Automation / Introduction to Chef
Introduction to Infrastructure as Code & Automation / Introduction to ChefIntroduction to Infrastructure as Code & Automation / Introduction to Chef
Introduction to Infrastructure as Code & Automation / Introduction to Chef
 
Priming Your Teams For Microservice Deployment to the Cloud
Priming Your Teams For Microservice Deployment to the CloudPriming Your Teams For Microservice Deployment to the Cloud
Priming Your Teams For Microservice Deployment to the Cloud
 
Docker security: Rolling out Trust in your container
Docker security: Rolling out Trust in your containerDocker security: Rolling out Trust in your container
Docker security: Rolling out Trust in your container
 
DOXLON November 2016 - Data Democratization Using Splunk
DOXLON November 2016 - Data Democratization Using SplunkDOXLON November 2016 - Data Democratization Using Splunk
DOXLON November 2016 - Data Democratization Using Splunk
 
DATA CENTER
DATA CENTER DATA CENTER
DATA CENTER
 
Drupal workshop fcul_2014
Drupal workshop fcul_2014Drupal workshop fcul_2014
Drupal workshop fcul_2014
 
Docker Security
Docker SecurityDocker Security
Docker Security
 
The free software history and communities’ journey ahead
The free software history and communities’ journey aheadThe free software history and communities’ journey ahead
The free software history and communities’ journey ahead
 
Docker and Cloud - Enables for DevOps - by ACA-IT
Docker and Cloud - Enables for DevOps - by ACA-ITDocker and Cloud - Enables for DevOps - by ACA-IT
Docker and Cloud - Enables for DevOps - by ACA-IT
 
DevOps meetup 16oct docker and jenkins
DevOps meetup 16oct docker and jenkinsDevOps meetup 16oct docker and jenkins
DevOps meetup 16oct docker and jenkins
 
Docker (compose) in devops - prague docker meetup
Docker (compose) in devops - prague docker meetupDocker (compose) in devops - prague docker meetup
Docker (compose) in devops - prague docker meetup
 

Similar to S.R.E - create ultra-scalable and highly reliable systems

A Crash Course in Building Site Reliability
A Crash Course in Building Site ReliabilityA Crash Course in Building Site Reliability
A Crash Course in Building Site ReliabilityAcquia
 
Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...
Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...
Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...ITSM Academy, Inc.
 
Introduction to Lean Software Development
Introduction to Lean Software DevelopmentIntroduction to Lean Software Development
Introduction to Lean Software DevelopmentMichael Vax
 
Lessons from DevOps: Taking DevOps practices into your AppSec Life
Lessons from DevOps: Taking DevOps practices into your AppSec LifeLessons from DevOps: Taking DevOps practices into your AppSec Life
Lessons from DevOps: Taking DevOps practices into your AppSec LifeMatt Tesauro
 
SRE (service reliability engineer) on big DevOps platform running on the clou...
SRE (service reliability engineer) on big DevOps platform running on the clou...SRE (service reliability engineer) on big DevOps platform running on the clou...
SRE (service reliability engineer) on big DevOps platform running on the clou...DevClub_lv
 
Webinar: Demonstrating Business Value for DevOps & Continuous Delivery
Webinar: Demonstrating Business Value for DevOps & Continuous DeliveryWebinar: Demonstrating Business Value for DevOps & Continuous Delivery
Webinar: Demonstrating Business Value for DevOps & Continuous DeliveryXebiaLabs
 
Using SaltStack to DevOps the enterprise
Using SaltStack to DevOps the enterpriseUsing SaltStack to DevOps the enterprise
Using SaltStack to DevOps the enterpriseChristian McHugh
 
Compliance Automation: detect & correct
Compliance Automation: detect & correctCompliance Automation: detect & correct
Compliance Automation: detect & correctKangaroot
 
The journey to Continuous Automation - Chef Automate
The journey to Continuous Automation - Chef AutomateThe journey to Continuous Automation - Chef Automate
The journey to Continuous Automation - Chef AutomateKangaroot
 
DevOps Transformation Solution Recommendation.pptx
DevOps Transformation Solution Recommendation.pptxDevOps Transformation Solution Recommendation.pptx
DevOps Transformation Solution Recommendation.pptxPrasannaKumarN8
 
Improving Regression Testing Effectiveness With Defect Detection Percentage (...
Improving Regression Testing Effectiveness With Defect Detection Percentage (...Improving Regression Testing Effectiveness With Defect Detection Percentage (...
Improving Regression Testing Effectiveness With Defect Detection Percentage (...DevOps.com
 
How to use Istio/Anthos to build Enterprise SRE
How to use Istio/Anthos to build Enterprise SREHow to use Istio/Anthos to build Enterprise SRE
How to use Istio/Anthos to build Enterprise SRETzung-Hsien (Shawn) Ho
 
Robert Mc Geachy Common Pitfalls Agile
Robert Mc Geachy Common Pitfalls AgileRobert Mc Geachy Common Pitfalls Agile
Robert Mc Geachy Common Pitfalls AgileRobert McGeachy
 
DOES14: Scott Prugh, CSG - DevOps and Lean in Legacy Environments
DOES14: Scott Prugh, CSG - DevOps and Lean in Legacy EnvironmentsDOES14: Scott Prugh, CSG - DevOps and Lean in Legacy Environments
DOES14: Scott Prugh, CSG - DevOps and Lean in Legacy EnvironmentsDevOps Enterprise Summmit
 
Planning presentation introduction to planning
Planning presentation introduction to planningPlanning presentation introduction to planning
Planning presentation introduction to planningUS-Analytics
 
DevOps Roadtrip Final Speaking Deck
DevOps Roadtrip Final Speaking Deck DevOps Roadtrip Final Speaking Deck
DevOps Roadtrip Final Speaking Deck VictorOps
 
Migrating Your Apps to the Cloud: How to do it and What to Avoid
Migrating Your Apps to the Cloud: How to do it and What to AvoidMigrating Your Apps to the Cloud: How to do it and What to Avoid
Migrating Your Apps to the Cloud: How to do it and What to AvoidVMware Tanzu
 

Similar to S.R.E - create ultra-scalable and highly reliable systems (20)

A Crash Course in Building Site Reliability
A Crash Course in Building Site ReliabilityA Crash Course in Building Site Reliability
A Crash Course in Building Site Reliability
 
Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...
Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...
Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...
 
Introduction to Lean Software Development
Introduction to Lean Software DevelopmentIntroduction to Lean Software Development
Introduction to Lean Software Development
 
Lessons from DevOps: Taking DevOps practices into your AppSec Life
Lessons from DevOps: Taking DevOps practices into your AppSec LifeLessons from DevOps: Taking DevOps practices into your AppSec Life
Lessons from DevOps: Taking DevOps practices into your AppSec Life
 
SRE (service reliability engineer) on big DevOps platform running on the clou...
SRE (service reliability engineer) on big DevOps platform running on the clou...SRE (service reliability engineer) on big DevOps platform running on the clou...
SRE (service reliability engineer) on big DevOps platform running on the clou...
 
Webinar: Demonstrating Business Value for DevOps & Continuous Delivery
Webinar: Demonstrating Business Value for DevOps & Continuous DeliveryWebinar: Demonstrating Business Value for DevOps & Continuous Delivery
Webinar: Demonstrating Business Value for DevOps & Continuous Delivery
 
DevOps
DevOpsDevOps
DevOps
 
Using SaltStack to DevOps the enterprise
Using SaltStack to DevOps the enterpriseUsing SaltStack to DevOps the enterprise
Using SaltStack to DevOps the enterprise
 
Compliance Automation: detect & correct
Compliance Automation: detect & correctCompliance Automation: detect & correct
Compliance Automation: detect & correct
 
The journey to Continuous Automation - Chef Automate
The journey to Continuous Automation - Chef AutomateThe journey to Continuous Automation - Chef Automate
The journey to Continuous Automation - Chef Automate
 
DevOps Transformation Solution Recommendation.pptx
DevOps Transformation Solution Recommendation.pptxDevOps Transformation Solution Recommendation.pptx
DevOps Transformation Solution Recommendation.pptx
 
Sre summary
Sre summarySre summary
Sre summary
 
Improving Regression Testing Effectiveness With Defect Detection Percentage (...
Improving Regression Testing Effectiveness With Defect Detection Percentage (...Improving Regression Testing Effectiveness With Defect Detection Percentage (...
Improving Regression Testing Effectiveness With Defect Detection Percentage (...
 
How to use Istio/Anthos to build Enterprise SRE
How to use Istio/Anthos to build Enterprise SREHow to use Istio/Anthos to build Enterprise SRE
How to use Istio/Anthos to build Enterprise SRE
 
Robert Mc Geachy Common Pitfalls Agile
Robert Mc Geachy Common Pitfalls AgileRobert Mc Geachy Common Pitfalls Agile
Robert Mc Geachy Common Pitfalls Agile
 
Agile webinar pack (2)
Agile webinar pack (2)Agile webinar pack (2)
Agile webinar pack (2)
 
DOES14: Scott Prugh, CSG - DevOps and Lean in Legacy Environments
DOES14: Scott Prugh, CSG - DevOps and Lean in Legacy EnvironmentsDOES14: Scott Prugh, CSG - DevOps and Lean in Legacy Environments
DOES14: Scott Prugh, CSG - DevOps and Lean in Legacy Environments
 
Planning presentation introduction to planning
Planning presentation introduction to planningPlanning presentation introduction to planning
Planning presentation introduction to planning
 
DevOps Roadtrip Final Speaking Deck
DevOps Roadtrip Final Speaking Deck DevOps Roadtrip Final Speaking Deck
DevOps Roadtrip Final Speaking Deck
 
Migrating Your Apps to the Cloud: How to do it and What to Avoid
Migrating Your Apps to the Cloud: How to do it and What to AvoidMigrating Your Apps to the Cloud: How to do it and What to Avoid
Migrating Your Apps to the Cloud: How to do it and What to Avoid
 

Recently uploaded

kiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal loadkiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal loadhamedmustafa094
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfJiananWang21
 
Learn the concepts of Thermodynamics on Magic Marks
Learn the concepts of Thermodynamics on Magic MarksLearn the concepts of Thermodynamics on Magic Marks
Learn the concepts of Thermodynamics on Magic MarksMagic Marks
 
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...soginsider
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.Kamal Acharya
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTbhaskargani46
 
Minimum and Maximum Modes of microprocessor 8086
Minimum and Maximum Modes of microprocessor 8086Minimum and Maximum Modes of microprocessor 8086
Minimum and Maximum Modes of microprocessor 8086anil_gaur
 
AIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech studentsAIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech studentsvanyagupta248
 
Online electricity billing project report..pdf
Online electricity billing project report..pdfOnline electricity billing project report..pdf
Online electricity billing project report..pdfKamal Acharya
 
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...Call Girls Mumbai
 
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptxA CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptxmaisarahman1
 
A Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityA Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityMorshed Ahmed Rahath
 
Work-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxWork-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxJuliansyahHarahap1
 
2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projects2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projectssmsksolar
 
Engineering Drawing focus on projection of planes
Engineering Drawing focus on projection of planesEngineering Drawing focus on projection of planes
Engineering Drawing focus on projection of planesRAJNEESHKUMAR341697
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXssuser89054b
 
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills KuwaitKuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwaitjaanualu31
 
Introduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaIntroduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaOmar Fathy
 

Recently uploaded (20)

kiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal loadkiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal load
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
Learn the concepts of Thermodynamics on Magic Marks
Learn the concepts of Thermodynamics on Magic MarksLearn the concepts of Thermodynamics on Magic Marks
Learn the concepts of Thermodynamics on Magic Marks
 
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 
Minimum and Maximum Modes of microprocessor 8086
Minimum and Maximum Modes of microprocessor 8086Minimum and Maximum Modes of microprocessor 8086
Minimum and Maximum Modes of microprocessor 8086
 
AIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech studentsAIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech students
 
Online electricity billing project report..pdf
Online electricity billing project report..pdfOnline electricity billing project report..pdf
Online electricity billing project report..pdf
 
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
 
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptxA CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
 
A Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityA Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna Municipality
 
Work-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxWork-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptx
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
 
2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projects2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projects
 
Engineering Drawing focus on projection of planes
Engineering Drawing focus on projection of planesEngineering Drawing focus on projection of planes
Engineering Drawing focus on projection of planes
 
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced LoadsFEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills KuwaitKuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
 
Introduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaIntroduction to Serverless with AWS Lambda
Introduction to Serverless with AWS Lambda
 

S.R.E - create ultra-scalable and highly reliable systems

  • 1.
  • 2. S.R.E. create ultra-scalable and highly reliable systems Ricardo Amaro DevOps - https://events.drupal.org/node/13519
  • 3. Who am I? @Drupal @ricardoamaro Portugal Lisbon Drupal Community Family +8 years Drupal 90’s Linux Adopter 5 years at Acquia Site Reliability Engineer, Senior Tier2 Ops https://drupal.org/user/666176
  • 4. About Acquia Metrics ○ Acquia Cloud: ○ # of Instances (17,200+) ○ # of Production Sites (54,000+) ○ # API Calls (3,000 + per sec) ○ # Of Availability Zones (20+) ○ # Of Regions (8)
  • 5. We will talk about A brief summary inspired on Google’s S.R.E. book ○ What is S.R.E? ○ Tenets of S.R.E. ○ Reliability & Toil ○ Error budget - keeping the Service Level Objective (SLO) ○ Development & Operations ○ Monitoring and Being On-Call ○ Release Engineering ○ Postmortem culture - Learning from failure
  • 7. ➔ Term crafted by Google in 2003. ➔ When Ben Treynor was hired to run “production” and ended up “applying software engineering to an operations function” ➔ Motivation: “as a software engineer, how would I want to invest my time to accomplish a set of repetitive tasks?” Site Reliability Engineering
  • 8. ➔ SRE is taken seriously by major companies Site Reliability Engineering Microsoft Apple Amazon
  • 9. SRE’s are engineers that... ➔ Apply the principles of computer science and engineering to design and develop large, distributed computing systems. ➔ Write software for those systems alongside product developers. ➔ Build all additional pieces those systems need, like backups and load balancing. ➔ Reuse old solutions for new problems. Site Reliability Engineering
  • 10. DevOps & S.R.E. DevOps is a practice, which was coined around 2008, that encompasses automation of manual tasks, continuous integration and continuous delivery. It applies to a wide audience of companies whereas SRE might be considered a subset of DevOps that possesses additional skill sets. Source: https://en.wikipedia.org/wiki/Site_reliability_engineering
  • 12. Tenets of SRE 1. Ensuring a Durable Focus on Engineering 2. Pursuing Maximum Change Velocity 3. Monitoring 4. Emergency Response 5. Change Management 6. Demand Forecasting and Capacity Planning 7. Provisioning 8. Efficiency and Performance
  • 13. ➔ Hire only coders ➔ Have Service Level Objectives (SLOs) for your service ➔ Measure and report performance against SLOs ➔ Use Error Budgets and gate launches on them ➔ Have a Common staffing pool for SRE and DEV ➔ Excess Ops work overflows to DEV team ➔ Cap SRE operational load at 50% and share 5% with the DEV team ➔ On-call teams at least 8 or 6 people in rotation, per product ➔ Maximum of 2 events per on-call shift ➔ Post mortem for every event ➔ Post mortems are BLAMELESS and focus on process and technology, not people How to achieve S.R.E. Treynor’s Action items IMPORTANT IMPORTANT
  • 15. The latest feature or That the product works? What is most the important Feature of a product?
  • 16. How about the “503” feature ? ...most important thing is that the product works!
  • 17. “Reliability is the most fundamental feature of any product.” Ben Treynor, Google’s VP for 24/7 Operations
  • 18. The 80’s Waterfall software delivery model Operations @customer ➔ *Provisioning ➔ *Installing ➔ *Upgrading ➔ *Maintaining ➔ *Backups/Restore ➔ *Scaling Source: wikipedia
  • 19. Then came the web... ● Software as a Service ● Platform as a Service ● Cloud computing ● ... ➔ Operations overhead not on the customer side ➔ Features could now be delivered faster ➔ Customer feedback important for product improvements Product Development Ship Features Operations Users
  • 20. Opposite rewarding conflicts Objectives: ➔ Ship new features ➔ Launch new products Objectives: ➔ Reliability & Availability ➔ Provision & Scale Dev Ops
  • 21. The problem: Toil* *exhausting labour ➔ Manual ➔ Repetitive ➔ Automatable ➔ Tactical (Unplanned work) ➔ No enduring value ➔ O(n) with service growth (not just “work I don’t like to do.”)
  • 22. An Old Solution to Toil Caption goes here ● Scale with bodies In the old operations model, you throw people at a reliability problem and keep pushing (sometimes for a year or more) until the problem either goes away or blows up in your face.
  • 23. Has your business succeeds workload tends to infinity (x) time ● Cap Ops Workload Because if you are successful and your business grows you need to reduce errors and toil. Put a 50% cap on Ops work and leave most of the SRE team time for writing code and reduce Toil. (y)customers/traffic Workload/Toil over time
  • 24. ➔ Keep operational work (i.e., toil) below 50% of each SREs time ➔ More than 50% of each SREs time is spent on: ◆ Engineering project work to reduce toil ◆ Add service features - improving reliability, performance, utilization ➔ Improves career planning for the SRE ➔ Improves morale on the organization ➔ An SRE team can easily devolve into an Ops team if the 50% target is broken Why less Toil is Better? S.R.E. - A modern solution not bad...
  • 25. S.R.E. - A modern solution DEV + OPS ➔ This conflict is not inevitable ➔ The solution is: Error Budgets! ➔ Everyone agrees on an Error Budget (as we will explain next) ➔ SRE only prevents releases or Launches if the Error Budget is exceeded. Dev Ops
  • 27. ➔ SLO - Service level objective is agreed as a means of measuring the performance of the Service Provider. ➔ SLA - Service Level Agreement specifies what service is to be provided, how it is supported, times, locations, costs, performance, and responsibilities of the parties involved. SLOs are specific measurable characteristics of the SLA such as availability, throughput, frequency, response time, or quality. ➔ SLI - Service Level Indicator is a measure of the service level provided by a service provider to a customer. SLIs form the basis of Service Level Objectives (SLOs), which in turn form the basis of Service Level Agreements (SLAs). SLO, SLA & SLI Terminology
  • 28. What is an Error Budget? The business or the product establishes Service Level Objectives (SLOs) for the system, based on Service Level indicators such as error rate, availability or latency... Error Budget Example: A 99.9% availability SLO means that the service can be 0.1% unavailable, which is the error budget. 100% - 99.9% = 0.1%
  • 29. ➔ 100% is the wrong reliability target for basically everything. ➔ Set a goal that acknowledges the trade-off and leaves an error budget ➔ Error budget can be spent on anything: launching features, etc. ➔ Error budget allows for discussion about how phased rollouts and 1% experiments can maintain tolerable levels of errors. ➔ Goal of SRE team isn’t “zero outages” – SRE and product devs are incentive aligned to spend the error budget to get maximum feature velocity. ➔ Out of Budget? No problems. Do more testing between releases. How to obtain the Error Budget
  • 30. ➔ This puts an incentive to developers that drives them to value stability (not just change) ➔ And gives control that drives SREs to permit change (not just stability) ➔ It forces decisions based on metrics, not politics- nor feelings, just data Error Budget A Self-regulating mechanism
  • 32. ➔ Development and SRE teams share a single staffing pool ◆ If all is Reliable Devs are rewarded with teammates ◆ If Ops is overloaded, SREs are contracted to support code How are Development & Operations teams organized? Now tell me… Why should I hire you?
  • 33. Systems, code… Are you able to cook also? ➔ SREs are developer/sys-admin hybrids ◆ They perform more Dev work as things become stable Development & Operations Systems, code… Are you able to cook also?
  • 34. ➔ SRE can only spend up to 50% of their time on ops work ➔ If operational load exceeds 50%, the ops work overflows to Dev ➔ Allow SRE to move to other projects Highly motivated and effective teamwork
  • 36. ➔ Three valid kinds of monitoring output ◆ Alerts: human needs to take action immediately ● If you get a huge volume of critical email alerts disable them and stick with paging ◆ Tickets: human needs to take action eventually ● On-call engineers can actually accomplish work when they aren’t being kept up by pages at all hours. Ultimately, temporarily backing off on our alerts will allow you to make faster progress toward a better service ◆ Logging: no action needed Monitoring and taking action
  • 37. ➔ Maximum of 2 events per 8–12hour on-call shift ➔ Handle the event accurately and quickly, clean up and restore normal service ➔ Conducting postmortems ➔ If more than 2 events occur regularly per on-call shift, problems can’t be investigated ➔ Pager fatigue also won’t improve with scale ➔ If they receive fewer than one event per shift, keeping them on point is a waste of their time Being On-Call
  • 38. ➔ Monitoring should never require a human to interpret any part of the alerting domain ➔ The four golden signals of monitoring are latency, traffic, errors, and saturation. Start to focus on these four “Don’t suggest, expose!” Dashboards
  • 39. ➔ An engineer can only react with urgency a few times a day before they get fatigued ➔ Every page should be actionable ➔ Every page response should require intelligence ➔ Pages should be about a new problem or an event that hasn’t been seen before Pager fatigue A serious a problem to be addressed
  • 40. Root Cause Analysis: The Core of Problem Solving and Corrective by Duke Okes https://www.amazon.com/Root-Cause-Analysis-Problem-Corrective/ dp/0873897641 Find and eliminate all root causes
  • 41. ➔ When humans are really necessary, thinking and recording the best practices ahead of time in a playbook or runbook improves 3x in the Mean Time To Repair (MTTR) ➔ SRE’s write and rely on on-call playbooks/runbooks Example: http://docs.ansible.com/ansible/playbooks_intro.html Playbooks/Runbooks
  • 42. A healthy monitoring and alerting pipeline should be simple and easy to reason about Monitoring Conclusion What do i do with this? ➔ Try always to have a high level stack overview ➔ Despite performance of services like databases often must be performed on the system itself ➔ A dashboard might also be paired with a log, in order to analyze historical correlations rapidly
  • 44. ➔ All activities in between regular development and delivery of a software product to the end user: ◆ i.e., integration, build, test execution, packaging and delivery of software ➔ “Accelerating the path from development to operations” ➔ A part of the SRE team where some more seasoned members are transitioned there to conduct this highly important task ➔ Is an internal service What is Release Engineering?
  • 45. 1. Use version control 2. Use the right building tool(s) for the job 3. Write simple and portable build files 4. Use a release process that is reproducible (CI process) 5. Use a package manager 6. Define upgrade process before reaching 1.0 7. Create detailed logs of changes made 8. Do “Canary” 9. Keep the big picture in mind 10. Apply these commands to yourself 10 Commandments of Release Engineering
  • 46. Collaboration developers, SRE’s and release engineers work together
  • 48. ➔ Document written for ALL significant incidents ➔ Non-paged incidents are even more valuable - monitoring gaps ➔ Explain what happened in detail ➔ Find all root causes of the event ➔ Assign actions to correct the problem or improve how it is addressed next time What are Postmortems? Postmortems?!
  • 49. Postmortems Are Blameless! ➔ Use a blame free postmortem culture, with the goal of exposing faults ◆ Apply engineering to fix these faults ◆ Try not just avoid or minimize them
  • 50. Learn and teach with postmortems Source: http://www.xkcd.com/1495/
  • 51. SERIOUSLY: BLAMELESS! The Field Guide to Understanding Human Error by Sidney Dekker https://www.amazon.com/Field-Guide-Understanding-Human -Error/dp/0754648265
  • 53. The S.R.E. Google Book and more resources ● https://g.co/SREBook ● There is now #SRE on @hangops Slack. https://t.co/btPgSGkGNz to join.
  • 55. Evaluate This Session THANK YOU! WHAT DID YOU THINK? We are hiring: https://www.acquia.com/careers/open-positions https://events.drupal.org/node/13519