Site (Service) Reliability Engineering

SITE RELIABILITY
ENGINEERING*
SEEN FROM DEVOPS AND AGILE PERSPECTIVES
*SERVICE
M UNDERWOOD @KNOWLENGR | V1.2 | KNOWLENGR.COM | VIEWS MY
OWN
1

GAPS IN AGILE, DEVOPS APPROACHES
WHY ADDITIONAL OR SUPPLEMENTARY APPROACHES ARE NEEDED
*EDITORIAL
M UNDERWOOD @KNOWLENGR | V1.2 | KNOWLENGR.COM | VIEWS MY OWN 2

HOW OPS GETS OVERLOOKED
• No obvious “product” release cycle
• Keeping complex systems running is not primarily a software
problem
• Ops troubleshooting may not follow any SDLC model
• Some Ops entail managing systems in which no code readily
available

PHILOSOPHICAL NOTES
• Technical approaches to privacy are inextricably tied to security
• Similarly, reliability engineering is also tied to security
• -- and not just “Availability”
• Quality engineering comfortably straddles both Dev and Ops
• Most quality engineering in practice is pure Ops
• Software engineering has immature notions of quality
• Supporting legacy systems may be more Ops than Dev

USE CASES
• Call center operations
• Field service
• Sales, sales support
• Most of health care (17.8% of US GDP spending)
• Rework and repair (all sectors)
• Financial services
• Government operations (e.g., voting systems, regulation, transportation management)
• Utilities
• Even the less obvious: decision support

SOFTWARE SUPPORTS OPS, BUT . . .
• Complex systems lack human-machine controls
• Humans are almost always “man in the middle” by design
• Ops were not designed to be automated
• Software only lightly mitigates labor increases when service
load increases
• Ops must encompass non-automated tasks

SITE RELIABILITY ENGINEERING
Edited by Betsy Beyer, Chris Jones, Jennifer Petoff
and Niall Richard Murphy
(O’Reilly). Copyright 2016 Google, Inc., 978-1-491-
92912-4.”

SITE RELIABILITY WORKBOOK
Edited by Betsy Beyer, Niall Richard Murphy,
David K. Rensin, Kent Kawahara and Stephen
Thorne
O’Reilly Media
Source

CREDIT GOOGLE
GOOGLE DEVELOPED SRE AND PUBLISHES A FREE ONLINE TEXT.
BEN TREYNOR SLOSS ORIGINATED THE TERM.

GOOGLE’S DEFINITION
“SRE IS WHAT YOU GET WHEN YOU TREAT OPERATIONS AS IF IT’S A SOFTWARE
PROBLEM. OUR MISSION IS TO PROTECT, PROVIDE FOR, AND PROGRESS THE SOFTWARE
AND SYSTEMS BEHIND ALL OF GOOGLE’S PUBLIC SERVICES — GOOGLE SEARCH, ADS,
GMAIL, ANDROID, YOUTUBE, AND APP ENGINE, TO NAME JUST A FEW — WITH AN EVER-
WATCHFUL EYE ON THEIR AVAILABILITY, LATENCY, PERFORMANCE, AND CAPACITY.”
SOURCE

WHAT IS IT?
• Quasi open standardized process (vs. “standard”)
• Scalable, proven (albeit inside deep pocket enterprises)
• Begun in 2003, it predated DevOps
• Left-shift Sysadmin functions
• But with healthy skills in layers 1-3 in UNIX network stack

IS IT DEVOPS?
• “. . . We are distinct from the industry term DevOps, because
although we definitely regard infrastructure as code, we
have reliability as our main focus. Additionally, we are strongly
oriented toward removing the necessity for operations—
see The Evolution of Automation at Google for more details.”

IS IT DEVOPS? (PER GOOGLE)
“One could view DevOps as a generalization of several core SRE
principles to a wider range of organizations, management
structures, and personnel. One could equivalently view SRE as a
specific implementation of DevOps with some idiosyncratic
extensions.” (Chapter 1)

OPS SRE RESPONSIBILITIES
• Availability
• Latency
• Performance [sic]
• Efficiency*
• Change Management
• Monitoring*
• Emergency Response
• Capacity Planning

HOW SRE LEFT-SHIFTS OPS
• No more than 50% duty in Ops
• Remaining 50% is “coding skills on project work”
• Heavy reliance on “blame-free postmortem culture”
• Ed: Quality principle
• Ed: Implies analytics, evidence-, data-driven processes

SRE EVENT ANALYTICS
• Max of two events per 8/12 hr on-call shift
• No equivalent to these events in software engineering
• Tied to monitoring (alerts, tickets, logging)
• Emergency response is a useful event + event metrics
• MTTF and MTTR – MTTR is key
• Playbook* building as synthetic event / scenario construction
• “We have found that thinking through and recording the best practices ahead of time
in a ‘playbook’ produces roughly a 3x improvement in MTTR as compared to the
strategy of "winging it."
• “Wheel of Misfortune” (software engineering equivalent: Adversarial testing?)

CHANGE MANAGEMENT IN @RL
• “SRE: 70% of outages due to changes in a live system.”
• SRE automation enables:
• Progressive rollouts (Ed not just “promote to QA”)
• Rapid problem diagnosis
• Automated rollback (Ed Typically not an app ‘requirement’)
• Mitigate user exposure to service disruptions
• Automation reduces impact of fatigue, familiarity/contempt, challenges of
highly repetitive tasks

SRE TACKLES PLANNING, CAPACITY
• Dev rarely has eyes on metrics, processes for provisioning
• Provisioning is higher risk than load shifting: a class of Ops use cases
• Dev rarely accounts for ingest of demand data streams
• Dev has little insight into aperiodic spikes, trends, schedules,
dependencies
• Weather, cascading power outages
• Resource utilization entails variables Dev may be blind to
• Monitoring must utilize alerting from time series data (Few
devs get it)

SRE LEFT-SHIFTED COMPONENTS
• Abstract Machine (Apache Mesos-like)
• Distributed Storage
• OpenFlow-based SDN
• Prometheus-like Monitoring & Alerting for:
• Acute incidents
• A/B and E1/E2 comparisons

DEV FOR OPS @GOOGLE
• Single shared repo
• “All software is reviewed before being submitted”
• Even large builds are fast
• Same infrastructure for continuous testing

SOFTWARE-CENTRIC OPS
“Unlike traditional operations groups, we view software as the
primary tool through which our systems are managed,
maintained, and minded; to that end, we have the source-level
access and moral authority required to fix, extend and scale code
to keep it working, harden it against the vagaries of the Internet,
and develop our own planet-scale platforms.”

“FULL DEPTH OF THE STACK”
“In Google, we have the good fortune to have developed many
large systems ranging from planet-spanning databases to near
real-time scalable data warehousing to fault-tolerant datastream
joining. In SRE, we flip between the fine-grained detail of disk
driver IO scheduling to the big picture of continental-level
service capacity, across a range of systems and a user population
measured in billions. We own those products in production. We
drive reliability and performance across massive scale by
mastering the full depth of the stack.“M UNDERWOOD @KNOWLENGR | V1.2 | KNOWLENGR.COM | VIEWS MY OWN 22

PRINCIPLES
• Embracing Risk (Ed: Listen up, FinTechs)
• Service Level Objectives
• Eliminating Toil (Ed: More than efficiency, velocity)
• Monitor (Ed: Integrated monitoring)
• Release Engineering
• Simplicity (Ed: Complexity evolved from simplicity?)

RISK MANAGEMENT IN SRE
“We strive to make a service reliable enough, but
no more reliable than it needs to be. That is, when we set an
availability target of 99.99%,we want to exceed it, but not by
much: that would waste opportunities to add features to the
system, clean up technical debt, or reduce its operational costs.
In a sense, we view the availability target as both a minimum and
a maximum. The key advantage of this framing is that it unlocks
explicit, thoughtful risktaking.” Source

SRE RISK PROCESS INSIGHTS
• Risk tolerance of consumer services
• Differential impact of failure types on product/service offering
• Google Apps for Business vs. Consumer
• Cost vs. availability (“an extra nine of availability means . . . “)
• Google + Google Partner latency objectives

SRE “ERROR BUDGET”
“In order to base these decisions [product velocity vs. reliability] on
objective data, the two teams jointly define a quarterly error budget
based on the service’s service level objective, or SLO (see Service Level
Objectives). The error budget provides a clear, objective metric that
determines how unreliable the service is allowed to be within a single
quarter. This metric removes the politics from negotiations between
the SREs and the product developers when deciding how much risk to
allow.”
“The main benefit of an error budget is that it provides a common
incentive that allows both product development and SRE to focus on
finding the right balance between innovation and reliability.”

KEY INSIGHT
Ed: Ops has a perspective on product performance that Dev will
rarely have. SRE leverages this by integrating processes to
monitor and manage the product while making improvements.

SERVICE ABSTRACTIONS
• SLA: Set by product owners, not SRE
• SLI Service Level Indicator (Ed: Domain specific dependent
measure)
• SLO Service Level Objective (Ed: Complex target range of
values; sets expectations)
• Agreements (usually, what happens when SLO not met)

OPS-DRIVEN TARGET GOALS
“Choosing targets (SLOs) is not a purely technical activity
because of the product and business implications, which should
be reflected in both the SLIs and SLOs (and maybe SLAs) that are
selected. Similarly, it may be necessary to trade off certain
product attributes against others within the constraints posed by
staffing, time to market, hardware availability, and funding.”
• SRE Ops-driven concepts: safety margin, throttling, systems
engineering (mod configs, OS tuning, load balancing, physical
updates)M UNDERWOOD @KNOWLENGR | V1.2 | KNOWLENGR.COM | VIEWS MY OWN 29

SRE KEY MONITORING INSIGHT
“Monitoring a complex application is a significant engineering
endeavor in and of itself.”
Ed: Software engineering is 7-20 years away from fully
integrating monitoring concepts into IDE’s

ALERTING INSIGHTS
• Human alerts must be simple and fast
• Monitoring should identify what’s broken and why (Ed: Domain
dependent!)
• Focus s/b on better post hoc analysis (Ed: Forensics; big data)
• “Google SRE has experienced only limited success with complex
dependency hierarchies”
• “Different aspects of a system should be measured with different
levels of granularity.”
• “In Google’s experience, basic collection and aggregation of metrics,
paired with alerting and dashboards, has worked well as a relatively
standalone system.”M UNDERWOOD @KNOWLENGR | V1.2 | KNOWLENGR.COM | VIEWS MY OWN 31

TYPES OF AUTOMATION
• No automation
• Externally maintained system-specific automation
• Externally maintained generic automation
• Internally maintained system-specific automation
• Systems need no automation
• Ed: Conclude Ops is closer to automation (except domain
specific)

LEFT-SHIFTING OPS ISN’T ONE-AND-DONE
“Automation code, like unit test code, dies when the maintaining
team isn’t obsessive about keeping the code in sync with the
codebase it covers. The world changes around the code: the DNS
team adds new configuration options, the storage team changes
their package names, and the networking team needs to support
new devices.”

TYPICAL SRE RELEASE PROCESS
• A typical release process proceeds as follows:
• Rapid uses the requested integration revision number (often obtained automatically from
our continuous test system) to create a release branch.
• Rapid uses Blaze to compile all the binaries and execute the unit tests, often performing
these two steps in parallel. Compilation and testing occur in environments dedicated to
those specific tasks, as opposed to taking place in the Borg job where the Rapid workflow
is executing. This separation allows us to parallelize work easily.
• Build artifacts are then available for system testing and canary deployments. A typical
canary deployment involves starting a few jobs in our production environment after the
completion of system tests.
• The results of each step of the process are logged. A report of all changes since the last
release is created.
• Rapid allows us to manage our release branches and cherry picks; individual cherry pick
requests can be approved or rejected for inclusion in a release. Source

SOME CONCLUSIONS
BY ED

1. Complex IT operations are challenging to left-shift at scale
2. Python (+ Go etc.) have facilitated left-shift
3. SDN (5-6G) is a game-changer; Ops is in the game, like it or
not
4. Monitoring and alerting are beyond current SE skills
5. SRE treats security as a feature (casual?)
6. SRE measures manual processes as part of using automation
to drive reliability
7. SRE has a more formal, Ops-driven approach to trade-off
compacts with product owners
8. Current DevOps SDLC practices have not formalized how to
capture and manage quality, reliability
9. Except for CMMI, risk is weakly integrated into the DevOps
SDLC
10. DevOps does not identify “toil,” hence may not participate in
PDCA cycle from Ops
11. Dev teams may not know what can/should be automated.

Site (Service) Reliability Engineering

More Related Content

What's hot

Similar to Site (Service) Reliability Engineering

More from Mark Underwood

Recently uploaded

Site (Service) Reliability Engineering

Editor's Notes