GDG Cloud Southlake #11 Steve McGhee Reliability Theory and Practice

© 2018 Google Inc. All rights reserved.
Intro to Reliability at Google

Proprietary + Confidential
Steve McGhee
Reliability Advocate, SRE
@stevemcghee
smcghee@google.com
He/Him

Proprietary + Confidential
Nathen Harvey
Developer Advocate
@nathenharvey
nathenharvey@google.com
He/Him

Outline
● Intro to Reliability at Google
● Theory: Cloud Reliability, Risks + Mitigations
● "r9y mapping"

For the past 15 years, Google has been
building out the world’s fastest, most
powerful, highest quality cloud
infrastructure on the planet.

Conﬁdential & Proprietary 6
Nine products with over
one billion users each,
all powered by the cloud.

Conﬁdential & Proprietary
Google Cloud's Global Presence
76 zones in 25 regions
(as of May 2021)

Unity (US, JP) 2010
Monet (US, BR) 2017
Tannat (BR, UY, AR) 2017
Junior (Rio, Santos) 2017
FASTER (US, JP, TW) 2016
PLCN (HK, LA) 2019
Indigo (SG, ID, AU) 2019
Curie (CL, US) 2019
Havfrue (US,IE, DK) 2019
SJC (JP, HK, SG) 2013
HK-G (HK, GU) 2019
Edge node locations
>1000
Edge points of
presence >100
Network
Future region and
number of zones
Current region and
number of zones
3
2
3
3
3
3
3
3
3
3
3
4
3 3 3
3
3
3
3
3
3
3
Scale on the same reliable infrastructure Google uses

The Network Matters
Typical Cloud
Provider Cloud
Provider
User
Google Cloud
Google
Cloud
Google
Pop
ISP User
Google
Pop

GCP - Architected for Resilience and Scale
Compute
Borg
Scalable job scheduler
Behind Google's 8+
Billion-user Products
Inspiration for Kubernetes
Storage
Colossus
Exabyte storage clusters
Next-Generation cluster
storage system
Networking
Andromeda
Global software-defined network
Highly-available, flat global
network

GCP leadership in infrastructure innovation
Compute
Borg
10+ years of evolution
Cloud specific clusters,
Layers of failure domains,
Flexible, fast control
Live Migration running VMs
No more maintenance windows.
Security patches and hardware
changes without VM downtime.
Storage
Colossus
Every bit triple-redundant
Services using Colossus inherit
world-class replication and
encoding
Distributed metadata model
Allows for fast, independent
retrieval of "hot" or "cold" data
Networking
Andromeda
Fail static
In the case of programming failure
or control plane fault, last-known-
good network remains in place

Zones & Regions are the basic building blocks of global compute infrastructure
Zone: a unit of deployment of computing and supporting infrastructure
Region: A collection of Zones, typically in a single or nearby metros. Expectation: Region is >= 3 Zones.
Networking connects resources within a zone, region, and across regions
cluster cluster cluster
zone zone zone
region region region region
global network
A Logical view
GCP building blocks - Regions, Zones

GCP Service Topology
Zones, Regions, Multi-Region (visible)
● Campuses, Buildings (internal)
● Borg Clusters (internal)
● Racks, Machines,
Power/Cooling (internal)
Think of Services within a scope:
● Zonal Service generally @ 99.9%
● Regional Service generally @
99.99%
Survive disaster (eg: hurricanes,
floods) via multi-regional
deployments.

© 2020 Google LLC. All rights reserved.
100% is the wrong reliability
target for basically everything.”
Benjamin Treynor Sloss, Vice President of 24x7 Engineering, Google
“

Share ownership SLOs & Blameless PMs Reduce costs of failure
Build the solution, don’t be the solution Quantify the impact
Including toil and reliability

Metrics &
Monitoring
Capacity
Planning
Areas of practice
Emergency
Response
Change
Management
Culture
● Paging vs.
ticketing
● Involve humans
for serious
threats to SLO
● Triggers,
actions
● Organic growth
● Inorganic
growth:
○ BFCM
○ COVID-19
● Buffer capacity
● Slow rollouts
● Efficient rollbacks
● Remember: ~70%
of outages are
caused by
changes
● Clear outage
thresholds
● Pre-defined
RACI
● Playbooks &
documentation
● Psychological
safety
● Blamelessness
● Data-driven

40k foot Theory: Cloud Reliability

Context: The Pyramids
Component-level reliability:
- solid base (big cold building, heavy
iron, redundant disks/net/power)
- each component up as much as
possible
- total availability as goal
- "scale up"
Scalable reliability:
- less-reliable, cost-effective base
- "warehouse scale" (many machines)
- software improves availability
- aggregate availability as goal
- "scale out"

This Bears Repeating
You can build
more reliable things
on top of
less reliable things
a simple example: RAID. see: The SRE I Aspire to Be, @aknin SREconEMEA 2019

More Theory: Risks and Mitigations

The SRE Virtuous Cycle
smcghee@google.com

Risk
● outages
○ planned
■ maint windows
○ unplanned
■ bad pushes
● bad data
● bad binaries
● bad conﬁg
■ natural disasters
● poor performance
● inability to innovate
● security issues

Equation 1:
Risk = Impact * probability
R=I*p
https://www.usenix.org/conference/srecon18asia/presentation/brown

Impact
$/second lost
users aﬀected
types of users
types of user actions
reputation / brand impact
(more on this in a minute)

probability
naaaah, it'll never happen
● aka "likelihood"
○ pls not a matrix!
○ [catastrophic, rare]
○ vs:
○ [minimal, frequent]
○ ¯_(ツ)_/¯
● we can know:
○ MTBF / ETBF
○ MTTD + MTTR
○ % Users aﬀected
○ SLO / Error Budget
Let's use ⇒ "Bad Minutes/year"

Equation 2:
Impact = blast radius * time
I=br*t

blast radius
"how many users were aﬀected
by this change"
● everybody 💥
● just one region 🤭
● just logged-in users 😿
● anyone who was checking out
during the time 🛍
● 1% of all users 🤓
● 0.001% of all users 😮

time
"area under the curve"
● MTTD | MTTR
● Detect, Mitigate, Prevent!
● Total outage → Partial outage →
Degraded State → Recovered
Note: Incident time might be
diﬀerent, due to post-incident
"cleanup" or analysis.

So What?
We have 3 things we want to potentially minimize:
● probability of bad thing occurring
● blast radius, when it does happen
● time to get it ﬁxed
reduce any of these, ideally all of these.

sample resilience engineering methods
● canary releases (blast radius)
● instrumenting for distributed traces (time)
● exponential retries with jitter (probability)
● sharding / partitioning data, traﬃc (blast radius)
● …
● Cost-based Load Balancing (probability)
● Throttling, Rate-Limiting (probability)
● Feature Flags, Dark Deployments (blast radius, probability)
● Multicluster Deployments (w/ internal loadbalancing) (blast radius)

r9y mapping

The Reliability (r9y) Journey
Cloud Customers have a hard time knowing what Reliability is, what they've done, and what they even
want! We need to learn how to best help them
● Start with a map of reliability capabilities
○ both known + unknown unknowns are presented, in context!
● Plot their current position with a orienteering survey
● Determine their destination with a compass
○ making a choice based on cost, business needs ("nines" availability, latency, DR, geography)
● Help plan their journey with a guidebook
○ how to decide next steps (feedback loops)
○ how to implement that step
○ what to buy or adopt along the way

The Reliability Map (WIP)
Eras (nines):
● Demo (90%)
● Deterministic (99.0%)
● Reactive (99.9%)
● Proactive (99.99%)
● Autonomic (99.999%)
Streams / Personas:
● Development
● Infra
● Operations
● Observability
● People

Quick Hack: the Virtuous Cycle
First: SLOs / Error Budget
⇒ Incident Response
⇒ Blameless Postmortems + Postmortem review
⇒ Risk Analysis
⇒ Resilience Engineering Backlog and prioritization
⇒ Risk / Impact Reduction!
⇒ SLOs (adjust)
This then becomes your flywheel for deciding which capabilities to build next.
* Separate: reduce toil as needed

Start with SLOs, unless you can't
In order to define and use SLOs (SLIs, error budgets etc), you need:
● accuracy
○ metrics that sufficiently represent the state of your system
○ only using blackbox/synthetic or "ping" insufficient and not representative of user traffic
○ changing a system to export its internal state can be more useful, either via metrics or logs
● precision
○ can't measure per-minute SLOs if you're only tracking "good days"
○ average latency ⇒ latency distribution over time
● breakdown per-service
○ measuring only at "the front door" or cross-stack can often be misleading
○ this is just another form of precision, breaking down per-service or per-container

The Pyramids
Component-level reliability:
- solid base (big cold building, heavy iron,
redundant disks/net/power)
- each component up as much as possible
- union of availability as goal
- "scale up"
Scalable reliability:
- less-reliable, cost-effective base
- "warehouse scale" (many machines)
- highly connected, API-driven
- software improves availability
- aggregate availability as goal
- "scale out"

Key Takeaway
We can build
more-reliable things
on top of
less-reliable things
This is counterintuitive!
Software lets us build systems that can cope with failure which hardware can't.
Apply this at many levels (app, system, team, org!) for great success.

Business Service Orientation
Business
Service 1
Capability
A
Limitation
X
Capability
B
Limitation
Y
Business
Service 2
Capability
B
Limitation
Y
Capability
D
Limitation
Z
Business
Service N
Capability
A
Limitation
X
Capability
F
Limitation
W
Identification of common limitations across Business Services surfaces the high impact modernization tasks

Modernization Adoption
time
capability 1
capability 2
capability 3
capability 4
service 1: low-risk
early adoption, slow progress
service N: high-risk
late adoption, fast safe progress!
platform
maturity
service N: high-risk
don't adopt prematurely!
gain confidence in
capabilities

GDG Cloud Southlake #11 Steve McGhee Reliability Theory and Practice

GDG Cloud Southlake #11 Steve McGhee Reliability Theory and Practice

Recommended

Recommended

More Related Content

Similar to GDG Cloud Southlake #11 Steve McGhee Reliability Theory and Practice

Similar to GDG Cloud Southlake #11 Steve McGhee Reliability Theory and Practice (20)

Recently uploaded

Recently uploaded (20)

GDG Cloud Southlake #11 Steve McGhee Reliability Theory and Practice