SlideShare a Scribd company logo
© 2018 Google Inc. All rights reserved.
Intro to Reliability at Google
Proprietary + Confidential
Steve McGhee
Reliability Advocate, SRE
@stevemcghee
smcghee@google.com
He/Him
Proprietary + Confidential
Nathen Harvey
Developer Advocate
@nathenharvey
nathenharvey@google.com
He/Him
© 2018 Google Inc. All rights reserved.
Outline
● Intro to Reliability at Google
● Theory: Cloud Reliability, Risks + Mitigations
● "r9y mapping"
For the past 15 years, Google has been
building out the world’s fastest, most
powerful, highest quality cloud
infrastructure on the planet.
Confidential & Proprietary 6
Nine products with over
one billion users each,
all powered by the cloud.
Confidential & Proprietary
Google Cloud's Global Presence
76 zones in 25 regions
(as of May 2021)
Unity (US, JP) 2010
Monet (US, BR) 2017
Tannat (BR, UY, AR) 2017
Junior (Rio, Santos) 2017
FASTER (US, JP, TW) 2016
PLCN (HK, LA) 2019
Indigo (SG, ID, AU) 2019
Curie (CL, US) 2019
Havfrue (US,IE, DK) 2019
SJC (JP, HK, SG) 2013
HK-G (HK, GU) 2019
Edge node locations
>1000
Edge points of
presence >100
Network
Future region and
number of zones
Current region and
number of zones
3
2
3
3
3
3
3
3
3
3
3
4
3 3 3
3
3
3
3
3
3
3
Scale on the same reliable infrastructure Google uses
The Network Matters
Typical Cloud
Provider Cloud
Provider
User
Google Cloud
Google
Cloud
Google
Pop
ISP User
Google
Pop
Confidential & Proprietary
GCP - Architected for Resilience and Scale
Compute
Borg
Scalable job scheduler
Behind Google's 8+
Billion-user Products
Inspiration for Kubernetes
Storage
Colossus
Exabyte storage clusters
Next-Generation cluster
storage system
Networking
Andromeda
Global software-defined network
Highly-available, flat global
network
Confidential & Proprietary
GCP leadership in infrastructure innovation
Compute
Borg
10+ years of evolution
Cloud specific clusters,
Layers of failure domains,
Flexible, fast control
Live Migration running VMs
No more maintenance windows.
Security patches and hardware
changes without VM downtime.
Storage
Colossus
Every bit triple-redundant
Services using Colossus inherit
world-class replication and
encoding
Distributed metadata model
Allows for fast, independent
retrieval of "hot" or "cold" data
Networking
Andromeda
Fail static
In the case of programming failure
or control plane fault, last-known-
good network remains in place
Confidential & Proprietary
Zones & Regions are the basic building blocks of global compute infrastructure
Zone: a unit of deployment of computing and supporting infrastructure
Region: A collection of Zones, typically in a single or nearby metros. Expectation: Region is >= 3 Zones.
Networking connects resources within a zone, region, and across regions
cluster cluster cluster
zone zone zone
region region region region
global network
A Logical view
GCP building blocks - Regions, Zones
Confidential & Proprietary
GCP Service Topology
Zones, Regions, Multi-Region (visible)
● Campuses, Buildings (internal)
● Borg Clusters (internal)
● Racks, Machines,
Power/Cooling (internal)
Think of Services within a scope:
● Zonal Service generally @ 99.9%
● Regional Service generally @
99.99%
Survive disaster (eg: hurricanes,
floods) via multi-regional
deployments.
© 2020 Google LLC. All rights reserved.
100% is the wrong reliability
target for basically everything.”
Benjamin Treynor Sloss, Vice President of 24x7 Engineering, Google
“
© 2020 Google LLC. All rights reserved.
Share ownership SLOs & Blameless PMs Reduce costs of failure
Build the solution, don’t be the solution Quantify the impact
Including toil and reliability
© 2020 Google LLC. All rights reserved.
Metrics &
Monitoring
Capacity
Planning
Areas of practice
Emergency
Response
Change
Management
Culture
● Paging vs.
ticketing
● Involve humans
for serious
threats to SLO
● Triggers,
actions
● Organic growth
● Inorganic
growth:
○ BFCM
○ COVID-19
● Buffer capacity
● Slow rollouts
● Efficient rollbacks
● Remember: ~70%
of outages are
caused by
changes
● Clear outage
thresholds
● Pre-defined
RACI
● Playbooks &
documentation
● Psychological
safety
● Blamelessness
● Data-driven
© 2018 Google Inc. All rights reserved.
40k foot Theory: Cloud Reliability
Context: The Pyramids
Component-level reliability:
- solid base (big cold building, heavy
iron, redundant disks/net/power)
- each component up as much as
possible
- total availability as goal
- "scale up"
Scalable reliability:
- less-reliable, cost-effective base
- "warehouse scale" (many machines)
- software improves availability
- aggregate availability as goal
- "scale out"
This Bears Repeating
You can build
more reliable things
on top of
less reliable things
a simple example: RAID. see: The SRE I Aspire to Be, @aknin SREconEMEA 2019
© 2018 Google Inc. All rights reserved.
More Theory: Risks and Mitigations
The SRE Virtuous Cycle
smcghee@google.com
Risk
● outages
○ planned
■ maint windows
○ unplanned
■ bad pushes
● bad data
● bad binaries
● bad config
■ natural disasters
● poor performance
● inability to innovate
● security issues
Equation 1:
Risk = Impact * probability
R=I*p
https://www.usenix.org/conference/srecon18asia/presentation/brown
Impact
$/second lost
users affected
types of users
types of user actions
reputation / brand impact
(more on this in a minute)
probability
naaaah, it'll never happen
● aka "likelihood"
○ pls not a matrix!
○ [catastrophic, rare]
○ vs:
○ [minimal, frequent]
○ ¯_(ツ)_/¯
● we can know:
○ MTBF / ETBF
○ MTTD + MTTR
○ % Users affected
○ SLO / Error Budget
Let's use ⇒ "Bad Minutes/year"
Equation 2:
Impact = blast radius * time
I=br*t
blast radius
"how many users were affected
by this change"
● everybody 💥
● just one region 🤭
● just logged-in users 😿
● anyone who was checking out
during the time 🛍
● 1% of all users 🤓
● 0.001% of all users 😮
time
"area under the curve"
● MTTD | MTTR
● Detect, Mitigate, Prevent!
● Total outage → Partial outage →
Degraded State → Recovered
Note: Incident time might be
different, due to post-incident
"cleanup" or analysis.
So What?
We have 3 things we want to potentially minimize:
● probability of bad thing occurring
● blast radius, when it does happen
● time to get it fixed
reduce any of these, ideally all of these.
sample resilience engineering methods
● canary releases (blast radius)
● instrumenting for distributed traces (time)
● exponential retries with jitter (probability)
● sharding / partitioning data, traffic (blast radius)
● …
● Cost-based Load Balancing (probability)
● Throttling, Rate-Limiting (probability)
● Feature Flags, Dark Deployments (blast radius, probability)
● Multicluster Deployments (w/ internal loadbalancing) (blast radius)
© 2018 Google Inc. All rights reserved.
r9y mapping
The Reliability (r9y) Journey
Cloud Customers have a hard time knowing what Reliability is, what they've done, and what they even
want! We need to learn how to best help them
● Start with a map of reliability capabilities
○ both known + unknown unknowns are presented, in context!
● Plot their current position with a orienteering survey
● Determine their destination with a compass
○ making a choice based on cost, business needs ("nines" availability, latency, DR, geography)
● Help plan their journey with a guidebook
○ how to decide next steps (feedback loops)
○ how to implement that step
○ what to buy or adopt along the way
The Reliability Map (WIP)
Eras (nines):
● Demo (90%)
● Deterministic (99.0%)
● Reactive (99.9%)
● Proactive (99.99%)
● Autonomic (99.999%)
Streams / Personas:
● Development
● Infra
● Operations
● Observability
● People
Quick Hack: the Virtuous Cycle
First: SLOs / Error Budget
⇒ Incident Response
⇒ Blameless Postmortems + Postmortem review
⇒ Risk Analysis
⇒ Resilience Engineering Backlog and prioritization
⇒ Risk / Impact Reduction!
⇒ SLOs (adjust)
This then becomes your flywheel for deciding which capabilities to build next.
* Separate: reduce toil as needed
Start with SLOs, unless you can't
In order to define and use SLOs (SLIs, error budgets etc), you need:
● accuracy
○ metrics that sufficiently represent the state of your system
○ only using blackbox/synthetic or "ping" insufficient and not representative of user traffic
○ changing a system to export its internal state can be more useful, either via metrics or logs
● precision
○ can't measure per-minute SLOs if you're only tracking "good days"
○ average latency ⇒ latency distribution over time
● breakdown per-service
○ measuring only at "the front door" or cross-stack can often be misleading
○ this is just another form of precision, breaking down per-service or per-container
The Pyramids
Component-level reliability:
- solid base (big cold building, heavy iron,
redundant disks/net/power)
- each component up as much as possible
- union of availability as goal
- "scale up"
Scalable reliability:
- less-reliable, cost-effective base
- "warehouse scale" (many machines)
- highly connected, API-driven
- software improves availability
- aggregate availability as goal
- "scale out"
Key Takeaway
We can build
more-reliable things
on top of
less-reliable things
This is counterintuitive!
Software lets us build systems that can cope with failure which hardware can't.
Apply this at many levels (app, system, team, org!) for great success.
Business Service Orientation
Business
Service 1
Capability
A
Limitation
X
Capability
B
Limitation
Y
Business
Service 2
Capability
B
Limitation
Y
Capability
D
Limitation
Z
Business
Service N
Capability
A
Limitation
X
Capability
F
Limitation
W
Identification of common limitations across Business Services surfaces the high impact modernization tasks
Modernization Adoption
time
capability 1
capability 2
capability 3
capability 4
service 1: low-risk
early adoption, slow progress
service N: high-risk
late adoption, fast safe progress!
platform
maturity
service N: high-risk
don't adopt prematurely!
gain confidence in
capabilities
GDG Cloud Southlake #11 Steve McGhee Reliability Theory and Practice

More Related Content

Similar to GDG Cloud Southlake #11 Steve McGhee Reliability Theory and Practice

Defining a Cloud Adoption Journey to Deliver Cloud Native Services
Defining a Cloud Adoption Journey to Deliver Cloud Native ServicesDefining a Cloud Adoption Journey to Deliver Cloud Native Services
Defining a Cloud Adoption Journey to Deliver Cloud Native Services
Amazon Web Services
 
Theresa Melvin, HP Enterprise - IOT/AI/ML at Hyperscale - how to go faster wi...
Theresa Melvin, HP Enterprise - IOT/AI/ML at Hyperscale - how to go faster wi...Theresa Melvin, HP Enterprise - IOT/AI/ML at Hyperscale - how to go faster wi...
Theresa Melvin, HP Enterprise - IOT/AI/ML at Hyperscale - how to go faster wi...
Aerospike
 
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...
Sumeet Singh
 
Mine excellence products description v1.2
Mine excellence  products description v1.2Mine excellence  products description v1.2
Mine excellence products description v1.2
Mason Taylor
 
MineExcellence Drill and Blast Platform
MineExcellence  Drill and Blast PlatformMineExcellence  Drill and Blast Platform
MineExcellence Drill and Blast Platform
Mason Taylor
 
Mine excellence products description v1.2
Mine excellence  products description v1.2Mine excellence  products description v1.2
Mine excellence products description v1.2
Mason Taylor
 
Google Cloud Networking Deep Dive
Google Cloud Networking Deep DiveGoogle Cloud Networking Deep Dive
Google Cloud Networking Deep Dive
Michelle Holley
 
A Primer for Your Next Data Science Proof of Concept on the Cloud
A Primer for Your Next Data Science Proof of Concept on the CloudA Primer for Your Next Data Science Proof of Concept on the Cloud
A Primer for Your Next Data Science Proof of Concept on the Cloud
Alton Alexander
 
"Using Multi-Master data replication for the parallel-run refactoring", Myros...
"Using Multi-Master data replication for the parallel-run refactoring", Myros..."Using Multi-Master data replication for the parallel-run refactoring", Myros...
"Using Multi-Master data replication for the parallel-run refactoring", Myros...
Fwdays
 
Architecting a Scalable Hadoop Platform: Top 10 considerations for success
Architecting a Scalable Hadoop Platform: Top 10 considerations for successArchitecting a Scalable Hadoop Platform: Top 10 considerations for success
Architecting a Scalable Hadoop Platform: Top 10 considerations for success
DataWorks Summit
 
Challenges with Cloud Security by Ken Y Chan
Challenges with Cloud Security by Ken Y ChanChallenges with Cloud Security by Ken Y Chan
Challenges with Cloud Security by Ken Y Chan
Ken Chan
 
Disaster Recovery Planning for MySQL & MariaDB
Disaster Recovery Planning for MySQL & MariaDBDisaster Recovery Planning for MySQL & MariaDB
Disaster Recovery Planning for MySQL & MariaDB
Severalnines
 
Audax Group: CIO Perspectives - Managing The Copy Data Explosion
Audax Group: CIO Perspectives - Managing The Copy Data ExplosionAudax Group: CIO Perspectives - Managing The Copy Data Explosion
Audax Group: CIO Perspectives - Managing The Copy Data Explosion
actifio
 
OpenStack Control Plane Architectures - Design Solutions
OpenStack Control Plane Architectures - Design SolutionsOpenStack Control Plane Architectures - Design Solutions
OpenStack Control Plane Architectures - Design Solutions
Shane Gibson
 
World Wide Technology: Is backing up to the cloud right for you?
World Wide Technology: Is backing up to the cloud right for you?World Wide Technology: Is backing up to the cloud right for you?
World Wide Technology: Is backing up to the cloud right for you?
Angie Clark
 
Avoiding Disasters by Embracing Chaos
Avoiding Disasters by Embracing ChaosAvoiding Disasters by Embracing Chaos
Avoiding Disasters by Embracing Chaos
OK2OK
 
Next generation alerting and fault detection, SRECon Europe 2016
Next generation alerting and fault detection, SRECon Europe 2016Next generation alerting and fault detection, SRECon Europe 2016
Next generation alerting and fault detection, SRECon Europe 2016
Dieter Plaetinck
 
Efficient Evaluation of Embedded-System Design Alternatives (SPLC Tutorial 2019)
Efficient Evaluation of Embedded-System Design Alternatives (SPLC Tutorial 2019)Efficient Evaluation of Embedded-System Design Alternatives (SPLC Tutorial 2019)
Efficient Evaluation of Embedded-System Design Alternatives (SPLC Tutorial 2019)
Maxime Cordy
 
DR hosting & cloud
DR hosting & cloudDR hosting & cloud
Container world 2019 Canary Release
Container world 2019 Canary ReleaseContainer world 2019 Canary Release
Container world 2019 Canary Release
Billy Yuen
 

Similar to GDG Cloud Southlake #11 Steve McGhee Reliability Theory and Practice (20)

Defining a Cloud Adoption Journey to Deliver Cloud Native Services
Defining a Cloud Adoption Journey to Deliver Cloud Native ServicesDefining a Cloud Adoption Journey to Deliver Cloud Native Services
Defining a Cloud Adoption Journey to Deliver Cloud Native Services
 
Theresa Melvin, HP Enterprise - IOT/AI/ML at Hyperscale - how to go faster wi...
Theresa Melvin, HP Enterprise - IOT/AI/ML at Hyperscale - how to go faster wi...Theresa Melvin, HP Enterprise - IOT/AI/ML at Hyperscale - how to go faster wi...
Theresa Melvin, HP Enterprise - IOT/AI/ML at Hyperscale - how to go faster wi...
 
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...
 
Mine excellence products description v1.2
Mine excellence  products description v1.2Mine excellence  products description v1.2
Mine excellence products description v1.2
 
MineExcellence Drill and Blast Platform
MineExcellence  Drill and Blast PlatformMineExcellence  Drill and Blast Platform
MineExcellence Drill and Blast Platform
 
Mine excellence products description v1.2
Mine excellence  products description v1.2Mine excellence  products description v1.2
Mine excellence products description v1.2
 
Google Cloud Networking Deep Dive
Google Cloud Networking Deep DiveGoogle Cloud Networking Deep Dive
Google Cloud Networking Deep Dive
 
A Primer for Your Next Data Science Proof of Concept on the Cloud
A Primer for Your Next Data Science Proof of Concept on the CloudA Primer for Your Next Data Science Proof of Concept on the Cloud
A Primer for Your Next Data Science Proof of Concept on the Cloud
 
"Using Multi-Master data replication for the parallel-run refactoring", Myros...
"Using Multi-Master data replication for the parallel-run refactoring", Myros..."Using Multi-Master data replication for the parallel-run refactoring", Myros...
"Using Multi-Master data replication for the parallel-run refactoring", Myros...
 
Architecting a Scalable Hadoop Platform: Top 10 considerations for success
Architecting a Scalable Hadoop Platform: Top 10 considerations for successArchitecting a Scalable Hadoop Platform: Top 10 considerations for success
Architecting a Scalable Hadoop Platform: Top 10 considerations for success
 
Challenges with Cloud Security by Ken Y Chan
Challenges with Cloud Security by Ken Y ChanChallenges with Cloud Security by Ken Y Chan
Challenges with Cloud Security by Ken Y Chan
 
Disaster Recovery Planning for MySQL & MariaDB
Disaster Recovery Planning for MySQL & MariaDBDisaster Recovery Planning for MySQL & MariaDB
Disaster Recovery Planning for MySQL & MariaDB
 
Audax Group: CIO Perspectives - Managing The Copy Data Explosion
Audax Group: CIO Perspectives - Managing The Copy Data ExplosionAudax Group: CIO Perspectives - Managing The Copy Data Explosion
Audax Group: CIO Perspectives - Managing The Copy Data Explosion
 
OpenStack Control Plane Architectures - Design Solutions
OpenStack Control Plane Architectures - Design SolutionsOpenStack Control Plane Architectures - Design Solutions
OpenStack Control Plane Architectures - Design Solutions
 
World Wide Technology: Is backing up to the cloud right for you?
World Wide Technology: Is backing up to the cloud right for you?World Wide Technology: Is backing up to the cloud right for you?
World Wide Technology: Is backing up to the cloud right for you?
 
Avoiding Disasters by Embracing Chaos
Avoiding Disasters by Embracing ChaosAvoiding Disasters by Embracing Chaos
Avoiding Disasters by Embracing Chaos
 
Next generation alerting and fault detection, SRECon Europe 2016
Next generation alerting and fault detection, SRECon Europe 2016Next generation alerting and fault detection, SRECon Europe 2016
Next generation alerting and fault detection, SRECon Europe 2016
 
Efficient Evaluation of Embedded-System Design Alternatives (SPLC Tutorial 2019)
Efficient Evaluation of Embedded-System Design Alternatives (SPLC Tutorial 2019)Efficient Evaluation of Embedded-System Design Alternatives (SPLC Tutorial 2019)
Efficient Evaluation of Embedded-System Design Alternatives (SPLC Tutorial 2019)
 
DR hosting & cloud
DR hosting & cloudDR hosting & cloud
DR hosting & cloud
 
Container world 2019 Canary Release
Container world 2019 Canary ReleaseContainer world 2019 Canary Release
Container world 2019 Canary Release
 

Recently uploaded

Data structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdfData structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdf
TIPNGVN2
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
名前 です男
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
Daiki Mogmet Ito
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
Neo4j
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
SOFTTECHHUB
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
Zilliz
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Speck&Tech
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 

Recently uploaded (20)

Data structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdfData structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdf
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 

GDG Cloud Southlake #11 Steve McGhee Reliability Theory and Practice

  • 1. © 2018 Google Inc. All rights reserved. Intro to Reliability at Google
  • 2. Proprietary + Confidential Steve McGhee Reliability Advocate, SRE @stevemcghee smcghee@google.com He/Him
  • 3. Proprietary + Confidential Nathen Harvey Developer Advocate @nathenharvey nathenharvey@google.com He/Him
  • 4. © 2018 Google Inc. All rights reserved. Outline ● Intro to Reliability at Google ● Theory: Cloud Reliability, Risks + Mitigations ● "r9y mapping"
  • 5. For the past 15 years, Google has been building out the world’s fastest, most powerful, highest quality cloud infrastructure on the planet.
  • 6. Confidential & Proprietary 6 Nine products with over one billion users each, all powered by the cloud.
  • 7. Confidential & Proprietary Google Cloud's Global Presence 76 zones in 25 regions (as of May 2021)
  • 8. Unity (US, JP) 2010 Monet (US, BR) 2017 Tannat (BR, UY, AR) 2017 Junior (Rio, Santos) 2017 FASTER (US, JP, TW) 2016 PLCN (HK, LA) 2019 Indigo (SG, ID, AU) 2019 Curie (CL, US) 2019 Havfrue (US,IE, DK) 2019 SJC (JP, HK, SG) 2013 HK-G (HK, GU) 2019 Edge node locations >1000 Edge points of presence >100 Network Future region and number of zones Current region and number of zones 3 2 3 3 3 3 3 3 3 3 3 4 3 3 3 3 3 3 3 3 3 3 Scale on the same reliable infrastructure Google uses
  • 9. The Network Matters Typical Cloud Provider Cloud Provider User Google Cloud Google Cloud Google Pop ISP User Google Pop
  • 10. Confidential & Proprietary GCP - Architected for Resilience and Scale Compute Borg Scalable job scheduler Behind Google's 8+ Billion-user Products Inspiration for Kubernetes Storage Colossus Exabyte storage clusters Next-Generation cluster storage system Networking Andromeda Global software-defined network Highly-available, flat global network
  • 11. Confidential & Proprietary GCP leadership in infrastructure innovation Compute Borg 10+ years of evolution Cloud specific clusters, Layers of failure domains, Flexible, fast control Live Migration running VMs No more maintenance windows. Security patches and hardware changes without VM downtime. Storage Colossus Every bit triple-redundant Services using Colossus inherit world-class replication and encoding Distributed metadata model Allows for fast, independent retrieval of "hot" or "cold" data Networking Andromeda Fail static In the case of programming failure or control plane fault, last-known- good network remains in place
  • 12. Confidential & Proprietary Zones & Regions are the basic building blocks of global compute infrastructure Zone: a unit of deployment of computing and supporting infrastructure Region: A collection of Zones, typically in a single or nearby metros. Expectation: Region is >= 3 Zones. Networking connects resources within a zone, region, and across regions cluster cluster cluster zone zone zone region region region region global network A Logical view GCP building blocks - Regions, Zones
  • 13. Confidential & Proprietary GCP Service Topology Zones, Regions, Multi-Region (visible) ● Campuses, Buildings (internal) ● Borg Clusters (internal) ● Racks, Machines, Power/Cooling (internal) Think of Services within a scope: ● Zonal Service generally @ 99.9% ● Regional Service generally @ 99.99% Survive disaster (eg: hurricanes, floods) via multi-regional deployments.
  • 14. © 2020 Google LLC. All rights reserved. 100% is the wrong reliability target for basically everything.” Benjamin Treynor Sloss, Vice President of 24x7 Engineering, Google “
  • 15. © 2020 Google LLC. All rights reserved. Share ownership SLOs & Blameless PMs Reduce costs of failure Build the solution, don’t be the solution Quantify the impact Including toil and reliability
  • 16. © 2020 Google LLC. All rights reserved. Metrics & Monitoring Capacity Planning Areas of practice Emergency Response Change Management Culture ● Paging vs. ticketing ● Involve humans for serious threats to SLO ● Triggers, actions ● Organic growth ● Inorganic growth: ○ BFCM ○ COVID-19 ● Buffer capacity ● Slow rollouts ● Efficient rollbacks ● Remember: ~70% of outages are caused by changes ● Clear outage thresholds ● Pre-defined RACI ● Playbooks & documentation ● Psychological safety ● Blamelessness ● Data-driven
  • 17. © 2018 Google Inc. All rights reserved. 40k foot Theory: Cloud Reliability
  • 18. Context: The Pyramids Component-level reliability: - solid base (big cold building, heavy iron, redundant disks/net/power) - each component up as much as possible - total availability as goal - "scale up" Scalable reliability: - less-reliable, cost-effective base - "warehouse scale" (many machines) - software improves availability - aggregate availability as goal - "scale out"
  • 19. This Bears Repeating You can build more reliable things on top of less reliable things a simple example: RAID. see: The SRE I Aspire to Be, @aknin SREconEMEA 2019
  • 20. © 2018 Google Inc. All rights reserved. More Theory: Risks and Mitigations
  • 21. The SRE Virtuous Cycle smcghee@google.com
  • 22. Risk ● outages ○ planned ■ maint windows ○ unplanned ■ bad pushes ● bad data ● bad binaries ● bad config ■ natural disasters ● poor performance ● inability to innovate ● security issues
  • 23. Equation 1: Risk = Impact * probability R=I*p https://www.usenix.org/conference/srecon18asia/presentation/brown
  • 24. Impact $/second lost users affected types of users types of user actions reputation / brand impact (more on this in a minute)
  • 25. probability naaaah, it'll never happen ● aka "likelihood" ○ pls not a matrix! ○ [catastrophic, rare] ○ vs: ○ [minimal, frequent] ○ ¯_(ツ)_/¯ ● we can know: ○ MTBF / ETBF ○ MTTD + MTTR ○ % Users affected ○ SLO / Error Budget Let's use ⇒ "Bad Minutes/year"
  • 26. Equation 2: Impact = blast radius * time I=br*t
  • 27. blast radius "how many users were affected by this change" ● everybody 💥 ● just one region 🤭 ● just logged-in users 😿 ● anyone who was checking out during the time 🛍 ● 1% of all users 🤓 ● 0.001% of all users 😮
  • 28. time "area under the curve" ● MTTD | MTTR ● Detect, Mitigate, Prevent! ● Total outage → Partial outage → Degraded State → Recovered Note: Incident time might be different, due to post-incident "cleanup" or analysis.
  • 29. So What? We have 3 things we want to potentially minimize: ● probability of bad thing occurring ● blast radius, when it does happen ● time to get it fixed reduce any of these, ideally all of these.
  • 30. sample resilience engineering methods ● canary releases (blast radius) ● instrumenting for distributed traces (time) ● exponential retries with jitter (probability) ● sharding / partitioning data, traffic (blast radius) ● … ● Cost-based Load Balancing (probability) ● Throttling, Rate-Limiting (probability) ● Feature Flags, Dark Deployments (blast radius, probability) ● Multicluster Deployments (w/ internal loadbalancing) (blast radius)
  • 31. © 2018 Google Inc. All rights reserved. r9y mapping
  • 32. The Reliability (r9y) Journey Cloud Customers have a hard time knowing what Reliability is, what they've done, and what they even want! We need to learn how to best help them ● Start with a map of reliability capabilities ○ both known + unknown unknowns are presented, in context! ● Plot their current position with a orienteering survey ● Determine their destination with a compass ○ making a choice based on cost, business needs ("nines" availability, latency, DR, geography) ● Help plan their journey with a guidebook ○ how to decide next steps (feedback loops) ○ how to implement that step ○ what to buy or adopt along the way
  • 33. The Reliability Map (WIP) Eras (nines): ● Demo (90%) ● Deterministic (99.0%) ● Reactive (99.9%) ● Proactive (99.99%) ● Autonomic (99.999%) Streams / Personas: ● Development ● Infra ● Operations ● Observability ● People
  • 34. Quick Hack: the Virtuous Cycle First: SLOs / Error Budget ⇒ Incident Response ⇒ Blameless Postmortems + Postmortem review ⇒ Risk Analysis ⇒ Resilience Engineering Backlog and prioritization ⇒ Risk / Impact Reduction! ⇒ SLOs (adjust) This then becomes your flywheel for deciding which capabilities to build next. * Separate: reduce toil as needed
  • 35. Start with SLOs, unless you can't In order to define and use SLOs (SLIs, error budgets etc), you need: ● accuracy ○ metrics that sufficiently represent the state of your system ○ only using blackbox/synthetic or "ping" insufficient and not representative of user traffic ○ changing a system to export its internal state can be more useful, either via metrics or logs ● precision ○ can't measure per-minute SLOs if you're only tracking "good days" ○ average latency ⇒ latency distribution over time ● breakdown per-service ○ measuring only at "the front door" or cross-stack can often be misleading ○ this is just another form of precision, breaking down per-service or per-container
  • 36. The Pyramids Component-level reliability: - solid base (big cold building, heavy iron, redundant disks/net/power) - each component up as much as possible - union of availability as goal - "scale up" Scalable reliability: - less-reliable, cost-effective base - "warehouse scale" (many machines) - highly connected, API-driven - software improves availability - aggregate availability as goal - "scale out"
  • 37. Key Takeaway We can build more-reliable things on top of less-reliable things This is counterintuitive! Software lets us build systems that can cope with failure which hardware can't. Apply this at many levels (app, system, team, org!) for great success.
  • 38. Business Service Orientation Business Service 1 Capability A Limitation X Capability B Limitation Y Business Service 2 Capability B Limitation Y Capability D Limitation Z Business Service N Capability A Limitation X Capability F Limitation W Identification of common limitations across Business Services surfaces the high impact modernization tasks
  • 39. Modernization Adoption time capability 1 capability 2 capability 3 capability 4 service 1: low-risk early adoption, slow progress service N: high-risk late adoption, fast safe progress! platform maturity service N: high-risk don't adopt prematurely! gain confidence in capabilities