From Apollo 13 to Google SRE

© 2015 IBM Corporation
Sanjeev Sharma
CTO – DevOps Adoption
IBM Distinguished Engineer
@sd_architect | sdarchitect.blog
From Apollo 13 to
Google SRE
When DevOps met SRE

2Page© 2016 IBM Corporation
#WhoAmI
• 20+ Years in Software
Development and Delivery
• IBM Distinguished Engineer and
CTO for DevOps Adoption
• Author of two DevOps books:
• DevOps For Dummies:
https://ibm.biz/BdsPMX
• The DevOps Adoption Playbook:
http://amzn.to/2hH7rt2
• Blog: https://sdarchitect.blog
• @sd_architect

What is SRE?
“SRE is what happens
when you ask a software
engineer to design an
operations team. ”
- Betsy Beyer, Chris Jones, Jennifer Petoff,
and Niall Richard Murphy.
“Site Reliability Engineering.”
Site Reliability Engineering (SRE) :
Google’s approach to Service Management

Apollo 13 – The real heroes
Image Courtesy:
Universal Pictures, NASA

Reliability: The Real Availability Numbers!
How much downtime does 5-nines 99.999% availability translate to?
• Daily: 0.9s
• Weekly: 6.0s
• Monthly: 26.3s
• Yearly: 5m 15.6s
4-nines or 99.99% translates to downtime of:
• Daily: 8.6s
• Weekly: 1m 0.5s
• Monthly: 4m 23.0s
• Yearly: 52m 35.7s
Even the more common
99.95% availability SLO is
a mere 43 seconds/day or
5:24 minutes/week.

Eight Tenets of Google SRE
1. Ensuring a Durable Focus on Engineering
2. Pursuing Maximum Change Velocity Without Violating a Service’s SLO
3. Monitoring
4. Emergency Response
5. Change Management
6. Demand Forecasting and Capacity Planning
7. Provisioning
8. Efficiency and Performance

Best Practices of Incident Management
1. Prioritize
2. Prepare
3. Trust
4. Introspect
5. Consider alternatives
6. Practice
7. Change it around
Image Courtesy:
Universal Pictures, NASA

Development SCM Build Package
Repo
Deploy
Repo
Deploy
Repo
Deploy
Repo
Deploy Test Stage Production Mainframe Hosted App
Mobile App
App Server Monolithic App
Cloud Native App
Enterprise
Release
Agile/Innovation Edge
Rapid Delivery for Innovation • Agile • Antifragile • Experimentation • New and Innovative • Hybrid Cloud • IaaS/PaaS • Containers
Industrialized Core
Deliver at regular cadence • Agile • Stability • Predictability • Lean Delivery pipeline • Core and Legacy Systems
Hybrid Infrastructure – Physical, Cloud • IaaS/PaaS • Containers
Business
Capability
DevOps + SRE in the Enterprise
Balancing Innovation and Optimization

Repo
Deploy
Repo
Deploy
Repo
Deploy
Repo
Deploy Test Stage Production
Application N
Application C
Application B
Application A
Enterprise
Release
Industrialized Core
Business
Capability
Touchpoints of Standardization Across Delivery Pipelines
Deployment
Automation and
Orchestration
Service and Test
Environment
Virtualization
APIs
Planning and
Architecture
Release
Management
Operational
Readiness

Repo
Deploy
Repo
Deploy
Repo
Deploy
Repo
Deploy Test Stage Production
Application N
Application C
Application B
Application A
Enterprise
Release
Industrialized Core
Business
Capability
When DevOps met SRE
Deployment
Automation and
Orchestration
Service and Test
Environment
Virtualization
APIs
Planning and
Architecture
Release
Management
Operational
Readiness
DevOps
SRE

Your Delivery Pipeline
will be as fast as the
slowest Delivery
Pipeline it is
dependent on
Architecture and Planning

Modernizing to
Microservices based
Architecture:
Refactoring Code
and Data and
defining REST APIs
APIs

Developers are paid
to write code, not
maintain deployment
and configuration
scripts
Application Deployment and Environment
Orchestration

If you are doing 2-
week Sprints, but it
takes 3-weeks to
get a Test Server,
how long are your
Sprints?
Test Service and Environment Virtualization

It is not possible to
patch the software of
a missile AFTER it
has been launched
Release Management

Shift thinking from
Mean Time Between
Failure (MTBF) to
Mean Time To
Repair (MTTR).
Operational Readiness for SRE

MTTR Calculus
Mean Time to Repair =
Mean Time to Detect + Mean Time to Triage +
Mean Time to Restore
+ Mean Time to Pass Blame…

Antifragile Systems
Antifragile: Things that are
neither fragile or robust,
but rather thrive in chaos.

Delivering Antifragile Systems
Servers may go “red,”
services are always
“green”
Cattle not pets
Fragility in systems actually
comes from a desire to make
them too robust.

Organizational Change
• “Everyone is responsible for
Delivering to Production”
• Squad-Tribe-Guild Team Model
• SRE Squads
• A Learning Organization

When DevOps meets SRE
DevOps: “Everyone is responsible for
delivery to production.”
SRE: “(Everyone) is responsible for
delivering Continuous Business Value”

From Apollo 13 to Google SRE

More Related Content

What's hot

Similar to From Apollo 13 to Google SRE

More from Sanjeev Sharma

Recently uploaded

From Apollo 13 to Google SRE