SlideShare a Scribd company logo
1 of 37
Download to read offline
container-solutions.com info@container-solutions.com Cloud Native Operating Models @michmueller_
info@container-solutions.com
container-solutions.com
Cloud
Native
Operating
Models
Jan 2021
Michael Mueller
@michmueller_
When digital transformation is done right, it’s like a
caterpillar turning into a butterfly, but when done wrong
all you have is a really fast caterpillar.
- George Westerman
container-solutions.com info@container-solutions.com Cloud Native Operating Models @michmueller_
About me
■ Michael Mueller
■ CC*O @ Container Solutions
■ obviously the best Cloud Native Consultancy in the world
■ CNCF Ambassador
■ I have two kids and a dog, so no time for fancy hobbies
*[Container|Comedian|Consulting|Customer|Cloud Native]
container-solutions.com info@container-solutions.com Cloud Native Operating Models @michmueller_
What is Cloud
Native?
...not just tech!
container-solutions.com info@container-solutions.com Cloud Native Operating Models @michmueller_
What is Cloud Native
Digital Transformation
Agile Transformation
DevOps Transformation
Cloud
Native
Transformation
container-solutions.com info@container-solutions.com Cloud Native Operating Models @michmueller_
Cloud Native
Platform
Engineering
container-solutions.com info@container-solutions.com Cloud Native Operating Models @michmueller_
but we’re going to
talk about...
SRE
container-solutions.com info@container-solutions.com Cloud Native Operating Models @michmueller_
What has SRE to do with Cloud Native
SREs operate AND improve systems/applications if they are manageable and
well architected.
→ Follow Cloud Native architectural and development practises
SREs apply Software Engineering practises towards operational tasks
container-solutions.com info@container-solutions.com Cloud Native Operating Models @michmueller_
SRE Example Impact
● Processes 4 billion notifications per month
● Filters out 99,9999875 % of noise
● 99,7% of the remaining are auto remediated
● Developed and maintained by two SREs
● Doing the work of roughly 250 system
administrators
container-solutions.com info@container-solutions.com Cloud Native Operating Models @michmueller_
10 Signs your App is not Cloud Native
Your deployment involves manual steps
An operator needs to decide on which server
a new service instance goes
Multiple services must be deployed at the
same time to prevent downtime
Your database change requires coordinating
releases of multiple services
Your releases regularly break other
consuming services
You can’t replace service instances one-by-one
in a rolling manner
One service crashing has a cascading effect and
tears down the whole application
You don’t know which request caused the
exception in a service down the road
Your application feels slow, but you can’t say
which service is the culprit
Your services are too chatty: One user
transaction creates hundreds requests
container-solutions.com info@container-solutions.com Cloud Native Operating Models @michmueller_
How does well architected look like?
Identify Failure Scenarios:
● Service A is not able to communicate with
Service B.
● Database is not accessible.
● Your application is not able to connect to the
file system.
● Server is down or not responding.
● Inject faults/delays into the services
Avoid Cascading Failure:
When you have service dependencies built inside
your architecture, you need to ensure that one
failed service does not cause ripple effect among
the entire chain.
Avoid Single Point if Failure:
Ensure that your services aren’t dependent on one
single component.
Handle Failures Gracefully and Allow for Fast
Degradation:
If there are errors/exceptions, the service should
handle it gracefully by providing an error message
or a default value.
Design for Failures:
By following some commonly used design
patterns you can make your service self-healing.
container-solutions.com info@container-solutions.com Cloud Native Operating Models @michmueller_
A Good Start - The Original 12 Factors
Codebase: One codebase tracked in revision
control, many deploys
Dependencies: Explicitly declare and isolate
dependencies
Config: Store config in the environment
Backing services: Treat backing services as
attached resources
Build, release, run: Strictly separate build and
run stages
Processes: Execute the app as one or more
stateless processes
Port binding (debated): Export services via port
binding
Concurrency: Scale out via the process model
Disposability: Maximize robustness with fast
startup and graceful shutdown
Dev/prod parity (debated): Keep dev., staging,
and production as similar as possible
Logs: Treat logs as event streams
Admin processes: Run admin/management tasks
as one-off processes
container-solutions.com info@container-solutions.com Cloud Native Operating Models @michmueller_
3 more factors for “cloud-native-ness”
Composability: Applications are composed of
independent services
Resilience: Failures of individual services have
localized impact
Observability: Metrics and service
interactions are exposed as data
container-solutions.com info@container-solutions.com Cloud Native Operating Models @michmueller_
Where does it
come from?
And how does it compare to DevOps
container-solutions.com info@container-solutions.com Cloud Native Operating Models @michmueller_
Origins of SRE
■ Early 2000s SRE evolved at Google
■ Independent of the DevOps movement
■ Happens to embody the philosophies of DevOps
■ SRE prescribes how to succeed in the various areas
container-solutions.com info@container-solutions.com Cloud Native Operating Models @michmueller_
DevOps & Site Reliability Engineering
reliability
container-solutions.com info@container-solutions.com Cloud Native Operating Models @michmueller_
And why?
Are we going away from YBIYRI aka
DevOps?
container-solutions.com info@container-solutions.com Cloud Native Operating Models @michmueller_
2-Pizza-DevOps-Teams + 24/7 Ops = ¯_(ツ)_/¯
container-solutions.com info@container-solutions.com Cloud Native Operating Models @michmueller_
2-Pizza-DevOps-Teams + 24/7 Ops = ¯_(ツ)_/¯
■ say, 5 engineers capable and willing to handle on-call duty
■ 365 days, 2 people on-call (1 primary, 1 backup)
■ everyone is on duty 146 days a year (almost every 2nd week)!
⇒ DevOps alone can't reasonably operate critical systems 24/7 and
deliver features on high quality
container-solutions.com info@container-solutions.com Cloud Native Operating Models @michmueller_
Ok, got it!
How do they work?
container-solutions.com info@container-solutions.com Cloud Native Operating Models @michmueller_
The guiding principles of SRE
■ The ability to regulate their own workload
■ Service Level Objectives (SLOs) with consequences
■ Time to make tomorrow better than today
■ Failure is an opportunity to improve
container-solutions.com info@container-solutions.com Cloud Native Operating Models @michmueller_
Regulate the own
workload
container-solutions.com info@container-solutions.com Cloud Native Operating Models @michmueller_
Glossary of terms
SLI
service level indicator:
a well-defined measure of
'good enough or user
pains'
● used to specify
SLO/SLA
● Software test /
probe
SLO
service level objective: a
top-line target for fraction
of good interactions
● specifies goals
(SLI + goal)
SLA
service level agreement:
consequences
● SLA = (SLO + margin)
+ consequences = SLI
+ goal + consequences
container-solutions.com info@container-solutions.com Cloud Native Operating Models @michmueller_
Error Budgets!
■ Don’t use the 99.85% SLO in daily discussions
■ View it as 0.15% Error Budget instead!
■ Can also be seen as user discomfort budget
■ Negotiate Error Budget with business stakeholders
■ Use free error budgets for innovation, e.g. to release features early
(most outages are caused by changes, like releases)
■ Error budget blown? ➔ Release freeze until budget is replenished
■ Make it public
container-solutions.com info@container-solutions.com Cloud Native Operating Models @michmueller_
Service in SLO → most operational work is a standard change
Service close to being out of SLO → revert to normal change
(No, I don't understand the difference between "standard" and "normal" either…)
ITIL Approximation
container-solutions.com info@container-solutions.com Cloud Native Operating Models @michmueller_
● Teams become self-policing
The error budget is a valuable resource for them
● Shared responsibility for system uptime
Infrastructure failures eat into the devs’ error budget
Benefits of error budgets
● Common incentive for your DevOps/SRE team
Find the right balance between innovation and
reliability or better called Features vs. Technical
Debt
● Teams can manage the risk themselves
They decide how to spend their error budget
● Unrealistic reliability goals become
unattractive
Such goals dampen the velocity of innovation
container-solutions.com info@container-solutions.com Cloud Native Operating Models @michmueller_
SLOs
With Consequences
container-solutions.com info@container-solutions.com Cloud Native Operating Models @michmueller_
Which systems should aspire 100% availability?
container-solutions.com info@container-solutions.com Cloud Native Operating Models @michmueller_
Which probably shouldn’t?
container-solutions.com info@container-solutions.com Cloud Native Operating Models @michmueller_
Why?
no user can tell the difference between a
system being 100% available and, let's say,
99.999% available
-- Ben Treynor, VP of Engineering at Google
container-solutions.com info@container-solutions.com Cloud Native Operating Models @michmueller_
The cost of inadequate availability targets
Too low:
● Loss of revenue due to lower usage of the
product
● Expensive workarounds for other systems, that
need to duplicate unreliable features
● Frustrated customers and loss of reputation
due to an unreliable product
Too high:
● Long time-to-market for new features due to
excessive test periods
● Disproportionate higher cost for development
and infrastructure
● Dependent systems gravitate to higher coupling
as they get used to the HA
● Frustrated developers and stakeholders as they
can’t ship new features
Image credit: Google
container-solutions.com info@container-solutions.com Cloud Native Operating Models @michmueller_
“Nines” cost money and add complexity
Availability Table
Target SLO Error Budget /
30 days
Requires
99.999 % 0.43 min (25
sec)
Automated failover
99.99 % 4.32 min Automated rollback
99.95 % 21.6 min Automated rollback
99.9 % 43.2 min Comprehensive
monitoring, 24/7 on-call
99.5% 216 min Comprehensive
monitoring, 24/7 on-call
99 % 432 min Alerting via user
complaints
Image credit: Google
container-solutions.com info@container-solutions.com Cloud Native Operating Models @michmueller_
Continuous
improvements
Time to make tomorrow better than today
container-solutions.com info@container-solutions.com Cloud Native Operating Models @michmueller_
Prerequisites
*)
https://codeascraft.com/2012/05/22/blameless-postmortems/
● Blameless Post-Mortems*
● Teams must lean towards automation:
○ Self-Service / APIs
○ GitOps
○ Test Automation
○ Continuous Delivery
● …
container-solutions.com info@container-solutions.com Cloud Native Operating Models @michmueller_
Failure is an opportunity to
improve
If humans aren’t enough, artificially
create failures
container-solutions.com info@container-solutions.com Cloud Native Operating Models @michmueller_
Related: Netflix Chaos Monkey*
■ Forces service decoupling by
randomly disabling services or
components
■ Beginners: Use the monkey only in
a test environment and file
cascading failures as bugs
■ Advanced: Use it in production
(during business hours)
■ Pro: Use it in production (24 x 7)
*) https://github.com/Netflix/SimianArmy/wiki/Chaos-Monkey
container-solutions.com info@container-solutions.com Cloud Native Operating Models @michmueller_
Site Reliability Engineering, Summary
■ Keep your users happy
■ Manage the innovation / reliability tension
■ Maintain all the things
container-solutions.com info@container-solutions.com Cloud Native Operating Models @michmueller_
Free SRE e-book

More Related Content

What's hot

Pivotal Digital Transformation Forum: Requirements to Deliver Innovation to M...
Pivotal Digital Transformation Forum: Requirements to Deliver Innovation to M...Pivotal Digital Transformation Forum: Requirements to Deliver Innovation to M...
Pivotal Digital Transformation Forum: Requirements to Deliver Innovation to M...
VMware Tanzu
 
Microservice Scars - Alt.net 2hr
Microservice Scars - Alt.net 2hrMicroservice Scars - Alt.net 2hr
Microservice Scars - Alt.net 2hr
Joshua Toth
 

What's hot (20)

Keynote: Architecting for Continuous Delivery (Pivotal Cloud Platform Roadshow)
Keynote: Architecting for Continuous Delivery (Pivotal Cloud Platform Roadshow)Keynote: Architecting for Continuous Delivery (Pivotal Cloud Platform Roadshow)
Keynote: Architecting for Continuous Delivery (Pivotal Cloud Platform Roadshow)
 
devops, platforms and devops platforms
devops, platforms and devops platformsdevops, platforms and devops platforms
devops, platforms and devops platforms
 
Evolving to Cloud-Native - Nate Schutta (2/2)
Evolving to Cloud-Native - Nate Schutta (2/2)Evolving to Cloud-Native - Nate Schutta (2/2)
Evolving to Cloud-Native - Nate Schutta (2/2)
 
devops, microservices, and platforms, oh my!
devops, microservices, and platforms, oh my!devops, microservices, and platforms, oh my!
devops, microservices, and platforms, oh my!
 
Are We Really Cloud-Native?
Are We Really Cloud-Native?Are We Really Cloud-Native?
Are We Really Cloud-Native?
 
Evolving to Cloud-Native - Nate Schutta (1/2)
Evolving to Cloud-Native - Nate Schutta (1/2)Evolving to Cloud-Native - Nate Schutta (1/2)
Evolving to Cloud-Native - Nate Schutta (1/2)
 
Cloud Native Architecture Patterns Tutorial
Cloud Native Architecture Patterns TutorialCloud Native Architecture Patterns Tutorial
Cloud Native Architecture Patterns Tutorial
 
Cloud Native Infrastructure Automation
Cloud Native Infrastructure AutomationCloud Native Infrastructure Automation
Cloud Native Infrastructure Automation
 
Cloud-Native Fundamentals: An Introduction to 12-Factor Applications
Cloud-Native Fundamentals: An Introduction to 12-Factor ApplicationsCloud-Native Fundamentals: An Introduction to 12-Factor Applications
Cloud-Native Fundamentals: An Introduction to 12-Factor Applications
 
Keynote: Software Kept Eating the World (Pivotal Cloud Platform Roadshow)
Keynote: Software Kept Eating the World (Pivotal Cloud Platform Roadshow)Keynote: Software Kept Eating the World (Pivotal Cloud Platform Roadshow)
Keynote: Software Kept Eating the World (Pivotal Cloud Platform Roadshow)
 
Cloud Foundry Summit 2015: A Year of Innovation: Cloud Foundry Lessons Learned
Cloud Foundry Summit 2015: A Year of Innovation: Cloud Foundry Lessons LearnedCloud Foundry Summit 2015: A Year of Innovation: Cloud Foundry Lessons Learned
Cloud Foundry Summit 2015: A Year of Innovation: Cloud Foundry Lessons Learned
 
Spring Boot & Spring Cloud on Pivotal Application Service
Spring Boot & Spring Cloud on Pivotal Application ServiceSpring Boot & Spring Cloud on Pivotal Application Service
Spring Boot & Spring Cloud on Pivotal Application Service
 
Platform Requirements for CI/CD Success—and the Enterprises Leading the Way
Platform Requirements for CI/CD Success—and the Enterprises Leading the WayPlatform Requirements for CI/CD Success—and the Enterprises Leading the Way
Platform Requirements for CI/CD Success—and the Enterprises Leading the Way
 
Accelerating Time to Market
Accelerating Time to MarketAccelerating Time to Market
Accelerating Time to Market
 
Cloud native enterprise
Cloud native enterpriseCloud native enterprise
Cloud native enterprise
 
OPS Executive insights Webinar - Accenture
OPS Executive insights Webinar - AccentureOPS Executive insights Webinar - Accenture
OPS Executive insights Webinar - Accenture
 
DevOps, Microservices and containers - a high level overview
DevOps, Microservices and containers - a high level overviewDevOps, Microservices and containers - a high level overview
DevOps, Microservices and containers - a high level overview
 
The Cloud Foundry Story
The Cloud Foundry StoryThe Cloud Foundry Story
The Cloud Foundry Story
 
Pivotal Digital Transformation Forum: Requirements to Deliver Innovation to M...
Pivotal Digital Transformation Forum: Requirements to Deliver Innovation to M...Pivotal Digital Transformation Forum: Requirements to Deliver Innovation to M...
Pivotal Digital Transformation Forum: Requirements to Deliver Innovation to M...
 
Microservice Scars - Alt.net 2hr
Microservice Scars - Alt.net 2hrMicroservice Scars - Alt.net 2hr
Microservice Scars - Alt.net 2hr
 

Similar to Cloud Native Operations

Migrating to cloud-native_app_architectures_pivotal
Migrating to cloud-native_app_architectures_pivotalMigrating to cloud-native_app_architectures_pivotal
Migrating to cloud-native_app_architectures_pivotal
kkdlavak3
 
Migrating_to_Cloud-Native_App_Architectures_Pivotal (2)
Migrating_to_Cloud-Native_App_Architectures_Pivotal (2)Migrating_to_Cloud-Native_App_Architectures_Pivotal (2)
Migrating_to_Cloud-Native_App_Architectures_Pivotal (2)
Dean Bruckman
 
Migrating_to_Cloud-Native_App_Architectures_Pivotal (2)
Migrating_to_Cloud-Native_App_Architectures_Pivotal (2)Migrating_to_Cloud-Native_App_Architectures_Pivotal (2)
Migrating_to_Cloud-Native_App_Architectures_Pivotal (2)
Tim Kirby
 
Migrating_to_Cloud-Native_App_Architectures_Pivotal
Migrating_to_Cloud-Native_App_Architectures_PivotalMigrating_to_Cloud-Native_App_Architectures_Pivotal
Migrating_to_Cloud-Native_App_Architectures_Pivotal
Estevan McCalley
 

Similar to Cloud Native Operations (20)

HLayer / Cloud Native Best Practices
HLayer / Cloud Native Best PracticesHLayer / Cloud Native Best Practices
HLayer / Cloud Native Best Practices
 
From Monoliths to Services: Grafually paying your Technical Debt
From Monoliths to Services: Grafually paying your Technical DebtFrom Monoliths to Services: Grafually paying your Technical Debt
From Monoliths to Services: Grafually paying your Technical Debt
 
Get Loose! Microservices and Loosely Coupled Architectures
Get Loose! Microservices and Loosely Coupled ArchitecturesGet Loose! Microservices and Loosely Coupled Architectures
Get Loose! Microservices and Loosely Coupled Architectures
 
Get Loose! Microservices and Loosely Coupled Architectures
Get Loose! Microservices and Loosely Coupled Architectures Get Loose! Microservices and Loosely Coupled Architectures
Get Loose! Microservices and Loosely Coupled Architectures
 
Productionizing Predictive Analytics using the Rendezvous Architecture - for ...
Productionizing Predictive Analytics using the Rendezvous Architecture - for ...Productionizing Predictive Analytics using the Rendezvous Architecture - for ...
Productionizing Predictive Analytics using the Rendezvous Architecture - for ...
 
Migrating to cloud-native_app_architectures_pivotal
Migrating to cloud-native_app_architectures_pivotalMigrating to cloud-native_app_architectures_pivotal
Migrating to cloud-native_app_architectures_pivotal
 
Migrating_to_Cloud-Native_App_Architectures_Pivotal (2)
Migrating_to_Cloud-Native_App_Architectures_Pivotal (2)Migrating_to_Cloud-Native_App_Architectures_Pivotal (2)
Migrating_to_Cloud-Native_App_Architectures_Pivotal (2)
 
Migrating_to_Cloud-Native_App_Architectures_Pivotal (2)
Migrating_to_Cloud-Native_App_Architectures_Pivotal (2)Migrating_to_Cloud-Native_App_Architectures_Pivotal (2)
Migrating_to_Cloud-Native_App_Architectures_Pivotal (2)
 
Migrating_to_Cloud-Native_App_Architectures_Pivotal
Migrating_to_Cloud-Native_App_Architectures_PivotalMigrating_to_Cloud-Native_App_Architectures_Pivotal
Migrating_to_Cloud-Native_App_Architectures_Pivotal
 
Microservices: Why and When? - Alon Fliess, CodeValue - Cloud Native Day Tel ...
Microservices: Why and When? - Alon Fliess, CodeValue - Cloud Native Day Tel ...Microservices: Why and When? - Alon Fliess, CodeValue - Cloud Native Day Tel ...
Microservices: Why and When? - Alon Fliess, CodeValue - Cloud Native Day Tel ...
 
devops, platforms and devops platforms
devops, platforms and devops platformsdevops, platforms and devops platforms
devops, platforms and devops platforms
 
Implementing Service Oriented Architecture
Implementing Service Oriented ArchitectureImplementing Service Oriented Architecture
Implementing Service Oriented Architecture
 
Implementing Service Oriented Architecture
Implementing Service Oriented Architecture Implementing Service Oriented Architecture
Implementing Service Oriented Architecture
 
Implementing Service Oriented Architecture
Implementing Service Oriented ArchitectureImplementing Service Oriented Architecture
Implementing Service Oriented Architecture
 
Resetting Your Culture for Cloud-Native Success
Resetting Your Culture for Cloud-Native SuccessResetting Your Culture for Cloud-Native Success
Resetting Your Culture for Cloud-Native Success
 
Twelve Factor - Designing for Change
Twelve Factor - Designing for ChangeTwelve Factor - Designing for Change
Twelve Factor - Designing for Change
 
It summit 2014_migrating_applications_to_the_cloud-5
It summit 2014_migrating_applications_to_the_cloud-5It summit 2014_migrating_applications_to_the_cloud-5
It summit 2014_migrating_applications_to_the_cloud-5
 
Single vs. multi tenant cost comparison
Single vs. multi tenant cost comparisonSingle vs. multi tenant cost comparison
Single vs. multi tenant cost comparison
 
The "Why", "What" & "How" of Microservices - short version
The "Why", "What" & "How" of Microservices - short versionThe "Why", "What" & "How" of Microservices - short version
The "Why", "What" & "How" of Microservices - short version
 
Modernizing the monolithic architecture to container based architecture apaco...
Modernizing the monolithic architecture to container based architecture apaco...Modernizing the monolithic architecture to container based architecture apaco...
Modernizing the monolithic architecture to container based architecture apaco...
 

Recently uploaded

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 

Recently uploaded (20)

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Modernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using BallerinaModernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using Ballerina
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Choreo: Empowering the Future of Enterprise Software Engineering
Choreo: Empowering the Future of Enterprise Software EngineeringChoreo: Empowering the Future of Enterprise Software Engineering
Choreo: Empowering the Future of Enterprise Software Engineering
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Navigating Identity and Access Management in the Modern Enterprise
Navigating Identity and Access Management in the Modern EnterpriseNavigating Identity and Access Management in the Modern Enterprise
Navigating Identity and Access Management in the Modern Enterprise
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Quantum Leap in Next-Generation Computing
Quantum Leap in Next-Generation ComputingQuantum Leap in Next-Generation Computing
Quantum Leap in Next-Generation Computing
 
Simplifying Mobile A11y Presentation.pptx
Simplifying Mobile A11y Presentation.pptxSimplifying Mobile A11y Presentation.pptx
Simplifying Mobile A11y Presentation.pptx
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDM
 
The Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightThe Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and Insight
 

Cloud Native Operations

  • 1. container-solutions.com info@container-solutions.com Cloud Native Operating Models @michmueller_ info@container-solutions.com container-solutions.com Cloud Native Operating Models Jan 2021 Michael Mueller @michmueller_ When digital transformation is done right, it’s like a caterpillar turning into a butterfly, but when done wrong all you have is a really fast caterpillar. - George Westerman
  • 2. container-solutions.com info@container-solutions.com Cloud Native Operating Models @michmueller_ About me ■ Michael Mueller ■ CC*O @ Container Solutions ■ obviously the best Cloud Native Consultancy in the world ■ CNCF Ambassador ■ I have two kids and a dog, so no time for fancy hobbies *[Container|Comedian|Consulting|Customer|Cloud Native]
  • 3. container-solutions.com info@container-solutions.com Cloud Native Operating Models @michmueller_ What is Cloud Native? ...not just tech!
  • 4. container-solutions.com info@container-solutions.com Cloud Native Operating Models @michmueller_ What is Cloud Native Digital Transformation Agile Transformation DevOps Transformation Cloud Native Transformation
  • 5. container-solutions.com info@container-solutions.com Cloud Native Operating Models @michmueller_ Cloud Native Platform Engineering
  • 6. container-solutions.com info@container-solutions.com Cloud Native Operating Models @michmueller_ but we’re going to talk about... SRE
  • 7. container-solutions.com info@container-solutions.com Cloud Native Operating Models @michmueller_ What has SRE to do with Cloud Native SREs operate AND improve systems/applications if they are manageable and well architected. → Follow Cloud Native architectural and development practises SREs apply Software Engineering practises towards operational tasks
  • 8. container-solutions.com info@container-solutions.com Cloud Native Operating Models @michmueller_ SRE Example Impact ● Processes 4 billion notifications per month ● Filters out 99,9999875 % of noise ● 99,7% of the remaining are auto remediated ● Developed and maintained by two SREs ● Doing the work of roughly 250 system administrators
  • 9. container-solutions.com info@container-solutions.com Cloud Native Operating Models @michmueller_ 10 Signs your App is not Cloud Native Your deployment involves manual steps An operator needs to decide on which server a new service instance goes Multiple services must be deployed at the same time to prevent downtime Your database change requires coordinating releases of multiple services Your releases regularly break other consuming services You can’t replace service instances one-by-one in a rolling manner One service crashing has a cascading effect and tears down the whole application You don’t know which request caused the exception in a service down the road Your application feels slow, but you can’t say which service is the culprit Your services are too chatty: One user transaction creates hundreds requests
  • 10. container-solutions.com info@container-solutions.com Cloud Native Operating Models @michmueller_ How does well architected look like? Identify Failure Scenarios: ● Service A is not able to communicate with Service B. ● Database is not accessible. ● Your application is not able to connect to the file system. ● Server is down or not responding. ● Inject faults/delays into the services Avoid Cascading Failure: When you have service dependencies built inside your architecture, you need to ensure that one failed service does not cause ripple effect among the entire chain. Avoid Single Point if Failure: Ensure that your services aren’t dependent on one single component. Handle Failures Gracefully and Allow for Fast Degradation: If there are errors/exceptions, the service should handle it gracefully by providing an error message or a default value. Design for Failures: By following some commonly used design patterns you can make your service self-healing.
  • 11. container-solutions.com info@container-solutions.com Cloud Native Operating Models @michmueller_ A Good Start - The Original 12 Factors Codebase: One codebase tracked in revision control, many deploys Dependencies: Explicitly declare and isolate dependencies Config: Store config in the environment Backing services: Treat backing services as attached resources Build, release, run: Strictly separate build and run stages Processes: Execute the app as one or more stateless processes Port binding (debated): Export services via port binding Concurrency: Scale out via the process model Disposability: Maximize robustness with fast startup and graceful shutdown Dev/prod parity (debated): Keep dev., staging, and production as similar as possible Logs: Treat logs as event streams Admin processes: Run admin/management tasks as one-off processes
  • 12. container-solutions.com info@container-solutions.com Cloud Native Operating Models @michmueller_ 3 more factors for “cloud-native-ness” Composability: Applications are composed of independent services Resilience: Failures of individual services have localized impact Observability: Metrics and service interactions are exposed as data
  • 13. container-solutions.com info@container-solutions.com Cloud Native Operating Models @michmueller_ Where does it come from? And how does it compare to DevOps
  • 14. container-solutions.com info@container-solutions.com Cloud Native Operating Models @michmueller_ Origins of SRE ■ Early 2000s SRE evolved at Google ■ Independent of the DevOps movement ■ Happens to embody the philosophies of DevOps ■ SRE prescribes how to succeed in the various areas
  • 15. container-solutions.com info@container-solutions.com Cloud Native Operating Models @michmueller_ DevOps & Site Reliability Engineering reliability
  • 16. container-solutions.com info@container-solutions.com Cloud Native Operating Models @michmueller_ And why? Are we going away from YBIYRI aka DevOps?
  • 17. container-solutions.com info@container-solutions.com Cloud Native Operating Models @michmueller_ 2-Pizza-DevOps-Teams + 24/7 Ops = ¯_(ツ)_/¯
  • 18. container-solutions.com info@container-solutions.com Cloud Native Operating Models @michmueller_ 2-Pizza-DevOps-Teams + 24/7 Ops = ¯_(ツ)_/¯ ■ say, 5 engineers capable and willing to handle on-call duty ■ 365 days, 2 people on-call (1 primary, 1 backup) ■ everyone is on duty 146 days a year (almost every 2nd week)! ⇒ DevOps alone can't reasonably operate critical systems 24/7 and deliver features on high quality
  • 19. container-solutions.com info@container-solutions.com Cloud Native Operating Models @michmueller_ Ok, got it! How do they work?
  • 20. container-solutions.com info@container-solutions.com Cloud Native Operating Models @michmueller_ The guiding principles of SRE ■ The ability to regulate their own workload ■ Service Level Objectives (SLOs) with consequences ■ Time to make tomorrow better than today ■ Failure is an opportunity to improve
  • 21. container-solutions.com info@container-solutions.com Cloud Native Operating Models @michmueller_ Regulate the own workload
  • 22. container-solutions.com info@container-solutions.com Cloud Native Operating Models @michmueller_ Glossary of terms SLI service level indicator: a well-defined measure of 'good enough or user pains' ● used to specify SLO/SLA ● Software test / probe SLO service level objective: a top-line target for fraction of good interactions ● specifies goals (SLI + goal) SLA service level agreement: consequences ● SLA = (SLO + margin) + consequences = SLI + goal + consequences
  • 23. container-solutions.com info@container-solutions.com Cloud Native Operating Models @michmueller_ Error Budgets! ■ Don’t use the 99.85% SLO in daily discussions ■ View it as 0.15% Error Budget instead! ■ Can also be seen as user discomfort budget ■ Negotiate Error Budget with business stakeholders ■ Use free error budgets for innovation, e.g. to release features early (most outages are caused by changes, like releases) ■ Error budget blown? ➔ Release freeze until budget is replenished ■ Make it public
  • 24. container-solutions.com info@container-solutions.com Cloud Native Operating Models @michmueller_ Service in SLO → most operational work is a standard change Service close to being out of SLO → revert to normal change (No, I don't understand the difference between "standard" and "normal" either…) ITIL Approximation
  • 25. container-solutions.com info@container-solutions.com Cloud Native Operating Models @michmueller_ ● Teams become self-policing The error budget is a valuable resource for them ● Shared responsibility for system uptime Infrastructure failures eat into the devs’ error budget Benefits of error budgets ● Common incentive for your DevOps/SRE team Find the right balance between innovation and reliability or better called Features vs. Technical Debt ● Teams can manage the risk themselves They decide how to spend their error budget ● Unrealistic reliability goals become unattractive Such goals dampen the velocity of innovation
  • 26. container-solutions.com info@container-solutions.com Cloud Native Operating Models @michmueller_ SLOs With Consequences
  • 27. container-solutions.com info@container-solutions.com Cloud Native Operating Models @michmueller_ Which systems should aspire 100% availability?
  • 28. container-solutions.com info@container-solutions.com Cloud Native Operating Models @michmueller_ Which probably shouldn’t?
  • 29. container-solutions.com info@container-solutions.com Cloud Native Operating Models @michmueller_ Why? no user can tell the difference between a system being 100% available and, let's say, 99.999% available -- Ben Treynor, VP of Engineering at Google
  • 30. container-solutions.com info@container-solutions.com Cloud Native Operating Models @michmueller_ The cost of inadequate availability targets Too low: ● Loss of revenue due to lower usage of the product ● Expensive workarounds for other systems, that need to duplicate unreliable features ● Frustrated customers and loss of reputation due to an unreliable product Too high: ● Long time-to-market for new features due to excessive test periods ● Disproportionate higher cost for development and infrastructure ● Dependent systems gravitate to higher coupling as they get used to the HA ● Frustrated developers and stakeholders as they can’t ship new features Image credit: Google
  • 31. container-solutions.com info@container-solutions.com Cloud Native Operating Models @michmueller_ “Nines” cost money and add complexity Availability Table Target SLO Error Budget / 30 days Requires 99.999 % 0.43 min (25 sec) Automated failover 99.99 % 4.32 min Automated rollback 99.95 % 21.6 min Automated rollback 99.9 % 43.2 min Comprehensive monitoring, 24/7 on-call 99.5% 216 min Comprehensive monitoring, 24/7 on-call 99 % 432 min Alerting via user complaints Image credit: Google
  • 32. container-solutions.com info@container-solutions.com Cloud Native Operating Models @michmueller_ Continuous improvements Time to make tomorrow better than today
  • 33. container-solutions.com info@container-solutions.com Cloud Native Operating Models @michmueller_ Prerequisites *) https://codeascraft.com/2012/05/22/blameless-postmortems/ ● Blameless Post-Mortems* ● Teams must lean towards automation: ○ Self-Service / APIs ○ GitOps ○ Test Automation ○ Continuous Delivery ● …
  • 34. container-solutions.com info@container-solutions.com Cloud Native Operating Models @michmueller_ Failure is an opportunity to improve If humans aren’t enough, artificially create failures
  • 35. container-solutions.com info@container-solutions.com Cloud Native Operating Models @michmueller_ Related: Netflix Chaos Monkey* ■ Forces service decoupling by randomly disabling services or components ■ Beginners: Use the monkey only in a test environment and file cascading failures as bugs ■ Advanced: Use it in production (during business hours) ■ Pro: Use it in production (24 x 7) *) https://github.com/Netflix/SimianArmy/wiki/Chaos-Monkey
  • 36. container-solutions.com info@container-solutions.com Cloud Native Operating Models @michmueller_ Site Reliability Engineering, Summary ■ Keep your users happy ■ Manage the innovation / reliability tension ■ Maintain all the things
  • 37. container-solutions.com info@container-solutions.com Cloud Native Operating Models @michmueller_ Free SRE e-book