Cloud Native Operations

container-solutions.com info@container-solutions.com Cloud Native Operating Models @michmueller_
info@container-solutions.com
container-solutions.com
Cloud
Native
Operating
Models
Jan 2021
Michael Mueller
@michmueller_
When digital transformation is done right, it’s like a
caterpillar turning into a butterfly, but when done wrong
all you have is a really fast caterpillar.
- George Westerman

About me
■ Michael Mueller
■ CC*O @ Container Solutions
■ obviously the best Cloud Native Consultancy in the world
■ CNCF Ambassador
■ I have two kids and a dog, so no time for fancy hobbies
*[Container|Comedian|Consulting|Customer|Cloud Native]

What is Cloud
Native?
...not just tech!

What is Cloud Native
Digital Transformation
Agile Transformation
DevOps Transformation
Cloud
Native
Transformation

Cloud Native
Platform
Engineering

but we’re going to
talk about...
SRE

What has SRE to do with Cloud Native
SREs operate AND improve systems/applications if they are manageable and
well architected.
→ Follow Cloud Native architectural and development practises
SREs apply Software Engineering practises towards operational tasks

SRE Example Impact
● Processes 4 billion notiﬁcations per month
● Filters out 99,9999875 % of noise
● 99,7% of the remaining are auto remediated
● Developed and maintained by two SREs
● Doing the work of roughly 250 system
administrators

10 Signs your App is not Cloud Native
Your deployment involves manual steps
An operator needs to decide on which server
a new service instance goes
Multiple services must be deployed at the
same time to prevent downtime
Your database change requires coordinating
releases of multiple services
Your releases regularly break other
consuming services
You can’t replace service instances one-by-one
in a rolling manner
One service crashing has a cascading eﬀect and
tears down the whole application
You don’t know which request caused the
exception in a service down the road
Your application feels slow, but you can’t say
which service is the culprit
Your services are too chatty: One user
transaction creates hundreds requests

How does well architected look like?
Identify Failure Scenarios:
● Service A is not able to communicate with
Service B.
● Database is not accessible.
● Your application is not able to connect to the
ﬁle system.
● Server is down or not responding.
● Inject faults/delays into the services
Avoid Cascading Failure:
When you have service dependencies built inside
your architecture, you need to ensure that one
failed service does not cause ripple eﬀect among
the entire chain.
Avoid Single Point if Failure:
Ensure that your services aren’t dependent on one
single component.
Handle Failures Gracefully and Allow for Fast
Degradation:
If there are errors/exceptions, the service should
handle it gracefully by providing an error message
or a default value.
Design for Failures:
By following some commonly used design
patterns you can make your service self-healing.

A Good Start - The Original 12 Factors
Codebase: One codebase tracked in revision
control, many deploys
Dependencies: Explicitly declare and isolate
dependencies
Config: Store config in the environment
Backing services: Treat backing services as
attached resources
Build, release, run: Strictly separate build and
run stages
Processes: Execute the app as one or more
stateless processes
Port binding (debated): Export services via port
binding
Concurrency: Scale out via the process model
Disposability: Maximize robustness with fast
startup and graceful shutdown
Dev/prod parity (debated): Keep dev., staging,
and production as similar as possible
Logs: Treat logs as event streams
Admin processes: Run admin/management tasks
as one-off processes

3 more factors for “cloud-native-ness”
Composability: Applications are composed of
independent services
Resilience: Failures of individual services have
localized impact
Observability: Metrics and service
interactions are exposed as data

Where does it
come from?
And how does it compare to DevOps

Origins of SRE
■ Early 2000s SRE evolved at Google
■ Independent of the DevOps movement
■ Happens to embody the philosophies of DevOps
■ SRE prescribes how to succeed in the various areas

DevOps & Site Reliability Engineering
reliability

And why?
Are we going away from YBIYRI aka
DevOps?

2-Pizza-DevOps-Teams + 24/7 Ops = ¯_(ツ)_/¯

2-Pizza-DevOps-Teams + 24/7 Ops = ¯_(ツ)_/¯
■ say, 5 engineers capable and willing to handle on-call duty
■ 365 days, 2 people on-call (1 primary, 1 backup)
■ everyone is on duty 146 days a year (almost every 2nd week)!
⇒ DevOps alone can't reasonably operate critical systems 24/7 and
deliver features on high quality

Ok, got it!
How do they work?

The guiding principles of SRE
■ The ability to regulate their own workload
■ Service Level Objectives (SLOs) with consequences
■ Time to make tomorrow better than today
■ Failure is an opportunity to improve

Regulate the own
workload

Glossary of terms
SLI
service level indicator:
a well-deﬁned measure of
'good enough or user
pains'
● used to specify
SLO/SLA
● Software test /
probe
SLO
service level objective: a
top-line target for fraction
of good interactions
● speciﬁes goals
(SLI + goal)
SLA
service level agreement:
consequences
● SLA = (SLO + margin)
+ consequences = SLI
+ goal + consequences

Error Budgets!
■ Don’t use the 99.85% SLO in daily discussions
■ View it as 0.15% Error Budget instead!
■ Can also be seen as user discomfort budget
■ Negotiate Error Budget with business stakeholders
■ Use free error budgets for innovation, e.g. to release features early
(most outages are caused by changes, like releases)
■ Error budget blown? ➔ Release freeze until budget is replenished
■ Make it public

Service in SLO → most operational work is a standard change
Service close to being out of SLO → revert to normal change
(No, I don't understand the difference between "standard" and "normal" either…)
ITIL Approximation

● Teams become self-policing
The error budget is a valuable resource for them
● Shared responsibility for system uptime
Infrastructure failures eat into the devs’ error budget
Beneﬁts of error budgets
● Common incentive for your DevOps/SRE team
Find the right balance between innovation and
reliability or better called Features vs. Technical
Debt
● Teams can manage the risk themselves
They decide how to spend their error budget
● Unrealistic reliability goals become
unattractive
Such goals dampen the velocity of innovation

SLOs
With Consequences

Which systems should aspire 100% availability?

Which probably shouldn’t?

Why?
no user can tell the difference between a
system being 100% available and, let's say,
99.999% available
-- Ben Treynor, VP of Engineering at Google

The cost of inadequate availability targets
Too low:
● Loss of revenue due to lower usage of the
product
● Expensive workarounds for other systems, that
need to duplicate unreliable features
● Frustrated customers and loss of reputation
due to an unreliable product
Too high:
● Long time-to-market for new features due to
excessive test periods
● Disproportionate higher cost for development
and infrastructure
● Dependent systems gravitate to higher coupling
as they get used to the HA
● Frustrated developers and stakeholders as they
can’t ship new features
Image credit: Google

“Nines” cost money and add complexity
Availability Table
Target SLO Error Budget /
30 days
Requires
99.999 % 0.43 min (25
sec)
Automated failover
99.99 % 4.32 min Automated rollback
99.95 % 21.6 min Automated rollback
99.9 % 43.2 min Comprehensive
monitoring, 24/7 on-call
99.5% 216 min Comprehensive
monitoring, 24/7 on-call
99 % 432 min Alerting via user
complaints
Image credit: Google

Continuous
improvements
Time to make tomorrow better than today

Prerequisites
*)
https://codeascraft.com/2012/05/22/blameless-postmortems/
● Blameless Post-Mortems*
● Teams must lean towards automation:
○ Self-Service / APIs
○ GitOps
○ Test Automation
○ Continuous Delivery
● …

Failure is an opportunity to
improve
If humans aren’t enough, artiﬁcially
create failures

Related: Netflix Chaos Monkey*
■ Forces service decoupling by
randomly disabling services or
components
■ Beginners: Use the monkey only in
a test environment and file
cascading failures as bugs
■ Advanced: Use it in production
(during business hours)
■ Pro: Use it in production (24 x 7)
*) https://github.com/Netflix/SimianArmy/wiki/Chaos-Monkey

Site Reliability Engineering, Summary
■ Keep your users happy
■ Manage the innovation / reliability tension
■ Maintain all the things

Free SRE e-book

Cloud Native Operations

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Cloud Native Operations

Similar to Cloud Native Operations (20)

Recently uploaded

Recently uploaded (20)

Cloud Native Operations