Cloud Native is more than a set of tools. It is a full architecture, a philosophical approach for building applications that take full advantage of cloud computing and a organisational change. Going Cloud Native requires an organisation to shift not only its tech stack but also its culture, processes and team setup. In this talk I'll dive into possible operating models for Cloud Native Systems.
The Zero-ETL Approach: Enhancing Data Agility and Insight
Cloud Native Operations
1. container-solutions.com info@container-solutions.com Cloud Native Operating Models @michmueller_
info@container-solutions.com
container-solutions.com
Cloud
Native
Operating
Models
Jan 2021
Michael Mueller
@michmueller_
When digital transformation is done right, it’s like a
caterpillar turning into a butterfly, but when done wrong
all you have is a really fast caterpillar.
- George Westerman
2. container-solutions.com info@container-solutions.com Cloud Native Operating Models @michmueller_
About me
■ Michael Mueller
■ CC*O @ Container Solutions
■ obviously the best Cloud Native Consultancy in the world
■ CNCF Ambassador
■ I have two kids and a dog, so no time for fancy hobbies
*[Container|Comedian|Consulting|Customer|Cloud Native]
7. container-solutions.com info@container-solutions.com Cloud Native Operating Models @michmueller_
What has SRE to do with Cloud Native
SREs operate AND improve systems/applications if they are manageable and
well architected.
→ Follow Cloud Native architectural and development practises
SREs apply Software Engineering practises towards operational tasks
8. container-solutions.com info@container-solutions.com Cloud Native Operating Models @michmueller_
SRE Example Impact
● Processes 4 billion notifications per month
● Filters out 99,9999875 % of noise
● 99,7% of the remaining are auto remediated
● Developed and maintained by two SREs
● Doing the work of roughly 250 system
administrators
9. container-solutions.com info@container-solutions.com Cloud Native Operating Models @michmueller_
10 Signs your App is not Cloud Native
Your deployment involves manual steps
An operator needs to decide on which server
a new service instance goes
Multiple services must be deployed at the
same time to prevent downtime
Your database change requires coordinating
releases of multiple services
Your releases regularly break other
consuming services
You can’t replace service instances one-by-one
in a rolling manner
One service crashing has a cascading effect and
tears down the whole application
You don’t know which request caused the
exception in a service down the road
Your application feels slow, but you can’t say
which service is the culprit
Your services are too chatty: One user
transaction creates hundreds requests
10. container-solutions.com info@container-solutions.com Cloud Native Operating Models @michmueller_
How does well architected look like?
Identify Failure Scenarios:
● Service A is not able to communicate with
Service B.
● Database is not accessible.
● Your application is not able to connect to the
file system.
● Server is down or not responding.
● Inject faults/delays into the services
Avoid Cascading Failure:
When you have service dependencies built inside
your architecture, you need to ensure that one
failed service does not cause ripple effect among
the entire chain.
Avoid Single Point if Failure:
Ensure that your services aren’t dependent on one
single component.
Handle Failures Gracefully and Allow for Fast
Degradation:
If there are errors/exceptions, the service should
handle it gracefully by providing an error message
or a default value.
Design for Failures:
By following some commonly used design
patterns you can make your service self-healing.
11. container-solutions.com info@container-solutions.com Cloud Native Operating Models @michmueller_
A Good Start - The Original 12 Factors
Codebase: One codebase tracked in revision
control, many deploys
Dependencies: Explicitly declare and isolate
dependencies
Config: Store config in the environment
Backing services: Treat backing services as
attached resources
Build, release, run: Strictly separate build and
run stages
Processes: Execute the app as one or more
stateless processes
Port binding (debated): Export services via port
binding
Concurrency: Scale out via the process model
Disposability: Maximize robustness with fast
startup and graceful shutdown
Dev/prod parity (debated): Keep dev., staging,
and production as similar as possible
Logs: Treat logs as event streams
Admin processes: Run admin/management tasks
as one-off processes
12. container-solutions.com info@container-solutions.com Cloud Native Operating Models @michmueller_
3 more factors for “cloud-native-ness”
Composability: Applications are composed of
independent services
Resilience: Failures of individual services have
localized impact
Observability: Metrics and service
interactions are exposed as data
14. container-solutions.com info@container-solutions.com Cloud Native Operating Models @michmueller_
Origins of SRE
■ Early 2000s SRE evolved at Google
■ Independent of the DevOps movement
■ Happens to embody the philosophies of DevOps
■ SRE prescribes how to succeed in the various areas
18. container-solutions.com info@container-solutions.com Cloud Native Operating Models @michmueller_
2-Pizza-DevOps-Teams + 24/7 Ops = ¯_(ツ)_/¯
■ say, 5 engineers capable and willing to handle on-call duty
■ 365 days, 2 people on-call (1 primary, 1 backup)
■ everyone is on duty 146 days a year (almost every 2nd week)!
⇒ DevOps alone can't reasonably operate critical systems 24/7 and
deliver features on high quality
20. container-solutions.com info@container-solutions.com Cloud Native Operating Models @michmueller_
The guiding principles of SRE
■ The ability to regulate their own workload
■ Service Level Objectives (SLOs) with consequences
■ Time to make tomorrow better than today
■ Failure is an opportunity to improve
22. container-solutions.com info@container-solutions.com Cloud Native Operating Models @michmueller_
Glossary of terms
SLI
service level indicator:
a well-defined measure of
'good enough or user
pains'
● used to specify
SLO/SLA
● Software test /
probe
SLO
service level objective: a
top-line target for fraction
of good interactions
● specifies goals
(SLI + goal)
SLA
service level agreement:
consequences
● SLA = (SLO + margin)
+ consequences = SLI
+ goal + consequences
23. container-solutions.com info@container-solutions.com Cloud Native Operating Models @michmueller_
Error Budgets!
■ Don’t use the 99.85% SLO in daily discussions
■ View it as 0.15% Error Budget instead!
■ Can also be seen as user discomfort budget
■ Negotiate Error Budget with business stakeholders
■ Use free error budgets for innovation, e.g. to release features early
(most outages are caused by changes, like releases)
■ Error budget blown? ➔ Release freeze until budget is replenished
■ Make it public
24. container-solutions.com info@container-solutions.com Cloud Native Operating Models @michmueller_
Service in SLO → most operational work is a standard change
Service close to being out of SLO → revert to normal change
(No, I don't understand the difference between "standard" and "normal" either…)
ITIL Approximation
25. container-solutions.com info@container-solutions.com Cloud Native Operating Models @michmueller_
● Teams become self-policing
The error budget is a valuable resource for them
● Shared responsibility for system uptime
Infrastructure failures eat into the devs’ error budget
Benefits of error budgets
● Common incentive for your DevOps/SRE team
Find the right balance between innovation and
reliability or better called Features vs. Technical
Debt
● Teams can manage the risk themselves
They decide how to spend their error budget
● Unrealistic reliability goals become
unattractive
Such goals dampen the velocity of innovation
29. container-solutions.com info@container-solutions.com Cloud Native Operating Models @michmueller_
Why?
no user can tell the difference between a
system being 100% available and, let's say,
99.999% available
-- Ben Treynor, VP of Engineering at Google
30. container-solutions.com info@container-solutions.com Cloud Native Operating Models @michmueller_
The cost of inadequate availability targets
Too low:
● Loss of revenue due to lower usage of the
product
● Expensive workarounds for other systems, that
need to duplicate unreliable features
● Frustrated customers and loss of reputation
due to an unreliable product
Too high:
● Long time-to-market for new features due to
excessive test periods
● Disproportionate higher cost for development
and infrastructure
● Dependent systems gravitate to higher coupling
as they get used to the HA
● Frustrated developers and stakeholders as they
can’t ship new features
Image credit: Google
31. container-solutions.com info@container-solutions.com Cloud Native Operating Models @michmueller_
“Nines” cost money and add complexity
Availability Table
Target SLO Error Budget /
30 days
Requires
99.999 % 0.43 min (25
sec)
Automated failover
99.99 % 4.32 min Automated rollback
99.95 % 21.6 min Automated rollback
99.9 % 43.2 min Comprehensive
monitoring, 24/7 on-call
99.5% 216 min Comprehensive
monitoring, 24/7 on-call
99 % 432 min Alerting via user
complaints
Image credit: Google
35. container-solutions.com info@container-solutions.com Cloud Native Operating Models @michmueller_
Related: Netflix Chaos Monkey*
■ Forces service decoupling by
randomly disabling services or
components
■ Beginners: Use the monkey only in
a test environment and file
cascading failures as bugs
■ Advanced: Use it in production
(during business hours)
■ Pro: Use it in production (24 x 7)
*) https://github.com/Netflix/SimianArmy/wiki/Chaos-Monkey