Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Availability in a cloud native world - Guidelines for mere mortals v2.0
1. Availability in a Cloud-native World.
Guidelines for mere mortals.
Academy of Technology - PREVAIL 2019 – München 🇩🇪
—
Haytham Elkhoja
Chief Architect & Global Tech Leader
IBM Services - Continuous Availability (a.k.a Always On)
haytham.elkhoja@ibm.com
Relevant links and assets:
https://ibm.biz/alwaysonbook
2. Availability in a Cloud-native World. Guidelines for mere mortals. PREVAIL 2019 – München 🇩🇪
/WHOIS
2
@hek
/in/haytham.Elkhoja
3. Availability in a Cloud-native World. Guidelines for mere mortals. PREVAIL 2019 – München 🇩🇪
March 2017 “Amazon broke the
internet with a typo” cnn.com
Impacted apps:
- Netflix
- HootSuite
- Expedia
- Slack
- Business Insider
- Reddit
3
4. Availability in a Cloud-native World. Guidelines for mere mortals. PREVAIL 2019 – München 🇩🇪
June 2019 “Google details
'catastrophic' cloud outage
events: Promises to do better
next time” zdnet.com
Impacted apps:
- Snapchat
- Spotify
- Google Docs
- Youtube
- Pokemon Go
- Gmail
4
6. Availability in a Cloud-native World. Guidelines for mere mortals. PREVAIL 2019 – München 🇩🇪 6
On why outages happen.
App and DB
67%
Batch
11%
Hardware
14%
Environmental
8%
Planned Outages
Process
40%
Application
40%
Hardware
10%
OS
10%
Unplanned Outages
8. Availability in a Cloud-native World. Guidelines for mere mortals. PREVAIL 2019 – München 🇩🇪 8
Keeping your app available during
planned and unplanned outages or
failures requires geographically-
distributed, multi-active, multi-
regions deployments.
Users
Data Replication
Session Replication
Traffic Traffic
Traffic
9. Availability in a Cloud-native World. Guidelines for mere mortals. PREVAIL 2019 – München 🇩🇪 9
The IBM Always On Pattern starts
at the infrastructure layer,
progresses to the data,
influences application design and
extends to the people and the
culture.
Herbie Pearthree, Distinguished Engineer
hpear3@us.ibm.com
11. State &
Consistency
Chaos &
Validation
Zones, Regions
& Swimlanes
Portability &
Deployment
Thinking differently
about Availability in a
Cloud-native world.
11Availability in a Cloud-native World. Guidelines for mere mortals. PREVAIL 2019 – München 🇩🇪
13. Availability in a Cloud-native World. Guidelines for mere mortals. PREVAIL 2019 – München 🇩🇪
Code differently.
Cloud-native Apps should be self-
contained, polyglot, loosely-
coupled, cattle-scaled, immutable,
idempotent, ephemeral and protocol
aware.
13
14. Availability in a Cloud-native World. Guidelines for mere mortals. PREVAIL 2019 – München 🇩🇪
No two clouds are created equal.
Architect for cloud mobility. Your
app should be cloud, infrastructure
and OS agnostic. The 12 factor
patterns will help you get there.
14
15. Availability in a Cloud-native World. Guidelines for mere mortals. PREVAIL 2019 – München 🇩🇪
No strings attached.
Environment variables should be
bootstrapped, also a requirement
for environment parity and your own
sanity.
15
FROM alpine:3.1
COPY app /app
COPY docker-entrypoint.sh /entrypoint.sh
ENTRYPOINT ["/entrypoint.sh"]
docker build -t app:v2 .
docker run --rm
-e "APP_DATADIR=/var/lib/data"
-e "APP_HOST=host.com"
-e "APP_PORT=3306"
-e "APP_USERNAME=user"
-e "APP_PASSWORD=password"
-e "APP_DATABASE=test"
app:v2
2019/10/15 04:44:29 Starting application...
16. Availability in a Cloud-native World. Guidelines for mere mortals. PREVAIL 2019 – München 🇩🇪
Delegate responsibilities.
Whatever as a Service. Somebody,
somewhere has done a much better
job.
16
17. Availability in a Cloud-native World. Guidelines for mere mortals. PREVAIL 2019 – München 🇩🇪
Trim down the fat.
Dependency management with multi-
stage builds is an art one must
pursue to keep apps clean and lean.
17
18. Availability in a Cloud-native World. Guidelines for mere mortals. PREVAIL 2019 – München 🇩🇪
Got Syslog?
Feed information and timestamp
using STDOUT and STDERR. Clarify
who’s the source.
18
19. Availability in a Cloud-native World. Guidelines for mere mortals. PREVAIL 2019 – München 🇩🇪
git’s your bible.
Everything should be versioned,
ephemeral and reproducible using
GitOps methods. This includes
configuration files and
Infrastructure as Code.
19
20. Availability in a Cloud-native World. Guidelines for mere mortals. PREVAIL 2019 – München 🇩🇪
Design for failure.
Handle SIGTERM and SIGKILL like a
champ.
20
21. Availability in a Cloud-native World. Guidelines for mere mortals. PREVAIL 2019 – München 🇩🇪
#$@&%*!
Fail gracefully and inform your
customers what’s up (or down), pun
intended.
21
22. Availability in a Cloud-native World. Guidelines for mere mortals. PREVAIL 2019 – München 🇩🇪
Robots > humans.
Actions performed by humans
hundreds of times won’t be
performed the same way each
time, even with the best
intentions. Automate.
22
24. Availability in a Cloud-native World. Guidelines for mere mortals. PREVAIL 2019 – München 🇩🇪
Resilient clouds don’t mean
resilient apps.
Multi active regions help you
scale while being resilient.
Out of Region is more than
just an insurance policy.
24
25. Availability in a Cloud-native World. Guidelines for mere mortals. PREVAIL 2019 – München 🇩🇪
Stay in your swimlane.
Respect region affinity and
stickiness using geo load
balancers to resolve traffic
to the nearest region and stay
there.
Crossing regions is a no no.
25
26. Availability in a Cloud-native World. Guidelines for mere mortals. PREVAIL 2019 – München 🇩🇪
DNS is your best friend.
Religiously steer clear from
IP addresses. Service
discovery will point you to
the right path.
And if you can’t, use Anycast.
26
27. Availability in a Cloud-native World. Guidelines for mere mortals. PREVAIL 2019 – München 🇩🇪
The most boring OS configs are
also the most important ones.
A /etc/resolv.conf ‘search’
entry forces traffic to your
swimlane’s subdomain, helping
you with region affinity.
27
28. Availability in a Cloud-native World. Guidelines for mere mortals. PREVAIL 2019 – München 🇩🇪
Share-nothing. Cluster-
nothing. Stretch-nothing.
Control-planes are delicate
creatures, especially if
stretched or shared.
28
DB DB
Disk
DB DB DB
Disk
DB
Disk
DB DB DB
DiskDisk Disk
Share
Everything
Share Disks
and Networking
Share Nothing
NetworkingNetworking Networking
29. Availability in a Cloud-native World. Guidelines for mere mortals. PREVAIL 2019 – München 🇩🇪
Bypass failures all together.
Disaster recovery processes
lead to a mediocre and
sometimes catastrophic
experience.
29
30. Availability in a Cloud-native World. Guidelines for mere mortals. PREVAIL 2019 – München 🇩🇪
Are we there yet?
Discover the awesome world of
service readiness, liveness
probes, circuit-breakers,
retries, rate-limiting,
bulkheading and fallbacks.
30
31. Availability in a Cloud-native World. Guidelines for mere mortals. PREVAIL 2019 – München 🇩🇪
One deployment at a time.
Rolling updates strategies for zero
downtime deployments within a
cluster or availability zone.
31
Deploy by adding an instance, then
remove an old one
Deploy by removing an instance, then
add a new one
Deploy by updating instances as fast as
possible
32. Availability in a Cloud-native World. Guidelines for mere mortals. PREVAIL 2019 – München 🇩🇪
One region at a time.
Then do the same across regions.
Your customers will not even
know what’s happening behind
the scenes.
33. Availability in a Cloud-native World. Guidelines for mere mortals. PREVAIL 2019 – München 🇩🇪
Love thy neighbor.
Configure resource requests and
limits. Throttle API requests.
35. Availability in a Cloud-native World. Guidelines for mere mortals. PREVAIL 2019 – München 🇩🇪
The network is reliable. Right.
CAP Theorem must be well understood
when choosing data stores. Knowing
that partition tolerance cannot be
sacrificed, pick consistency or
availability.
35
P
A C
Pick
A or C
Oracle, DB2, MySQL etc…
36. Availability in a Cloud-native World. Guidelines for mere mortals. PREVAIL 2019 – München 🇩🇪
Do you really need Strong
Consistency?
Applications can support weak,
eventual, or strong consistency.
36
37. Availability in a Cloud-native World. Guidelines for mere mortals. PREVAIL 2019 – München 🇩🇪
Distributed consistency is already
difficult as it is.
Normally, higher availability means
higher revenue. Think of ATM
machines. A trumps C.
Educate your business on eventual
consistency. Strong consistency
should be the last option, unless
you’re the NYSE.
37
38. Availability in a Cloud-native World. Guidelines for mere mortals. PREVAIL 2019 – München 🇩🇪
Master! Master!
Write anywhere and everywhere.
Master-Master, Master-less and
Peer to Peer database-level
replication.
Shard, partition or
Write/Query if you can’t.
38
39. Availability in a Cloud-native World. Guidelines for mere mortals. PREVAIL 2019 – München 🇩🇪
Data Replication. More than
meets the eye.
Data patterns differ. Not all
data is created equal.
39
Messaging
BPM
CEP
APP
Active standby
or active/query
Hot standby
or configured
active/active for
fast switchover
Multi-master
or peer-to-peer
write anywhere
Data distribution
filter and push
Data warehouse
integration and
federation
Data through
messaging filter
and push
distribution
40. Availability in a Cloud-native World. Guidelines for mere mortals. PREVAIL 2019 – München 🇩🇪
Conflict resolution during a
network partition will make
you creative.
Log and notify conflicts.
Last-write-wins, CQRS, write
partitioning are all valid but
subjective (and emotional)
decisions.
40
41. Availability in a Cloud-native World. Guidelines for mere mortals. PREVAIL 2019 – München 🇩🇪
NTP is dead. Long live NTP.
Achieve globally distributed,
consensus respected,
synchronously-replicated,
databases with Google TrueTime
and AWS Time Sync, if you
really need it.
41
42. Availability in a Cloud-native World. Guidelines for mere mortals. PREVAIL 2019 – München 🇩🇪
Database is much more than
just a DBA’s job.
Database versioning and
backward-compatible schemas
are not optional, but
compulsory.
42
43. Availability in a Cloud-native World. Guidelines for mere mortals. PREVAIL 2019 – München 🇩🇪
Why is my shopping cart empty?
Aim for stateless, but
maintain sessions, if you
must.
43
45. Availability in a Cloud-native World. Guidelines for mere mortals. PREVAIL 2019 – München 🇩🇪
Design for feedback.
Measure every single detail
via KPIs and SLIs. Capture
metrics and logs. There’s no
such thing as too much logs.
45
46. Availability in a Cloud-native World. Guidelines for mere mortals. PREVAIL 2019 – München 🇩🇪
Hope is not a strategy.
Reduce uncertainty with game days,
then aim to regularly injecting
failure in your production
environment.
46
47. Availability in a Cloud-native World. Guidelines for mere mortals. PREVAIL 2019 – München 🇩🇪
Continuous tinkering is healthy.
Use randomness to spoon-feed
yourself with discoveries. You’ll
be surprised what you come across.
47
48. Availability in a Cloud-native World. Guidelines for mere mortals. PREVAIL 2019 – München 🇩🇪
You don’t choose Chaos Monkey.
Chaos Monkey chooses you.
When pursuing Chaos
Engineering, start controlled,
small, observe, squash and
learn.
Remember, there is nothing
Chaotic about Chaos
Engineering.
48
“Chaos Engineering the discipline
of experimenting on a distributed
system in order to build
confidence in the system's
capability to withstand turbulent
conditions in production.”
https://principlesofchaos.org
49. Availability in a Cloud-native World. Guidelines for mere mortals. PREVAIL 2019 – München 🇩🇪
Chaos Engineering is a
collection of “What if”s.
What if I add latency? What if
I DDoS a service? What if I
change the hardware clock?
49
Example of tests:
• tc qdisc add dev eth0 root netem delay 300ms
• wrk -t12 -c400 -d30s http://host/api/request
• stress-ng --random 50 -t 60 --metrics-brief --times
• iptables -I OUTPUT -p udp -d DNS Server --dport 53 -j DROP
• umount /mnt/blockstorage
• hwclock
50. Availability in a Cloud-native World. Guidelines for mere mortals. PREVAIL 2019 – München 🇩🇪
The rollback button is a lie.
That’s not only true for
application deployments but also
for fault injection, as both face
the same fundamental problem:
State.
50
51. Availability in a Cloud-native World. Guidelines for mere mortals. PREVAIL 2019 – München 🇩🇪
Go beyond trivial ICMP and
connection tests.
Synthetic automated monitoring
help you understand what your
digital users experience far
from typical platform
monitoring.
Do it from multiple locations.
51
52. Availability in a Cloud-native World. Guidelines for mere mortals. PREVAIL 2019 – München 🇩🇪
Love DevOps? Wait till you
meet SRE.
SRE is what happens when you
ask a software engineer to
design an operations team.
52