@alexsotob
Chaos Engineering in
Kubernetes
Alex Soto
Director of Developer Experience Red Hat
(@alexsotob)
@alexsotob
2
Buenas tardes, buenas noches,
señoritas y señores
To be here with you tonight. 
Brings me joy, que alegria
— Miguel
“
@alexsotob
3
Who Am I?
Alex Soto
@alexsotob 4
Questions
@alexsotob5
I think that is the
most stupidest thing
I ever heard it.
— Gideon Grey
“
@alexsotob6
MyApp
Monolith
@alexsotob7
Modules
@alexsotob8
Components
@alexsotob9
Microservices
@alexsotob10
Microservices
@alexsotob11
Microservices
@alexsotob12
Network of Services
@alexsotob13
This is the most beautiful
miracle I’ve ever seen.
— Vanellope Von
Schweetz
“
@alexsotob14
Failure of a Service
@alexsotob15
Cascading Failure
@alexsotob16
Production is not sacrosanct anymore
@alexsotob17
- Unit Tests
- Component Tests
- Static Analysis
- Coverage Tests
- Benchmark Tests
- Contract Tests
- Acceptance Tests
- Mutation Tests
- Smoke Tests
- UI/UX Tests
- Penetration Tests
- Integration Tests
- Tap Compare
- Load Tests
- Shadowing
- Config Tests
- Canarying
- Dark Canaries
- Monitoring
- Feature Flagging
- Exception Tracking
- Feature Graduation
- Teeing
- Profiling
- Logs
- Chaos Testing
- Monitoring
- A/B Testing
- Tracing
- Auditing
- OnCall Experience
- Journey tests
Cindy Sridharan
Pre-Production
Testing In Production
Deploy Release Post Release
The New ¿Pyramid?
@alexsotob18
@alexsotob19
All of this has happened
before,
and it will all happen
again.
— Peter Pan
“
@alexsotob20
@alexsotob21
The flower that bloom in adversity
is the most rare
and beautiful of all.
— The Emperor
“
@alexsotob22
Chaos Engineering
https://principlesofchaos.org/
Break Your System on purpose
Find out how it behaves and fix if necessary
@alexsotob23
The human world…
It’s a mess.
— Sebastian
“
@alexsotob24
Phases of chaos
Steady State
Hypothesis
Run
Validate
Fix
@alexsotob25
cogs Steady State
@alexsotob26
Hypothesis
What happen in case of …
Service starts returning Error Codes
Latency increased to 500 ms
Database is not available
Database is not available
Time travel
Partially deleting Kafka topics
@alexsotob27
Run
Canary Release
X v1
X v2
user
90%
10%
Dark Canaries
X v1
X v2
user
*
[10.0.X.Y]
Containerise the experiment
Define Expected Behaviour
Make it public within the organisation
@alexsotob28
Validate
Compare between current state and steady state
System recover to steady state
Identify the problems
@alexsotob29
Worked - You are good.
Escalate and Start Developing
Not Blame
Fix
@alexsotob30
@alexsotob31
@alexsotob32
Image:
quay.io/images/custservice:1.1.0
Replicas:
2
Labels:
customerservice=prod,ci_build=1213
ConfigMap:
cust_config
@alexsotob33
@alexsotob34
@alexsotob35
Defining Steady State"steady-state-hypothesis": {
"title": "Services are all available and healthy",
"probes": [
{
"type": "probe",
"name": "application-should-be-alive-and-healthy",
"tolerance": true,
"provider": {
"type": "python",
"module": "chaosk8s.probes",
"func": "microservice_available_and_healthy",
"arguments": {
"name": “greetings-app",
"ns": "default"
}
}
},
{
"type": "probe",
"name": "application-must-respo
"tolerance": 200,
"provider": {
"type": "http",
"verify_tls": false,
"url": “https://app.greetin
}
}
]
},
@alexsotob36
Defining Experiment
"method": [
{
"type": "action",
"name": "terminate-db-master",
"provider": {
"type": "python",
"module": "chaosk8s.pod.actions",
"func": "terminate_pods",
"arguments": {
"label_selector": "spilo-role=master",
"name_pattern": “greetings-db-[0-9]$",
"rand": true,
"ns": "default"
}
},
"pauses": {
"after": 2
}
},
@alexsotob37
Verifying Experiment{
"type": "probe",
"ref": "application-must-respond"
},
{
"type": "probe",
"name": "fetch-patroni-operator-logs",
"provider": {
"type": "python",
"module": "chaosk8s.pod.probes",
"func": "read_pod_logs",
"arguments": {
"label_selector": "name=postgres-operator",
"last": "20s",
"ns": "default"
}
}
}
],
"rollbacks": []chaos run experiment.json
@alexsotob38
AWS
{
"name": "stop-an-ec2-instance-in-az-at-random",
"provider": {
"type": "python",
"module": "chaosaws.ec2.actions",
"func": "stop_instance",
"arguments": {
"az": "us-west-1"
}
}
}
@alexsotob39
@alexsotob40
@alexsotob41
metadata:
name: abort-ratings-metric
namespace: bookinfo
spec:
spec:
duration: 60s
failureConditions:
- trigger:
prometheus:
customQuery: |
scalar(sum(rate(istio_requests_total{ source_app=“productpage",response_code="500"
reporter="destination",destination_app="reviews",destination_version!="v1"}[1
thresholdValue: 0.01
comparisonOperator: ">"
faults:
- destinationServices:
- name: ratings
namespace: app
fault:
abort:
httpStatus: 500
percentage: 100
targetMesh:
kubectl create -f experiment.yml
@alexsotob
It's time to see
what I can do
To test the limits
and break through.
— Elsa
“
@alexsotob43
Put on your Sunday clothes
there's lots of world
out there.
— Wall-E
“
[https://github.com/lordofthejars/chaos-quarkus]
@alexsotob44
What's the lesson?
What is the take-away?
— Maui
“
@alexsotob45
Every adventure
requires a first step.
— Alice
“
@alexsotob46
Start with the most sympathetic and innovative groups
MarketGrowth
Time
2.5% 13,5% 34% 34% 16%
The Chasm
Early
Adopters
Innovators
Early
Majority
Late
Majority
Laggards
The Technology Adoption Curve (Source: Moore and McKenna, Crossing The Chasm)
@alexsotob47
Start small
As close as possible to production
Communicate with everybody
Have a failover plan
Start Manual
To get started
@alexsotob48
This is the circle of sadness.
Your job is to make sure
that all sadness stays
inside of it.
— Joy
“
@alexsotob49
Oh yes,
the past can hurt.
But the way I see it,
you can either run from it
or learn from it.
— Rafiki
“
@alexsotob
Hay un amigo en mí,
cuando salgan a volar,
hay un amigo en mí
— Toy Story
“
@alexsotob
asotobue@redhat.com
http://www.lordofthejars.com/
lordofthejars

Chaos Engineering Kubernetes