@alexsotob
Chaos Engineering in
Kubernetes
Alex Soto
Director of Developer Experience Red Hat
(@alexsotob)
@alexsotob
2
Buenas tardes, buenas noches,
señoritas y señores
To be here with you tonight. 
Brings me joy, que alegria
— Miguel
“
@alexsotob
3
Who Am I?
Alex Soto
@alexsotob 4
Questions
@alexsotob5
I think that is the
most stupidest thing
I ever heard it.
— Gideon Grey
“
@alexsotob6
2000
RED HAT
LINUX
2007
KVM
2009
DEVOPS
@alexsotob
@alexsotob
@alexsotob
@alexsotob
@alexsotob
@alexsotob
@alexsotob13
MyApp
Monolith
@alexsotob14
Modules
@alexsotob15
Components
@alexsotob16
Microservices
@alexsotob17
Microservices
@alexsotob18
Microservices
@alexsotob19
Network of Services
@alexsotob20
This is the most beautiful
miracle I’ve ever seen.
— Vanellope Von
Schweetz
“
@alexsotob21
Failure of a Service
@alexsotob22
Cascading Failure
@alexsotob23
Production is not sacrosanct anymore
@alexsotob24
- Unit Tests
- Component Tests
- Static Analysis
- Coverage Tests
- Benchmark Tests
- Contract Tests
- Acceptance Tests
- Mutation Tests
- Smoke Tests
- UI/UX Tests
- Penetration Tests
- Integration Tests
- Tap Compare
- Load Tests
- Shadowing
- Config Tests
- Canarying
- Dark Canaries
- Monitoring
- Feature Flagging
- Exception Tracking
- Feature Graduation
- Teeing
- Profiling
- Logs
- Chaos Testing
- Monitoring
- A/B Testing
- Tracing
- Auditing
- OnCall Experience
- Journey tests
Cindy Sridharan
Pre-Production
Testing In Production
Deploy Release Post Release
The New ¿Pyramid?
@alexsotob25
There is a lot of grey area in
make me a prince.
— Gene
“
@alexsotob26
@alexsotob27
All of this has happened
before,
and it will all happen
again.
— Peter Pan
“
@alexsotob28
@alexsotob29
The flower that bloom in adversity
is the most rare
and beautiful of all.
— The Emperor
“
@alexsotob30
Chaos Engineering
https://principlesofchaos.org/
Break Your System on purpose
Find out how it behaves and fix if necessary
@alexsotob31
The human world…
It’s a mess.
— Sebastian
“
@alexsotob32
Phases of chaos
Steady State
Hypothesis
Run
Validate
Fix
@alexsotob33
Steady State
@alexsotob34
https://www.oreilly.com/ideas/chaos-engineering
@alexsotob35
Hypothesis
What happen in case of …
Service starts returning Error Codes
Latency increased to 500 ms
Database is not available
Time travel
Partially deleting Kafka topics
@alexsotob36
Run
Canary Release
X v1
X v2
user
90%
10%
Dark Canaries
X v1
X v2
user
*
[10.0.X.Y]
Containerise the experiment
Define Expected Behaviour
Make it public within the organisation
volume-up
@alexsotob37
Validate
Compare between current state and steady state
System recover to steady state
Identify the problems
@alexsotob38
Worked - You are good.
Escalate and Start Developing
Not Blame
Fix
@alexsotob39
@alexsotob40
@alexsotob41
Image:
quay.io/images/custservice:1.1.0
Replicas:
2
Labels:
customerservice=prod,ci_build=1213
ConfigMap:
cust_config
@alexsotob42
Istio — ‘Sail’
(Kubernetes — the ‘Helmsman’)
@alexsotob43
Pod
Container
JVM
Service A
Pod
Container
JVM
Service C
Pod
Container
JVM
Service B
Microservices Externalizing Capabilities
The service mesh intercepts all network traffic
@alexsotob44
Resiliency Chaos
Retries
Bulkhead
Circuit Breaker
Pool Ejection
Request Timeout
Fault Injection
Fault Delay
Istio and Chaos
@alexsotob45
Defining Failure
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: ratings
...
spec:
hosts:
- ratings
http:
- fault:
abort:
httpStatus: 500
percentage:
value: 100
match:
- headers:
end-user:
exact: alex
route:
- destination:
host: ratings
kubectl apply -f failure.yml
@alexsotob46
@alexsotob47
@alexsotob48
Defining Steady State
"steady-state-hypothesis": {
"title": "Services are all available and healthy",
"probes": [
{
"type": "probe",
"name": "application-should-be-alive-and-healthy",
"tolerance": true,
"provider": {
"type": "python",
"module": "chaosk8s.probes",
"func": "microservice_available_and_healthy",
"arguments": {
"name": “greetings-app",
"ns": "default"
}
}
},
@alexsotob49
Defining Experiment
"method": [
{
"type": "action",
"name": "terminate-db-master",
"provider": {
"type": "python",
"module": "chaosk8s.pod.actions",
"func": "terminate_pods",
"arguments": {
"label_selector": "spilo-role=master",
"name_pattern": “greetings-db-[0-9]$",
"rand": true,
"ns": "default"
}
},
"pauses": {
"after": 2
}
},
@alexsotob50
Verifying Experiment
{
"type": "probe",
"ref": "application-must-respond"
},
{
"type": "probe",
"name": "fetch-patroni-operator-logs",
"provider": {
"type": "python",
"module": "chaosk8s.pod.probes",
"func": "read_pod_logs",
"arguments": {
"label_selector": "name=postgres-operator",
"last": "20s",
"ns": "default"
}
}
}
],
"rollbacks": []
chaos run experiment.json
@alexsotob51
AWS
{
"name": "stop-an-ec2-instance-in-az-at-random",
"provider": {
"type": "python",
"module": "chaosaws.ec2.actions",
"func": "stop_instance",
"arguments": {
"az": "us-west-1"
}
}
}
@alexsotob52
@alexsotob53
@alexsotob54
apiVersion: glooshot.solo.io/v1
kind: Experiment
metadata:
name: abort-ratings-metric
namespace: bookinfo
spec:
spec:
duration: 60s
failureConditions:
- trigger:
prometheus:
customQuery: |
scalar(sum(rate(istio_requests_total{ source_app=“productpage",response_code="500",
reporter="destination",destination_app="reviews",destination_version!="v1"}[1m])))
thresholdValue: 0.01
comparisonOperator: ">"
faults:
- destinationServices:
- name: ratings
namespace: app
fault:
abort:
httpStatus: 500
percentage: 100
targetMesh:
name: istio-istio-system
namespace: app
kubectl create -f experiment.yml
@alexsotob55
The way to get started is
to quit talking
and begin doing.
— Walt Disney
“
@alexsotob56
Put on your Sunday clothes
there's lots of world
out there.
— Wall-E
“
[https://github.com/lordofthejars/chaos-quarkus]
@alexsotob57
What's the lesson?
What is the take-away?
— Maui
“
@alexsotob58
Every adventure
requires a first step.
— Alice
“
@alexsotob59
Start with the most sympathetic and innovative groups
MarketGrowth
Time
2.5% 13,5% 34% 34% 16%
The Chasm
Early
Adopters
Innovators
Early
Majority
Late
Majority
Laggards
The Technology Adoption Curve (Source: Moore and McKenna, Crossing The Chasm)
@alexsotob60
Options and Hedges
@alexsotob61
Start small
As close as possible to production
Communicate with everybody
Have a failover plan
Start Manual
To get started
@alexsotob62
Don’t you know
there’s part of me that
longs to go…
Into the Unknown
— Elsa
“
@alexsotob63
Adventure is
out there.
— Ellie
“
@alexsotob64
This is the circle of sadness.
Your job is to make sure
that all sadness stays
inside of it.
— Joy
“
@alexsotob65
Oh yes,
the past can hurt.
But the way I see it,
you can either run from it
or learn from it.
— Rafiki
“
@alexsotob
Hay un amigo en mí,
cuando salgan a volar,
hay un amigo en mí
— Toy Story
“
@alexsotob
asotobue@redhat.com
http://www.lordofthejars.com/
lordofthejars

Chaos Engineering Kubernetes