Chaos Engineering Kubernetes

@alexsotob
Chaos Engineering in
Kubernetes
Alex Soto
Director of Developer Experience Red Hat
(@alexsotob)

@alexsotob
2
Buenas tardes, buenas noches,
señoritas y señores
To be here with you tonight.
Brings me joy, que alegria
— Miguel
“

@alexsotob
3
Who Am I?
Alex Soto

@alexsotob5
I think that is the
most stupidest thing
I ever heard it.
— Gideon Grey
“

@alexsotob12
Network of Services

@alexsotob13
This is the most beautiful
miracle I’ve ever seen.
— Vanellope Von
Schweetz
“

@alexsotob14
Failure of a Service

@alexsotob15
Cascading Failure

@alexsotob16
Production is not sacrosanct anymore

@alexsotob17
- Unit Tests
- Component Tests
- Static Analysis
- Coverage Tests
- Benchmark Tests
- Contract Tests
- Acceptance Tests
- Mutation Tests
- Smoke Tests
- UI/UX Tests
- Penetration Tests
- Integration Tests
- Tap Compare
- Load Tests
- Shadowing
- Conﬁg Tests
- Canarying
- Dark Canaries
- Monitoring
- Feature Flagging
- Exception Tracking
- Feature Graduation
- Teeing
- Proﬁling
- Logs
- Chaos Testing
- Monitoring
- A/B Testing
- Tracing
- Auditing
- OnCall Experience
- Journey tests
Cindy Sridharan
Pre-Production
Testing In Production
Deploy Release Post Release
The New ¿Pyramid?

@alexsotob19
All of this has happened
before,
and it will all happen
again.
— Peter Pan
“

@alexsotob21
The flower that bloom in adversity
is the most rare
and beautiful of all.
— The Emperor
“

@alexsotob22
Chaos Engineering
https://principlesofchaos.org/
Break Your System on purpose
Find out how it behaves and ﬁx if necessary

@alexsotob23
The human world…
It’s a mess.
— Sebastian
“

@alexsotob24
Phases of chaos
Steady State
Hypothesis
Run
Validate
Fix

@alexsotob25
cogs Steady State

@alexsotob26
Hypothesis
What happen in case of …
Service starts returning Error Codes
Latency increased to 500 ms
Database is not available
Database is not available
Time travel
Partially deleting Kafka topics

@alexsotob27
Run
Canary Release
X v1
X v2
user
90%
10%
Dark Canaries
X v1
X v2
user
*
[10.0.X.Y]
Containerise the experiment
Deﬁne Expected Behaviour
Make it public within the organisation

@alexsotob28
Validate
Compare between current state and steady state
System recover to steady state
Identify the problems

@alexsotob29
Worked - You are good.
Escalate and Start Developing
Not Blame
Fix

@alexsotob32
Image:
quay.io/images/custservice:1.1.0
Replicas:
2
Labels:
customerservice=prod,ci_build=1213
ConfigMap:
cust_config

@alexsotob35
Defining Steady State"steady-state-hypothesis": {
"title": "Services are all available and healthy",
"probes": [
{
"type": "probe",
"name": "application-should-be-alive-and-healthy",
"tolerance": true,
"provider": {
"type": "python",
"module": "chaosk8s.probes",
"func": "microservice_available_and_healthy",
"arguments": {
"name": “greetings-app",
"ns": "default"
}
}
},
{
"type": "probe",
"name": "application-must-respo
"tolerance": 200,
"provider": {
"type": "http",
"verify_tls": false,
"url": “https://app.greetin
}
}
]
},

@alexsotob36
Defining Experiment
"method": [
{
"type": "action",
"name": "terminate-db-master",
"provider": {
"type": "python",
"module": "chaosk8s.pod.actions",
"func": "terminate_pods",
"arguments": {
"label_selector": "spilo-role=master",
"name_pattern": “greetings-db-[0-9]$",
"rand": true,
"ns": "default"
}
},
"pauses": {
"after": 2
}
},

@alexsotob37
Verifying Experiment{
"type": "probe",
"ref": "application-must-respond"
},
{
"type": "probe",
"name": "fetch-patroni-operator-logs",
"provider": {
"type": "python",
"module": "chaosk8s.pod.probes",
"func": "read_pod_logs",
"arguments": {
"label_selector": "name=postgres-operator",
"last": "20s",
"ns": "default"
}
}
}
],
"rollbacks": []chaos run experiment.json

@alexsotob38
AWS
{
"name": "stop-an-ec2-instance-in-az-at-random",
"provider": {
"type": "python",
"module": "chaosaws.ec2.actions",
"func": "stop_instance",
"arguments": {
"az": "us-west-1"
}
}
}

@alexsotob41
metadata:
name: abort-ratings-metric
namespace: bookinfo
spec:
spec:
duration: 60s
failureConditions:
- trigger:
prometheus:
customQuery: |
scalar(sum(rate(istio_requests_total{ source_app=“productpage",response_code="500"
reporter="destination",destination_app="reviews",destination_version!="v1"}[1
thresholdValue: 0.01
comparisonOperator: ">"
faults:
- destinationServices:
- name: ratings
namespace: app
fault:
abort:
httpStatus: 500
percentage: 100
targetMesh:
kubectl create -f experiment.yml

@alexsotob
It's time to see
what I can do
To test the limits
and break through.
— Elsa
“

@alexsotob43
Put on your Sunday clothes
there's lots of world
out there.
— Wall-E
“
[https://github.com/lordofthejars/chaos-quarkus]

@alexsotob44
What's the lesson?
What is the take-away?
— Maui
“

@alexsotob45
Every adventure
requires a first step.
— Alice
“

@alexsotob46
Start with the most sympathetic and innovative groups
MarketGrowth
Time
2.5% 13,5% 34% 34% 16%
The Chasm
Early
Adopters
Innovators
Early
Majority
Late
Majority
Laggards
The Technology Adoption Curve (Source: Moore and McKenna, Crossing The Chasm)

@alexsotob47
Start small
As close as possible to production
Communicate with everybody
Have a failover plan
Start Manual
To get started

@alexsotob48
This is the circle of sadness.
Your job is to make sure
that all sadness stays
inside of it.
— Joy
“

@alexsotob49
Oh yes,
the past can hurt.
But the way I see it,
you can either run from it
or learn from it.
— Rafiki
“

@alexsotob
Hay un amigo en mí,
cuando salgan a volar,
hay un amigo en mí
— Toy Story
“
@alexsotob
asotobue@redhat.com
http://www.lordofthejars.com/
lordofthejars

Chaos Engineering Kubernetes

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Chaos Engineering Kubernetes

Similar to Chaos Engineering Kubernetes (20)

More from Alex Soto

More from Alex Soto (20)

Recently uploaded

Recently uploaded (20)

Chaos Engineering Kubernetes