muCon 2017 - Build Confidence in your System with Chaos Engineering

Build Confidence in your
System with Chaos
Engineering
Try, Learn, Adapt
2017 - Sylvain Hellegouarch - http://chaosiq.io CHAOSIQ

What we tend to actually build

The environment matters... not?

Even prepared, things may go wrong...
Kudos for
the report!

Everything changes and nothing stands still
Heraclitus - 402BC

Non-functional is a force towards change
It’s fine, we
have
microservices

As soon as you are launched you are at
the mercy of your environment

Environment is full of unknowns

"It's just one of these cases where Mars is going to give us
a new deal, and we're going to have to play the cards we
get, not the ones we want”
Jim Erickson / Project Manager at Nasa for Mars Rovers missions

Redundancy
Duplicate Components

Adaptation
Dynamic Response to Environment

One might rephrase “calculation and
correction of error”

One might rephrase “calculation and
correction of error” as “recognition of
and response to difference”.Jeff Sussna / Designing Delivery: Rethinking IT in the Digital Service Economy

Your System is dynamic and fluid
Recognition

Environment is not stable
Recognition

Accept you do not control everything
Recognition

Be creative when you explore but seek
evidence
Response

Look for deviation from normal

It’s not about breaking stuff you fool!

Open Ended Questions
Chaos Engineering

Probe your system for data
Chaos Engineering

Chaos Engineering is about trying
controlled change to observe system
availability deviation

Planned experiments of realistic events

Collective Effort
Shared Understanding

Communication is key
No surprises

Generate a Playbook
Automation helps confidence

Be careful when interpreting data

You can't always get what you want
But if you try sometimes you might find
The Rolling Stones

chaostoolkit.org
Chaos as a Code

What normal looks like? Your steady state
{
"probes": {
"steady": {
"title": "All services must be healthy before we begin",
"layer": "application",
"type": "python",
"module": "chaosk8s.probes",
"func": "all_microservices_healthy"
}
}
}

Add sources of information with probes
"probes": {
"close": {
"title": "Fetch the CPU usage for our service",
"type": "python",
"module": "chaosprometheus.probes",
"func": "query",
"arguments": {
"query": "process_cpu_seconds_total{job='websvc'}",
"when": "2 minutes ago"
}
}
}

Set the condition for change in normality
"action": {
"title": "Let's max out the CPU of a node",
"type": "python",
"module": "chaosgremlin.actions",
"func": "attack",
"secrets": "gremlin",
"arguments": {
"command": {
"type": "cpu"
},
"target": {
"type": "Random"
}
}
}

Before learning
$ chaos run experiment.json
[2017-10-06 17:37:33 INFO] Running experiment: System is resilient to provider's failures
[2017-10-06 17:37:33 INFO] Observing steady state: All services must be healthy before we begin
[2017-10-06 17:37:33 INFO] Steady State succeeded
[2017-10-06 17:37:33 INFO] Observing steady state: Before we kill it, our microservice should be alive
[2017-10-06 17:37:33 INFO] Observing action: Let's stop our provider
[2017-10-06 17:37:33 INFO] Action succeeded
[2017-10-06 17:37:33 INFO] Observing close state: All services must be healthy before we begin
[2017-10-06 17:37:33 INFO] Close State succeeded
[2017-10-06 17:37:33 INFO] Observing steady state: Consumer should respond as if nothing
[2017-10-06 17:37:44 ERROR] Steady State failed: {"timestamp":1507304264100,"status":500,"error":"Internal
Server Error","exception":"feign.RetryableException","message":"connect timed out executing GET http://my-
provider-service:8080/","path":"/invokeConsumedService"}
[2017-10-06 17:37:44 INFO] Experiment is now complete

Respond to the non-functional force of
change
Do not merely correct the error

Adaptation
$ chaos run experiment.json
[2017-10-06 17:40:25 INFO] Running experiment: System is resilient to provider's failures
[2017-10-06 17:40:25 INFO] Observing steady state: All services must be healthy before we begin
[2017-10-06 17:40:25 INFO] Observing steady state: Before we kill it, our microservice should be alive
[2017-10-06 17:40:26 INFO] Observing action: Let's stop our provider
[2017-10-06 17:40:26 INFO] Action succeeded
[2017-10-06 17:40:26 INFO] Observing close state: All services must be healthy before we begin
[2017-10-06 17:40:26 INFO] Close State succeeded
[2017-10-06 17:40:26 INFO] Observing steady state: Consumer should respond as if nothing
[2017-10-06 17:40:30 INFO] Experiment is now complete

CHAOSIQ
Chaos Engineering for Cloud Native
SaaS & On-Premises

Sylvain Hellegouarch
CTO ChaosIQ
@lawouach / sylvain@chaosiq.io

muCon 2017 - Build Confidence in your System with Chaos Engineering

Recommended

Recommended

More Related Content

What's hot

What's hot (11)

Similar to muCon 2017 - Build Confidence in your System with Chaos Engineering

Similar to muCon 2017 - Build Confidence in your System with Chaos Engineering (20)

Recently uploaded

Recently uploaded (20)

muCon 2017 - Build Confidence in your System with Chaos Engineering