Mumbai
Chaos
Engineering
Meetup #1
Sponsors
What is Gremlin?
› Practitioners of Chaos Engineering
› Builds software that helps organizations
build resilient systems
› Offers 11 ways to inject chaos for Chaos
Engineering experiments
Agenda
› Welcome
› Resilient Systems
› Introduction to Chaos Engineering
› Chaos Engineering at other companies
› Chaos Engineering with Pumba & Pod-Reaper
› Demo
HELLO!
I am Shantanu Deshpande
You can find me at:
Twitter: @ishantanu_
LinkedIn: shantanud10
Website: inherentlychaotic.me
Chaos Engineering Community Stats
1.
What is a Resilient
System?
Resilient System
› Highly available and durable system
› Can maintain acceptable level of service
in the face of failure
› Can weather the storm
2.
How do we measure
System’s Resiliency?
Measuring System’s Resiliency
› Mean Time to Failure (MTTF)
› Mean Time to Recovery (MTTR)
› Mean Time between Failure (MTBF)
3.
How do we create
More Resilient Systems?
Chaos
Engineering
“
Chaos Engineering is the discipline of experimenting
on a distributed system
in order to build confidence in the system’s capability
to withstand turbulent conditions in production.
”
What is Chaos Engineering?
“
Inject something harmful, in order to build an
immunity
“
4.
Prerequisites for
Chaos Engineering
Prerequisites
› High Severity Incident Management
› Monitoring
› Measure the Impact of Downtime
4.
Prerequisite #1:
High Severity Incident
Management
“
Practice of recording, triaging, tracking and assigning
business value to problems that impact critical systems
“
- gremlin High SEV incident management blog post
What are
SEVs?
What are SEVs?
SEV Level Description
Target
Resolution
Time
Who is
notified
SEV 0
Catastrophic
service impact
Resolve
within 15 min
Entire
Company
SEV 1
Critical Service
Impact
Resolve
within 8
hours
Teams
working on
SEV and CTO
SEV 2
High Service
Impact
Resolve
within 24
hours
Teams
working on
SEV
4.
Prerequisite #2:
Monitoring
Why Monitor?
› Analyzing long-term trends
› Comparing over time or experiment
groups
› Alerting
› Building dashboards
› Conducting ad hoc retrospective
analysis
How should you
Monitor?
The Four Golden Signals - The Google SRE Book
Monitoring
Signal
Description Target Resolution Time
Latency The time taken to serve a request
HTTP 500 error
triggered due to loss of
DB connection
Traffic
A measure of how much demand
is being placed on your system
HTTP requests per
second
Errors
Rate of requests that fails either
implicitly or explicitly
Catching HTTP 500s at
LB
Saturation
How full your service is. Should
also signal impending saturation
It looks like you DB will
fill it’s hard drive in 4
hours.
4.
Prerequisite #3:
Measure the Impact of
Downtime
Measure the impact of downtime
System Impact:
› Availability
› Durability
Customer Impact
› Outcome
› Cost
› Time
5.
Getting Started with
Chaos Engineering
Getting Started
› Build a hypothesis around steady state
behavior
› Vary real-world events
› Run experiments in production
› Automate experiments to run
continuously
› Minimize blast radius.
Chaos Engineering Tools
https://github.com/dastergon/awesome-chaos-engineering
5.
Chaos Engineering at
companies
Chaos at Netflix
- Used Simian army to keep the cloud
safe, secure and highly available.
- Chaos Monkey
- Janitor Monkey
- Conformity Monkey
Chaos at Google
- Running DRTs from many years.
Chaos at LinkedIn
- Project Waterbear which provides
“application resilience” as a service
- Application Failure (LinkedOut)
- Infrastructure Failure (FireDrill)
6.
Chaos Engineering with
Pumba & pod-reaper
What is Pumba?
- Well-know supporting character
(warthog) from Disney’s animated film
The lion king
- Pumba means: to be foolish, silly,
weak-minded, careless, negligent
- An open-source chaos testing tool for
Docker containers
What can Pumba do?
- Can disturb docker runtime environment by
injecting failures
- The victim container can be specified, providing
name(s)/regex.
- Random selection supported (--random)
- Can disturb either single docker host, swarm
cluster, and Kubernetes cluster
What is Pod-Reaper?
- Designed to kill pods that meet specific
conditions
Time to break some
containers
THANKS!

Chaos engineering intro