Chaos Testing for Docker Containers
Who am I?
‣Alexei Ledenev (@alexeiled)
‣Chief of Research @codefresh.io
‣Open Source Projects
‣github.com/alexei-led/pumba
‣github.com/codefresh-io/microci
‣#docker #k8s #aws #gcloud
Complex Systems
"Sooner or later, any complex system will fail, and software systems are no exception.
Failure can occur anytime and almost anywhere. So you should never get too comfortable."
Last Year Outages
• IBM Cloud, January 26

• GitLab, January 31

• AWS, February 28

• Microsoft Azure, March 16

• ...

• Visit http://outage.report/
What can we do
to achieve better Quality?
More testing? Better monitoring?
Functional Testing
Performance Testing
Integration Testing
Penetration Testing
Acceptance Testing Log Analytics
Monitoring Alerts
Failure Predictions
Building distributed software today is easier than ever
CAP Theorem
“Of three properties of
shared-data systems
(Consistency, Availability
and tolerance to network
Partitions) only two can be
achieved at any given
moment in time.”
Eric Brewer
Chaos Engineering
• Embrace the failure!
• Defines an empirical approach to resilience testing of distributed software systems 

• Chaos Experiment

- define a "normal/steady" state of the system (e.g. by monitoring a set of system and business
metrics)

- pseudo-randomly inject faults (e.g. by terminating VMs, killing containers or changing network
behavior)

- try to discover system weaknesses by deviation from expected or steady-state behavior 

The harder it is to disrupt the steady state, the more confidence we have in the behavior of the system.
 
http://principlesofchaos.org/
https://github.com/Netflix/SimianArmy
Google :// Chaos Monkey for DockerWarthog
What is Pumba(a)?
1. Pumbaa is a well-known supporting character
(warthog) from Disney’s animated film The Lion King

2.  In Swahili, pumbaa means “to be foolish, silly, weak-
minded, careless, negligent”
3. It's also an open source Chaos Testing tool for Docker
containers 

1. https://github.com/gaia-adm/pumba

2. Linux, Windows, MacOS, Docker
What Pumba can do?
• Pumba disturbs Docker runtime environment, injecting different failures 

• The "victim" container can be specified, providing name/s or regex

• Radom selection is also supported (with `--random` flag)

• It's possible to define a repeatable time interval and duration parameters
to better control the Chaos

• Pumba can disturb either single Docker host, Swarm cluster, and
Kubernetes cluster
Pumba Docker Chaos Commands
1. stop running Docker container

2. kill (send termination or other signal) to the main process within a
Docker container

3. remove "victim" containers, with their links and volumes

4. pause all processes within a "victim" Docker container for a
specified time
demo time ...
Examples
# stop random container once in a 10 minutes
$ pumba --random --interval 10m kill --signal SIGSTOP
# every 15 minutes kill `mysql` container and
# every hour remove containers starting with "cf"
$ pumba --interval 15m kill --signal SIGTERM mysql &
$ pumba --interval 1h rm re2:^cf &
# every 5 min randomly kill "worker1" or "worker2" containers
# and every 3 minutes pause "queue" container for 15s
$ pumba --random --interval 5m kill --signal SIGKILL worker1 worker2 &
$ pumba --interval 3m pause --duration 15s queue &
Pumba Network Chaos Commands
1. Pumba can emulate network failures at container level (filter by IP too)

2. delay egress traffic for the specified containers

3. add packet-loss based on different probability loss models (2-3-4 state
Markov, Gilbert, Simple Gilbert and Bernoulli)

4. rate limit egress traffic for the specified containers
# add 3 seconds delay for all outgoing packets
# on (default) network device of Docker container for 5 minutes
$ pumba netem --duration 5m delay --time 3000 mydb
# add a delay of 3000ms ± 30ms,
# with the next random element depending 20% on the last one,
# for all outgoing packets on device of all Docker container,
# with name start with for 10 minutes
$ pumba netem --duration 5m --interface eth1 delay 
--time 3000 --jitter 30 --correlation 20 re2:^hp
# add a delay of 3000ms ± 40ms, where variation in delay
# is described by normal distribution,
# for all outgoing packets on main network device of randomly
# chosen Docker container
# from the specified list, for 5 minutes
$ pumba --random netem --duration 5m delay --time 3000 
--jitter 40 --distribution normal 
container1 container2 container3
Pumba Netem under the hood
• The Linux kernel offers a native framework for routing, bridging, firewalling, address
translation and much else.

• Before a packet leaves the output interface, it passes through Linux Traffic Control (tc). This
component is a powerful tool for scheduling, shaping, classifying and prioritizing traffic.

• The basic component of Linux Traffic Control is the queuing discipline (qdisc).  The
simplest implementation of a qdisc is first in first out (FIFO). There are others too.

• The network emulation (netem) project adds queuing disciplines that emulate wide area
network properties such as latency, jitter, loss, duplication, corruption and reordering.
demo time ...
pumba netem loss: https://asciinema.org/a/82430
pumba netem delay: https://asciinema.org/a/82428
Chaos Engineering for Docker

Chaos Engineering for Docker

  • 1.
    Chaos Testing forDocker Containers
  • 2.
    Who am I? ‣AlexeiLedenev (@alexeiled) ‣Chief of Research @codefresh.io ‣Open Source Projects ‣github.com/alexei-led/pumba ‣github.com/codefresh-io/microci ‣#docker #k8s #aws #gcloud
  • 3.
    Complex Systems "Sooner orlater, any complex system will fail, and software systems are no exception. Failure can occur anytime and almost anywhere. So you should never get too comfortable."
  • 4.
    Last Year Outages •IBM Cloud, January 26 • GitLab, January 31 • AWS, February 28 • Microsoft Azure, March 16 • ... • Visit http://outage.report/
  • 6.
    What can wedo to achieve better Quality? More testing? Better monitoring? Functional Testing Performance Testing Integration Testing Penetration Testing Acceptance Testing Log Analytics Monitoring Alerts Failure Predictions
  • 7.
    Building distributed softwaretoday is easier than ever
  • 8.
    CAP Theorem “Of threeproperties of shared-data systems (Consistency, Availability and tolerance to network Partitions) only two can be achieved at any given moment in time.” Eric Brewer
  • 9.
    Chaos Engineering • Embracethe failure! • Defines an empirical approach to resilience testing of distributed software systems • Chaos Experiment - define a "normal/steady" state of the system (e.g. by monitoring a set of system and business metrics) - pseudo-randomly inject faults (e.g. by terminating VMs, killing containers or changing network behavior) - try to discover system weaknesses by deviation from expected or steady-state behavior The harder it is to disrupt the steady state, the more confidence we have in the behavior of the system.   http://principlesofchaos.org/
  • 10.
  • 11.
    Google :// ChaosMonkey for DockerWarthog
  • 12.
    What is Pumba(a)? 1.Pumbaa is a well-known supporting character (warthog) from Disney’s animated film The Lion King 2.  In Swahili, pumbaa means “to be foolish, silly, weak- minded, careless, negligent” 3. It's also an open source Chaos Testing tool for Docker containers 1. https://github.com/gaia-adm/pumba 2. Linux, Windows, MacOS, Docker
  • 13.
    What Pumba cando? • Pumba disturbs Docker runtime environment, injecting different failures • The "victim" container can be specified, providing name/s or regex • Radom selection is also supported (with `--random` flag) • It's possible to define a repeatable time interval and duration parameters to better control the Chaos • Pumba can disturb either single Docker host, Swarm cluster, and Kubernetes cluster
  • 14.
    Pumba Docker ChaosCommands 1. stop running Docker container 2. kill (send termination or other signal) to the main process within a Docker container 3. remove "victim" containers, with their links and volumes 4. pause all processes within a "victim" Docker container for a specified time
  • 15.
  • 16.
    Examples # stop randomcontainer once in a 10 minutes $ pumba --random --interval 10m kill --signal SIGSTOP # every 15 minutes kill `mysql` container and # every hour remove containers starting with "cf" $ pumba --interval 15m kill --signal SIGTERM mysql & $ pumba --interval 1h rm re2:^cf & # every 5 min randomly kill "worker1" or "worker2" containers # and every 3 minutes pause "queue" container for 15s $ pumba --random --interval 5m kill --signal SIGKILL worker1 worker2 & $ pumba --interval 3m pause --duration 15s queue &
  • 17.
    Pumba Network ChaosCommands 1. Pumba can emulate network failures at container level (filter by IP too) 2. delay egress traffic for the specified containers 3. add packet-loss based on different probability loss models (2-3-4 state Markov, Gilbert, Simple Gilbert and Bernoulli) 4. rate limit egress traffic for the specified containers
  • 18.
    # add 3seconds delay for all outgoing packets # on (default) network device of Docker container for 5 minutes $ pumba netem --duration 5m delay --time 3000 mydb # add a delay of 3000ms ± 30ms, # with the next random element depending 20% on the last one, # for all outgoing packets on device of all Docker container, # with name start with for 10 minutes $ pumba netem --duration 5m --interface eth1 delay --time 3000 --jitter 30 --correlation 20 re2:^hp # add a delay of 3000ms ± 40ms, where variation in delay # is described by normal distribution, # for all outgoing packets on main network device of randomly # chosen Docker container # from the specified list, for 5 minutes $ pumba --random netem --duration 5m delay --time 3000 --jitter 40 --distribution normal container1 container2 container3
  • 19.
    Pumba Netem underthe hood • The Linux kernel offers a native framework for routing, bridging, firewalling, address translation and much else. • Before a packet leaves the output interface, it passes through Linux Traffic Control (tc). This component is a powerful tool for scheduling, shaping, classifying and prioritizing traffic. • The basic component of Linux Traffic Control is the queuing discipline (qdisc).  The simplest implementation of a qdisc is first in first out (FIFO). There are others too. • The network emulation (netem) project adds queuing disciplines that emulate wide area network properties such as latency, jitter, loss, duplication, corruption and reordering.
  • 20.
    demo time ... pumbanetem loss: https://asciinema.org/a/82430 pumba netem delay: https://asciinema.org/a/82428