Resilience testing! Why should you

RESILIENCE
TESTING!
Why should you?
Geoffrey Arij van der Tas
Quality & Team Performance Coach

4
Geoffrey van der Tas
TEAM & QUALITY COACH
2012 – Started in Testing
2015 – Training About
Resilience Testing @ING
2018 – Workshop about
Resilience Testing

5
Inspire
open your eyes for something new

• Why is it important?
• What is the impact of new technologies?
• What is Resilience testing exactly?
• How to start with it?
• Do it yourself tips for after the presentation!?
AGENDA

8
● Dutch Banks (source:DNB):
• 2016: 99,64% (30 hours downtime)
• 2017: 99,76% (20 hours downtime)
• 2018: 99,88% (10 hours downtime)
• 2020: 99,94 (5 hours downtime)
• Netflix: 99,98% (less 3 hours downtime)
• AWS: 99,7%
ABOUT THOSE DEMANDS

RESILIENCE
“the ability of a substance or object to spring back
into shape; elasticity” – google dictionary

IT RESILIENCE?
“ Resilience is the ability of a system
to withstand a major disruption
within acceptable degradation
parameters and to recover within an
acceptable time and composite costs
and risks. ”

GOAL OF BEING RESILIENT IS:
● Availability
● Less downtime
● Quicker recovery
● More fault tollerant
● Performance
● Security
● Integrity
● Customer Feedback
12

RESILIENCE
Infra
People &
Processes
Software
Examples:
Load Balancing
Stand-by servers
Examples:
Stand-by shifts
Release protocols
Examples:
Re-try Pattern
Circuit Breaker Pattern

BRAINSTORM ABOUT FAILURES
15
Business impact
Probability
Database Failure
Database Downtime
API Failure
Storage Full
Network Downtime
IP renewal
High Load
1 API Fails

RESILIENCE PATTERNS
16https://blog.codecentric.de/en/2019/06/resilience-design-patterns-
retry-fallback-timeout-circuit-breaker/

HOW TO TEST IT?
19
Load Generator Monitoring
PERFORMANCE TESTING TYPES
Load
Time
0 Long
High
Load
Stress
Endurance
12h
Spike

HOW TO TEST IT?
20
Load Generator Monitoring

INTRODUCING FAILURES
21
Load
Generator
Monitoring

INTRODUCING FAILURES
22
Load
Generator
Monitoring
/etc/init.d/networking restart
dd bs=2048 if=/dev/urandom of=/dev/null
stress –i 1 –t 60
https://github.com/Netflix/SimianArmy
https://github.com/Shopify/toxiproxy

WANT TO TRY YOURSELF?
Website with Microservices architecture and communication between front-end, API and Database
● Platform: Docker or Kubernetes
● Monitoring: Prometheus & Grafana
Add a performance test:
● Gatling
● Jmeter
Add a Failure:
● Kill POD
● Stress a POD
● Chaos Monkey
Socks Shop: https://microservices-demo.github.io/
24

Read about it:
Resilience:
https://netflixtechblog.com/tagged/resilience
https://www.zerto.com/the-platform/what-is-it-resilience/
Resilience patterns:
https://www.jrebel.com/blog/microservices-resilience-
patterns
https://medium.com/@adhorn/patterns-for-resilient-
architecture-part-1-d3b60cd8d2b6
Resilience Testing:
https://usersnap.com/blog/resilience-testing/
https://thenewstack.io/the-importance-of-resilience-
testing-and-observability/
https://en.wikipedia.org/wiki/Chaos_engineering
Doing it:
● Performance testing:
● Gatling - https://gatling.io/
● Jmeter - https://jmeter.apache.org/
● Test target:
● Socks website - https://microservices-demo.github.io/
● Platform
● Docker - https://www.docker.com/
● Kubernetes - https://kubernetes.io/
● Tools for failures:
● Nstress - https://www.ibm.com/support/pages/stress-test-your-
aix-or-linux-server-nstress
● Simian Army - https://github.com/Netflix/SimianArmy
● ToxiProxy - https://github.com/Shopify/toxiproxy
● Stress Container - https://github.com/progrium/docker-stress
● Monitoring
● Prometheus - https://prometheus.io/
● Graphana - https://grafana.com/
● Dynatrace - https://www.dynatrace.com/

COMPLEX? TOO HARD?
● Not really.. Start small..
● Reboot a Server
● Delete a database
● Kill a service on your machine
● See what happens? Resilience is about:
26
“the ability of a substance or object to spring back
into shape; elasticity” – google dictionary

Platform & System
Load Generator
27
● Why Should you?
● Microservices, Cloud & Always online
● Decreasing fault margines
● Decreasing response times
● Where to start?
● Brainstorm
● Resiliency Patterns
● Communication (Sequence Diagrams)
● Resilience – Testing the elasticity of your IT services
● Performance testing;
● Introducing Failures;
● Monitoring/Alerting;
● Analyse it;
CONCLUSION RESILIENCE
Monitoring & Analyzing
Failures
<./Command>

Read about it:
Resilience:
https://netflixtechblog.com/tagged/resilience
https://www.zerto.com/the-platform/what-is-it-resilience/
Resilience patterns:
https://www.jrebel.com/blog/microservices-resilience-
patterns
https://medium.com/@adhorn/patterns-for-resilient-
architecture-part-1-d3b60cd8d2b6
https://blog.codecentric.de/en/2019/06/resilience-design-
patterns-retry-fallback-timeout-circuit-breaker/
Resilience Testing:
https://usersnap.com/blog/resilience-testing/
https://thenewstack.io/the-importance-of-resilience-
testing-and-observability/
https://en.wikipedia.org/wiki/Chaos_engineering
Doing it:
● Performance testing:
● Gatling - https://gatling.io/
● Jmeter - https://jmeter.apache.org/
● Test target:
● Socks website - https://microservices-demo.github.io/
● Platform
● Docker - https://www.docker.com/
● Kubernetes - https://kubernetes.io/
● Tools for failures:
● Nstress - https://www.ibm.com/support/pages/stress-test-your-
aix-or-linux-server-nstress
● Simian Army - https://github.com/Netflix/SimianArmy
● ToxiProxy - https://github.com/Shopify/toxiproxy
● Stress Container - https://github.com/progrium/docker-stress
● Monitoring
● Prometheus - https://prometheus.io/
● Graphana - https://grafana.com/
● Dynatrace - https://www.dynatrace.com/

Resilience testing! Why should you

More Related Content

What's hot

Similar to Resilience testing! Why should you

Recently uploaded

Resilience testing! Why should you