RESILIENCE
TESTING!
Why should you?
Geoffrey Arij van der Tas
Quality & Team Performance Coach
3
4
Geoffrey van der Tas
TEAM & QUALITY COACH
2012 – Started in Testing
2015 – Training About
Resilience Testing @ING
2018 – Workshop about
Resilience Testing
5
Inspire
open your eyes for something new
• Why is it important?
• What is the impact of new technologies?
• What is Resilience testing exactly?
• How to start with it?
• Do it yourself tips for after the presentation!?
AGENDA
7
DEMANDS HAVE CHANGED
8
● Dutch Banks (source:DNB):
• 2016: 99,64% (30 hours downtime)
• 2017: 99,76% (20 hours downtime)
• 2018: 99,88% (10 hours downtime)
• 2020: 99,94 (5 hours downtime)
• Netflix: 99,98% (less 3 hours downtime)
• AWS: 99,7%
ABOUT THOSE DEMANDS
9
IT HAS CHANGED
RESILIENCE
“the ability of a substance or object to spring back
into shape; elasticity” – google dictionary
IT RESILIENCE?
“ Resilience is the ability of a system
to withstand a major disruption
within acceptable degradation
parameters and to recover within an
acceptable time and composite costs
and risks. ”
GOAL OF BEING RESILIENT IS:
● Availability
● Less downtime
● Quicker recovery
● More fault tollerant
● Performance
● Security
● Integrity
● Customer Feedback
12
RESILIENCE
Infra
People &
Processes
Software
Examples:
Load Balancing
Stand-by servers
Examples:
Stand-by shifts
Release protocols
Examples:
Re-try Pattern
Circuit Breaker Pattern
WHERE TO START?
14
BRAINSTORM ABOUT FAILURES
15
Business impact
Probability
Database Failure
Database Downtime
API Failure
Storage Full
Network Downtime
IP renewal
High Load
1 API Fails
RESILIENCE PATTERNS
16https://blog.codecentric.de/en/2019/06/resilience-design-patterns-
retry-fallback-timeout-circuit-breaker/
17
SEQUENCE DIAGRAMS
18
HOW TO TEST IT?
19
Load Generator Monitoring
PERFORMANCE TESTING TYPES
Load
Time
0 Long
High
Load
Stress
Endurance
12h
Spike
HOW TO TEST IT?
20
Load Generator Monitoring
INTRODUCING FAILURES
21
Load
Generator
Monitoring
INTRODUCING FAILURES
22
Load
Generator
Monitoring
/etc/init.d/networking restart
dd bs=2048 if=/dev/urandom of=/dev/null
stress –i 1 –t 60
https://github.com/Netflix/SimianArmy
https://github.com/Shopify/toxiproxy
HOW TO ANALYZE IT?
23
WANT TO TRY YOURSELF?
Website with Microservices architecture and communication between front-end, API and Database
● Platform: Docker or Kubernetes
● Monitoring: Prometheus & Grafana
Add a performance test:
● Gatling
● Jmeter
Add a Failure:
● Kill POD
● Stress a POD
● Chaos Monkey
Socks Shop: https://microservices-demo.github.io/
24
Read about it:
Resilience:
https://netflixtechblog.com/tagged/resilience
https://www.zerto.com/the-platform/what-is-it-resilience/
Resilience patterns:
https://www.jrebel.com/blog/microservices-resilience-
patterns
https://medium.com/@adhorn/patterns-for-resilient-
architecture-part-1-d3b60cd8d2b6
Resilience Testing:
https://usersnap.com/blog/resilience-testing/
https://thenewstack.io/the-importance-of-resilience-
testing-and-observability/
https://en.wikipedia.org/wiki/Chaos_engineering
Doing it:
● Performance testing:
● Gatling - https://gatling.io/
● Jmeter - https://jmeter.apache.org/
● Test target:
● Socks website - https://microservices-demo.github.io/
● Platform
● Docker - https://www.docker.com/
● Kubernetes - https://kubernetes.io/
● Tools for failures:
● Nstress - https://www.ibm.com/support/pages/stress-test-your-
aix-or-linux-server-nstress
● Simian Army - https://github.com/Netflix/SimianArmy
● ToxiProxy - https://github.com/Shopify/toxiproxy
● Stress Container - https://github.com/progrium/docker-stress
● Monitoring
● Prometheus - https://prometheus.io/
● Graphana - https://grafana.com/
● Dynatrace - https://www.dynatrace.com/
COMPLEX? TOO HARD?
● Not really.. Start small..
● Reboot a Server
● Delete a database
● Kill a service on your machine
● See what happens? Resilience is about:
26
“the ability of a substance or object to spring back
into shape; elasticity” – google dictionary
Platform & System
Load Generator
27
● Why Should you?
● Microservices, Cloud & Always online
● Decreasing fault margines
● Decreasing response times
● Where to start?
● Brainstorm
● Resiliency Patterns
● Communication (Sequence Diagrams)
● Resilience – Testing the elasticity of your IT services
● Performance testing;
● Introducing Failures;
● Monitoring/Alerting;
● Analyse it;
CONCLUSION RESILIENCE
Monitoring & Analyzing
Failures
<./Command>
Read about it:
Resilience:
https://netflixtechblog.com/tagged/resilience
https://www.zerto.com/the-platform/what-is-it-resilience/
Resilience patterns:
https://www.jrebel.com/blog/microservices-resilience-
patterns
https://medium.com/@adhorn/patterns-for-resilient-
architecture-part-1-d3b60cd8d2b6
https://blog.codecentric.de/en/2019/06/resilience-design-
patterns-retry-fallback-timeout-circuit-breaker/
Resilience Testing:
https://usersnap.com/blog/resilience-testing/
https://thenewstack.io/the-importance-of-resilience-
testing-and-observability/
https://en.wikipedia.org/wiki/Chaos_engineering
Doing it:
● Performance testing:
● Gatling - https://gatling.io/
● Jmeter - https://jmeter.apache.org/
● Test target:
● Socks website - https://microservices-demo.github.io/
● Platform
● Docker - https://www.docker.com/
● Kubernetes - https://kubernetes.io/
● Tools for failures:
● Nstress - https://www.ibm.com/support/pages/stress-test-your-
aix-or-linux-server-nstress
● Simian Army - https://github.com/Netflix/SimianArmy
● ToxiProxy - https://github.com/Shopify/toxiproxy
● Stress Container - https://github.com/progrium/docker-stress
● Monitoring
● Prometheus - https://prometheus.io/
● Graphana - https://grafana.com/
● Dynatrace - https://www.dynatrace.com/

Resilience testing! Why should you

  • 1.
    RESILIENCE TESTING! Why should you? GeoffreyArij van der Tas Quality & Team Performance Coach
  • 3.
  • 4.
    4 Geoffrey van derTas TEAM & QUALITY COACH 2012 – Started in Testing 2015 – Training About Resilience Testing @ING 2018 – Workshop about Resilience Testing
  • 5.
    5 Inspire open your eyesfor something new
  • 6.
    • Why isit important? • What is the impact of new technologies? • What is Resilience testing exactly? • How to start with it? • Do it yourself tips for after the presentation!? AGENDA
  • 7.
  • 8.
    8 ● Dutch Banks(source:DNB): • 2016: 99,64% (30 hours downtime) • 2017: 99,76% (20 hours downtime) • 2018: 99,88% (10 hours downtime) • 2020: 99,94 (5 hours downtime) • Netflix: 99,98% (less 3 hours downtime) • AWS: 99,7% ABOUT THOSE DEMANDS
  • 9.
  • 10.
    RESILIENCE “the ability ofa substance or object to spring back into shape; elasticity” – google dictionary
  • 11.
    IT RESILIENCE? “ Resilienceis the ability of a system to withstand a major disruption within acceptable degradation parameters and to recover within an acceptable time and composite costs and risks. ”
  • 12.
    GOAL OF BEINGRESILIENT IS: ● Availability ● Less downtime ● Quicker recovery ● More fault tollerant ● Performance ● Security ● Integrity ● Customer Feedback 12
  • 13.
    RESILIENCE Infra People & Processes Software Examples: Load Balancing Stand-byservers Examples: Stand-by shifts Release protocols Examples: Re-try Pattern Circuit Breaker Pattern
  • 14.
  • 15.
    BRAINSTORM ABOUT FAILURES 15 Businessimpact Probability Database Failure Database Downtime API Failure Storage Full Network Downtime IP renewal High Load 1 API Fails
  • 16.
  • 17.
  • 18.
  • 19.
    HOW TO TESTIT? 19 Load Generator Monitoring PERFORMANCE TESTING TYPES Load Time 0 Long High Load Stress Endurance 12h Spike
  • 20.
    HOW TO TESTIT? 20 Load Generator Monitoring
  • 21.
  • 22.
    INTRODUCING FAILURES 22 Load Generator Monitoring /etc/init.d/networking restart ddbs=2048 if=/dev/urandom of=/dev/null stress –i 1 –t 60 https://github.com/Netflix/SimianArmy https://github.com/Shopify/toxiproxy
  • 23.
  • 24.
    WANT TO TRYYOURSELF? Website with Microservices architecture and communication between front-end, API and Database ● Platform: Docker or Kubernetes ● Monitoring: Prometheus & Grafana Add a performance test: ● Gatling ● Jmeter Add a Failure: ● Kill POD ● Stress a POD ● Chaos Monkey Socks Shop: https://microservices-demo.github.io/ 24
  • 25.
    Read about it: Resilience: https://netflixtechblog.com/tagged/resilience https://www.zerto.com/the-platform/what-is-it-resilience/ Resiliencepatterns: https://www.jrebel.com/blog/microservices-resilience- patterns https://medium.com/@adhorn/patterns-for-resilient- architecture-part-1-d3b60cd8d2b6 Resilience Testing: https://usersnap.com/blog/resilience-testing/ https://thenewstack.io/the-importance-of-resilience- testing-and-observability/ https://en.wikipedia.org/wiki/Chaos_engineering Doing it: ● Performance testing: ● Gatling - https://gatling.io/ ● Jmeter - https://jmeter.apache.org/ ● Test target: ● Socks website - https://microservices-demo.github.io/ ● Platform ● Docker - https://www.docker.com/ ● Kubernetes - https://kubernetes.io/ ● Tools for failures: ● Nstress - https://www.ibm.com/support/pages/stress-test-your- aix-or-linux-server-nstress ● Simian Army - https://github.com/Netflix/SimianArmy ● ToxiProxy - https://github.com/Shopify/toxiproxy ● Stress Container - https://github.com/progrium/docker-stress ● Monitoring ● Prometheus - https://prometheus.io/ ● Graphana - https://grafana.com/ ● Dynatrace - https://www.dynatrace.com/
  • 26.
    COMPLEX? TOO HARD? ●Not really.. Start small.. ● Reboot a Server ● Delete a database ● Kill a service on your machine ● See what happens? Resilience is about: 26 “the ability of a substance or object to spring back into shape; elasticity” – google dictionary
  • 27.
    Platform & System LoadGenerator 27 ● Why Should you? ● Microservices, Cloud & Always online ● Decreasing fault margines ● Decreasing response times ● Where to start? ● Brainstorm ● Resiliency Patterns ● Communication (Sequence Diagrams) ● Resilience – Testing the elasticity of your IT services ● Performance testing; ● Introducing Failures; ● Monitoring/Alerting; ● Analyse it; CONCLUSION RESILIENCE Monitoring & Analyzing Failures <./Command>
  • 28.
    Read about it: Resilience: https://netflixtechblog.com/tagged/resilience https://www.zerto.com/the-platform/what-is-it-resilience/ Resiliencepatterns: https://www.jrebel.com/blog/microservices-resilience- patterns https://medium.com/@adhorn/patterns-for-resilient- architecture-part-1-d3b60cd8d2b6 https://blog.codecentric.de/en/2019/06/resilience-design- patterns-retry-fallback-timeout-circuit-breaker/ Resilience Testing: https://usersnap.com/blog/resilience-testing/ https://thenewstack.io/the-importance-of-resilience- testing-and-observability/ https://en.wikipedia.org/wiki/Chaos_engineering Doing it: ● Performance testing: ● Gatling - https://gatling.io/ ● Jmeter - https://jmeter.apache.org/ ● Test target: ● Socks website - https://microservices-demo.github.io/ ● Platform ● Docker - https://www.docker.com/ ● Kubernetes - https://kubernetes.io/ ● Tools for failures: ● Nstress - https://www.ibm.com/support/pages/stress-test-your- aix-or-linux-server-nstress ● Simian Army - https://github.com/Netflix/SimianArmy ● ToxiProxy - https://github.com/Shopify/toxiproxy ● Stress Container - https://github.com/progrium/docker-stress ● Monitoring ● Prometheus - https://prometheus.io/ ● Graphana - https://grafana.com/ ● Dynatrace - https://www.dynatrace.com/