Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
"GameDay"
Achieving resilience through Chaos Engineering
Matt Fellows
@matthewfellows
#AAGameDay
#ChaosTesting
Pete Cohen
...
What is the common thread for these catastrophes?
#1 They all combined Technology with People + Process
#2 They all had multiple causes
Overview
■ Why GameDay exercises?
■ Case Studies
■ How you can run one for yourself
■ Bugs
■ Integration issues
■ Distributed failure
■ The squishy stuff: People + Process
Classes of issues
User Interface Mobile
API Gateway
Mainframe / DB
Middleware /
APIs
VerticalSlice
Pace layered architecture
User Interface Mobile
API Gateway
Mainframe / DB
Middleware /
APIs
VerticalSlice
Bug
bug
User Interface Mobile
API Gateway
Mainframe / DB
Middleware /
APIs
VerticalSlice
Integration Issues
integration
User Interface Mobile
API Gateway
Mainframe / DB
Middleware /
APIs
VerticalSlice
Distributed Failures
distributed
User Interface Mobile
API Gateway
Mainframe / DB
Middleware /
APIs
VerticalSlice
Catastrophes
Customers
Engineers
Call Cen...
Classes of issues
■ Bugs
■ Integration issues
■ Distributed failure
■ The squishy stuff: People + Process
So how do we avoid becoming front page news?
(the bad kind)
Fragility vs Resilience
Resilience vs Antifragility
Embracing Failure
■ We need to practice failure
■ Software Engineering needs its Fire Drill
An exercise where we place our systems
- technology, people + processes -
under stress in order to
learn and improve resil...
A GameDay manifesto?
DR GameDays
Driver Process Continuous Improvement
Approach Run sheet + requirements Loose plan + a li...
Once you finally start succeeding at agile…
Iterative software development
Independent feature teams
Nimble architectures
...
We want to inspire you
to give GameDays a go
Case Studies
Case Study: SEEK & nib
Logistics - how to plan a GameDay
dius.com.au/resources/game-day
■ People and roles to get involved
■ Preparation workshop...
Get
buy in
Find
the
right
people
Run
workshops
Logistical
preparation
Run
the
GameDay
Communicate
and act on
outcomes
Get
buy in
Find
the
right
people
Run
workshops
Logistical
preparation
Run
the
GameDay
Communicate
and act on
outcomes
Decide
which
broad
areas to
test
Identify
scenarios
Capture
hypotheses
Formulate an
action plan to
set up
scenarios
Scenar...
Load
Balancer
API API API API
Load
Balancer
Load Balancer
Load
Balancer
Load
Balancer
Post Mortem
Post Mortem
Load
Balancer
API API API API
Load Balancer Load Balancer Load Balancer Load Balancer
X X X XNo visibility!
✅ ...
Ingredients for catastrophe
✓Introduction of a change to the system
✓Human error
✓Missing local controls (tests) to preven...
What did we learn?
■ Just getting teams together to discuss resilience
was worthwhile
■ We always found something
■ Our ex...
What matters:
■ Cross-functional team
■ Planning
■ Open to exposing failure
■ Customer focus
■ Bake it in - do GameDays fr...
Are GameDays the new hack days?
■ Collaboration
■ Problem solving
■ Creates business value
The journey towards automated resilience testing
Pre-Production:
■Create local experiments in Docker
■Manual chaos in inte...
Matt Fellows @matthewfellows mfellows@dius.com.au
Pete Cohen @petecohen pcohen@dius.com.au
For links, references, template...
GameDay - Achieving resilience through Chaos Engineering
GameDay - Achieving resilience through Chaos Engineering
GameDay - Achieving resilience through Chaos Engineering
GameDay - Achieving resilience through Chaos Engineering
GameDay - Achieving resilience through Chaos Engineering
GameDay - Achieving resilience through Chaos Engineering
GameDay - Achieving resilience through Chaos Engineering
GameDay - Achieving resilience through Chaos Engineering
Upcoming SlideShare
Loading in …5
×

GameDay - Achieving resilience through Chaos Engineering

1,715 views

Published on

http://dius.com.au/resources/game-day/

Agility has brought us iterative software development, independent feature teams, nimble architectures and distributed, scalable infrastructure. But how do you maintain confidence in these systems in the face of this emergent complexity and fast paced change? The answer is to anticipate and practice failure!

In this session we explore GameDays, a collaborative exercise where teams safely introduce chaos into their systems, in order to make them better.

Published in: Technology
  • Be the first to comment

GameDay - Achieving resilience through Chaos Engineering

  1. 1. "GameDay" Achieving resilience through Chaos Engineering Matt Fellows @matthewfellows #AAGameDay #ChaosTesting Pete Cohen @petecohen
  2. 2. What is the common thread for these catastrophes?
  3. 3. #1 They all combined Technology with People + Process
  4. 4. #2 They all had multiple causes
  5. 5. Overview ■ Why GameDay exercises? ■ Case Studies ■ How you can run one for yourself
  6. 6. ■ Bugs ■ Integration issues ■ Distributed failure ■ The squishy stuff: People + Process Classes of issues
  7. 7. User Interface Mobile API Gateway Mainframe / DB Middleware / APIs VerticalSlice Pace layered architecture
  8. 8. User Interface Mobile API Gateway Mainframe / DB Middleware / APIs VerticalSlice Bug bug
  9. 9. User Interface Mobile API Gateway Mainframe / DB Middleware / APIs VerticalSlice Integration Issues integration
  10. 10. User Interface Mobile API Gateway Mainframe / DB Middleware / APIs VerticalSlice Distributed Failures distributed
  11. 11. User Interface Mobile API Gateway Mainframe / DB Middleware / APIs VerticalSlice Catastrophes Customers Engineers Call Centre bug distributed ... integration Public Relations
  12. 12. Classes of issues ■ Bugs ■ Integration issues ■ Distributed failure ■ The squishy stuff: People + Process
  13. 13. So how do we avoid becoming front page news? (the bad kind)
  14. 14. Fragility vs Resilience
  15. 15. Resilience vs Antifragility
  16. 16. Embracing Failure ■ We need to practice failure ■ Software Engineering needs its Fire Drill
  17. 17. An exercise where we place our systems - technology, people + processes - under stress in order to learn and improve resilience. GameDay
  18. 18. A GameDay manifesto? DR GameDays Driver Process Continuous Improvement Approach Run sheet + requirements Loose plan + a little chaos Focus Infrastructure Customer Who Operations Cross functional, multi-disciplinary team Assumption System is built to a robust design System is hazardous
  19. 19. Once you finally start succeeding at agile… Iterative software development Independent feature teams Nimble architectures Distributed, scalable infrastructure
  20. 20. We want to inspire you to give GameDays a go
  21. 21. Case Studies
  22. 22. Case Study: SEEK & nib
  23. 23. Logistics - how to plan a GameDay dius.com.au/resources/game-day ■ People and roles to get involved ■ Preparation workshops and planning ■ Templates and checklist ■ Physical space set up
  24. 24. Get buy in Find the right people Run workshops Logistical preparation Run the GameDay Communicate and act on outcomes
  25. 25. Get buy in Find the right people Run workshops Logistical preparation Run the GameDay Communicate and act on outcomes
  26. 26. Decide which broad areas to test Identify scenarios Capture hypotheses Formulate an action plan to set up scenarios Scenario and hypothesis generation workshop Get a common view of the stack
  27. 27. Load Balancer API API API API Load Balancer Load Balancer Load Balancer Load Balancer Post Mortem
  28. 28. Post Mortem Load Balancer API API API API Load Balancer Load Balancer Load Balancer Load Balancer X X X XNo visibility! ✅ ✅ ✅ ✅ X Release Dashboard
  29. 29. Ingredients for catastrophe ✓Introduction of a change to the system ✓Human error ✓Missing local controls (tests) to prevent syntax issue ✓Lack of salient information for operator (monitoring and alerting) ✓Opportunity to misinterpret data ✓Distance between expert and operator (process)
  30. 30. What did we learn? ■ Just getting teams together to discuss resilience was worthwhile ■ We always found something ■ Our experiments reduced the impact of hindsight bias
  31. 31. What matters: ■ Cross-functional team ■ Planning ■ Open to exposing failure ■ Customer focus ■ Bake it in - do GameDays frequently What doesn’t matter: ■ Size of team/company ■ Waterfall/Agile ■ Language, technology...
  32. 32. Are GameDays the new hack days? ■ Collaboration ■ Problem solving ■ Creates business value
  33. 33. The journey towards automated resilience testing Pre-Production: ■Create local experiments in Docker ■Manual chaos in integrated environments Production: ■Start small! ■Metrics-driven approach Chaos Kong pumba
  34. 34. Matt Fellows @matthewfellows mfellows@dius.com.au Pete Cohen @petecohen pcohen@dius.com.au For links, references, templates and your GameDay toolkit, head to: dius.com.au/resources/game-day Thank you!

×