Embrace Chaos - Introducing Chaos Engineering to your Organization

•

0 likes•97 views

Paul Osman

A talk I gave about running game days at the Austin Chaos Engineering Meetup.

Engineering

Introduction
• Paul Osman - Senior Engineering Manager
• posman@underarmour.com
• Previous Lives: PagerDuty, 500px, SoundCloud

Game Days: Planned fault injection exercises.

Game Days
• Imagine what could fail.
• Figure out how to prevent it from affecting business,
implement that.
• Cause the failure scenario to happen in production,
hopefully to prove the non effect of the event, thus gaining
conﬁdence in the system.

Trust
• Engineers <> Engineers
• Engineers <> Managers
• Non-Engineers <> Engineers

Engineers <> Engineers
This is just a healthy team. A few things I've found build trust on
a team:
• Embrace failures. Learn from them.
• Incident Response Process (STAT)
• Practice blame free retrospectives.
• Embrace ownership - engineers own alerts.

Engineers <> Managers
What can managers do to build trust?
• Nurture a blame free and just culture.
• Protect time for action items.

Engineers <> Non-Engineers
How about building trust between Engineers and Non-
Engineering stakeholders? (i.e. product, executives, customer
support, etc)
• Metrics that show business impact
• Be Transparent about Incidents
• Talk loudly about Chaos Engineering

Operational Maturity Checklist
• Incident Response Process
• Blame Free Retrospectives
• Action Items
• Metrics on Incidents
• Talk Loudly about Resiliency

Failure Scenarios
• Scenario A - Weather HTTP Service Unavailable
• Scenario B - Weather MySQL RDS Unavailable
• Scenario C - The Weather Channel API - High Latency
• Scenario D - Workout Service Unavailable
• Scenario E - Weather Async Service Unavailable

Scenario A - Weather HTTP
Service Unavailable
• Workout still shown, just without weather
• PagerDuty alert? Should ﬁre a low urgency alert

Scenario B - Weather MySQL RDS
Unavailable
• Expected 503s when database down, service was throwing
504
• Had to restart service after database was brought back up -
connections were not being recycled

Scenario C - High Latency from
Weather Channel capability
• Requests timeout - should ﬁre low urgency alert
• Action item: audit timeouts
• Expectation: asynchronous tasks are still processed

Take Aways!
• We learned a ton!
• Scheduled some valuable action items
• Just thinking about this stuff was worthwhile
• Less alert fatigue!
• Let's do more!

Next steps
• More teams doing more game days more frequently
• Build failure injection into our release process (production
readiness)
• Automate automate automate (hi Gremlin!)

Resources
• PagerDuty Incident Response Docs https://
response.pagerduty.com/
• Principles of Chaos https://principlesofchaos.org/
• Fault Injection in Production - https://queue.acm.org/
detail.cfm?id=2353017
• Gremlin Blog - https://www.gremlin.com/blog/

Thank you!
Psst https://careers.underarmour.com/
Or just talk to me:
posman@underarmour.com
@paulosman

What's hot

The Four Principles of Atlassian Performance TuningAtlassian

Chaos Engineering with Kubernetes - Berlin / Hamburg Chaos Engineering Meetup...Ana Medina

Automated Performance Testing for Desktop Applications by Ciprian Balea3Pillar Global

Scrum Control or Kanban Agility? You Can Have both, Using MetricsAtlassian

Cloud Platforms for Java3Pillar Global

Chaos Engineering in a Multi-Cloud World | Escape Conference 2019 Ana Medina

The Practice of Chaos Engineering - Reactive Summit 2018 - Montreal, QCAna Medina

Chaos Engineering with Containers - QCon SF 2018 Ana Medina

Saltconf16 william-cannon bWilliam Cannon

Next Level Chaos Engineering - Chaos Conf 2018 Ana Medina

You wouldn't build a toast, would youYan Cui

Not All Heroes Wear Capes: Skills and Tools Helpful in Becoming a Support Sup...Atlassian

Scaling Without Expanding: a DevOps StoryAtlassian

Test Your Own Stuff - Scrum Atlanta 2015Alex Kell

The Atlassian Bug Bounty ProgramAtlassian

Identify Waste in your Build PipelineScott Turnquest

Puppet Camp Melbourne 2014: Puppet

Continues Deployment - Tech Talk weekrantav

Continuously Integrating Distributed Code at NetflixAtlassian

How to build your own auto-remediation workflow - Ansible Meetup MunichJürgen Etzlstorfer

What's hot (20)

The Four Principles of Atlassian Performance Tuning

Chaos Engineering with Kubernetes - Berlin / Hamburg Chaos Engineering Meetup...

Automated Performance Testing for Desktop Applications by Ciprian Balea

Scrum Control or Kanban Agility? You Can Have both, Using Metrics

Cloud Platforms for Java

Chaos Engineering in a Multi-Cloud World | Escape Conference 2019

The Practice of Chaos Engineering - Reactive Summit 2018 - Montreal, QC

Chaos Engineering with Containers - QCon SF 2018

Saltconf16 william-cannon b

Next Level Chaos Engineering - Chaos Conf 2018

You wouldn't build a toast, would you

Not All Heroes Wear Capes: Skills and Tools Helpful in Becoming a Support Sup...

Scaling Without Expanding: a DevOps Story

Test Your Own Stuff - Scrum Atlanta 2015

The Atlassian Bug Bounty Program

Identify Waste in your Build Pipeline

Puppet Camp Melbourne 2014:

Continues Deployment - Tech Talk week

Continuously Integrating Distributed Code at Netflix

How to build your own auto-remediation workflow - Ansible Meetup Munich

Similar to Embrace Chaos - Introducing Chaos Engineering to your Organization

Accelerate Develoment with VIrtual DataKyle Hailey

AUTOMATE 2015 - Is Automation Right for Your Company - Craig Salvalaggio 3-2015Craig Salvalaggio

Information Technology - Discover the Root Cause and Develop a solution throu...John Hudson

MFG4 2016 - Is Automation Right for Your Company - 4-2016Craig Salvalaggio

DevOpsDays Austin: Helping Horses Become Unicorns, Chef's Operations Maturity...Matt Ray

Best Practices for Web Infrastructure on Amazon Web ServicesBrett Gillett

Performance Tuning in the TrenchesDonald Belcham

2 speed it powered by microsoft azureMichael Stephenson

DevOps, Databases and The Phoenix Project UGF4042 from OOW14Kyle Hailey

Mobile User Experience:Auto Drive through Performance MetricsAndreas Grabner

DrupalCamp LA 2014 - A Perfect Launch, Every TimeSuzanne Aldrich

Virtual Data : Eliminating the data constraint in Application DevelopmentKyle Hailey

Effective ScrumSándor Zolta Székely Sipos

Intuit continuous performance testing for code camp tempRamakrishna Kollipara

Jsm computer solutionsJason Mast

BGOUG "Agile Data: revolutionizing database cloning'Kyle Hailey

Building and Supporting Billion Dollar Ships with JIRA - Greg WarnerAtlassian

Beyond DevOps - How Netflix Bridges the GapJosh Evans

Eric Proegler Oredev Performance Testing in New ContextsEric Proegler

Introduction to cypress in Angular (Chinese)Hong Tat Yew

Similar to Embrace Chaos - Introducing Chaos Engineering to your Organization (20)

Accelerate Develoment with VIrtual Data

AUTOMATE 2015 - Is Automation Right for Your Company - Craig Salvalaggio 3-2015

Information Technology - Discover the Root Cause and Develop a solution throu...

MFG4 2016 - Is Automation Right for Your Company - 4-2016

DevOpsDays Austin: Helping Horses Become Unicorns, Chef's Operations Maturity...

Best Practices for Web Infrastructure on Amazon Web Services

Performance Tuning in the Trenches

2 speed it powered by microsoft azure

DevOps, Databases and The Phoenix Project UGF4042 from OOW14

Mobile User Experience:Auto Drive through Performance Metrics

DrupalCamp LA 2014 - A Perfect Launch, Every Time

Virtual Data : Eliminating the data constraint in Application Development

Effective Scrum

Intuit continuous performance testing for code camp temp

Jsm computer solutions

BGOUG "Agile Data: revolutionizing database cloning'

Building and Supporting Billion Dollar Ships with JIRA - Greg Warner

Beyond DevOps - How Netflix Bridges the Gap

Eric Proegler Oredev Performance Testing in New Contexts

Introduction to cypress in Angular (Chinese)

Recently uploaded

VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...VICTOR MAESTRE RAMIREZ

Software and Systems Engineering Standards: Verification and Validation of Sy...VICTOR MAESTRE RAMIREZ

Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR9953056974 Low Rate Call Girls In Saket, Delhi NCR

Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort servicejennyeacort

ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...ZTE

power system scada applications and usesDevarapalliHaritha

HARMONY IN THE HUMAN BEING - Unit-II UHV-2RajaP95

🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...9953056974 Low Rate Call Girls In Saket, Delhi NCR

Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionDr.Costas Sachpazis

VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130Suhani Kapoor

High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile

★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR9953056974 Low Rate Call Girls In Saket, Delhi NCR

IVE Industry Focused Event - Defence Sector 2024Mark Billinghurst

(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escortsranjana rawat

Microscopic Analysis of Ceramic Materials.pptxpurnimasatapathy1234

Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...srsj9000

What are the advantages and disadvantages of membrane structures.pptxwendy cai

Artificial-Intelligence-in-Electronics (K).pptxbritheesh05

OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...Soham Mondal

young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service9953056974 Low Rate Call Girls In Saket, Delhi NCR

Recently uploaded (20)

VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...

Software and Systems Engineering Standards: Verification and Validation of Sy...

Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR

Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service

ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...

power system scada applications and uses

HARMONY IN THE HUMAN BEING - Unit-II UHV-2

🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...

Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction

VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130

High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts

★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR

IVE Industry Focused Event - Defence Sector 2024

(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts

Microscopic Analysis of Ceramic Materials.pptx

Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...

What are the advantages and disadvantages of membrane structures.pptx

Artificial-Intelligence-in-Electronics (K).pptx

OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...

young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service

Embrace Chaos - Introducing Chaos Engineering to your Organization

1. Embracing Chaos Introducing Chaos Engineering to your organization

2. “Chaos Engineering is the discipline of experimenting on a distributed system in order to build conﬁdence in the system's capability to withstand turbulent conditions in production.” -- http://principlesofchaos.org/

3. Introduction • Paul Osman - Senior Engineering Manager • posman@underarmour.com • Previous Lives: PagerDuty, 500px, SoundCloud

6. Game Days: Planned fault injection exercises.

7. Game Days • Imagine what could fail. • Figure out how to prevent it from affecting business, implement that. • Cause the failure scenario to happen in production, hopefully to prove the non effect of the event, thus gaining conﬁdence in the system.

8. Trust • Engineers <> Engineers • Engineers <> Managers • Non-Engineers <> Engineers

9. Engineers <> Engineers This is just a healthy team. A few things I've found build trust on a team: • Embrace failures. Learn from them. • Incident Response Process (STAT) • Practice blame free retrospectives. • Embrace ownership - engineers own alerts.

10. Engineers <> Managers What can managers do to build trust? • Nurture a blame free and just culture. • Protect time for action items.

11. Engineers <> Non-Engineers How about building trust between Engineers and Non- Engineering stakeholders? (i.e. product, executives, customer support, etc) • Metrics that show business impact • Be Transparent about Incidents • Talk loudly about Chaos Engineering

12. Operational Maturity Checklist • Incident Response Process • Blame Free Retrospectives • Action Items • Metrics on Incidents • Talk Loudly about Resiliency

13. Our First Game Day

14.

15. Failure Scenarios • Scenario A - Weather HTTP Service Unavailable • Scenario B - Weather MySQL RDS Unavailable • Scenario C - The Weather Channel API - High Latency • Scenario D - Workout Service Unavailable • Scenario E - Weather Async Service Unavailable

16. Failure Scenarios • Scenario A - Weather HTTP Service Unavailable • Scenario B - Weather MySQL RDS Unavailable • Scenario C - The Weather Channel API - High Latency • Scenario D - Workout Service Unavailable • Scenario E - Weather Async Service Unavailable

17. Scenario A - Weather HTTP Service Unavailable • Workout still shown, just without weather • PagerDuty alert? Should ﬁre a low urgency alert

18. Scenario B - Weather MySQL RDS Unavailable • Expected 503s when database down, service was throwing 504 • Had to restart service after database was brought back up - connections were not being recycled

19. Scenario C - High Latency from Weather Channel capability • Requests timeout - should ﬁre low urgency alert • Action item: audit timeouts • Expectation: asynchronous tasks are still processed

20. Take Aways! • We learned a ton! • Scheduled some valuable action items • Just thinking about this stuff was worthwhile • Less alert fatigue! • Let's do more!

21. Next steps • More teams doing more game days more frequently • Build failure injection into our release process (production readiness) • Automate automate automate (hi Gremlin!)

22. Resources • PagerDuty Incident Response Docs https:// response.pagerduty.com/ • Principles of Chaos https://principlesofchaos.org/ • Fault Injection in Production - https://queue.acm.org/ detail.cfm?id=2353017 • Gremlin Blog - https://www.gremlin.com/blog/

23. Thank you! Psst https://careers.underarmour.com/ Or just talk to me: posman@underarmour.com @paulosman

Embrace Chaos - Introducing Chaos Engineering to your Organization

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Embrace Chaos - Introducing Chaos Engineering to your Organization

Similar to Embrace Chaos - Introducing Chaos Engineering to your Organization (20)

Recently uploaded

Recently uploaded (20)

Embrace Chaos - Introducing Chaos Engineering to your Organization