Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

The Journey of Chaos Engineering Begins with a Single Step

1,294 views

Published on

PagerDuty Summit 2016
Presenters: Bruce Wong, James Burns
https://www.pagerduty.com/pagerduty-summit-2016/

Heard of Netflix' Chaos Engineering & the Simian Army? Google's legendary DiRT exercises? Hear about how Twilio is getting started on its journey with Chaos Engineering. This talk is the story of how Twilio got started with Chaos Engineering, lessons learned, and the impact to our engineering culture.

Published in: Technology
  • Be the first to comment

The Journey of Chaos Engineering Begins with a Single Step

  1. 1. #PDSummit16#PDSummit16 The Journey of Chaos Engineering Begins with a Single Step
  2. 2. #PDSummit16#PDSummit16 Bruce WongSenior Engineering Manager Twilio @bruce_m_wong https://www.linkedin.com/in/brucemwong
  3. 3. #PDSummit16#PDSummit16
  4. 4. #PDSummit16#PDSummit16 2009 2012 2014 http://techblog.netflix.com/2012/07/chaos-monkey-released-into-wild.html https://github.com/Netflix/SimianArmy http://techblog.netflix.com/2015/09/chaos-engineering-upgraded.html
  5. 5. #PDSummit16 http://readwrite.com/2014/09/17/netflix-chaos-engineering-for-everyone/ http://techblog.netflix.com/2014/09/introducing-chaos-engineering.html
  6. 6. #PDSummit16 https://www.twilio.com/
  7. 7. #PDSummit16#PDSummit16 https://customers.twilio.com/
  8. 8. #PDSummit16#PDSummit16 The journey of a thousand miles begins with a single step. -Lao Tzu
  9. 9. #PDSummit16#PDSummit16 James BurnsTech Lead Twilio @1mentat #PDSummit16 https://www.linkedin.com/in/james-burns-7816a82
  10. 10. #PDSummit16#PDSummit16 Preparation Pre-Launch Log Aggregation System -Stage env -Synthetic Traffic
  11. 11. #PDSummit16 The Master of Disaster •Network Issues •Partitions •Thundering Herds •Cascading Failures •Resource Starvation •CPU •Memory •Disk IO •Network IO •Application Load
  12. 12. > sudo halt #PDSummit16
  13. 13. Incident Start #PDSummit16
  14. 14. Impact? #PDSummit16
  15. 15. Post-Mortem #PDSummit16
  16. 16. #PDSummit16#PDSummit16
  17. 17. #PDSummit16 Round 2 •Network Issues •Partitions •Thundering Herds •Cascading Failures •Resource Starvation •CPU •Memory •Disk IO •Network IO •Application Load
  18. 18. > sudo halt #PDSummit16
  19. 19. Third-Party API Failure #PDSummit16
  20. 20. #PDSummit16 Well, that’s not what I expected to see
  21. 21. #PDSummit16 Outcomes Instrument Instrument Instrument API SLAs Architectural Change!
  22. 22. #PDSummit16 Recap • Start Simple • Instrumentation Gaps • Understand your dashboards • Prevent outages
  23. 23. #PDSummit16 http://www.crisistextline.org/ http://polarisproject.org/befree-textline http://trekmedics.org/ https://www.twilio.org/
  24. 24. #PDSummit16 When you wish upon a blue moon…
  25. 25. #PDSummit16#PDSummit16 Please provide feedback for this session by filling out the feedback survey

×