Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

PagerDuty | OSCON 2016 Failure Testing


Published on

Failure Testing: Automating a Series of Unfortunate Events

Presentation by:
Alper Kokmen
Software Engineer

Published in: Technology
  • D0WNL0AD FULL ▶ ▶ ▶ ▶ ◀ ◀ ◀ ◀
    Are you sure you want to  Yes  No
    Your message goes here

PagerDuty | OSCON 2016 Failure Testing

  2. 2. Alper Kokmen PRESENT Software Engineer at PagerDuty Surrounded by smart people PAST Start-ups, Microsoft Surrounded by smart people #OSCON
  3. 3. #OSCON
  4. 4. Goals Start manually injecting failures. Start automating your manual tests. #OSCON
  5. 5. CHAOS ENGINEERING “[T]he discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production.” Principles of Chaos Engineering #OSCON
  7. 7. PagerDuty Simian Army? Multiple cloud providers (AWS and Azure) Experimentation Application-specific failure scenarios #OSCON
  8. 8. PagerDuty Simian Human Army FAILURE FRIDAY Time-boxed recurring meeting Pre-announced agenda Break things Sign-off from service owners Attendance GROUND RULES Keep monitoring & alerting Abort if needed #OSCON
  9. 9. Failure Friday: Agenda #OSCON
  10. 10. Failure Friday: Process #OSCON Inject Failure Monitor Repeat
  11. 11. Failure Friday: Monitoring #OSCON
  12. 12. 2 Years Later BENEFITS System design Knowledge sharing Incident response training #OSCON
  13. 13. 2 Years Later ACCOMPLISHMENTS Whole DC outages Target multiple services at once Distribute failure testing to teams Automation (in progress) #OSCON
  14. 14. Automation: Rationale #OSCON “MANY” HOSTS - Distribute tasks to multiple people and keep executing manually. - Watch Operations team with envy while they use chef and knife. - Start automating.
  15. 15. PagerDuty/blender A MODULAR ORCHESTRATION ENGINE Ruby DSL Host Discovery (blender-chef, blender-serf) Ranjib Dey (@RanjibDey) #OSCON
  16. 16. PagerDuty/blender CODE #OSCON # example.rb ssh_task 'update' do execute 'sudo apt-get update -y' members ['ubuntu01', 'ubuntu02', 'ubuntu03'] end
  17. 17. PagerDuty/blender EXECUTION #OSCON blend -f example.rb Run[example.rb] started 3 job(s) computed using 'Default' strategy Job 1 [update on ubuntu01] finished Job 2 [update on ubuntu02] finished Job 3 [update on ubuntu03] finished Run finished (42.228923876 s)
  18. 18. PagerDuty/smoothie A SIMPLE LIBRARY OF BLENDER RECIPES Chef Integration Recipes for Disaster CLI to Specify Recipes #OSCON
  19. 19. PagerDuty/smoothie REBOOT RECIPE #OSCON def recipe__reboot(hosts) ssh_task 'reboot' do members hosts execute 'sudo /sbin/reboot' # shutdown will break ssh connection. ignore_failure true end end
  20. 20. PagerDuty/smoothie UNICORN SUSPEND & RESUME RECIPES #OSCON def recipe__unicorn_suspend_master(hosts) ssh_task 'suspend unicorn[master] immediately' do members hosts execute 'sudo kill -s STOP `cat /u/.../pids/`' end end def recipe__unicorn_resume_master(hosts) ssh_task 'resume unicorn[master] immediately' do members hosts execute 'sudo kill -s CONT `cat /u/.../pids/`' end end
  21. 21. PagerDuty/smoothie LATENCY RECIPE #OSCON def recipe__tc_add_latency(hosts) ssh_task 'add network latency using tc' do members hosts execute 'sudo tc qdisc add dev eth0 root netem delay 500ms 100ms loss 20%' end end def recipe__tc_remove_latency(hosts) ssh_task 'remove network latency using tc' do members hosts execute 'sudo tc qdisc del dev eth0 root netem' end end
  22. 22. PagerDuty/smoothie EXECUTION #OSCON HOSTFILTER=app1 RECIPE=reboot blend -f smoothie.rb def recipe__reboot(hosts)
  23. 23. PagerDuty/smoothie EXECUTION #OSCON ZONE=us-west-2a RECIPE=reboot blend -f smoothie.rb def recipe__reboot(hosts)
  24. 24. Failure Friday: Blender #OSCON ZONE=us-west-2a ROLE=web-app RECIPE=monit_unmonitor ZONE=us-west-2a ROLE=web-app RECIPE=monit_monitor ZONE=us-west-2a ROLE=web-app RECIPE=unicorn_stop_master_gracefully ZONE=us-west-2b ROLE=web-app RECIPE=unicorn_suspend_master ZONE=us-west-2b ROLE=web-app RECIPE=unicorn_resume_master ZONE=us-west-2c ROLE=web-app RECIPE=reboot ZONE=us-west-2a ROLE=web-app RECIPE=iptables_network_isolate ZONE=us-west-2a ROLE=web-app RECIPE=iptables_rebuild ZONE=us-west-2b ROLE=web-app RECIPE=tc_add_latency ZONE=us-west-2b ROLE=web-app RECIPE=tc_remove_latency
  25. 25. Future AUTOMATION Build more automation for service-specific scenarios. Scheduled runs (similar to Netflix). #OSCON
  26. 26. Future CHATOPS Inject failures by invoking chat commands. Share metrics and graphs to help people follow along. Collect TODOs during Failure Fridays and generate a report. #OSCON
  27. 27. Future NEW TYPES OF FAILURES Distributed Denial of Service (DDoS) attacks for services. Impediments that come up during Incident Response. #OSCON
  28. 28. Summary FAILURES WILL HAPPEN Anything that can go wrong, will go wrong. Proactively test failure handling now. Start simple. #OSCON
  29. 29. #OSCON PROPOSED EDIT “Experiments that aren’t introducing new insights should be automated and used to monitor ongoing health of the system. New experiments should be devised to continue to push the bounds of the system.” Culture From Chaos by @dougbarth
  30. 30. Thank you. #OSCON @alperkokmen