0
Image Service Outage
postrotate  /bin/kill -HUP `cat /var/run/httpd.pid 2>/dev/null`  2> /dev/null || true
NEVER TEST INPRODUCTION!
It only takesone tinymistake
How Do You Enforce This?•   Documented standards and communicated best practices•   Robust testing workflow    •   Environm...
Testing Workflow
How We Use Environments•   Three environments: production, development, testing    •   Testing is unconstrained    •   Tes...
Working with Environments•    knife-flip by Etsy engineer Jon Cowie     (https://github.com/jonlives/knife-flip)    % knife ...
Keeping Environments in Sync•   knife-env-diff by Etsy engineer John Goulah    •   Get it at https://github.com/jgoulah/kn...
Introducing Knife Spork•   Knife plugin providing a testing/versioning workflow•   Authored by Jon Cowie•   Get it at https...
Spork Features•   Four stage process    •   Check: Look at versioning info for a cookbook    •   Bump: Automatically incre...
git:                              enabled: true                            irccat:                              enabled: t...
% knife spork check foodcriticChecking versions for cookbook foodcritic...Current local version: 0.0.4Remote versions (Max...
% knife spork bump foodcriticLoaded config file /home/pmcdonnell/git/chef-repo/config/spork-config.yml...Loaded config fil...
% knife spork upload foodcriticLoaded config file /home/pmcdonnell/git/chef-repo/config/spork-config.yml...Loaded config f...
% knife spork promote foodcritic --remotePulling latest changes from gitChecking that foodcritic version 0.0.5 exists on t...
WARNING: Youre about to promote changes to severalcookbooks:logrotate: = 0.1.24 changed to = 0.1.23foodcritic: = 0.0.4 cha...
Spork’s Logging Mechanisms •   Irccat: Logs to IRC channel (https://github.com/RJ/irccat)[11:35:33] <irccat> CHEF: pmcdonn...
Linting
Foodcritic•   A lint tool for Chef cookbooks written by Andrew Crump    (http://acrmp.github.com/foodcritic/)•   Comes wit...
Etsy’s Rules•       A work in progress, but newly open-sourced at        https://github.com/etsy/foodcritic-rules•       O...
Rule Resulting from Image Outage•   ETSY005 - Action :restart sent to a core service    •   Trippable services include htt...
Rule Resulting from Image Outage30 template "/etc/httpd/conf/httpd.conf" do31   source "httpd-conf.erb"32   owner "root"33...
Memcache Outage
02:27 < jallspaw> [Sat, 10 Jul 2010 01:45:01 +0000]INFO: Upgrading package[memcached] version from               1.4.2-1.f...
Don’t leave“known unknowns”lying in wait
Resulting Foodcritic Rule•   ETSY001 - Package or yum_package resource used with :upgrade action    •   Enforces always us...
Resulting Foodcritic Rule20 package "memcached" do21   action :upgrade22 end                      Changed to:20 package "m...
Reporting and Monitoring
Using Handlers •   Etsy’s handlers (https://github.com/etsy/chef-handlers)     •   Log failures to IRC[10:52:03] <irccat> ...
Graph with Graphite•   Metrics reporting made possible by knife-lastrun, authored by    John Goulah (https://github.com/jg...
% dsh -g all -c -M grep "Chef Run complete in" /var/log/chef/client.log | head -n 3 2>&1 | tee /tmp/tee && grepChef Run co...
Finding Run Time Outliers•   Knife doesn’t currently support Lucene’s NumericRangeQuery    •   Elapsed time is a floating p...
% knife search node elapsed:[200 TO 225] -alastrun.runtimes.elapsed4 items foundid:                         cent6-vmtempla...
% knife node lastrun sandboxmisc01.ny4.etsy.comStatus                    failedElapsed Time              211.78604Start Ti...
What Did Chef Just Do?•   chefrecentupdates by Etsy engineer Laurie Denness    (https://github.com/lozzd/ChefScripts)% che...
Preventative Measures
Knife Preflight•   By Jon Cowie (https://github.com/jonlives/knife-preflight)% knife preflight memcache::datacacheSearching ...
Continuous Chef•   Using Jenkins and base virtual machine images
“Out-of-Band” Management•   dsh (distributed shell) works even if Chef server is down    •   Etsy’s dsh groups are managed...
Configs Bundled with Packages•   Be careful with configs distributed with packages overwriting Chef    configs    •   They mu...
Jon will be at Velocity!•   Workshop: Michelin Starred Cooking with Chef    •   11:00am Monday, 06/25/2012    •   Topics  ...
We’re Hiring!        http://www.etsy.com/careers•   TONS of engineering positions open!•   Especially looking for a talent...
Lessons from Etsy: Avoiding Kitchen Nightmares - #ChefConf 2012
Lessons from Etsy: Avoiding Kitchen Nightmares - #ChefConf 2012
Lessons from Etsy: Avoiding Kitchen Nightmares - #ChefConf 2012
Lessons from Etsy: Avoiding Kitchen Nightmares - #ChefConf 2012
Lessons from Etsy: Avoiding Kitchen Nightmares - #ChefConf 2012
Lessons from Etsy: Avoiding Kitchen Nightmares - #ChefConf 2012
Lessons from Etsy: Avoiding Kitchen Nightmares - #ChefConf 2012
Upcoming SlideShare
Loading in...5
×

Lessons from Etsy: Avoiding Kitchen Nightmares - #ChefConf 2012

17,814

Published on

Talk by Patrick McDonnell (@mcdonnps) at #ChefConf 2012

Chef makes it so easy to change configuration en masse that it can be dangerous if not used with certain precautions and in accordance with a well thought out testing workflow. In our use of Chef at Etsy, we have devised many in-house best practices in response to failures which have helped greatly in avoiding catastrophic outages. This talk will focus on mistakes we've made and how we've avoided repeating them by enforcing standards in cookbooks, testing changes before rollout through the use of environments and in conjunction with the Spork plugin for Knife, and linting cookbooks with Foodcritic. I'll also talk about using handlers intelligently to monitor Chef runs and how to generate reports from the myriad data available in CouchDB.

Published in: Technology, Business
2 Comments
27 Likes
Statistics
Notes
No Downloads
Views
Total Views
17,814
On Slideshare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
123
Comments
2
Likes
27
Embeds 0
No embeds

No notes for slide

Transcript of "Lessons from Etsy: Avoiding Kitchen Nightmares - #ChefConf 2012"

  1. 1. Image Service Outage
  2. 2. postrotate /bin/kill -HUP `cat /var/run/httpd.pid 2>/dev/null` 2> /dev/null || true
  3. 3. NEVER TEST INPRODUCTION!
  4. 4. It only takesone tinymistake
  5. 5. How Do You Enforce This?• Documented standards and communicated best practices• Robust testing workflow • Environments • Knife Plugins• Linting with rules derived from standards • Foodcritic
  6. 6. Testing Workflow
  7. 7. How We Use Environments• Three environments: production, development, testing • Testing is unconstrained • Test nodes are depooled and “flipped” to the testing environment, then repooled and analyzed • Test nodes are then flipped back to production
  8. 8. Working with Environments• knife-flip by Etsy engineer Jon Cowie (https://github.com/jonlives/knife-flip) % knife node flip somenode.etsy.com testing % knife role flip SomeRole testing• knife-bulkchangeenvironment (https://github.com/jonlives/knife- bulkchangeenvironment) % knife node bulk_change_environment testing production
  9. 9. Keeping Environments in Sync• knife-env-diff by Etsy engineer John Goulah • Get it at https://github.com/jgoulah/knife-env-diff% knife environment diff development productiondiffing environment development against productioncookbook: hadoop development version: = 0.1.0 production version: = 0.1.8cookbook: mysql development version: = 0.2.4 production version: = 0.2.5
  10. 10. Introducing Knife Spork• Knife plugin providing a testing/versioning workflow• Authored by Jon Cowie• Get it at https://github.com/jonlives/knife-spork
  11. 11. Spork Features• Four stage process • Check: Look at versioning info for a cookbook • Bump: Automatically increment the cookbook’s version number • Upload: Knife upload and freeze • Promote: Set environment constraints equal to specified version
  12. 12. git: enabled: true irccat: enabled: true server: irccat.mycompany.com port: 12345Spork Config channel: "#chef" graphite: enabled: true server: graphite.mycompany.com• /path/to/chef-repo port: 2003 /config/spork-config.yml gist: enabled: true• /etc/spork-config.yml in_chef: true chef_path: cookbooks/gist/files/default/gist• ~/.chef/spork-config.yml path: /usr/bin/gist foodcritic: enabled: true fail_tags: [any] tags: [foo] include_rules: [/home/me/myrules] default_environments: [ production, development ]
  13. 13. % knife spork check foodcriticChecking versions for cookbook foodcritic...Current local version: 0.0.4Remote versions (Max. 5 most recent only):*0.0.4, frozen0.0.3, frozen0.0.2, unfrozen0.0.1, frozenDANGER: Your local cookbook has same version number as thestarred version above!Please bump your local version or you wont be able toupload.
  14. 14. % knife spork bump foodcriticLoaded config file /home/pmcdonnell/git/chef-repo/config/spork-config.yml...Loaded config file /etc/spork-config.yml...Pulling latest changes from gitPulling latest changes from git submodules (if any)Bumping patch level of the foodcritic cookbook from 0.0.4 to0.0.5Git adding /home/pmcdonnell/git/chef-repo/cookbooks/foodcritic/metadata.rb
  15. 15. % knife spork upload foodcriticLoaded config file /home/pmcdonnell/git/chef-repo/config/spork-config.yml...Loaded config file /etc/spork-config.yml...Uploading and freezing foodcritic [0.0.5]upload complete
  16. 16. % knife spork promote foodcritic --remotePulling latest changes from gitChecking that foodcritic version 0.0.5 exists on the serverbefore promoting (any error means it hasnt been uploadedyet)...foodcritic version 0.0.5 found on server!Environment: productionAdding version constraint foodcritic = 0.0.5Saving changes into production.jsonGit adding /home/pmcdonnell/git/chef-repo/environments/production.jsonUploading production to server
  17. 17. WARNING: Youre about to promote changes to severalcookbooks:logrotate: = 0.1.24 changed to = 0.1.23foodcritic: = 0.0.4 changed to = 0.0.5Are you sure you want to continue? (Y/N) nYou said no, so Im done here.Would you like to reset your local production.json to matchthe server?? (Y/N) yGit adding /home/pmcdonnell/git/chef-repo/environments/production.jsonproduction.json reset.
  18. 18. Spork’s Logging Mechanisms • Irccat: Logs to IRC channel (https://github.com/RJ/irccat)[11:35:33] <irccat> CHEF: pmcdonnell uploaded and froze cookbook ldap version 0.1.27[11:35:43] <irccat> CHEF: pmcdonnell uploaded environment productionhttps://github.etsycorp.com/gist/376967[11:35:43] <irccat> CHEF: pmcdonnell uploaded environment developmenthttps://github.etsycorp.com/gist/376968 • Graphite: promote --remote sends to deploys.chef metric • Gist: Added to irccat notifications on promote --remoteEnvironment production uploaded at 2012-05-15 18:35:42 UTC by pmcdonnellConstraints updated on server in this version:ldap: = 0.1.26 changed to = 0.1.27
  19. 19. Linting
  20. 20. Foodcritic• A lint tool for Chef cookbooks written by Andrew Crump (http://acrmp.github.com/foodcritic/)• Comes with a good set of default rules and is very easily extensible• To enable in spork config: foodcritic: enabled: true fail_tags: [any] tags: [foo] include_rules: [/home/me/myrules]
  21. 21. Etsy’s Rules• A work in progress, but newly open-sourced at https://github.com/etsy/foodcritic-rules• Our rules are “style”-tagged rules that serve to enforce what we consider to be best practices in our environment • ETSY001 - Package or yum_package resource used with :upgrade action • ETSY002 - Execute resource used to run git commands • ETSY003 - Execute resource used to run curl or wget commands • ETSY004 - Execute resource defined without conditional or action :nothing • ETSY005 - Action :restart sent to a core service
  22. 22. Rule Resulting from Image Outage• ETSY005 - Action :restart sent to a core service • Trippable services include httpd, mysql, memcached, postgresql-server% foodcritic -t etsy -I ~/git/chef-repo/config/rules.rb ~/git/chef-repo/cookbooks/apacheETSY005: Action :restart sent to a core service:/home/pmcdonnell/git/chef-repo/cookbooks/apache/recipes/default.rb:39
  23. 23. Rule Resulting from Image Outage30 template "/etc/httpd/conf/httpd.conf" do31 source "httpd-conf.erb"32 owner "root"33 group "root"34 mode 0064435 variables(36 :fqdn => node[:fqdn],37 :port => "80"38 )39 notifies :restart, resources(:service => "httpd")40 end
  24. 24. Memcache Outage
  25. 25. 02:27 < jallspaw> [Sat, 10 Jul 2010 01:45:01 +0000]INFO: Upgrading package[memcached] version from 1.4.2-1.fc10 to 1.4.5-1.el5
  26. 26. Don’t leave“known unknowns”lying in wait
  27. 27. Resulting Foodcritic Rule• ETSY001 - Package or yum_package resource used with :upgrade action • Enforces always using :install% foodcritic -t etsy -I ~/git/chef-repo/config/rules.rb ~/git/chef-repo/cookbooks/memcacheETSY001: Package or yum_package resource used with :upgradeaction: /home/pmcdonnell/git/chef-repo/cookbooks/memcache/recipes/default.rb:20
  28. 28. Resulting Foodcritic Rule20 package "memcached" do21 action :upgrade22 end Changed to:20 package "memcached" do21 version "1.4.2-1.fc10"22 action :install23 end
  29. 29. Reporting and Monitoring
  30. 30. Using Handlers • Etsy’s handlers (https://github.com/etsy/chef-handlers) • Log failures to IRC[10:52:03] <irccat> Chef run failed on dev-dbtasks01.ny4dev.etsy.com[10:52:03] <irccat> https://github.etsycorp.com/gist/371229 • Graph aggregated metrics with Graphite • Graph chef “deploys”
  31. 31. Graph with Graphite• Metrics reporting made possible by knife-lastrun, authored by John Goulah (https://github.com/jgoulah/knife-lastrun) • Provides a handler and knife plugin for reporting on the most recent chef run, storing data as node attributes • Elapsed, starting, and ending time • Exit code status • Backtrace/exception information
  32. 32. % dsh -g all -c -M grep "Chef Run complete in" /var/log/chef/client.log | head -n 3 2>&1 | tee /tmp/tee && grepChef Run complete /tmp/tee | sort -n -k +13 | tail -5dn0035.doop: [Mon, 14 May 2012 03:21:07 +0000] INFO: ChefRun complete in 512.936813012 secondsdn0004.doop: [Mon, 14 May 2012 04:28:03 +0000] INFO: ChefRun complete in 677.423964906 secondsdn0006.doop: [Mon, 14 May 2012 04:29:51 +0000] INFO: ChefRun complete in 770.231469266 secondsdn0025.doop: [Mon, 14 May 2012 04:26:13 +0000] INFO: ChefRun complete in 787.183615612 secondsdn0030.doop: [Mon, 14 May 2012 04:30:42 +0000] INFO: ChefRun complete in 848.586507872 seconds
  33. 33. Finding Run Time Outliers• Knife doesn’t currently support Lucene’s NumericRangeQuery • Elapsed time is a floating point number, but we can only match it as a string due to query limitations in knife • Work around it with knife search -a
  34. 34. % knife search node elapsed:[200 TO 225] -alastrun.runtimes.elapsed4 items foundid: cent6-vmtemplate.ny4dev.etsy.comlastrun.runtimes.elapsed: 21.642378406id: sandboxmisc01.ny4.etsy.comlastrun.runtimes.elapsed: 211.749555id: smardenfeld.vm.ny4dev.etsy.comlastrun.runtimes.elapsed: 22.184596id: bob0120.vm.ny4dev.etsy.comlastrun.runtimes.elapsed: 21.348335354
  35. 35. % knife node lastrun sandboxmisc01.ny4.etsy.comStatus failedElapsed Time 211.78604Start Time 2012-05-15 07:43:18 +0000End Time 2012-05-15 07:46:50 +0000BacktraceOmitted for brevityExceptionChef::Exceptions::Package: package[diffutils](installerz::diffutils line 1) had an error: Yum failed -#<Process::Status: pid 21293 exit 1> - returns: ["yum-dumpRepository Error: Cannot retrieve repository metadata(repomd.xml) for repository: PostgreSQL-8.3-x86_64. Pleaseverify its path and try againn"]
  36. 36. What Did Chef Just Do?• chefrecentupdates by Etsy engineer Laurie Denness (https://github.com/lozzd/ChefScripts)% chefrecentupdates...1 resources updated in /var/log/chef/client.log-20120505.gz:[Fri, 04 May 2012 17:49:42 +0000]INFO: cookbook_file[/usr/bin/gist]...
  37. 37. Preventative Measures
  38. 38. Knife Preflight• By Jon Cowie (https://github.com/jonlives/knife-preflight)% knife preflight memcache::datacacheSearching for nodes containing memcache::datacache in their expandedrun_list...4 Nodes founddatacache03.ny4.etsy.comdatacache04.ny4.etsy.comdatacache01.ny4.etsy.comdatacache02.ny4.etsy.comSearching for roles containing memcache::datacache in their run_list...1 Roles foundDatacacheFound 4 nodes and 1 roles using the specified search criteria
  39. 39. Continuous Chef• Using Jenkins and base virtual machine images
  40. 40. “Out-of-Band” Management• dsh (distributed shell) works even if Chef server is down • Etsy’s dsh groups are managed by Chef and generated from the list of nodes corresponding to each role
  41. 41. Configs Bundled with Packages• Be careful with configs distributed with packages overwriting Chef configs • They must be replaced by Chef before restarting services, so watch out for resource order
  42. 42. Jon will be at Velocity!• Workshop: Michelin Starred Cooking with Chef • 11:00am Monday, 06/25/2012 • Topics • Team-wide familiarity and understanding • Critical approach and experimentation with workflows • Plugin writing 101
  43. 43. We’re Hiring! http://www.etsy.com/careers• TONS of engineering positions open!• Especially looking for a talented network engineer; referrals welcome!
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×