Advanced Topics in Continuous Deployment
Upcoming SlideShare
Loading in...5
×
 

Advanced Topics in Continuous Deployment

on

  • 4,947 views

Like what you've read? We're frequently hiring for a variety of engineering roles at Etsy. If you're interested, drop me a line or send me your resume: mike@etsy.com. ...

Like what you've read? We're frequently hiring for a variety of engineering roles at Etsy. If you're interested, drop me a line or send me your resume: mike@etsy.com.

http://www.etsy.com/careers

Statistics

Views

Total Views
4,947
Slideshare-icon Views on SlideShare
4,853
Embed Views
94

Actions

Likes
15
Downloads
37
Comments
0

3 Embeds 94

https://twitter.com 91
https://www.linkedin.com 2
http://leanderwattig.de 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Advanced Topics in Continuous Deployment Advanced Topics in Continuous Deployment Presentation Transcript

    • Advanced Topics in Continuous Deployment Mike Brittain Engineering Director, Etsy @mikebrittain mikebrittain.com/talks
    • - Config flags - this one goes to eleven. Today’s TOPICs credit: photobookgirl (flickr)
    • - Config flags - this one goes to eleven. - Automated deploys - never settle for anything less Today’s TOPICs credit: photobookgirl (flickr)
    • - Config flags - this one goes to eleven. - Automated deploys - never settle for anything less - Release management - who needs it?!? Today’s TOPICs credit: photobookgirl (flickr)
    • - Config flags - this one goes to eleven. - Automated deploys - never settle for anything less - Release management - who needs it?!? - Deploying schema changes - ‘cause everybody asks Today’s TOPICs credit: photobookgirl (flickr)
    • www. .com
    • 20 Million Items listed 60+ Million Monthly unique visitors 200 Countries with annual transactions ! 175+ Committers, everyone deploys Items by anjaysdesigns, betwixxt, OneStarLeatherGoods, mediumcontrol, TheDesignPallet
    • Linux, Apache, MySQL, PHP ArchitectureStack Memcache, Gearman, Redis, node.js, Postgresql, Solr, Java, Apache Traffic Server, Hadoop, HBase credit: Saire Elizabeth (flickr) Git, Jenkins Chef, Ruby, Python
    • @mikebrittain DEPLOYMENTSPERDAY APPCODE CONFIGFILES
    • 1st Day Assignment Put your face on etsy.com/about
    • 2nd day Complete tax, insurance, and benefits forms. credit: ktpupp (flickr)
    • I’m not telling you to go do this. (But you can go do this.)
    • credit: photobookgirl (flickr) - Config flags - this one goes to eleven. Today’s TOPICs
    • Code Deploy ≠ Feature Release (Most deploys are gated by config flags)
    • $cfg[‘new_search’] = array('enabled' => 'off'); $cfg[‘sign_in’] = array('enabled' => 'on'); $cfg[‘checkout’] = array('enabled' => 'on'); $cfg[‘homepage’] = array('enabled' => 'on');
    • $cfg[‘new_search’] = array('enabled' => 'off'); !
    • $cfg[‘new_search’] = array('enabled' => 'off'); ! // Meanwhile... ! ! ! ! ! # old and boring search $results = do_grep();
    • $cfg[‘new_search’] = array('enabled' => 'off'); ! // Meanwhile... ! if ($cfg[‘new_search’] == ‘on’) { # New and fancy search $results = do_solr(); } else { # old and boring search $results = do_grep(); }
    • $cfg[‘new_search’] = array('enabled' => 'on'); $cfg[‘new_search’] = array('enabled' => 'off'); “Config Flags,” “Feature Flags,” “Feature Toggles”
    • $cfg[‘new_search’] = array('enabled' => 'on'); $cfg[‘new_search’] = array('enabled' => 'off'); ! // or... ! $cfg[‘new_search’] = array('enabled' => 'staff'); !
    • $cfg[‘new_search’] = array('enabled' => 'on'); $cfg[‘new_search’] = array('enabled' => 'off'); ! // or... ! $cfg[‘new_search’] = array('enabled' => 'staff'); ! // or... ! $cfg[‘new_search’] = array('enabled' => 'users', 'user_list' => 'mike,john,kellan'); etsy.com/prototypes
    • $cfg[‘new_search’] = array('enabled' => 'on'); $cfg[‘new_search’] = array('enabled' => 'off'); ! // or... ! $cfg[‘new_search’] = array('enabled' => 'staff'); ! // or... ! $cfg[‘new_search’] = array('enabled' => 'users', 'user_list' => 'mike,john,kellan'); // or... ! $cfg[‘new_search’] = array('enabled' => '1%'); !
    • $cfg[‘new_search’] = array('enabled' => 'on'); $cfg[‘new_search’] = array('enabled' => 'off'); ! // or... ! $cfg[‘new_search’] = array('enabled' => 'staff'); ! // or... ! $cfg[‘new_search’] = array('enabled' => 'users', 'user_list' => 'mike,john,kellan'); // or... ! $cfg[‘new_search’] = array('enabled' => '1%'); $cfg[‘new_search’] = array('enabled' => '5%'); !
    • $cfg[‘new_search’] = array('enabled' => 'on'); $cfg[‘new_search’] = array('enabled' => 'off'); ! // or... ! $cfg[‘new_search’] = array('enabled' => 'staff'); ! // or... ! $cfg[‘new_search’] = array('enabled' => 'users', 'user_list' => 'mike,john,kellan'); // or... ! $cfg[‘new_search’] = array('enabled' => '1%'); $cfg[‘new_search’] = array('enabled' => '5%'); $cfg[‘new_search’] = array('enabled' => '11%'); “This one goes to eleven.”
    • $cfg[‘new_search’] = array('enabled' => 'on'); $cfg[‘new_search’] = array('enabled' => 'off'); ! // or... ! $cfg[‘new_search’] = array('enabled' => 'staff'); ! // or... ! $cfg[‘new_search’] = array('enabled' => 'users', 'user_list' => 'mike,john,kellan'); // or... ! $cfg[‘new_search’] = array('enabled' => '1%'); $cfg[‘new_search’] = array('enabled' => '5%'); $cfg[‘new_search’] = array('enabled' => '11%'); https://github.com/etsy/feature
    • Validate in production, hidden from public. Product managers still “release” features.
    • Small incremental changes to the application New classes, methods, controllers Graphics, stylesheets, templates Copy/content changes App deploys Turning flags on, off, or % ramp up Config deploys
    • http://www.flickr.com/photos/flyforfun/2694158656/
    • http://www.flickr.com/photos/flyforfun/2694158656/ Operator Config flags Metrics
    • Latent bugs and security holes Traffic management, load shedding Adding and removing infrastructure ! Tweaking config flags or releasing patches. “Operating” the site
    • Favorites “on”
    • Favorites “off”
    • Favorites “on”
    • Promote “dev flags” to “feature flags”
    • // Feature flag $cfg[‘mobilized_pages’] = array('enabled' => 'on'); ! // Dev flags $cfg[‘mobile_templates_seller_tools’] = array('enabled' => 'on'); $cfg[‘mobile_templates_account_tools’] = array('enabled' => 'on'); $cfg[‘mobile_templates_member_profile’] = array('enabled' => 'on'); $cfg[‘mobile_templates_search’] = array('enabled' => 'off'); $cfg[‘mobile_templates_activity_feed’] = array('enabled' => 'off'); ! ... ! if ($cfg[‘mobilized_pages’] == ‘on’ && $cfg[‘mobile_templates_search’] == ‘on’) { // ... // ... }
    • // Feature flags $cfg[‘search’] = array('enabled' => 'on'); $cfg[‘developer_api’] = array('enabled' => 'on'); $cfg[‘seller_tools’] = array('enabled' => 'on'); ! $cfg[‘the_entire_web_site’] = array('enabled' => 'on');
    • // Feature flags $cfg[‘search’] = array('enabled' => 'on'); $cfg[‘developer_api’] = array('enabled' => 'on'); $cfg[‘seller_tools’] = array('enabled' => 'on'); ! $cfg[‘the_entire_web_site’] = array('enabled' => 'on'); $cfg[‘the_entire_web_site_no_really_i_mean_it’] = array('enabled' => 'on');
    • Go add one config flag to your codebase.
    • credit: photobookgirl (flickr) - Automated deploys - never settle for anything less Today’s TOPICs
    • Interpreted language, text files. Opcode cache (Opcache or APC) ~100 servers (web, gearman, api) Rsync (push, not pull) Avoid restarts PHP & Apache
    • Lots of remote orchestration (ssh and dsh) Push code from a git clone to production network. Splay to a few boxes, each splays to more. Stage files in a temp location on prod boxes. Local rsync (using dsh) into live docroot. Keeping things fast
    • 100+ files opened per request. Flushing opcode cache (or graceful restart). Mostly harmless. What can go wrong with this?
    • “Screwed Users”
    • Two document roots (“yin” and “yang”) Symbolic link to the right one Opcache has to use path name, not inode Atomic deploys http://codeascraft.com/2013/07/01/atomic-deploys-at-etsy/
    • Binaries, not text files Requires restarts Requires search index and cache warming Rsync (push, not pull) Solr and JVM
    • Take boxes out of rotation, deploy, bring back up Beware capacity management Multiple versions running for extended period Rollbacks are a pain (esp. when in mixed-state) Rolling restarts
    • “Fraught with danger.”
    • One live cluster, one dark cluster Deploy to dark cluster (indexes, pre-warm, restarts) Define search clusters in app config Switch cluster traffic via config deploy “Flip” and “Flop”
    • Start with a shell script. Yours will be a unique snowflake.
    • credit: photobookgirl (flickr) - Release management - who needs it?!? Today’s TOPICs
    • We have a one-button deploy tool, We manage deploys in an IRC channel.
    • #push
    • Thank you. Mike Brittain Engineering Director, Etsy @mikebrittain mikebrittain.com/talks
    • We have a one-button deploy tool, We manage deploys in an IRC channel. I mean, seriously… what else do you want to know?!?
    • Keep real people in the loop Queue, with max batch size of seven. Automated deployment run by humans
    • Pssst… Want to see how it works?
    • 4 people in this deploy. “I’ve pushed my changes to master.” “Everyone has checked in.”
    • Build QA and Pre-prod Build progress Status in #push Git SHA1 in for each env. Date, username, deploy log, changeset, link to dashboard from time of deploy
    • Reporting what’s going on in Deployinator, and who triggered Status from build cluster
    • Pre-prod (“princess”) has been deployed. ! SHA1 of the change Time it took to deploy Link to changeset in GitHub Log of the deploy script
    • Btw, there are three bots talking in channel at this point. O_o
    • Queuing for next deploy Humans talk to other humans from time to time.
    • Talking to pushbot. ! Pushbot knows some Spanish… because, ya know, why not?
    • Link to test results for CI environment, along with how long the tests took.Alerting by name.
    • 8 minutes have elapsed… We’ve built and tested our release in the CI environment (“QA”). ! QA build failed our 5 min. SLA for tests.
    • “Try” is our pre-commit testing cluster.
    • Bots help reinforce our values. This is especially helpful for new people on the team.
    • Still 8 minutes elapsed… Pre-prod has been deployed and tested. ! This ran in parallel with our QA build and tests.
    • Cross-traffic: In a separate channel (#config), our app configs files were deployed to pre-prod.
    • Cross-traffic: Ops team deployed a configuration change.
    • Code is live Link to dashboard.
    • 13 minutes elapsed… Code is now in production with public traffic.
    • Who committed code in the last deploy? And how many lines did each of them change?
    • Handoff for the next deploy.
    • Entire app deploy took 15 minutes. ! 4 people running the deployment 8 committers Config deploy and Chef change deployed in parallel.
    • Optimal queue size Normalized communication Improved visibility Historical record is ideal for post-mortems Organic evolution
    • Hold up the queue (.hold) Work the issue with the people available in #push Additional help always available in #sysops Buddy-system for off-hours deploys Ops-on-call, dev-on-call When something goes wrong?
    • You won’t need an IRC channel.
    • credit: photobookgirl (flickr) - Deploying schema changes…. with code? Today’s TOPICs
    • credit: photobookgirl (flickr) - Deploying schema changes…. with code? Today’s TOPICs (a.k.a Don’t do this!)
    • ~15-20 minutes THURSDAYS! Code deploys Schema changes
    • credit: photobookgirl (flickr) - Deploying schema changes Today’s TOPICs - Managing versions across services
    • Our web application is largely monolithic. Etsy.com, Support & Back-office tools, Developer API, Gearman (async work)
    • Etsy.com, Support & Back-office tools, Developer API, Gearman (async work) PHP, Apache, Memcache Our web application is largely monolithic.
    • External “services” are not deployed with the main application. e.g. Databases, Search, Photo storage, Payments
    • e.g. Databases, Search, Photo storage, Payments MYSQL (schema changes) SOLR, JVM (rolling restarts) PROXY CACHE, FILERS, AMAZON S3 (specialized infra.) PCI (controlled access) External “services” are not deployed with the main application.
    • For every config flag, there are two states we can support — present and future.
    • ... or past and present. For every config flag, there are two states we can support — present and future.
    • $cfg[‘new_search’] = array('enabled' => 'off'); ! // Meanwhile... ! if ($cfg[‘new_search’] == ‘on’) { # New and fancy search $results = do_solr(); } else { # old and boring search $results = do_grep(); }
    • “Non-Breaking Expansions” Expose new version in a service interface; support multiple versions in the consumer.
    • Example: Changing a Database Schema Merging “users” and “users_prefs”
    • RULE OF THUMB Prefer ADDs over ALTERs (non-breaking expansion) C
    • ! 1. Write to both versions 2. Backfill historical data 3. Read from new version 4. Cut-off writes to old version
    • 0. Add new version to schema 1. Write to both versions 2. Backfill historical data 3. Read from new version 4. Cut-off writes to old version
    • 0. Add new version to schema Schema change to add prefs columns to “users” table. ! “write_prefs_to_user_prefs_table” => “on” “write_prefs_to_users_table” => “off” “read_prefs_from_users_table” => “off”
    • 1. Write to both versions Write code for writing prefs to the “users” table. ! “write_prefs_to_user_prefs_table” => “on” “write_prefs_to_users_table” => “on” “read_prefs_from_users_table” => “off”
    • 2. Backfill historical data Offline process to sync existing data from “user_prefs” to new columns in “users”
    • 3. Read from new version Data validation tests. Ensure consistency both internally and in production. ! “write_prefs_to_user_prefs_table” => “on” “write_prefs_to_users_table” => “on” “read_prefs_from_users_table” => “staff” !
    • 3. Read from new version Data validation tests. Ensure consistency both internally and in production. ! “write_prefs_to_user_prefs_table” => “on” “write_prefs_to_users_table” => “on” “read_prefs_from_users_table” => “1%” !
    • 3. Read from new version Data validation tests. Ensure consistency both internally and in production. ! “write_prefs_to_user_prefs_table” => “on” “write_prefs_to_users_table” => “on” “read_prefs_from_users_table” => “5%” !
    • 3. Read from new version Data validation tests. Ensure consistency both internally and in production. ! “write_prefs_to_user_prefs_table” => “on” “write_prefs_to_users_table” => “on” “read_prefs_from_users_table” => “11%” ! “This one goes to eleven.”
    • 3. Read from new version Data validation tests. Ensure consistency both internally and in production. ! “write_prefs_to_user_prefs_table” => “on” “write_prefs_to_users_table” => “on” “read_prefs_from_users_table” => “on” // same as 100% ! ! ! !
    • 4. Cut-off writes to old version After running on the new table for a significant amount of time, we can cut off writes to the old table. ! “write_prefs_to_user_prefs_table” => “off” “write_prefs_to_users_table” => “on” “read_prefs_from_users_table” => “on” !
    • “Branch by Astraction” Controller Controller Users Model “users” (old) “user_prefs” “users” old schema new schema (Abstraction) http://paulhammant.com/blog/branch_by_abstraction.html http://continuousdelivery.com/2011/05/make-large-scale-changes-incrementally-with-branch-by-abstraction/
    • Avoid temptation of putting logic into DB Async worker queue (Gearman) Get good at alerting on data inconsistencies Easier to scale out app servers that DBs Shards limit complexity About our database design…
    • No longer valid for the business No longer stable, valid, or trusted code Impacting performance or readability We can afford to spend time Clean up old config flags?
    • Decouple schema changes from app code. Aim for simplicity.
    • Start small. (We did.) Automated tests and production monitoring. Have a story around maintaining quality. “We can always go back to the old way.” Demonstrate value to leadership.
    • Go write your own story.
    • Thank you. Mike Brittain Engineering Director, Etsy @mikebrittain mikebrittain.com/talks