CONTINUOUS DEPLOYMENT
                          The Dirty Details


Mike Brittain
ENGINEERING DIRECTOR   @mikebrittain
                       mike@etsy.com
CONTINUOUS DEPLOYMENT
       The Dirty Details
“OK, sounds cool. But I have some questions...”
CD 100- & 200-levels
- CI environment for automated tests
- Committing to trunk
- Branching in code
- Config flags (a.k.a. feature flags)
- DevOps mentality
- Metrics and alerting
- Automated deploy scripts




                                        credit: photobookgirl (flickr)
CD 100- & 200-levels
- CI environment for automated tests
- Committing to trunk
- Branching in code
- Config flags (a.k.a. feature flags)
- DevOps mentality                       CD 300 level
- Metrics and alerting                - Deploys vs. releases
- Automated deploy scripts
                                   - Decoupled systems, schema changes
                                   - How we work: Arch. & Process
                                   - Integration and Operations


                                                           credit: photobookgirl (flickr)
www.   .com
GROSS MERCHANDISE SALES




http://www.etsy.com/blog/news/2013/notes-from-chad-2012-year-in-review/
DECEMBER 2012
                                                                                        1.5 Billion page views
                                                                                        $117 Million of goods sold
                                                                                        6 Million items sold




Items by anjaysdesigns, betwixxt, OneStarLeatherGoods, mediumcontrol, TheDesignPallet    http://www.etsy.com/blog/news/2013/etsy-statistics-december-2012-weather-report/
175+ Committers, everyone deploys




credit: martin_heigan (flickr)
DEPLOYMENTS PER DAY




       Very end of 2009
                          Today
Continuous delivery is a pattern language in growing use
          in software development to improve the process of
          software delivery.
                 Techniques such as automated testing, continuous integration,
           and continuous deployment allow software to be developed to a high
           standard and easily packaged and deployed to test environments,
           resulting in the ability to rapidly, reliably and repeatedly push out
           enhancements and bug fixes to customers at low risk and with
           minimal manual overhead.
                                                                            ~wikipedia
credit: Stewart, redgen (flickr)
Architecture Stack
                                  Linux, Apache, MySQL, PHP
                                  Memcache, Gearman, Postgresql, Solr,
                                  Java, Traffic Server, Hadoop, HBase
                                  Git, Jenkins




credit: Saire Elizabeth (flickr)
Then             Now
     2009          2010-today

 Just before we
started using CD
Then                 Now
    6-14 hours            15 mins
“Deployment Army”         1 person

 Special event and    Part of everyday
highly orchestrated      workflow
Then           Now
Blocked for   Blocked for
6-14 hours.   15 minutes.

6+ hours to   15 minutes to
 redeploy       redeploy
Then                 Now
  Release branch,        Mainline,
 database schemas,     minimal linking
  data transforms,      and building,
     packaging,            rsync,
   rolling restarts,       site up
   cache purging,
scheduled downtime
1st day
Put your face on Etsy.com.
2nd day
                         Complete tax, insurance, and
                              benefits forms.




credit: ktpupp (flickr)
WARNING
GROSS MERCHANDISE SALES
Continuous Deployment
      Small, frequent changes.
Constantly integrating into production.
        30+ deploys per day.
“Wow... 30 deploys a day.
How do you build features so quickly?”
Software Deploy ≠ Product Launch
Deploys frequently gated by config flags
             (“dark” releases)
$cfg[‘new_search’]   =   array('enabled'   =>   'off');
$cfg[‘sign_in’]      =   array('enabled'   =>   'on');
$cfg[‘checkout’]     =   array('enabled'   =>   'on');
$cfg[‘homepage’]     =   array('enabled'   =>   'on');
$cfg[‘new_search’] = array('enabled' => 'off');
$cfg[‘new_search’] = array('enabled' => 'off');

// Meanwhile...




# old and boring search
$results = do_grep();
$cfg[‘new_search’] = array('enabled' => 'off');

// Meanwhile...

if ($cfg[‘new_search’] == ‘on’) {
  # New and fancy search
  $results = do_solr();
} else {
  # old and boring search
  $results = do_grep();
}
$cfg[‘new_search’] = array('enabled' => 'on');

// or...

$cfg[‘new_search’] = array('enabled' => 'staff');

// or...

$cfg[‘new_search’] = array('enabled' => '1%');

// or...

$cfg[‘new_search’] = array('enabled' => 'users',
                           'user_list' => 'mike,john,kellan');
Validate in production, hidden from public.
What’s in a deploy?
Small incremental changes to the application
New classes, methods, controllers
Graphics, stylesheets, templates
Copy/content changes

Turning flags on, off, or % ramp up
Low MTTR (response times)
Latent bugs and security holes
Traffic management, load shedding
Adding and removing infrastructure

Tweaking config flags or releasing patches.
http://www.flickr.com/photos/flyforfun/2694158656/
Config flags
                          Operator
                                                        Metrics




http://www.flickr.com/photos/flyforfun/2694158656/
Favorites
 “on”
Favorites
 “off”
Many deploys eventually lead to a product launch.
“How do you continuously deploy
  database schema changes?”
Code deploys ~15-20 minutes
Schema changes
Code deploys ~15-20 minutes
Schema changes THURSDAYS!
Our web application is largely monolithic.
     Etsy.com, Support & Back-office tools,
     Developer API, Gearman (async work)
Our web application is largely monolithic.
     Etsy.com, Support & Back-office tools,
     Developer API, Gearman (async work)


                                PHP, Apache, Memcache
External “services” are not deployed with
          the main application.
e.g. Databases, Search, Photo storage, Payments
External “services” are not deployed with
            the main application.
   e.g. Databases, Search, Photo storage, Payments


    MYSQL                                                        PCI
                                           PROXY CACHE,
(schema changes)                                                 (controlled access)
                                        FILERS, AMAZON S3
                      SOLR, JVM
                                          (specialized infra.)
                   (rolling restarts)
For every config flag, there are two states
  we can support — present and future.
For every config flag, there are two states
  we can support — present and future.
                            ... or past and present.
“Non-Breaking Expansions”
 Expose new version in a service interface;
support multiple versions in the consumer.
Example: Changing a Database Schema
    Merging “users” and “users_prefs”
C
     RULE OF THUMB:
     Prefer ADDs over ALTERs (non-breaking expansion)
1. Write to both versions
2. Backfill historical data
3. Read from new version
4. Cut-off writes to old version
0. Add new version to schema
1. Write to both versions
2. Backfill historical data
3. Read from new version
4. Cut-off writes to old version
0. Add new version to schema
Schema change to add prefs columns to “users” table.

“write_prefs_to_user_prefs_table” => “on”
“write_prefs_to_users_table” => “off”
“read_prefs_from_users_table” => “off”
1. Write to both versions
Write code for writing prefs to the “users” table.

“write_prefs_to_user_prefs_table” => “on”
“write_prefs_to_users_table” => “on”
“read_prefs_from_users_table” => “off”
2. Backfill historical data
Offline process to sync existing data from “user_prefs”
to new columns in “users”
3. Read from new version
Data validation tests. Ensure consistency both internally
and in production.

“write_prefs_to_user_prefs_table” => “on”
“write_prefs_to_users_table” => “on”
“read_prefs_from_users_table” => “staff”
3. Read from new version
Data validation tests. Ensure consistency both internally
and in production.

“write_prefs_to_user_prefs_table” => “on”
“write_prefs_to_users_table” => “on”
“read_prefs_from_users_table” => “1%”
3. Read from new version
Data validation tests. Ensure consistency both internally
and in production.

“write_prefs_to_user_prefs_table” => “on”
“write_prefs_to_users_table” => “on”
“read_prefs_from_users_table” => “5%”
3. Read from new version
Data validation tests. Ensure consistency both internally
and in production.

“write_prefs_to_user_prefs_table” => “on”
“write_prefs_to_users_table” => “on”
“read_prefs_from_users_table” => “on”


(“on” == “100%”)
4. Cut-off writes to old version
After running on the new table for a significant amount
of time, we can cut off writes to the old table.

“write_prefs_to_user_prefs_table” => “off”
“write_prefs_to_users_table” => “on”
“read_prefs_from_users_table” => “on”
“Branch by Astraction”

                            Controller                               Controller




                                              Users Model                    (Abstraction)




“users” (old)                     “user_prefs”                                          “users”

                  old schema                                                          new schema


                              http://paulhammant.com/blog/branch_by_abstraction.html
     http://continuousdelivery.com/2011/05/make-large-scale-changes-incrementally-with-branch-by-abstraction/
“The Migration 4-Step”
1. Write to both versions
2. Backfill historical data
3. Read from new version
4. Cut-off writes to old version
“When do you clean up all of those config flags?
We might remove config flags for the old version when...
It is no longer valid for the business.
It is no longer stable, maintained, or trusted.
It has poor performance characteristics.
The code is a mess, or difficult to read.
We can afford to spend time on it.
Promote “dev flags” to “feature flags”
// Feature flag
$cfg[‘mobilized_pages’] = array('enabled' => 'on');

// Dev flags
$cfg[‘mobile_templates_seller_tools’]     =   array('enabled'   =>   'on');
$cfg[‘mobile_templates_account_tools’]    =   array('enabled'   =>   'on');
$cfg[‘mobile_templates_member_profile’]   =   array('enabled'   =>   'on');
$cfg[‘mobile_templates_search’]           =   array('enabled'   =>   'off');
$cfg[‘mobile_templates_activity_feed’]    =   array('enabled'   =>   'off');

...

if ($cfg[‘mobilized_pages’] == ‘on’ && $cfg[‘mobile_templates_search’] == ‘on’) {
     // ...
     // ...
}
// Feature flags
$cfg[‘search’]                = array('enabled' => 'on');
$cfg[‘developer_api’]         = array('enabled' => 'on');
$cfg[‘seller_tools’]          = array('enabled' => 'on');

$cfg[‘the_entire_web_site’]                       = array('enabled' => 'on');
// Feature flags
$cfg[‘search’]              = array('enabled' => 'on');
$cfg[‘developer_api’]       = array('enabled' => 'on');
$cfg[‘seller_tools’]        = array('enabled' => 'on');

$cfg[‘the_entire_web_site’]                     = array('enabled' => 'on');
$cfg[‘the_entire_web_site_no_really_i_mean_it’] = array('enabled' => 'on');
Architecture and Process
Deploying is cheap.
Releasing is cheap.
Some philosophies on product development...
Gathering data should be cheap, too.
        staff, opt-in prototypes, 1%
Treat first iterations as experiments.
Get into code as quickly as possible.
“Where a new system concept or new technology is used, one has to build a
system to throw away, for even the best planning is not so omniscient as to
get it right the first time. Hence plan to throw one away; you will, anyhow.”

                                             ~ Fred Brooks, The Mythical Man-Month
Architecture largely doesn’t matter.
Kill things that don’t work.
Your assumptions will be wrong
    once you’ve scaled 10x.
“We don’t optimize for being right. We optimize
  for quickly detecting when we’re wrong.”
                                ~Kellan Elliott-McCrea, CTO
Become really good at changing
     your architecture.
Invest time in architecture by the
       2nd or 3rd iteration.
Integration and Operations
WARNING

          REMEMBER THIS?
Continuous Deployment
      Small, frequent changes.
Constantly integrating into production.
         30 deploys per day.
Code review before commit
Automated tests before deploy
Why Integrate with Production?
Dev ≠ Prod
Verify frequently and in small batches.
Integrating with production is a test in itself.
    We do this frequently and in small batches.
More database servers in prod.
Bigger database hardware in prod.
More web servers.
Various replication schemes.
Different versions of server and OS software.
Schema changes applied at different times.
Physical hardware in prod.
More data in prod.
Legacy data (7 years of odd user states).
More traffic in prod.
Wait, I mean MUCH more traffic in prod.
Fewer elves.
Faster disks (SSDs) in prod.
Using a MySQL database in dev for an application that will be running
on Oracle in production: Priceless
Verify frequently and in small batches.
Dev ⇾ QA ⇾ Staging ⇾ Prod
Dev ⇾ QA ⇾ Staging ⇾ Prod
Dev ⇾ Pre-Prod ⇾ Prod
Test and integrate where you’ll see value.
Config flags (again)
off, on, staff, opt-in prototypes, user list, 0-100%
Config flags (again)
off, on, staff, opt-in prototypes, user list, 0-100%

                    “canary pools”
Automated tests after deploy
Real-time metrics and dashboards
Network & Servers, Application, Business
SERVER METRICS
Apache requests/sec, Busy processes,
CPU utilization, Script exec time (med. & 95th)
APPLICATION METRICS
Logins, Registrations, Checkouts,
Listings created, Forum posts


Time and event correlated.
Real humans reporting trouble!
“Theoretical” vs. “Practical” Engineering
Config flags
                          Operator
                                                        Metrics




http://www.flickr.com/photos/flyforfun/2694158656/
Managing risk during Holiday Shopping season
  Thanksgiving, “Black Friday,” “Cyber Monday” ➔ Christmas
                           (~30 days)



                    Code Freeze?
DEPLOYMENTS PER DAY




                      “Code Slush”
Tighten your feedback cycles
Integrate with production and validate early in cycle.
Use tools that allow you to detect issues early.
Optimize for quick response times.

Applied to both feature development and operability.
Thank you
                                                           ... and questions?


These slides will be available later today at http://mikebrittain.com/talks



    Mike Brittain
     ENGINEERING DIRECTOR       @mikebrittain
                                mike@etsy.com

Continuous Deployment: The Dirty Details

  • 1.
    CONTINUOUS DEPLOYMENT The Dirty Details Mike Brittain ENGINEERING DIRECTOR @mikebrittain mike@etsy.com
  • 2.
    CONTINUOUS DEPLOYMENT The Dirty Details “OK, sounds cool. But I have some questions...”
  • 3.
    CD 100- &200-levels - CI environment for automated tests - Committing to trunk - Branching in code - Config flags (a.k.a. feature flags) - DevOps mentality - Metrics and alerting - Automated deploy scripts credit: photobookgirl (flickr)
  • 4.
    CD 100- &200-levels - CI environment for automated tests - Committing to trunk - Branching in code - Config flags (a.k.a. feature flags) - DevOps mentality CD 300 level - Metrics and alerting - Deploys vs. releases - Automated deploy scripts - Decoupled systems, schema changes - How we work: Arch. & Process - Integration and Operations credit: photobookgirl (flickr)
  • 6.
    www. .com
  • 7.
  • 8.
    DECEMBER 2012 1.5 Billion page views $117 Million of goods sold 6 Million items sold Items by anjaysdesigns, betwixxt, OneStarLeatherGoods, mediumcontrol, TheDesignPallet http://www.etsy.com/blog/news/2013/etsy-statistics-december-2012-weather-report/
  • 9.
    175+ Committers, everyonedeploys credit: martin_heigan (flickr)
  • 10.
    DEPLOYMENTS PER DAY Very end of 2009 Today
  • 11.
    Continuous delivery isa pattern language in growing use in software development to improve the process of software delivery. Techniques such as automated testing, continuous integration, and continuous deployment allow software to be developed to a high standard and easily packaged and deployed to test environments, resulting in the ability to rapidly, reliably and repeatedly push out enhancements and bug fixes to customers at low risk and with minimal manual overhead. ~wikipedia credit: Stewart, redgen (flickr)
  • 12.
    Architecture Stack Linux, Apache, MySQL, PHP Memcache, Gearman, Postgresql, Solr, Java, Traffic Server, Hadoop, HBase Git, Jenkins credit: Saire Elizabeth (flickr)
  • 14.
    Then Now 2009 2010-today Just before we started using CD
  • 15.
    Then Now 6-14 hours 15 mins “Deployment Army” 1 person Special event and Part of everyday highly orchestrated workflow
  • 16.
    Then Now Blocked for Blocked for 6-14 hours. 15 minutes. 6+ hours to 15 minutes to redeploy redeploy
  • 17.
    Then Now Release branch, Mainline, database schemas, minimal linking data transforms, and building, packaging, rsync, rolling restarts, site up cache purging, scheduled downtime
  • 18.
    1st day Put yourface on Etsy.com.
  • 19.
    2nd day Complete tax, insurance, and benefits forms. credit: ktpupp (flickr)
  • 21.
  • 22.
  • 23.
    Continuous Deployment Small, frequent changes. Constantly integrating into production. 30+ deploys per day.
  • 24.
    “Wow... 30 deploysa day. How do you build features so quickly?”
  • 25.
    Software Deploy ≠Product Launch
  • 26.
    Deploys frequently gatedby config flags (“dark” releases)
  • 27.
    $cfg[‘new_search’] = array('enabled' => 'off'); $cfg[‘sign_in’] = array('enabled' => 'on'); $cfg[‘checkout’] = array('enabled' => 'on'); $cfg[‘homepage’] = array('enabled' => 'on');
  • 28.
  • 29.
    $cfg[‘new_search’] = array('enabled'=> 'off'); // Meanwhile... # old and boring search $results = do_grep();
  • 30.
    $cfg[‘new_search’] = array('enabled'=> 'off'); // Meanwhile... if ($cfg[‘new_search’] == ‘on’) { # New and fancy search $results = do_solr(); } else { # old and boring search $results = do_grep(); }
  • 31.
    $cfg[‘new_search’] = array('enabled'=> 'on'); // or... $cfg[‘new_search’] = array('enabled' => 'staff'); // or... $cfg[‘new_search’] = array('enabled' => '1%'); // or... $cfg[‘new_search’] = array('enabled' => 'users', 'user_list' => 'mike,john,kellan');
  • 32.
    Validate in production,hidden from public.
  • 33.
    What’s in adeploy? Small incremental changes to the application New classes, methods, controllers Graphics, stylesheets, templates Copy/content changes Turning flags on, off, or % ramp up
  • 34.
    Low MTTR (responsetimes) Latent bugs and security holes Traffic management, load shedding Adding and removing infrastructure Tweaking config flags or releasing patches.
  • 35.
  • 36.
    Config flags Operator Metrics http://www.flickr.com/photos/flyforfun/2694158656/
  • 37.
  • 38.
  • 39.
    Many deploys eventuallylead to a product launch.
  • 40.
    “How do youcontinuously deploy database schema changes?”
  • 41.
    Code deploys ~15-20minutes Schema changes
  • 42.
    Code deploys ~15-20minutes Schema changes THURSDAYS!
  • 43.
    Our web applicationis largely monolithic. Etsy.com, Support & Back-office tools, Developer API, Gearman (async work)
  • 44.
    Our web applicationis largely monolithic. Etsy.com, Support & Back-office tools, Developer API, Gearman (async work) PHP, Apache, Memcache
  • 45.
    External “services” arenot deployed with the main application. e.g. Databases, Search, Photo storage, Payments
  • 46.
    External “services” arenot deployed with the main application. e.g. Databases, Search, Photo storage, Payments MYSQL PCI PROXY CACHE, (schema changes) (controlled access) FILERS, AMAZON S3 SOLR, JVM (specialized infra.) (rolling restarts)
  • 47.
    For every configflag, there are two states we can support — present and future.
  • 48.
    For every configflag, there are two states we can support — present and future. ... or past and present.
  • 49.
    “Non-Breaking Expansions” Exposenew version in a service interface; support multiple versions in the consumer.
  • 50.
    Example: Changing aDatabase Schema Merging “users” and “users_prefs”
  • 51.
    C RULE OF THUMB: Prefer ADDs over ALTERs (non-breaking expansion)
  • 52.
    1. Write toboth versions 2. Backfill historical data 3. Read from new version 4. Cut-off writes to old version
  • 53.
    0. Add newversion to schema 1. Write to both versions 2. Backfill historical data 3. Read from new version 4. Cut-off writes to old version
  • 54.
    0. Add newversion to schema Schema change to add prefs columns to “users” table. “write_prefs_to_user_prefs_table” => “on” “write_prefs_to_users_table” => “off” “read_prefs_from_users_table” => “off”
  • 55.
    1. Write toboth versions Write code for writing prefs to the “users” table. “write_prefs_to_user_prefs_table” => “on” “write_prefs_to_users_table” => “on” “read_prefs_from_users_table” => “off”
  • 56.
    2. Backfill historicaldata Offline process to sync existing data from “user_prefs” to new columns in “users”
  • 57.
    3. Read fromnew version Data validation tests. Ensure consistency both internally and in production. “write_prefs_to_user_prefs_table” => “on” “write_prefs_to_users_table” => “on” “read_prefs_from_users_table” => “staff”
  • 58.
    3. Read fromnew version Data validation tests. Ensure consistency both internally and in production. “write_prefs_to_user_prefs_table” => “on” “write_prefs_to_users_table” => “on” “read_prefs_from_users_table” => “1%”
  • 59.
    3. Read fromnew version Data validation tests. Ensure consistency both internally and in production. “write_prefs_to_user_prefs_table” => “on” “write_prefs_to_users_table” => “on” “read_prefs_from_users_table” => “5%”
  • 60.
    3. Read fromnew version Data validation tests. Ensure consistency both internally and in production. “write_prefs_to_user_prefs_table” => “on” “write_prefs_to_users_table” => “on” “read_prefs_from_users_table” => “on” (“on” == “100%”)
  • 61.
    4. Cut-off writesto old version After running on the new table for a significant amount of time, we can cut off writes to the old table. “write_prefs_to_user_prefs_table” => “off” “write_prefs_to_users_table” => “on” “read_prefs_from_users_table” => “on”
  • 62.
    “Branch by Astraction” Controller Controller Users Model (Abstraction) “users” (old) “user_prefs” “users” old schema new schema http://paulhammant.com/blog/branch_by_abstraction.html http://continuousdelivery.com/2011/05/make-large-scale-changes-incrementally-with-branch-by-abstraction/
  • 63.
    “The Migration 4-Step” 1.Write to both versions 2. Backfill historical data 3. Read from new version 4. Cut-off writes to old version
  • 64.
    “When do youclean up all of those config flags?
  • 65.
    We might removeconfig flags for the old version when... It is no longer valid for the business. It is no longer stable, maintained, or trusted. It has poor performance characteristics. The code is a mess, or difficult to read. We can afford to spend time on it.
  • 66.
    Promote “dev flags”to “feature flags”
  • 67.
    // Feature flag $cfg[‘mobilized_pages’]= array('enabled' => 'on'); // Dev flags $cfg[‘mobile_templates_seller_tools’] = array('enabled' => 'on'); $cfg[‘mobile_templates_account_tools’] = array('enabled' => 'on'); $cfg[‘mobile_templates_member_profile’] = array('enabled' => 'on'); $cfg[‘mobile_templates_search’] = array('enabled' => 'off'); $cfg[‘mobile_templates_activity_feed’] = array('enabled' => 'off'); ... if ($cfg[‘mobilized_pages’] == ‘on’ && $cfg[‘mobile_templates_search’] == ‘on’) { // ... // ... }
  • 68.
    // Feature flags $cfg[‘search’] = array('enabled' => 'on'); $cfg[‘developer_api’] = array('enabled' => 'on'); $cfg[‘seller_tools’] = array('enabled' => 'on'); $cfg[‘the_entire_web_site’] = array('enabled' => 'on');
  • 70.
    // Feature flags $cfg[‘search’] = array('enabled' => 'on'); $cfg[‘developer_api’] = array('enabled' => 'on'); $cfg[‘seller_tools’] = array('enabled' => 'on'); $cfg[‘the_entire_web_site’] = array('enabled' => 'on'); $cfg[‘the_entire_web_site_no_really_i_mean_it’] = array('enabled' => 'on');
  • 71.
  • 72.
  • 73.
  • 74.
    Some philosophies onproduct development...
  • 75.
    Gathering data shouldbe cheap, too. staff, opt-in prototypes, 1%
  • 76.
    Treat first iterationsas experiments.
  • 77.
    Get into codeas quickly as possible.
  • 78.
    “Where a newsystem concept or new technology is used, one has to build a system to throw away, for even the best planning is not so omniscient as to get it right the first time. Hence plan to throw one away; you will, anyhow.” ~ Fred Brooks, The Mythical Man-Month
  • 79.
  • 80.
    Kill things thatdon’t work.
  • 81.
    Your assumptions willbe wrong once you’ve scaled 10x.
  • 82.
    “We don’t optimizefor being right. We optimize for quickly detecting when we’re wrong.” ~Kellan Elliott-McCrea, CTO
  • 83.
    Become really goodat changing your architecture.
  • 84.
    Invest time inarchitecture by the 2nd or 3rd iteration.
  • 85.
  • 86.
    WARNING REMEMBER THIS?
  • 87.
    Continuous Deployment Small, frequent changes. Constantly integrating into production. 30 deploys per day.
  • 88.
  • 89.
  • 90.
    Why Integrate withProduction?
  • 91.
  • 92.
    Verify frequently andin small batches.
  • 93.
    Integrating with productionis a test in itself. We do this frequently and in small batches.
  • 94.
    More database serversin prod. Bigger database hardware in prod. More web servers. Various replication schemes. Different versions of server and OS software. Schema changes applied at different times. Physical hardware in prod. More data in prod. Legacy data (7 years of odd user states). More traffic in prod. Wait, I mean MUCH more traffic in prod. Fewer elves. Faster disks (SSDs) in prod.
  • 95.
    Using a MySQLdatabase in dev for an application that will be running on Oracle in production: Priceless
  • 96.
    Verify frequently andin small batches.
  • 97.
    Dev ⇾ QA⇾ Staging ⇾ Prod
  • 98.
    Dev ⇾ QA⇾ Staging ⇾ Prod
  • 99.
  • 100.
    Test and integratewhere you’ll see value.
  • 101.
    Config flags (again) off,on, staff, opt-in prototypes, user list, 0-100%
  • 102.
    Config flags (again) off,on, staff, opt-in prototypes, user list, 0-100% “canary pools”
  • 103.
  • 104.
    Real-time metrics anddashboards Network & Servers, Application, Business
  • 106.
    SERVER METRICS Apache requests/sec,Busy processes, CPU utilization, Script exec time (med. & 95th) APPLICATION METRICS Logins, Registrations, Checkouts, Listings created, Forum posts Time and event correlated.
  • 108.
  • 109.
  • 110.
    Config flags Operator Metrics http://www.flickr.com/photos/flyforfun/2694158656/
  • 111.
    Managing risk duringHoliday Shopping season Thanksgiving, “Black Friday,” “Cyber Monday” ➔ Christmas (~30 days) Code Freeze?
  • 112.
    DEPLOYMENTS PER DAY “Code Slush”
  • 113.
    Tighten your feedbackcycles Integrate with production and validate early in cycle. Use tools that allow you to detect issues early. Optimize for quick response times. Applied to both feature development and operability.
  • 114.
    Thank you ... and questions? These slides will be available later today at http://mikebrittain.com/talks Mike Brittain ENGINEERING DIRECTOR @mikebrittain mike@etsy.com