Continuous Deployment: The Dirty Details

CONTINUOUS DEPLOYMENT
The Dirty Details

Mike Brittain
ENGINEERING DIRECTOR @mikebrittain
mike@etsy.com

CONTINUOUS DEPLOYMENT
The Dirty Details
“OK, sounds cool. But I have some questions...”

CD 100- & 200-levels
- CI environment for automated tests
- Committing to trunk
- Branching in code
- Config flags (a.k.a. feature flags)
- DevOps mentality
- Metrics and alerting
- Automated deploy scripts

credit: photobookgirl (flickr)

CD 100- & 200-levels
- CI environment for automated tests
- Committing to trunk
- Branching in code
- Config flags (a.k.a. feature flags)
- DevOps mentality CD 300 level
- Metrics and alerting - Deploys vs. releases
- Automated deploy scripts
- Decoupled systems, schema changes
- How we work: Arch. & Process
- Integration and Operations

credit: photobookgirl (flickr)

GROSS MERCHANDISE SALES

http://www.etsy.com/blog/news/2013/notes-from-chad-2012-year-in-review/

DECEMBER 2012
1.5 Billion page views
$117 Million of goods sold
6 Million items sold

Items by anjaysdesigns, betwixxt, OneStarLeatherGoods, mediumcontrol, TheDesignPallet http://www.etsy.com/blog/news/2013/etsy-statistics-december-2012-weather-report/

175+ Committers, everyone deploys

credit: martin_heigan (ﬂickr)

DEPLOYMENTS PER DAY

Very end of 2009
Today

Continuous delivery is a pattern language in growing use
in software development to improve the process of
software delivery.
Techniques such as automated testing, continuous integration,
and continuous deployment allow software to be developed to a high
standard and easily packaged and deployed to test environments,
resulting in the ability to rapidly, reliably and repeatedly push out
enhancements and bug ﬁxes to customers at low risk and with
minimal manual overhead.
~wikipedia
credit: Stewart, redgen (ﬂickr)

Architecture Stack
Linux, Apache, MySQL, PHP
Memcache, Gearman, Postgresql, Solr,
Java, Traﬃc Server, Hadoop, HBase
Git, Jenkins

credit: Saire Elizabeth (ﬂickr)

Then Now
2009 2010-today

Just before we
started using CD

Then Now
6-14 hours 15 mins
“Deployment Army” 1 person

Special event and Part of everyday
highly orchestrated workﬂow

Then Now
Blocked for Blocked for
6-14 hours. 15 minutes.

6+ hours to 15 minutes to
redeploy redeploy

Then Now
Release branch, Mainline,
database schemas, minimal linking
data transforms, and building,
packaging, rsync,
rolling restarts, site up
cache purging,
scheduled downtime

1st day
Put your face on Etsy.com.

2nd day
Complete tax, insurance, and
beneﬁts forms.

credit: ktpupp (ﬂickr)

Continuous Deployment
Small, frequent changes.
Constantly integrating into production.
30+ deploys per day.

“Wow... 30 deploys a day.
How do you build features so quickly?”

Software Deploy ≠ Product Launch

Deploys frequently gated by conﬁg ﬂags
(“dark” releases)

$cfg[‘new_search’] = array('enabled' => 'off');
$cfg[‘sign_in’] = array('enabled' => 'on');
$cfg[‘checkout’] = array('enabled' => 'on');
$cfg[‘homepage’] = array('enabled' => 'on');


// Meanwhile...

# old and boring search
$results = do_grep();


// Meanwhile...

if ($cfg[‘new_search’] == ‘on’) {
# New and fancy search
$results = do_solr();
} else {
# old and boring search
$results = do_grep();
}

$cfg[‘new_search’] = array('enabled' => 'on');

// or...

$cfg[‘new_search’] = array('enabled' => 'staff');

// or...

$cfg[‘new_search’] = array('enabled' => '1%');

// or...

$cfg[‘new_search’] = array('enabled' => 'users',
'user_list' => 'mike,john,kellan');

Validate in production, hidden from public.

What’s in a deploy?
Small incremental changes to the application
New classes, methods, controllers
Graphics, stylesheets, templates
Copy/content changes

Turning ﬂags on, oﬀ, or % ramp up

Low MTTR (response times)
Latent bugs and security holes
Traffic management, load shedding
Adding and removing infrastructure

Tweaking config flags or releasing patches.

http://www.ﬂickr.com/photos/ﬂyforfun/2694158656/

Config flags
Operator
Metrics

http://www.flickr.com/photos/flyforfun/2694158656/

Many deploys eventually lead to a product launch.

“How do you continuously deploy
database schema changes?”

Code deploys ~15-20 minutes
Schema changes

Code deploys ~15-20 minutes
Schema changes THURSDAYS!

Our web application is largely monolithic.
Etsy.com, Support & Back-oﬃce tools,
Developer API, Gearman (async work)

Our web application is largely monolithic.
Etsy.com, Support & Back-oﬃce tools,
Developer API, Gearman (async work)

PHP, Apache, Memcache

External “services” are not deployed with
the main application.
e.g. Databases, Search, Photo storage, Payments

External “services” are not deployed with
the main application.
e.g. Databases, Search, Photo storage, Payments

MYSQL PCI
PROXY CACHE,
(schema changes) (controlled access)
FILERS, AMAZON S3
SOLR, JVM
(specialized infra.)
(rolling restarts)

For every conﬁg ﬂag, there are two states
we can support — present and future.

For every conﬁg ﬂag, there are two states
we can support — present and future.
... or past and present.

“Non-Breaking Expansions”
Expose new version in a service interface;
support multiple versions in the consumer.

Example: Changing a Database Schema
Merging “users” and “users_prefs”

C
RULE OF THUMB:
Prefer ADDs over ALTERs (non-breaking expansion)

1. Write to both versions
2. Backfill historical data
3. Read from new version
4. Cut-off writes to old version

0. Add new version to schema

0. Add new version to schema
Schema change to add prefs columns to “users” table.

“write_prefs_to_user_prefs_table” => “on”
“write_prefs_to_users_table” => “oﬀ”
“read_prefs_from_users_table” => “oﬀ”

Write code for writing prefs to the “users” table.

“write_prefs_to_users_table” => “on”
“read_prefs_from_users_table” => “oﬀ”

Oﬄine process to sync existing data from “user_prefs”
to new columns in “users”

Data validation tests. Ensure consistency both internally
and in production.

“read_prefs_from_users_table” => “staﬀ”

and in production.

“read_prefs_from_users_table” => “1%”

and in production.

“read_prefs_from_users_table” => “5%”

and in production.

“read_prefs_from_users_table” => “on”

(“on” == “100%”)

After running on the new table for a significant amount
of time, we can cut off writes to the old table.

“write_prefs_to_user_prefs_table” => “off”
“read_prefs_from_users_table” => “on”

“Branch by Astraction”

Controller Controller

Users Model (Abstraction)

“users” (old) “user_prefs” “users”

old schema new schema

http://paulhammant.com/blog/branch_by_abstraction.html
http://continuousdelivery.com/2011/05/make-large-scale-changes-incrementally-with-branch-by-abstraction/

“The Migration 4-Step”

“When do you clean up all of those conﬁg ﬂags?

We might remove config flags for the old version when...
It is no longer valid for the business.
It is no longer stable, maintained, or trusted.
It has poor performance characteristics.
The code is a mess, or difficult to read.
We can afford to spend time on it.

Promote “dev ﬂags” to “feature ﬂags”

// Feature flag
$cfg[‘mobilized_pages’] = array('enabled' => 'on');

// Dev flags
$cfg[‘mobile_templates_seller_tools’] = array('enabled' => 'on');
$cfg[‘mobile_templates_account_tools’] = array('enabled' => 'on');
$cfg[‘mobile_templates_member_profile’] = array('enabled' => 'on');
$cfg[‘mobile_templates_search’] = array('enabled' => 'off');
$cfg[‘mobile_templates_activity_feed’] = array('enabled' => 'off');

...

if ($cfg[‘mobilized_pages’] == ‘on’ && $cfg[‘mobile_templates_search’] == ‘on’) {
// ...
// ...
}

// Feature flags
$cfg[‘search’] = array('enabled' => 'on');
$cfg[‘developer_api’] = array('enabled' => 'on');
$cfg[‘seller_tools’] = array('enabled' => 'on');

$cfg[‘the_entire_web_site’] = array('enabled' => 'on');

// Feature flags
$cfg[‘search’] = array('enabled' => 'on');
$cfg[‘developer_api’] = array('enabled' => 'on');
$cfg[‘seller_tools’] = array('enabled' => 'on');

$cfg[‘the_entire_web_site’] = array('enabled' => 'on');
$cfg[‘the_entire_web_site_no_really_i_mean_it’] = array('enabled' => 'on');

Some philosophies on product development...

Gathering data should be cheap, too.
staﬀ, opt-in prototypes, 1%

Treat ﬁrst iterations as experiments.

Get into code as quickly as possible.

“Where a new system concept or new technology is used, one has to build a
system to throw away, for even the best planning is not so omniscient as to
get it right the ﬁrst time. Hence plan to throw one away; you will, anyhow.”

~ Fred Brooks, The Mythical Man-Month

Architecture largely doesn’t matter.

Kill things that don’t work.

Your assumptions will be wrong
once you’ve scaled 10x.

“We don’t optimize for being right. We optimize
for quickly detecting when we’re wrong.”
~Kellan Elliott-McCrea, CTO

Become really good at changing
your architecture.

Invest time in architecture by the
2nd or 3rd iteration.

WARNING

REMEMBER THIS?

Continuous Deployment
Small, frequent changes.
Constantly integrating into production.
30 deploys per day.

Why Integrate with Production?

Verify frequently and in small batches.

Integrating with production is a test in itself.
We do this frequently and in small batches.

More database servers in prod.
Bigger database hardware in prod.
More web servers.
Various replication schemes.
Different versions of server and OS software.
Schema changes applied at different times.
Physical hardware in prod.
More data in prod.
Legacy data (7 years of odd user states).
More traffic in prod.
Wait, I mean MUCH more traffic in prod.
Fewer elves.
Faster disks (SSDs) in prod.

Using a MySQL database in dev for an application that will be running
on Oracle in production: Priceless

Dev ⇾ QA ⇾ Staging ⇾ Prod

Test and integrate where you’ll see value.

Config flags (again)
off, on, staff, opt-in prototypes, user list, 0-100%

Config flags (again)
off, on, staff, opt-in prototypes, user list, 0-100%

“canary pools”

Real-time metrics and dashboards
Network & Servers, Application, Business

SERVER METRICS
Apache requests/sec, Busy processes,
CPU utilization, Script exec time (med. & 95th)
APPLICATION METRICS
Logins, Registrations, Checkouts,
Listings created, Forum posts

Time and event correlated.

Real humans reporting trouble!

“Theoretical” vs. “Practical” Engineering

Managing risk during Holiday Shopping season
Thanksgiving, “Black Friday,” “Cyber Monday” ➔ Christmas
(~30 days)

Code Freeze?

DEPLOYMENTS PER DAY

“Code Slush”

Tighten your feedback cycles
Integrate with production and validate early in cycle.
Use tools that allow you to detect issues early.
Optimize for quick response times.

Applied to both feature development and operability.

Thank you
... and questions?

These slides will be available later today at http://mikebrittain.com/talks

Mike Brittain
ENGINEERING DIRECTOR @mikebrittain
mike@etsy.com

Continuous Deployment: The Dirty Details

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Continuous Deployment: The Dirty Details

Similar to Continuous Deployment: The Dirty Details (20)

Recently uploaded

Recently uploaded (20)

Continuous Deployment: The Dirty Details