Go or No-Go
Operability and Contingency Planning
               John Allspaw, Etsy.com
Etsy as of now
Total Members:          over 5.7 million
Total Sellers:           over 400,000
Items Currently Listed:      6.5 million
Page Views per month:       775 million
Total $ sold (gross merchandise sales)
2010 = $179.4 million (through August)
New Features
Delivering OperableGoSoftware
Arch Review Development/Ops or No-Go Launch*
             Feedback Loop
CONTINUOUS DEPLOYMENT
             !=
deploying new features without
  coordination and planning
Operability Review
Contingency Checklist
Not An Innovative Idea
        http://en.wikipedia.org/wiki/Launch_status_check
10 minute get-together
    •   Product
    •   Development
    •   Operations
    •   Design
    •   Community
    •   Support
Consensus
Informally Codifies “OK”
             Dev               “We all understand/agree/
             Ops              accept that we are OK here!”
             Product
             Community
             Support



   Buggy                     Stable                          Perfect!
  Sloppy          Finished Enough For Launch
 Unfinished
Yes
 or
No
Has the feature been tested enough to
deploy to production?


Is there any final functional QA still needed?
Is communication (blog post/forums/etc)
about the feature ready to go out with the
feature?
Does everyone know:


  when it will go live, and
  who will push the feature?
Has the feature been in production for staff
(or some other specific subset of the users)
already?


If not, could it have been?
Is it possible to dark launch this feature?



Will this feature be dark launched?
(or, has it already?)
Is it possible to turn up this feature on a
percentage basis?


If so: will we?
Does it involve any new infrastructure?
If so: are those pieces in monitoring and
metrics collection?



(this answer can’t be “no” before launch)
Do we have on/off switches for this feature?


If so: are those switches documented?


(this answer can’t be “no” before launch)
Are all the leads (Dev, Ops, Product,
Community, Support, etc.) available for the
launch and in communication?


(this answer can’t be “no” before launch)
Is there a single and easy place for users to
report bugs or concerns about the feature?
(forum topic, etc.)
Have all leads agreed upon a post-launch
“it’s all DONE” time to declare the launch
was successful?
Have we done a Contingency Checklist™
and everyone reviewed it?


(this answer can’t be “no” before launch)
Contingency Checklist
“What could possibly go wrong?”

   “When it does go wrong,
       WTF will we do?!”
NOTE:
This is worked outBEFORE      launch, normally by product and
development, involving others where needed.




                                    (when we have saner heads)
Issue                  Onsite Messaging
Likelihood             Forums
Comment(s)             Blog
Impact on Users        PR
Engineering Response
Comment Impact on Engineering Onsite
Issue Likelihood                                      Forums Blog   PR
                   (s)     Users Response Messaging
Example: Coffee!
AWESOME NEW FEATURE
 •   add coffee (like a tag) to your profile
 •   others can favorite coffees
 •   page showing all coffee favorites
 •   bulk-add coffees to your profile
 •   search people by their coffee
Issue
 What could possibly go wrong with the feature launched in
 production?




Example:
   “The Coffees-You’ve-Favorited page is too expensive.”
Likelihood
 How likely is this issue going to come up?




Example:
   “Low to mid.”
Comment(s)
 Any extra info about this issue here.


Example:
   “Because of how we paginate coffee favorites page, they are
   somewhat harder than normal favorites. If we do have to turn
   this off, we’re saying that we need to re-design it, or it needs
   to stay off until the initial burst of traffic from the launch.”
Impact
 How much is this going to impact the experience of the feature, if
 it does become a concern?



Example:
   “High”
Engineering Response
 What will we do to mitigate the issue (i.e. can we gracefully
 degrade?)



Example:
   “Set disable_coffee_favorites_page = 1”
Onsite Messaging
 What is the messaging to the community in the forums/blog/etc.,
 if this needs graceful degradation?

Example:
   “‘The Coffee Favorites page is currently unavailable.’ Or, in the
   forums: “We’re working through some issues with displaying
   Coffee Favorites, we’ll let you know the status as time goes
   on.’”
PR
 Is the issue so severe that we need PR involved?


Example:
   “The CEO sends a press release, apologizing to Folger’s,
   Peet’s, and Starbucks with a witty yet calming voice of
   explanation and a humble request for patience.”
* afterwards....
*successful launch...


Metrics?
Are we there yet?
OMG! Who to call if it breaks later?
* non-successful launch...


Metrics?
What’d we miss?
Post Mortem?
Ramp down?
Photos
http://www.flickr.com/photos/jliba/3783269078/

http://www.flickr.com/photos/mybloodyself/2072928376/

http://www.flickr.com/photos/jacy/360020853/

http://www.flickr.com/photos/f-l-e-x/2319852529/

http://www.flickr.com/photos/16230215@N08/3023061528/

http://www.flickr.com/photos/proimos/4199675334/

http://www.flickr.com/photos/askal_bosch/2579320395/

Go or No-Go: Operability and Contingency Planning at Etsy.com