In the startup world, the most pressing issue usually isn't, "how to we prepare for a once-in-a-lifetime storm?" Risk management is always stressed at Velocity, but businesses don't always buy into dedicating resources to low probability events. Let our comedy of errors during Hurricane Sandy help to convince your bosses that infrastructure and redundancy are actually worthwhile investments.
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
How we survived Hurricane Sandy
1. How we survived
Hurricane Sandy
a look at what not to do
Dan White
@airplanedan
Mike Zaic
VP, Engineering, CafeMom
@focustrate
Photo: TenSafeFrogs (flickr)
7. Architecture
• Hosting
– Single physical datacenter
– Redundancy? pffff
• Cloud Presence
– Limited to specific parts of the app
– No database required
Photo: Arthur40A (flickr)
20. Technical Takeaways
what we did
•
•
•
•
Redundancies
Code without data
DNS propagation
Cloud replication
– Codebase
– Database
Photo: Robert S. Donovan (flickr)
21. Technical Takeaways
next steps
• Architecting for degraded
service?
• Simplify application?
• Second physical site?
• Geographic redundancy?
• Undoing "fixes" of failovers?
Photo: paul bica (flickr)
22. Overarching Problems
•
•
•
•
•
•
•
Don’t trust vendor/provider assurances!
Managing expectations during an outage
Feelings of helplessness
Scrambling doesn't necessarily get things up faster
Must know downtime tolerance
Cost of DR planning vs Lost opportunity of outage
DR Planning is insurance
23. Know when to say when
•
•
•
•
When is it appropriate to plan for DR?
When is it appropriate to build for DR?
When is it appropriate to retrofit DR for existing app?
When is it appropriate to dictate product requirements
based on ease of DR?
• During an outage - when is it appropriate to do nothing
(and just sleep)?
24. When not a great time to think about DR?
During a disaster.
(unless you’re thinking about the next one)
25. Moving forward – inertia sucks!
• Change is tough
• Mature infrastructure/app resists drastic change
• Pace of development inertia against stepping back
and revisiting
• Pace of business inertia against non-revenue
projects
• No recent disasters inertia against DR necessity
This conference is in NYC – assuming people remember “Hurricane Sandy” – end of October, last year?
This is what Manhattan looked like.
We work for CafeMom.com – Dan was the Senior Vice President of Technology at the time, Mike ran the development team.
Our datacenter was in lower Manhattan, and our site went down for about a week. We’re hoping that you can learn from some of our mistakes, and use us as a case study to bring to your organizations to emphasize the need for business continuity and disaster recovery.
Let’s get started.
Other site mentions
- MamasLatinas – launched Jan 2012
- The Stir / Que Mas
User base
- Rabid
- Expectation of Realtime
- Very Vocal
- Not shy about complaining – CHANGE IS BAD!!!
breadth of application
pieces are both loosely and tightly coupled
- lots of unrelated functionality
- reliance on common data (friends activity!)
- interconnected code
complex logic
Hosting:
“Who’s got multiple datacenters?”
Cloud presence:
* EC2, S3, Cloudfront
* Basically for offloading bits off the app
Resiliency:
- built into the application
- can handle outages at
- application
- server
- network device
- power circuit
- internet pipe
Sentence sums up prior mindset with regards to a single datacenter in NYC.
Almost 1 year ago – end of October, into November
Daily account
Frame in terms of:
1. Communication
2. Technical / Physical
Communication
10/29 9:07am - DG email: we're aware of the storm and all is good
Physical locations of Dan/Mike
5:46pm - DG says ready for the storm
hour later - site's down
communicated to DG
communicated to company
9:25pm - everything's back!
communicated to company...
... prematurely
9:43pm - out for good
let the company know that night
Why? Fuel pumps in flooded basement! Re-tell 9/11 stuff
Technical
Static EC2 fail page
Switch nameservers
caused problems
ssh access to production
Communication
From DG re: pumping water
Army corps of engineers help
Seemingly misleading/contradictory reports
Set up script to monitor DG page
Us to company throughout with updates
Communication with Gawker, eyes on ground - stuff is happening!
Technical
Inbound email - cloud hosted email, but still routed through DG mailserver means no inbound external email... need to repoint postini directly to 365
Even after switch, took a while to unspool
We did get it working
thestir wordpress -- dns thestirlive.cafemom.com
Communication
Frequent DG updates
- Fiber optic cables severed?
- Some of our datacenters are up - we're not so bad
- Water still being pumped
- Mobile generator on the way
Lots of false hope:
- Only 2 feet of water left to pump
- Fiber connectivity good
- ETA on generator = 5pm tomorrow
Finally reached out to the team to make sure everyone was ok?
San Diego contingent was ok
Get a statement from CEO
Continuous stream of idiotic emails
BI emails
Sales tickets
Reporting questions
"My mouse is broken"
Technical / Physical
around midnight Barry started driving into Manhattan for a snatch and grab
Technical
Barry got the server
- 7 floors, pitch black stairwell up AND down
- brought it home
- racked it on his coffee table
- started to copy code base / config to EC2
Start re-architecting app to work in the cloud
Communication
All generator based - stuck in traffic
ETA of 5pm reiterated up until 3pm
8:05pm find out that it's 100 miles away in stop and go traffic - but almost here!
BUT ... thank god the generator was delayed bc the cables didn't get there until after 10:30
Note: from the time that we started getting updates about the generator to the generator actually getting there felt like an ETERNITY. AN ETERNITY.
Communication
All generator updates
Generator in traffic, 4 hours to go 13 blocks
12:45pm - another generator on it's way
2pm - good thing we have another entering the city
- first one is unsafe, so we can't use it anyway
2:07pm - it arrived
took 7 hours for final connection
Technical
EC2
- servers set up, configs set up
- reverse engineered schemas, still no data
Communication
11/03 1:31am - DG site update: GENERATOR IS RUNNING!
11/03 8:00am - DG site update: EMERGENCY GENERATOR SHUTDOWN!
11/03 10:31am - DG site update: GENERATOR IS RUNNING AGAIN!
11/03 5:50pm - DG site update: EMERGENCY GENERATOR SHUTDOWN AGAIN!
11/03 11:20pm - DG site update: generator techs on site working to repair
IT: just go get servers
Technical / Physical
Sent 2 IT guys to be onsite, hands on
Starting testing feasibility of using RDS => EC2 better
Technical/Physical
Cesar/Eric go up 25 flights, grab 3 servers
(all dedicated backups slaves so site integrity not compromised if power is restored)
Set up shop at Barry's house
Get to work on bringing DBs to EC2
took hours
Messages table - huge
row chunks for bigger tables
Communication
Generator not just broken, but irreparably broken. Needs to be replaced
New generator onsite within an hour and half
Up and running at 10:16am
Site came back mid-afternoon
Issues once it gets up?
- DB corruption, fixing slaved
- Priming
- Switching DNS back over lower TTL (300s)
- Fresh cache
- Some slaves missing (on Barry’s kitchen counter)
11/04 6:31pm - @cafemom officially tweets that the site is back up. They have a contest with prizes to try and get people coming back.
For purposes of this talk, we’ll cut it off here. Lingering issues for a few days.
This is what our traffic looked like, per GA.
So what did we learn?
We’ll look at:
- the tech end of things
- the business end of things
- inertia
In Hindsight
Redundancies
servers removed from architecture werent critical... no impact to site infrastructure integrity when power restored
some servers suffered damage, but redundancies mean we're ok
Code without data
code without data doesnt help for most of cafemom/mamaslatinas (stir/quemas functionality could have been replicated, but no post history)
DNS propagation
strategy to avoid dns propagation would have been nice (reverse proxy?)
DG -> GoDaddy -> Route 53
Watch your TTLs!
Cloud replication
- Codebase
did DB and full request load testing after we had data
- Database
set up ec2 based slave for realtime offsite db
had to determine if RDS could give required IO performance (dedicated high IOPS could) --- BUT.. RDS couldn’t be slaved?
Next Steps
investigated ASAP hosting (eg rackspace) – would have taken too long
once service restored shifted investigation to alternate primary datacenter and potential secondary physical DR hosting
any potential to simplify application for easier DR planning? (maybe less realtime, read only version for nonmembers and emergencies?)
- HOW do you pick? Who decides?
How do you back out of the fail over if you have to use it?
all scrambling during outage didn't get us up faster, just got us a ec2 hosted fail page and wordpress sites (after dns nameserver changes finally propagated)
scrambling could have caused more issues if there was costly "undoing" necessary (bring servers back to DG, repoint DNS, undo workarounds, etc)
managing expectations and keeping people informed of (lack of) progress, especially when you're operating in the dark and relaying poor information
POOR information in general.
Gawker was on site, but the details weren’t especially helpful other than just knowing that people were, in fact, doing things
After retrofit question:
Building Dr into new application is easy(er) vs existing application
More complex the application usually means more complex the Dr solution
know when to say when (no sleep for no reason?)- should we have exhausted ourselves, potentially opening us up to big mistakes when we knew we couldnt do anything to help current situation
Last day:
2 concurrent paths
Dan and I working on different problems.
Dan: working on getting the site up
Me: working on having a fallback
Here we are, a year later – and not much has changed other than that which followed in the weeks after Sandy.
We finally put in a ticket for a fully fleshed out DR site on March 4 of this year – as you can see above. It’s normal priority, with no due date, and has sat there bc it’s never been the same priority as it was following the storm. This screen shot is from this past weekend.
We are making progress, though. We have a fallback solution, and we’re meeting this week to talk specifically about DR. We’re building out our engineering some, so we might have the resources to allocate to it.
We’re taking baby steps.
Conclude with:
We’re hoping that you can learn from some of our mistakes, and use us as a case study to bring to your organizations to emphasize the need for business continuity and disaster recovery.
Thanks for listening.
We’d like to open the floor up for questions.
Thanks!
We’d like to open the floor up for questions.
Thanks!
We’d like to open the floor up for questions.