How we survived Hurricane Sandy

•Download as PPT, PDF•

0 likes•330 views

In the startup world, the most pressing issue usually isn't, "how to we prepare for a once-in-a-lifetime storm?" Risk management is always stressed at Velocity, but businesses don't always buy into dedicating resources to low probability events. Let our comedy of errors during Hurricane Sandy help to convince your bosses that infrastructure and redundancy are actually worthwhile investments.

Technology Business

How we survived
Hurricane Sandy
a look at what not to do

Dan White
@airplanedan
Mike Zaic
VP, Engineering, CafeMom
@focustrate

Photo: TenSafeFrogs (flickr)

Introduction
Calm before the Storm
Into the Storm
Lessons Learned

CafeMom
•
•
•
•
•

Founded: 2006
Social Network for moms
> 3 million members
> 10 million monthly UVs
~ 130 employees

– Tech: 13 people
– Sales/Community: bulk of
people

Our Application(s)
stats
photos

MamasLatinas
video

answers

profiles

sponsorships

advice

CafeMom
advertising

admin tools
Que Mas

email

The Stir

language switching

games

groups

DevOps Focus and Pace
•
•
•
•

Sales Driven!
Limited resources
Product shifting
Priorities – refactoring?
Photo: árticotropical (flickr)

Architecture
• Hosting
– Single physical datacenter
– Redundancy? pffff

• Cloud Presence
– Limited to specific parts of the app
– No database required
Photo: Arthur40A (flickr)

Infrastructure Growth

minimal hardware

2006

rapid growth
=
infrastructure expands

continued growth
=
optimize for scale

stagnant infrastructure + more functionality
=
tightly coupled birds nest
slower growth
=
stagnant infrastructure

2013

“If NYC has serious issues we’ll all have more important things to worry about…”

Day 3: false hope

Photo: raneko (flickr)

Day 4: code with no data

Photo: mandiberg (flickr)

Technical Takeaways
what we did

•
•
•
•

Redundancies
Code without data
DNS propagation
Cloud replication
– Codebase
– Database

Photo: Robert S. Donovan (flickr)

Technical Takeaways
next steps

• Architecting for degraded
service?
• Simplify application?
• Second physical site?
• Geographic redundancy?
• Undoing "fixes" of failovers?

Photo: paul bica (flickr)

Overarching Problems
•
•
•
•
•
•
•

Don’t trust vendor/provider assurances!
Managing expectations during an outage
Feelings of helplessness
Scrambling doesn't necessarily get things up faster
Must know downtime tolerance
Cost of DR planning vs Lost opportunity of outage
DR Planning is insurance

Know when to say when
•
•
•
•

When is it appropriate to plan for DR?
When is it appropriate to build for DR?
When is it appropriate to retrofit DR for existing app?
When is it appropriate to dictate product requirements
based on ease of DR?
• During an outage - when is it appropriate to do nothing
(and just sleep)?

When not a great time to think about DR?
During a disaster.
(unless you’re thinking about the next one)

Moving forward – inertia sucks!
• Change is tough
• Mature infrastructure/app resists drastic change
• Pace of development inertia against stepping back
and revisiting
• Pace of business inertia against non-revenue
projects
• No recent disasters inertia against DR necessity

thank you

Photo: Lisa Bettany {Mostly Lisa} (flickr)

Viewers also liked

headacheMsK for drug correlation

iv administiration MsK for drug correlation

Line Bias: exploiting gambling line dataMichael Zaic

Biodegradable polymers as drug carriers MsK for drug correlation

Honda VFR400 nc30 service manualandonis-artist

Driving in-thailand.com test-eng2andonis-artist

2007 owner manual honda cbr600rrandonis-artist

Vintage calf skin jacket restorationandonis-artist

Maintaining apartment 601, Sea and Sky, Phuket, Thailandandonis-artist

Viewers also liked (9)

headache

iv administiration

Line Bias: exploiting gambling line data

Biodegradable polymers as drug carriers

Honda VFR400 nc30 service manual

Driving in-thailand.com test-eng2

2007 owner manual honda cbr600rr

Vintage calf skin jacket restoration

Maintaining apartment 601, Sea and Sky, Phuket, Thailand

Similar to How we survived Hurricane Sandy

The dev ops drumbeat reinventing the iron triangleJason Bloomberg

JAXLondon 2015 "DevOps and the Cloud: All Hail the (Developer) King"Daniel Bryant

DevOps and the cloud: all hail the (developer) king - Daniel Bryant, Steve PooleJAXLondon_Conference

Data Virtualization: revolutionizing database cloningKyle Hailey

Why Enterprise Digital Strategies Must Drive IT ModernizationJason Bloomberg

Webinar: Preparing for Disasters that Will Actually HappenStorage Switzerland

Choosing Public vs. Private vs. Hybrid Cloud ComputingSkytap Cloud

ALM Practices - Modern Applications Development and its impact on ALM especificacoes.com

Embracing collaborative chaos (April 2020) by Lyndsay PrewerEqual Experts

Executive Presentation on Agile Project Management by Boardroom Metrics Inc.Boardroom Metrics

Addressing the DevOps Resilience ChallengeJason Bloomberg

Adapting agile afei - 2-15Jason Bloomberg

Embracing collaborative chaosEqual Experts

ATC2013-Thiru and Abhishek-How to prevent Agile from becoming Fragile?India Scrum Enthusiasts Community

Breaking Down Enterprise Silos in the Cloud - Jason Bloomberg, Intellyx, Clou...Jason Bloomberg

How To Leverage Cloud Computing for Business & Operational Benefit - CAMP ITSkytap Cloud

Bringing Your Web Apps to IBM Digital ExperienceJohn Head

Does Agile Enterprise Architecture = Agile + Enterprise Architecture?Jason Bloomberg

Seeking Sunshine in Cloud Technology - STC PMC 2014Roger Renteria

Best Practices for Managing IaaS, PaaS, and Container-Based Deployments - App...AppDynamics

Similar to How we survived Hurricane Sandy (20)

The dev ops drumbeat reinventing the iron triangle

JAXLondon 2015 "DevOps and the Cloud: All Hail the (Developer) King"

DevOps and the cloud: all hail the (developer) king - Daniel Bryant, Steve Poole

Data Virtualization: revolutionizing database cloning

Why Enterprise Digital Strategies Must Drive IT Modernization

Webinar: Preparing for Disasters that Will Actually Happen

Choosing Public vs. Private vs. Hybrid Cloud Computing

ALM Practices - Modern Applications Development and its impact on ALM

Embracing collaborative chaos (April 2020) by Lyndsay Prewer

Executive Presentation on Agile Project Management by Boardroom Metrics Inc.

Addressing the DevOps Resilience Challenge

Adapting agile afei - 2-15

Embracing collaborative chaos

ATC2013-Thiru and Abhishek-How to prevent Agile from becoming Fragile?

Breaking Down Enterprise Silos in the Cloud - Jason Bloomberg, Intellyx, Clou...

How To Leverage Cloud Computing for Business & Operational Benefit - CAMP IT

Bringing Your Web Apps to IBM Digital Experience

Does Agile Enterprise Architecture = Agile + Enterprise Architecture?

Seeking Sunshine in Cloud Technology - STC PMC 2014

Best Practices for Managing IaaS, PaaS, and Container-Based Deployments - App...

Recently uploaded

08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls

Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren

How to Remove Document Management Hurdles with X-Docs?XfilesPro

Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4

SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55

Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix

Artificial intelligence in the post-deep learning eraDeakin University

Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group

SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren

Vulnerability_Management_GRC_by Sohang Sengupta.pptxnull - The Open Security Community

How to convert PDF to text with Nanonetsnaman860154

08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls

Slack Application Development 101 Slidespraypatel2

Understanding the Laravel MVC ArchitecturePixlogix Infotech

CloudStudio User manual (basic edition):comworks

Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxnull - The Open Security Community

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited

Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes

Recently uploaded (20)

08448380779 Call Girls In Greater Kailash - I Women Seeking Men

Advanced Test Driven-Development @ php[tek] 2024

How to Remove Document Management Hurdles with X-Docs?

Azure Monitor & Application Insight to monitor Infrastructure & Application

SIEMENS: RAPUNZEL – A Tale About Knowledge Graph

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...

Swan(sea) Song – personal research during my six years at Swansea ... and bey...

Artificial intelligence in the post-deep learning era

Snow Chain-Integrated Tire for a Safe Drive on Winter Roads

SQL Database Design For Developers at php[tek] 2024

Vulnerability_Management_GRC_by Sohang Sengupta.pptx

How to convert PDF to text with Nanonets

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men

Slack Application Development 101 Slides

Understanding the Laravel MVC Architecture

CloudStudio User manual (basic edition):

Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365

Enhancing Worker Digital Experience: A Hands-on Workshop for Partners

How we survived Hurricane Sandy

1. How we survived Hurricane Sandy a look at what not to do Dan White @airplanedan Mike Zaic VP, Engineering, CafeMom @focustrate Photo: TenSafeFrogs (flickr)

2. Introduction Calm before the Storm Into the Storm Lessons Learned

3. CafeMom • • • • • Founded: 2006 Social Network for moms > 3 million members > 10 million monthly UVs ~ 130 employees – Tech: 13 people – Sales/Community: bulk of people

4. Introduction Calm before the Storm Into the Storm Lessons Learned

5. Our Application(s) stats photos MamasLatinas video answers profiles sponsorships advice CafeMom advertising admin tools Que Mas email The Stir language switching games groups

6. DevOps Focus and Pace • • • • Sales Driven! Limited resources Product shifting Priorities – refactoring? Photo: árticotropical (flickr)

7. Architecture • Hosting – Single physical datacenter – Redundancy? pffff • Cloud Presence – Limited to specific parts of the app – No database required Photo: Arthur40A (flickr)

8. Infrastructure Growth minimal hardware 2006 rapid growth = infrastructure expands continued growth = optimize for scale stagnant infrastructure + more functionality = tightly coupled birds nest slower growth = stagnant infrastructure 2013

9. “If NYC has serious issues we’ll all have more important things to worry about…”

10. Introduction Calm before the Storm Into the Storm Lessons Learned

11. Day 1: Bring on Sandy!

12. Day 2: waiting and hoping

13. Day 3: false hope Photo: raneko (flickr)

14. Day 4: code with no data Photo: mandiberg (flickr)

15. Day 5: generator anticipation

16. Day 6: generator #fail

17. Day 7: everything… works?

18. Introduction Calm before the Storm Into the Storm Lessons Learned

19.

20. Technical Takeaways what we did • • • • Redundancies Code without data DNS propagation Cloud replication – Codebase – Database Photo: Robert S. Donovan (flickr)

21. Technical Takeaways next steps • Architecting for degraded service? • Simplify application? • Second physical site? • Geographic redundancy? • Undoing "fixes" of failovers? Photo: paul bica (flickr)

22. Overarching Problems • • • • • • • Don’t trust vendor/provider assurances! Managing expectations during an outage Feelings of helplessness Scrambling doesn't necessarily get things up faster Must know downtime tolerance Cost of DR planning vs Lost opportunity of outage DR Planning is insurance

23. Know when to say when • • • • When is it appropriate to plan for DR? When is it appropriate to build for DR? When is it appropriate to retrofit DR for existing app? When is it appropriate to dictate product requirements based on ease of DR? • During an outage - when is it appropriate to do nothing (and just sleep)?

24. When not a great time to think about DR? During a disaster. (unless you’re thinking about the next one)

25. Moving forward – inertia sucks! • Change is tough • Mature infrastructure/app resists drastic change • Pace of development inertia against stepping back and revisiting • Pace of business inertia against non-revenue projects • No recent disasters inertia against DR necessity

26. baby steps

27. thank you Photo: Lisa Bettany {Mostly Lisa} (flickr)

28. thank you Photo: Lisa Bettany {Mostly Lisa} (flickr)

29. thank you Photo: Lisa Bettany {Mostly Lisa} (flickr)

Editor's Notes

This conference is in NYC – assuming people remember “Hurricane Sandy” – end of October, last year? This is what Manhattan looked like. We work for CafeMom.com – Dan was the Senior Vice President of Technology at the time, Mike ran the development team. Our datacenter was in lower Manhattan, and our site went down for about a week. We’re hoping that you can learn from some of our mistakes, and use us as a case study to bring to your organizations to emphasize the need for business continuity and disaster recovery. Let’s get started.
Other site mentions - MamasLatinas – launched Jan 2012 - The Stir / Que Mas User base - Rabid - Expectation of Realtime - Very Vocal - Not shy about complaining – CHANGE IS BAD!!!
breadth of application pieces are both loosely and tightly coupled - lots of unrelated functionality - reliance on common data (friends activity!) - interconnected code complex logic
Hosting: “Who’s got multiple datacenters?” Cloud presence: * EC2, S3, Cloudfront * Basically for offloading bits off the app Resiliency: - built into the application - can handle outages at - application - server - network device - power circuit - internet pipe
Sentence sums up prior mindset with regards to a single datacenter in NYC.
Almost 1 year ago – end of October, into November Daily account Frame in terms of: 1. Communication 2. Technical / Physical
Communication 10/29 9:07am - DG email: we're aware of the storm and all is good Physical locations of Dan/Mike 5:46pm - DG says ready for the storm hour later - site's down communicated to DG communicated to company 9:25pm - everything's back! communicated to company... ... prematurely 9:43pm - out for good let the company know that night Why? Fuel pumps in flooded basement! Re-tell 9/11 stuff Technical Static EC2 fail page Switch nameservers caused problems ssh access to production
Communication From DG re: pumping water Army corps of engineers help Seemingly misleading/contradictory reports Set up script to monitor DG page Us to company throughout with updates Communication with Gawker, eyes on ground - stuff is happening! Technical Inbound email - cloud hosted email, but still routed through DG mailserver means no inbound external email... need to repoint postini directly to 365 Even after switch, took a while to unspool We did get it working thestir wordpress -- dns thestirlive.cafemom.com
Communication Frequent DG updates - Fiber optic cables severed? - Some of our datacenters are up - we're not so bad - Water still being pumped - Mobile generator on the way Lots of false hope: - Only 2 feet of water left to pump - Fiber connectivity good - ETA on generator = 5pm tomorrow Finally reached out to the team to make sure everyone was ok? San Diego contingent was ok Get a statement from CEO Continuous stream of idiotic emails BI emails Sales tickets Reporting questions "My mouse is broken" Technical / Physical around midnight Barry started driving into Manhattan for a snatch and grab
Technical Barry got the server - 7 floors, pitch black stairwell up AND down - brought it home - racked it on his coffee table - started to copy code base / config to EC2 Start re-architecting app to work in the cloud Communication All generator based - stuck in traffic ETA of 5pm reiterated up until 3pm 8:05pm find out that it's 100 miles away in stop and go traffic - but almost here! BUT ... thank god the generator was delayed bc the cables didn't get there until after 10:30 Note: from the time that we started getting updates about the generator to the generator actually getting there felt like an ETERNITY. AN ETERNITY.
Communication All generator updates Generator in traffic, 4 hours to go 13 blocks 12:45pm - another generator on it's way 2pm - good thing we have another entering the city - first one is unsafe, so we can't use it anyway 2:07pm - it arrived took 7 hours for final connection Technical EC2 - servers set up, configs set up - reverse engineered schemas, still no data
Communication 11/03 1:31am - DG site update: GENERATOR IS RUNNING! 11/03 8:00am - DG site update: EMERGENCY GENERATOR SHUTDOWN! 11/03 10:31am - DG site update: GENERATOR IS RUNNING AGAIN! 11/03 5:50pm - DG site update: EMERGENCY GENERATOR SHUTDOWN AGAIN! 11/03 11:20pm - DG site update: generator techs on site working to repair IT: just go get servers Technical / Physical Sent 2 IT guys to be onsite, hands on Starting testing feasibility of using RDS => EC2 better
Technical/Physical Cesar/Eric go up 25 flights, grab 3 servers (all dedicated backups slaves so site integrity not compromised if power is restored) Set up shop at Barry's house Get to work on bringing DBs to EC2 took hours Messages table - huge row chunks for bigger tables Communication Generator not just broken, but irreparably broken. Needs to be replaced New generator onsite within an hour and half Up and running at 10:16am Site came back mid-afternoon Issues once it gets up? - DB corruption, fixing slaved - Priming - Switching DNS back over lower TTL (300s) - Fresh cache - Some slaves missing (on Barry’s kitchen counter) 11/04 6:31pm - @cafemom officially tweets that the site is back up. They have a contest with prizes to try and get people coming back. For purposes of this talk, we’ll cut it off here. Lingering issues for a few days.
This is what our traffic looked like, per GA. So what did we learn? We’ll look at: - the tech end of things - the business end of things - inertia
In Hindsight Redundancies servers removed from architecture werent critical... no impact to site infrastructure integrity when power restored some servers suffered damage, but redundancies mean we're ok Code without data code without data doesnt help for most of cafemom/mamaslatinas (stir/quemas functionality could have been replicated, but no post history) DNS propagation strategy to avoid dns propagation would have been nice (reverse proxy?) DG -> GoDaddy -> Route 53 Watch your TTLs! Cloud replication - Codebase did DB and full request load testing after we had data - Database set up ec2 based slave for realtime offsite db had to determine if RDS could give required IO performance (dedicated high IOPS could) --- BUT.. RDS couldn’t be slaved?
Next Steps investigated ASAP hosting (eg rackspace) – would have taken too long once service restored shifted investigation to alternate primary datacenter and potential secondary physical DR hosting any potential to simplify application for easier DR planning? (maybe less realtime, read only version for nonmembers and emergencies?) - HOW do you pick? Who decides? How do you back out of the fail over if you have to use it?
all scrambling during outage didn't get us up faster, just got us a ec2 hosted fail page and wordpress sites (after dns nameserver changes finally propagated) scrambling could have caused more issues if there was costly "undoing" necessary (bring servers back to DG, repoint DNS, undo workarounds, etc) managing expectations and keeping people informed of (lack of) progress, especially when you're operating in the dark and relaying poor information POOR information in general. Gawker was on site, but the details weren’t especially helpful other than just knowing that people were, in fact, doing things
After retrofit question: Building Dr into new application is easy(er) vs existing application More complex the application usually means more complex the Dr solution know when to say when (no sleep for no reason?)- should we have exhausted ourselves, potentially opening us up to big mistakes when we knew we couldnt do anything to help current situation
Last day: 2 concurrent paths Dan and I working on different problems. Dan: working on getting the site up Me: working on having a fallback
Here we are, a year later – and not much has changed other than that which followed in the weeks after Sandy. We finally put in a ticket for a fully fleshed out DR site on March 4 of this year – as you can see above. It’s normal priority, with no due date, and has sat there bc it’s never been the same priority as it was following the storm. This screen shot is from this past weekend. We are making progress, though. We have a fallback solution, and we’re meeting this week to talk specifically about DR. We’re building out our engineering some, so we might have the resources to allocate to it. We’re taking baby steps.
Conclude with: We’re hoping that you can learn from some of our mistakes, and use us as a case study to bring to your organizations to emphasize the need for business continuity and disaster recovery. Thanks for listening. We’d like to open the floor up for questions.
Thanks! We’d like to open the floor up for questions.
Thanks! We’d like to open the floor up for questions.

How we survived Hurricane Sandy

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (9)

Similar to How we survived Hurricane Sandy

Similar to How we survived Hurricane Sandy (20)

Recently uploaded

Recently uploaded (20)

How we survived Hurricane Sandy

Editor's Notes