SlideShare a Scribd company logo
1 of 33
Retrospective from a startup built in the cloud :  top 3 big lessons from the AWS outage on 04.21.2011 plus 4,369 other smaller ones 5/27/2011 1
What a country : entrepreneurial resiliency 5/27/2011 2
(true story) “robust systems:highly fault-tolerant, on or off grid. eg: our culture wrt entrepreneurs, AWS, the BD API” 5/27/2011 3
Boom 5/27/2011 4
good to be home! Go Buffs 5/27/2011 5
me: previous startupteams in 3 countrieshighly transactional systemMS tech : IIS/MS SQL Serverco-located, leased/owned hardware0% in cloud$75M/yearly rev  5/27/2011 6
me : current startupsystems 100% on AWS99% free/open-source software 5/27/2011 7 standing on the shoulders of giants
fault tolerance: 3 to 47 important failearnings and 4,369 less important ones 5/27/2011 8
in the context of our startup, of course YMMV depending on velocity 5/27/2011 9
Ruger 5/27/2011 10
The Ruger Fault Equivalencytime = money fault tolerance = time²  - risk tolerance  Also known as:  'Fast, good and cheap : pick two‘ 5/27/2011 11
system design philosophy: 5/27/2011 12 leverage proven, open-source tech in the cloud to build a scaleable reliable secure operational foundation quickly
So how do you achievethe right level of fault tolerance in the cloud? 3 tenets 5/27/2011 13
Tenet #1 5/27/2011 14 Scripted Repeatability  Tenet #2 SPOF Elimination Tenet #3 Clear-Cut Communication
who here has used AWS? 5/27/2011 15
Tenet #1prepare a fault-tolerant foundation with scripted repeatability aka automation 5/27/2011 16
from the start :script the non-interactive install of your toolsand OScustom  AMIDebian : great package managementbased on Eric Hammond’s workhttp://alestic.com/ 5/27/2011 17
which will allow you toscript the setup/tear-down of your stack 5/27/2011 18
which will allow you toscript system testsintegrity (3-4K tests)performance (30-40K tests)load, capacity (2-4M requests) 5/27/2011 19
5/27/2011 20 A/B system test results : MySQL Percona Upgrade
That’s how1 person set up andmanaged a networkcomprised of 90+/- server instancesfor 1.5 yearswhile serving various other roleswithout having to leave their chair 5/27/2011 21 try that with real hardware
Tenet #2SPOF Elimination We don’t need no stinkin single points of failure.   5/27/2011 22
SPOF Examples:Cloud ProviderRegionZoneLoad BalancerApp Server DatabaseFred 5/27/2011 23
Cloud Provider fail-over? e.g. AWS –> Rackspace 5/27/2011 24
Region fail-over? e.g. useast->uswest within AWS Nah. 5/27/2011 25
Zone fail-over? Yes. 5/27/2011 26 US-WEST US-EAST
Zone fail-over best practices:are you using auto-scaling?no : distribute server instances evenly between 2 or more zonesyes : trigger scaling on network I/O or custom metrics 5/27/2011 27
Load-balancer (ELB), app server, database fail-over? Yes. 5/27/2011 28
So it’s actually all about reduction of the right SPOFs for your business context Just adding the ability to fail-over and have backups within a region is huge! Probably enough for most. What about Fred? 5/27/2011 29
Tenet #3Clear-Cut Communication transparency is soooo 2010 5/27/2011 30
During an outage, communicating the right things at the right time:hard. But not that hard. 5/27/2011 31
Tenet #1 5/27/2011 32 Three Tenets Revisited Scripted Repeatability  Tenet #2 SPOF Elimination Tenet #3 Clear-Cut Communication
Notes 5/27/2011 33

More Related Content

What's hot

Amazon Elastic Beanstalk
Amazon Elastic BeanstalkAmazon Elastic Beanstalk
Amazon Elastic Beanstalk
Eberhard Wolff
 
Autoscaling Ws On Ec2 Apache Con Presentation
Autoscaling Ws On Ec2 Apache Con PresentationAutoscaling Ws On Ec2 Apache Con Presentation
Autoscaling Ws On Ec2 Apache Con Presentation
guest60ed0b
 

What's hot (18)

Take control of your dev ops dumping ground
Take control of your  dev ops dumping groundTake control of your  dev ops dumping ground
Take control of your dev ops dumping ground
 
SIMCLOUD: Running Operational Simulators in the Cloud
SIMCLOUD: Running Operational Simulators in the CloudSIMCLOUD: Running Operational Simulators in the Cloud
SIMCLOUD: Running Operational Simulators in the Cloud
 
All You Need to Know about AWS Elastic Load Balancer
All You Need to Know about AWS Elastic Load BalancerAll You Need to Know about AWS Elastic Load Balancer
All You Need to Know about AWS Elastic Load Balancer
 
Amazon cloud failure
Amazon cloud failureAmazon cloud failure
Amazon cloud failure
 
Microsoft Azure Automation
Microsoft Azure AutomationMicrosoft Azure Automation
Microsoft Azure Automation
 
MesosCon 2017 - OpenWhisk as an Apache Mesos Framework
MesosCon 2017 - OpenWhisk as an Apache Mesos FrameworkMesosCon 2017 - OpenWhisk as an Apache Mesos Framework
MesosCon 2017 - OpenWhisk as an Apache Mesos Framework
 
Efficient way to manage environments in AWS
Efficient way to manage environments in AWS Efficient way to manage environments in AWS
Efficient way to manage environments in AWS
 
Microsoft Azure. Troubleshooting and monitoring.
Microsoft Azure. Troubleshooting and monitoring.Microsoft Azure. Troubleshooting and monitoring.
Microsoft Azure. Troubleshooting and monitoring.
 
Availability & Scalability with Elastic Load Balancing & Route 53 (CPN204) | ...
Availability & Scalability with Elastic Load Balancing & Route 53 (CPN204) | ...Availability & Scalability with Elastic Load Balancing & Route 53 (CPN204) | ...
Availability & Scalability with Elastic Load Balancing & Route 53 (CPN204) | ...
 
From Docker Straight to AWS
From Docker Straight to AWSFrom Docker Straight to AWS
From Docker Straight to AWS
 
Amazon Elastic Beanstalk
Amazon Elastic BeanstalkAmazon Elastic Beanstalk
Amazon Elastic Beanstalk
 
How NYTimes.com uses Amazon Web Services - AWS Summit 2011
How NYTimes.com uses Amazon Web Services - AWS Summit 2011How NYTimes.com uses Amazon Web Services - AWS Summit 2011
How NYTimes.com uses Amazon Web Services - AWS Summit 2011
 
Magento Developer Talk. Microservice Architecture and Actor Model
Magento Developer Talk. Microservice Architecture and Actor ModelMagento Developer Talk. Microservice Architecture and Actor Model
Magento Developer Talk. Microservice Architecture and Actor Model
 
Evolve18 | Brian Johnson & Ira Lessack | Business Track How To Move Your On-...
Evolve18 | Brian Johnson & Ira Lessack |  Business Track How To Move Your On-...Evolve18 | Brian Johnson & Ira Lessack |  Business Track How To Move Your On-...
Evolve18 | Brian Johnson & Ira Lessack | Business Track How To Move Your On-...
 
Autoscaling Ws On Ec2 Apache Con Presentation
Autoscaling Ws On Ec2 Apache Con PresentationAutoscaling Ws On Ec2 Apache Con Presentation
Autoscaling Ws On Ec2 Apache Con Presentation
 
Architecting in Cloud : Your Guide to Amazon Web Services
Architecting in Cloud : Your Guide to Amazon Web ServicesArchitecting in Cloud : Your Guide to Amazon Web Services
Architecting in Cloud : Your Guide to Amazon Web Services
 
Agile Deployment using Git and AWS Elastic Beanstalk
Agile Deployment using Git and AWS Elastic BeanstalkAgile Deployment using Git and AWS Elastic Beanstalk
Agile Deployment using Git and AWS Elastic Beanstalk
 
Operating OpenStack - Case Study in the Rackspace Cloud
Operating OpenStack - Case Study in the Rackspace CloudOperating OpenStack - Case Study in the Rackspace Cloud
Operating OpenStack - Case Study in the Rackspace Cloud
 

Similar to BigDoor's Jeff Malek Gluecon Presentation

Was liberty at scale
Was liberty at scaleWas liberty at scale
Was liberty at scale
sflynn073
 
Muves3 Elastic Grid Java One2009 Final
Muves3 Elastic Grid Java One2009 FinalMuves3 Elastic Grid Java One2009 Final
Muves3 Elastic Grid Java One2009 Final
Elastic Grid, LLC.
 
Patterns & Practices of Microservices
Patterns & Practices of MicroservicesPatterns & Practices of Microservices
Patterns & Practices of Microservices
Wesley Reisz
 

Similar to BigDoor's Jeff Malek Gluecon Presentation (20)

Retrospective from a startup built in the cloud: top three big lessons learne...
Retrospective from a startup built in the cloud: top three big lessons learne...Retrospective from a startup built in the cloud: top three big lessons learne...
Retrospective from a startup built in the cloud: top three big lessons learne...
 
Powering the Cloud with Oracle WebLogic
Powering the Cloud with Oracle WebLogicPowering the Cloud with Oracle WebLogic
Powering the Cloud with Oracle WebLogic
 
5 Quick Wins for the Cloud
5 Quick Wins for the Cloud5 Quick Wins for the Cloud
5 Quick Wins for the Cloud
 
Was liberty at scale
Was liberty at scaleWas liberty at scale
Was liberty at scale
 
(ISM319) What Drives the Need for Application-Defined Management
(ISM319) What Drives the Need for Application-Defined Management(ISM319) What Drives the Need for Application-Defined Management
(ISM319) What Drives the Need for Application-Defined Management
 
Scaling Databricks to Run Data and ML Workloads on Millions of VMs
Scaling Databricks to Run Data and ML Workloads on Millions of VMsScaling Databricks to Run Data and ML Workloads on Millions of VMs
Scaling Databricks to Run Data and ML Workloads on Millions of VMs
 
Cto cloud
Cto cloudCto cloud
Cto cloud
 
Muves3 Elastic Grid Java One2009 Final
Muves3 Elastic Grid Java One2009 FinalMuves3 Elastic Grid Java One2009 Final
Muves3 Elastic Grid Java One2009 Final
 
Si so product 1 day technical
Si so product 1 day technicalSi so product 1 day technical
Si so product 1 day technical
 
Web sphere application transformation and modernization at engie electrabel
Web sphere application transformation and modernization at engie electrabelWeb sphere application transformation and modernization at engie electrabel
Web sphere application transformation and modernization at engie electrabel
 
VMworld 2013: Practicing What We Preach: VMware IT on vCenter Operations Mana...
VMworld 2013: Practicing What We Preach: VMware IT on vCenter Operations Mana...VMworld 2013: Practicing What We Preach: VMware IT on vCenter Operations Mana...
VMworld 2013: Practicing What We Preach: VMware IT on vCenter Operations Mana...
 
VMworld 2013: Virtualizing and Tuning Large Scale Java Platforms
VMworld 2013: Virtualizing and Tuning Large Scale Java Platforms VMworld 2013: Virtualizing and Tuning Large Scale Java Platforms
VMworld 2013: Virtualizing and Tuning Large Scale Java Platforms
 
V mware v fabric 5 - what's new technical sales training presentation
V mware v fabric 5 - what's new technical sales training presentationV mware v fabric 5 - what's new technical sales training presentation
V mware v fabric 5 - what's new technical sales training presentation
 
Patterns & Practices of Microservices
Patterns & Practices of MicroservicesPatterns & Practices of Microservices
Patterns & Practices of Microservices
 
Introduction To Cloud Computing
Introduction To Cloud ComputingIntroduction To Cloud Computing
Introduction To Cloud Computing
 
Oracle on AWS partner webinar series
Oracle on AWS partner webinar series Oracle on AWS partner webinar series
Oracle on AWS partner webinar series
 
Reduce Risk with End to End Monitoring of Middleware-based Applications
Reduce Risk with End to End Monitoring of Middleware-based ApplicationsReduce Risk with End to End Monitoring of Middleware-based Applications
Reduce Risk with End to End Monitoring of Middleware-based Applications
 
Madrid meetup #7 deployment models
Madrid meetup #7   deployment modelsMadrid meetup #7   deployment models
Madrid meetup #7 deployment models
 
OMEGAMON XE for Mainframe Networks v5.3 Long presentation
OMEGAMON XE for Mainframe Networks v5.3 Long presentationOMEGAMON XE for Mainframe Networks v5.3 Long presentation
OMEGAMON XE for Mainframe Networks v5.3 Long presentation
 
Why Cloud Management Makes Sense
Why Cloud Management Makes SenseWhy Cloud Management Makes Sense
Why Cloud Management Makes Sense
 

Recently uploaded

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 

BigDoor's Jeff Malek Gluecon Presentation

  • 1. Retrospective from a startup built in the cloud : top 3 big lessons from the AWS outage on 04.21.2011 plus 4,369 other smaller ones 5/27/2011 1
  • 2. What a country : entrepreneurial resiliency 5/27/2011 2
  • 3. (true story) “robust systems:highly fault-tolerant, on or off grid. eg: our culture wrt entrepreneurs, AWS, the BD API” 5/27/2011 3
  • 5. good to be home! Go Buffs 5/27/2011 5
  • 6. me: previous startupteams in 3 countrieshighly transactional systemMS tech : IIS/MS SQL Serverco-located, leased/owned hardware0% in cloud$75M/yearly rev 5/27/2011 6
  • 7. me : current startupsystems 100% on AWS99% free/open-source software 5/27/2011 7 standing on the shoulders of giants
  • 8. fault tolerance: 3 to 47 important failearnings and 4,369 less important ones 5/27/2011 8
  • 9. in the context of our startup, of course YMMV depending on velocity 5/27/2011 9
  • 11. The Ruger Fault Equivalencytime = money fault tolerance = time²  - risk tolerance Also known as: 'Fast, good and cheap : pick two‘ 5/27/2011 11
  • 12. system design philosophy: 5/27/2011 12 leverage proven, open-source tech in the cloud to build a scaleable reliable secure operational foundation quickly
  • 13. So how do you achievethe right level of fault tolerance in the cloud? 3 tenets 5/27/2011 13
  • 14. Tenet #1 5/27/2011 14 Scripted Repeatability Tenet #2 SPOF Elimination Tenet #3 Clear-Cut Communication
  • 15. who here has used AWS? 5/27/2011 15
  • 16. Tenet #1prepare a fault-tolerant foundation with scripted repeatability aka automation 5/27/2011 16
  • 17. from the start :script the non-interactive install of your toolsand OScustom AMIDebian : great package managementbased on Eric Hammond’s workhttp://alestic.com/ 5/27/2011 17
  • 18. which will allow you toscript the setup/tear-down of your stack 5/27/2011 18
  • 19. which will allow you toscript system testsintegrity (3-4K tests)performance (30-40K tests)load, capacity (2-4M requests) 5/27/2011 19
  • 20. 5/27/2011 20 A/B system test results : MySQL Percona Upgrade
  • 21. That’s how1 person set up andmanaged a networkcomprised of 90+/- server instancesfor 1.5 yearswhile serving various other roleswithout having to leave their chair 5/27/2011 21 try that with real hardware
  • 22. Tenet #2SPOF Elimination We don’t need no stinkin single points of failure. 5/27/2011 22
  • 23. SPOF Examples:Cloud ProviderRegionZoneLoad BalancerApp Server DatabaseFred 5/27/2011 23
  • 24. Cloud Provider fail-over? e.g. AWS –> Rackspace 5/27/2011 24
  • 25. Region fail-over? e.g. useast->uswest within AWS Nah. 5/27/2011 25
  • 26. Zone fail-over? Yes. 5/27/2011 26 US-WEST US-EAST
  • 27. Zone fail-over best practices:are you using auto-scaling?no : distribute server instances evenly between 2 or more zonesyes : trigger scaling on network I/O or custom metrics 5/27/2011 27
  • 28. Load-balancer (ELB), app server, database fail-over? Yes. 5/27/2011 28
  • 29. So it’s actually all about reduction of the right SPOFs for your business context Just adding the ability to fail-over and have backups within a region is huge! Probably enough for most. What about Fred? 5/27/2011 29
  • 30. Tenet #3Clear-Cut Communication transparency is soooo 2010 5/27/2011 30
  • 31. During an outage, communicating the right things at the right time:hard. But not that hard. 5/27/2011 31
  • 32. Tenet #1 5/27/2011 32 Three Tenets Revisited Scripted Repeatability Tenet #2 SPOF Elimination Tenet #3 Clear-Cut Communication

Editor's Notes

  1. Nothing to see here, move along
  2. ‘what a country’ : my dad always says this, I like itso, BradFeld was in our offices recently, and was asking how AWS was working out for us i'd replied very much in the positive, with a few exceptions regarding their support services.  that night at dinner brad was talking about how resilient our culture is for entrepreneurs; how we can fail and retry here in the united statesdoing things that folks might get strung up for, in other countries.the following night, I found myself exploring analogies between that idea and computing systems, and wound up pulling out my phone and started typing up a twitter post
  3. It went something like this.this was going to be the brilliant culmination of my twitter career, to date. I was almost ready to hit the send button, when I started getting alerts from our systemsThe alerts were appearing literally right above what I had written : ‘system DOWN’.  Oh, the irony. wish i had a screenshot from my phone
  4. that was the evening of 4/20, morning of 4/21 - the AWS outageLasted for a number of days; our API was intermittently affected for about 12 hours; that could have been mitigatedThat outage totally sucked for so many reasons I’m hoping that by sharing some of my experience with AWS , you’ll gain some insights that may help you prepare adequatelyAlso hoping that this can turn into a conversation toward the end, so you can share your experiences as well.
  5. So who am I? My name is Jeff MalekI grew up here, folks still here – in fact today is their 43rd anniversarygraduated from CU in 93 after 6 long years and a suspensionduring which time I hitch-hiked around the country, winding up in hawaiigraduated, moved around, met some great friends, helped to start up a company
  6. was at Zango for 10 years,responsible for engineering, QA and product development teams distributed across three countries50+ people who built and maintained the high-transaction system that resulted in $75M yearly revenue at its peakleveragedthe client side software I wrote in the C programming language which talked to backend systems built on Windows technology (IIS, MSSQL Server, etc) which was sitting on co-located , purchased hardware
  7. BigDoor: over 2 years oldplatform that powers game mechanics and social loyalty programs for digital communities.freeRESTfulAPI that you can brand any way you wantBuilt in the cloud on AWS using 99.99% free/open-source software.Even after the outage, still a huge fan of AWS, generally very impressed with what they’ve built and their speed of innovationAfter the outage some great folks from the AWS team visited us from across the street (just realized they were across the street)my team gave me shit for bringing sodas into the conference room for them (bitter). I just figured they’d get dry from all the explaining they were about to do. When was the last time you got a newsletter letting you know that a vendor’s pricing was going down Funded by Foundry and Brad Feld in 2010, who as you know also do a lot to make this event happen. You guys are money.
  8. So that’s what I’m here to talk about: fault tolerance in the cloud, AWS and hopefully some API stuff of interestI’ll share this presentation including notes and supporting material later, on our blog at www.bigdoor.com
  9. I want to talk about all of this in the context of our startup, of courseUltimately the AWS outage didn’t result in any major changes to the way we do thingsWhile there were a few smaller things that we bumped up the priority chain, there’s a certain level of risk that a start up is willing to live with
  10. My girlfriend Jenny and I got Ruger as a puppy, right when BigDoor startedRaised him from a puppy while building out our operational infrastructure, working out of our houseSo he’s kind of our mascot, and to help put things in context, I came up with a formula : The Ruger Fault Equivalency.
  11. IOW given a low tolerance for risk, you can create a highly-fault tolerant system if you have a lot of time and/or money. that’s not BigDoor. Conversely, executing with a higher tolerance for risk gets you to market faster with less money, but with lower fault tolerance.For us, scalability is more important than extremely high fault tolerancestartup = time^2 is low (little time and money)So, fun and interesting, but what does it mean in the context of BigDoor system design?
  12. I designed the BigDoor systems at a high level with this philosophy in mind. A bit more regarding our context : Django/PythonWeek long sprints that end in production code release 260G+ and growing transactional database, so still not that bigPeak so far: 18MM API requests/day, so still a ways to go Response times need to be faster than 500ms
  13. OK, given that context – how do you achieve the right amount of fault tolerance in the cloud?Three basic tenets, and in the context of the AWS outage:the first sets a foundation for fault tolerance the second leverages the first to improve fault toleranceand the third will help keep your customers around when you are in crisis mode, ultimately also improving fault tolerance
  14. Scripted repeatabilitySPOF eliminationClear-Cut Communication
  15. Get count of audience who are/have used AWSLow count?  Give more background around what the various services are/do.High count?  Give less explanation around what things are, and ask for other's best practices
  16. Nothing to see here, move along
  17. AMIs (amazon machine image, install images; OS blueprints), these are used to start new server instancesLeverage pre-built AMIsDebian has great package managementpackages are verified, tested before making it into the main line - less to think aboutThank you Eric HammondA good best practice : use a single master AMI re-buildregularly via automation with new softwarenew package patches (apt)your application code we thentag per environment (test, staging, production) switch services (Apache, MySQL) on and off during boot via init scriptsAnother good practice :All app code and software config is checked out via SVN and baked into the AMIsvn up during boot via init scriptsenables fast initialization during auto-scaling activities
  18. AWS has cloud formationThey came out with that a few days after I’d finished pretty much doing the sameI wrapped the AWS command line tools in shell scriptsSince we’re a Python shop, we’re likely going to be using boto (which has matured quite a bit in the last two years) and fabric
  19. Nothing to see here, move along
  20. Nothing to see here, move along
  21. That’s a picture of the IBM RAMAC, built in 1956, which had 5M of storage and weighed a ton. We’ve come a long way, baby!
  22. For anyone unfamiliar: if a system stops working when a part of it fails, that part is a single point of failure. So in every system there’s potential for many single points of failure, proportional to system complexityBecause of the Ruger Fault Equivalency, the idea is to pick the right SPsOF and eliminate (or at least mitigate) themI used the word ‘elimination’ here, hoping that it would make some folks chuckle; it’s really not possible to eliminate all SPOFs. You can mitigate them, though. So here are some examples, and I’ll drill into which ones are critical in our context.
  23. If your cloud provider goes out of business, you’re hosed. SPOF.In AWS, a region is…etc. If a region disappears, you’re hosed. SPOF.Within regions, are zones. If an entire zone fails, you’re hosed. SPOF.Same with load balancers, application servers, databasesAnd even Fred. If Fred is the only guy who knows your operational systems, and he trips over the extension cord, knocking himself out in the process – you’re hosed. SPOF. The critical ones in our context and likely in many others : Zones and everything below.
  24. Should you attempt to achieve high fault-tolerance through cloud-cloud failover?Ruger Fault Equivalency says : Cost prohibitive (times squared)RightScale , who provides a very cool cloud management system, apparently has some of this functionality, and will likely be the place to go for cloud-cloud fail-over in the future. They will also add roughly 30% to your overall AWS costs. Their scripts are also in ruby. Blah. In time my arguments against doing this will sound similar to the current arguments you hear against going to the cloud.
  25. Ruger Fault Equivalency says :Ditto – cost prohibitiveIf you try to migrate an ELB-balanced tech stack from one region to the next, you’ll learn:EIPs can’t be pointed from an instance in one region to another (at least not easily, I’ve heard you can ask to have it done)Your custom useast (for example) AMI can’t be used in the new region Your useast Security groups can’t be used in the new regionYour snapshots can’t be used to create new volumes, in the new regionSure, this can all be worked-around, but do you have the time and money? Do set up a DB replicant in another region, if possible.
  26. Ruger says : yes, even in light of the recent outage, that affected the entire useast region. It’s not cost-prohibitive, and you get data-center fail-over.At my last company, we co-located in a downtown Seattle data center that also hosted MS, amazon, expedia servers. at the time seemed like a fortress, but it is in fact a single building. Contrast that to an AWS region, which contain zones that are four separate data centers, separate buildings. The Seattle data center caught fire a couple of years ago, causing a major outage for our last company (after we left). Many years previous, we had spent a lot of time and money creating our own data-center fail-over, within our Seattle office, even backed up by a generator.Did they fail over to the backup data center? I don’t think so. What about the recent AWS outage? A human error caused a major problem in one zone that had a ripple effect into the other zones, to a large degree caused by folks failing over. But ultimately, downtime suffered was in proportion to how well you were already leveraging other zones, and how dependent you were on EBS volumes. If all of your eggs were in the wrong zone, or didn’t have the right backup strategy in place – totally screwed. Otherwise – not so bad! I’ve often heard that VCs like entrepreneurs who have failed; what’s the cloud analogy? Something about lightening striking twice…
  27. Our zone scenario and why were were down intermittently for 12 hours during the AWS outagebefore the outage we had auto-scaling groups in two zones within a single regionat some point I brought everything into a single zone, while debugging odd performance between the twoconscientiously de-prioritized revisiting that, in light of other priorities, figuring the single-zone group would at least scale with trafficbut I’d configured the groups with a trigger to auto-scale when CPU spikedover time our application grew more resource efficient, which meant CPU wasn’t spiking, which meant we weren’t scaling with trafficled to the learning that it’s better to scale on network IO, or now that AWS supports them, custom scaling triggerswe’re in multiple zones again now
  28. Ruger says : don’t even think about not doing it.What’s generally worked for us:ELBs for same-region traffic distribution auto scaling groups to allow application server fail-over, within a zone and across themreplication to put secondary fail-over database servers in other zones within a region.
  29. What about Fred? Cut Fred some slack for tripping over the extension cord, we all make mistakes. You need Fred. That is, assuming he communicates what happened widely. If he doesn’t, he’s going to suffer the wrath of his internal and external customers.
  30. Transparency is so 2010, and it hints at over-communication. Your customers don’t need to know that your only DBA is out sick today. they don’t need a ton of detail; they need status updates and anything actionable. Does open communication increase fault tolerance? I’d argue yes. As I’ve tried to point out, people are core parts of our systemsYour customers will be more tolerant of your faults if you’re open and clear about them
  31. At BigDoor, if there’s a crisis, our standard operating procedure identifies a single person responsible for stopping the team on an hourly basis to get status and determine what should be communicated externally, if anything. As much as we love him, we don’t involve our lawyer in that conversation, by the way.
  32. In summary, these are the three tenets that I’m hoping will help you achieve the right amount of fault tolerance in the cloud:Scripted repeatabilitySPOF eliminationClear-Cut CommunicationThanks again Gluecon, I’ll be at the BigDoor pod in the lobby if anyone wants to talk more about this stuff later. I also have some notes that describe the good and bad about AWS, will be available on our blog @ www.bigdoor.com. Thanks again.
  33. Tools : the good and bad ELBsGood : quick to configure, auto-scaling load-balancerscan be used for fail-over within a regionBad :  no loggingreturn 503s on error - you won't know unless you can monitor every request end to ende.g. if there aren't instances that can service requestsname servers disregarding ttls + auto-scaling = traffic routing issuesbest practice : return custom HTTP headers in your response so that you can distinguish calls during support incidentscan't be used for failover between AWS regions; need separate DNS solution for funneling trafficAMIs (amazon machine image, install images; OS blueprints)Good : Leveraged pre-built Debian AMIDebian has great package management, which can be scripted.packages are verified, tested before making it into the main line - less to think aboutThank you Eric Hammondhttp://alestic.com/scripted repeatability : script the non-interactive install of your toolscan be used to stand-up instances within a regionbest practice : single master AMI built on top of pre-existing, re-built regularly with new software, app code and patches, via automation.  Tagged. best practice : put app code, package configuration into SVN and include in your AMI, svn-up regularly or during instance start-upfaster for things like auto-scalingBad : Can't copy/port AMIs from region to region easilyNot having the entire process scripted from kernel means loss of flexibility (regional AMIs) and securitypitfall : easy to get off track.  Didn't start out with a single script that installs everything or stay diligent about including everything?  Have fun re-doing all that!EC2 instancesGood :Leverages AMIsObviously, script-able automated instance creationEIPs allow for easy, dependable service re-routing from one instance to anotherSecurity groups are an easy way to firewall (and tag, before they came out with those)Zones allow easy fail-over within a geographic region (most of the time)Regions provide the promise of fail-over between data centers more geographically separated (virginiavscalifornia)Init scripts allow you to create/update on a per-instance basisBad:Security groups can't be added to or removed from an instance once it's runningbest practice: use a different group for each narrower categorye.g. instead of 'database group', create groups for 'primary transactional db server in production', 'replicant...' etc best practice : use a group that whitelists trusted IPs to give access to otherwise un-needed ports and servicesRegions don't allow easy failover; EIPs can't be mapped between them (at least not programmatically)Can't port AMIs from region to region easily, so setup to fail region-region is difficult.EBSGood:provides redundant storage for instances that can be snapshot-ed for easy backup and volume duplication within a regionBad:volumes from snapshots can't be done between regions data loss: it happened (not to us, fortunately) so be prepared and apply the amount of resources your risk tolerance allowspoor I/O in general, specifically writes, typically only has been an issue for us on our primary tx DB serversbest pracitice : RAID 0 array for MySQL data directory, but make sure it's replicated and backed upAuto-scalingGood:n scaling groups in 1-4 zones behind an ELB; provides same-region fail-overn# of instances in a scaling groupcloud watch monitors provide great statspreviously, limited scaling triggers were provided, latest integrate CloudWatch much better including custom metrics you defineBad:learning : we had no baselines for when to scale on anything other than CPU utilization, which at the time was easy to differentiate; we spikedapplication improvements fixed the spikes, which in return stopped auto scaling triggers need monitoring/alerting via nagios/other tool?  figure out how to (de-)register new instances during scaling activitiesthis is changing - cloud watch is getting better.  do you trust amazon's monitoring/alerting on amazon's monitoring/alerting?EMRGood :Great for async log analysiswhat's worked for us : centralized log hostsapache logs rotated via logrotate and rsync'd via cron, pre-processed, sync'd to S3 and drawn into EMR/Hive cluster for aggregations and reporting Hive/HQL very similar to SQLBad :asynchronous, takes a fair amount of time to munge data S3Good:Available from anywhere, any regionS3cmd is a great tool , for the most partBad:no full support for standard paths and directories…TBDCloudWatch Good :can monitor various services and trigger/alert when thresholds are crossed (e.g. ELB network in)new : auto-scaling can leverage triggers more broadly, custom metrics (new)Bad :no built-in ability to trigger/alert based on % change from previous measurementsconsole reports/graphs need decoder tool and most recently, appear buggy.  but they've made big steps forward.AWS APIsGood :API wrappers provided; allow for cmd-line scriptingDRY : Can (and should) script most things that repeat, repeatableAll done via scripts :a bit about our process and how the cloud fits well1 week sprints - lockdown tuesdays, test overnight (uTEST), release wedtest first methodologysystem tests for backend, other big changes, our API changesTested a new ver of MySQL (Percona, recommended)http://screencast.com/t/yVf5RnaUN9http://screencast.com/t/WJaL2qiSRperformance, integrity, load, capacitythese require full-stack stand-up/tear-down , including a 230G+ db backendBad :Keep your eye out for library updates (why not open-source these things? Verify they’re not already…)Scripts, wrappers trail AWS innovation, which is fast.  BASH isn't as well-known or readable as Python, for example - maintainabilityscripted stuff bakes you in a bit, no way around this w/out baking yourself into RightScale or some other solution anyway thoughAPI key management : not straight-forwardAPI keys aren't portable between regions; region-region fail-over not as easy as it sounds.  not rocket science, either.Bake region 1’s keys into region 2’s new AMIAPI's - GeneralBuild things test first, run integrity tests before pushing out changes to your APIDon't version;  make it backwards-compatibleWe try to keep away from anything that’s going to lock us in too muchWe continue to shy away from SQS (simple queuing service), RDS (relational database service), SimpleDB (non-relational datastore)SQS, SimpleDB proprietary, would prefer to avoid lock-in for these things and their need hasn't been high enough for us yetRDS : doesn't provide enough flexibility for us.  would love to use it as a replicant pool for reads/reporting though. can't.multi-zone RDS suffered one of the biggest hits during recent AWS outageWhat we're looking forward to leveragingNew CW status, PUTs, scaling triggers from them