Glue con2011 Jeff Malek from BigDoor

1,105 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,105
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
12
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Glue con2011 Jeff Malek from BigDoor

  1. 1. Retrospective from a startup built in the cloud : top 3 big lessons from the AWS outage on 04.21.2011 plus 4,369 other smaller ones<br />5/27/2011<br />1<br />
  2. 2. What a country : entrepreneurial resiliency<br />5/27/2011<br />2<br />
  3. 3. (true story)<br />“robust systems:highly fault-tolerant, on or off grid. eg: our culture wrt entrepreneurs, AWS, the BD API”<br />5/27/2011<br />3<br />
  4. 4. Boom<br />5/27/2011<br />4<br />
  5. 5. good to be home!<br />Go Buffs<br />5/27/2011<br />5<br />
  6. 6. me: previous startupteams in 3 countrieshighly transactional systemMS tech : IIS/MS SQL Serverco-located, leased/owned hardware0% in cloud$75M/yearly rev <br />5/27/2011<br />6<br />
  7. 7. me : current startupsystems 100% on AWS99% free/open-source software<br />5/27/2011<br />7<br />standing on the shoulders of giants<br />
  8. 8. fault tolerance: 3 to 47 important failearnings<br />and 4,369 less important ones<br />5/27/2011<br />8<br />
  9. 9. in the context of our startup, of course<br />YMMV depending on velocity<br />5/27/2011<br />9<br />
  10. 10. Ruger<br />5/27/2011<br />10<br />
  11. 11. The Ruger Fault Equivalencytime = money fault tolerance = time²  - risk tolerance <br />Also known as:<br /> 'Fast, good and cheap : pick two‘<br />5/27/2011<br />11<br />
  12. 12. system design philosophy:<br />5/27/2011<br />12<br />leverage proven, open-source tech<br />in the cloud<br />to build a<br />scaleable<br />reliable<br />secure<br />operational foundation<br />quickly<br />
  13. 13. So how do you achievethe right level of fault tolerance in the cloud?<br />3 tenets<br />5/27/2011<br />13<br />
  14. 14. Tenet #1<br />5/27/2011<br />14<br />Scripted Repeatability <br />Tenet #2<br />SPOF Elimination<br />Tenet #3<br />Clear-Cut Communication<br />
  15. 15. who here has used AWS?<br />5/27/2011<br />15<br />
  16. 16. Tenet #1prepare a fault-tolerant foundation with scripted repeatability<br />aka automation<br />5/27/2011<br />16<br />
  17. 17. from the start :script the non-interactive install of your toolsand OScustom AMIDebian : great package managementbased on Eric Hammond’s workhttp://alestic.com/<br />5/27/2011<br />17<br />
  18. 18. which will allow you toscript the setup/tear-down of your stack<br />5/27/2011<br />18<br />
  19. 19. which will allow you toscript system testsintegrity (3-4K tests)performance (30-40K tests)load, capacity (2-4M requests)<br />5/27/2011<br />19<br />
  20. 20. 5/27/2011<br />20<br />A/B system test results : MySQL Percona Upgrade<br />
  21. 21. That’s how1 person set up andmanaged a networkcomprised of 90+/- server instancesfor 1.5 yearswhile serving various other roleswithout having to leave their chair<br />5/27/2011<br />21<br />try that with real hardware<br />
  22. 22. Tenet #2SPOF Elimination<br />We don’t need no stinkin single points of failure. <br />5/27/2011<br />22<br />
  23. 23. SPOF Examples:Cloud ProviderRegionZoneLoad BalancerApp Server DatabaseFred<br />5/27/2011<br />23<br />
  24. 24. Cloud Provider fail-over?<br />e.g. AWS –> Rackspace<br />5/27/2011<br />24<br />
  25. 25. Region fail-over?<br />e.g. useast->uswest within AWS<br />Nah.<br />5/27/2011<br />25<br />
  26. 26. Zone fail-over?<br />Yes.<br />5/27/2011<br />26<br />US-WEST<br />US-EAST<br />
  27. 27. Zone fail-over best practices:are you using auto-scaling?no : distribute server instances evenly between 2 or more zonesyes : trigger scaling on network I/O or custom metrics<br />5/27/2011<br />27<br />
  28. 28. Load-balancer (ELB), app server, database fail-over?<br />Yes.<br />5/27/2011<br />28<br />
  29. 29. So it’s actually all about reduction of the right SPOFs for your business context<br />Just adding the ability to fail-over and have backups within a region is huge!<br />Probably enough for most.<br />What about Fred?<br />5/27/2011<br />29<br />
  30. 30. Tenet #3Clear-Cut Communication<br />transparency is soooo 2010<br />5/27/2011<br />30<br />
  31. 31. During an outage, communicating the right things at the right time:hard.<br />But not that hard.<br />5/27/2011<br />31<br />
  32. 32. Tenet #1<br />5/27/2011<br />32<br />Three Tenets Revisited<br />Scripted Repeatability <br />Tenet #2<br />SPOF Elimination<br />Tenet #3<br />Clear-Cut Communication <br />
  33. 33. Notes<br />5/27/2011<br />33<br />

×