Your SlideShare is downloading. ×
Glue con2011 Jeff Malek from BigDoor
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Glue con2011 Jeff Malek from BigDoor

990
views

Published on


0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
990
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
10
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Retrospective from a startup built in the cloud : top 3 big lessons from the AWS outage on 04.21.2011 plus 4,369 other smaller ones
    5/27/2011
    1
  • 2. What a country : entrepreneurial resiliency
    5/27/2011
    2
  • 3. (true story)
    “robust systems:highly fault-tolerant, on or off grid. eg: our culture wrt entrepreneurs, AWS, the BD API”
    5/27/2011
    3
  • 4. Boom
    5/27/2011
    4
  • 5. good to be home!
    Go Buffs
    5/27/2011
    5
  • 6. me: previous startupteams in 3 countrieshighly transactional systemMS tech : IIS/MS SQL Serverco-located, leased/owned hardware0% in cloud$75M/yearly rev
    5/27/2011
    6
  • 7. me : current startupsystems 100% on AWS99% free/open-source software
    5/27/2011
    7
    standing on the shoulders of giants
  • 8. fault tolerance: 3 to 47 important failearnings
    and 4,369 less important ones
    5/27/2011
    8
  • 9. in the context of our startup, of course
    YMMV depending on velocity
    5/27/2011
    9
  • 10. Ruger
    5/27/2011
    10
  • 11. The Ruger Fault Equivalencytime = money fault tolerance = time²  - risk tolerance
    Also known as:
    'Fast, good and cheap : pick two‘
    5/27/2011
    11
  • 12. system design philosophy:
    5/27/2011
    12
    leverage proven, open-source tech
    in the cloud
    to build a
    scaleable
    reliable
    secure
    operational foundation
    quickly
  • 13. So how do you achievethe right level of fault tolerance in the cloud?
    3 tenets
    5/27/2011
    13
  • 14. Tenet #1
    5/27/2011
    14
    Scripted Repeatability
    Tenet #2
    SPOF Elimination
    Tenet #3
    Clear-Cut Communication
  • 15. who here has used AWS?
    5/27/2011
    15
  • 16. Tenet #1prepare a fault-tolerant foundation with scripted repeatability
    aka automation
    5/27/2011
    16
  • 17. from the start :script the non-interactive install of your toolsand OScustom AMIDebian : great package managementbased on Eric Hammond’s workhttp://alestic.com/
    5/27/2011
    17
  • 18. which will allow you toscript the setup/tear-down of your stack
    5/27/2011
    18
  • 19. which will allow you toscript system testsintegrity (3-4K tests)performance (30-40K tests)load, capacity (2-4M requests)
    5/27/2011
    19
  • 20. 5/27/2011
    20
    A/B system test results : MySQL Percona Upgrade
  • 21. That’s how1 person set up andmanaged a networkcomprised of 90+/- server instancesfor 1.5 yearswhile serving various other roleswithout having to leave their chair
    5/27/2011
    21
    try that with real hardware
  • 22. Tenet #2SPOF Elimination
    We don’t need no stinkin single points of failure.
    5/27/2011
    22
  • 23. SPOF Examples:Cloud ProviderRegionZoneLoad BalancerApp Server DatabaseFred
    5/27/2011
    23
  • 24. Cloud Provider fail-over?
    e.g. AWS –> Rackspace
    5/27/2011
    24
  • 25. Region fail-over?
    e.g. useast->uswest within AWS
    Nah.
    5/27/2011
    25
  • 26. Zone fail-over?
    Yes.
    5/27/2011
    26
    US-WEST
    US-EAST
  • 27. Zone fail-over best practices:are you using auto-scaling?no : distribute server instances evenly between 2 or more zonesyes : trigger scaling on network I/O or custom metrics
    5/27/2011
    27
  • 28. Load-balancer (ELB), app server, database fail-over?
    Yes.
    5/27/2011
    28
  • 29. So it’s actually all about reduction of the right SPOFs for your business context
    Just adding the ability to fail-over and have backups within a region is huge!
    Probably enough for most.
    What about Fred?
    5/27/2011
    29
  • 30. Tenet #3Clear-Cut Communication
    transparency is soooo 2010
    5/27/2011
    30
  • 31. During an outage, communicating the right things at the right time:hard.
    But not that hard.
    5/27/2011
    31
  • 32. Tenet #1
    5/27/2011
    32
    Three Tenets Revisited
    Scripted Repeatability
    Tenet #2
    SPOF Elimination
    Tenet #3
    Clear-Cut Communication
  • 33. Notes
    5/27/2011
    33