Failure Happens
F***, the F***ing thing is F***king F***ed*
            *Official WebOps term from Artur Bergman




      ...
This will be on the test:

 FAILURE HAPPENS!
25%

75%
25%

75%         Paranoid
25%   Pyromaniac



75%         Paranoid
Good
Book!
“multiple and unexpected
interactions of failures are
        inevitable”
                -Charles Perrow
Failure Happens
define:
 Nines (roughly)
define:
 Nines (roughly)
   99%	 5256 min (3.5 days)
define:
 Nines (roughly)
   99%	 5256 min (3.5 days)
   99.9%	 528 min ( 8.8 hours )
define:
 Nines (roughly)
   99%	 5256 min (3.5 days)
   99.9%	 528 min ( 8.8 hours )
   99.99% 53 min
define:
 Nines (roughly)
   99%	 5256 min (3.5 days)
   99.9%	 528 min ( 8.8 hours )
   99.99% 53 min
   99.999% 5 min
define:
 Nines (roughly)
   99%	 5256 min (3.5 days)
   99.9%	 528 min ( 8.8 hours )
   99.99% 53 min
   99.999% 5 min
   9...
define:
 Nines (roughly)
   99%	 5256 min (3.5 days)
   99.9%	 528 min ( 8.8 hours )
   99.99% 53 min
   99.999% 5 min
   9...
Internet Routing... won’t.
;''-1(<quot;=/-)quot;3.1>0?-'quot;@'-':




!quot;#$$%quot;&'(')*)quot;+,-.,-/01,(   +/.01210*quot;345467quot;89:   #
#googlefail
YOU
Continuous Power...
       isn’t
365 Main SF
365 364.96 Main SF
Failure happens

 A single datacenter is the
 problem
 • Since they all fail at some point

 Recovery procedures after
 fa...
Truck 1, Rackspace 0
Geography is a
Single Point of Failure
+2304,$5%67quot;#,-8$1




 !quot;#$%#&'()(#*&+,&!quot;#$%&!'()* #%-#%*%,.&'(/,.#+%*&0+.1&-#%2+3&(/.quot;4%*&(2&quot;.&)%q...
Taser weilding robbers

C I Hosts' Chicago facility
robbed twice!

(the other two times were
merely quot;break-ins where t...
Providers are
baskets too.
Failure Happens.
Anyone promising otherwise
 is either foolish or lying
          (or both).
Go Here!

une 22-24, 2009


         Jesse Robbins
       jesse@oreilly.com
Failure Happens: CloudCamp Interop
Failure Happens: CloudCamp Interop
Failure Happens: CloudCamp Interop
Failure Happens: CloudCamp Interop
Failure Happens: CloudCamp Interop
Failure Happens: CloudCamp Interop
Failure Happens: CloudCamp Interop
Failure Happens: CloudCamp Interop
Failure Happens: CloudCamp Interop
Failure Happens: CloudCamp Interop
Failure Happens: CloudCamp Interop
Failure Happens: CloudCamp Interop
Failure Happens: CloudCamp Interop
Failure Happens: CloudCamp Interop
Failure Happens: CloudCamp Interop
Upcoming SlideShare
Loading in...5
×

Failure Happens: CloudCamp Interop

1,482

Published on

Published in: Technology
1 Comment
5 Likes
Statistics
Notes
No Downloads
Views
Total Views
1,482
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
26
Comments
1
Likes
5
Embeds 0
No embeds

No notes for slide


  • firefighters are usually considered to be about 75% paranoid and about 25% pyromaniac.
  • firefighters are usually considered to be about 75% paranoid and about 25% pyromaniac.
  • firefighters are usually considered to be about 75% paranoid and about 25% pyromaniac.
  • Which means this sort of thing makes perfect sense to me at the time.



















  • The 365main site does not have a typical battery backup system. Instead they rely on Continuous Power Supplies (CPS) which use a flywheel driven alternator to generate electricity. The flywheel is connected to both a large diesel motor and an electric motor which runs on utility power. The flywheel is normally turned by the electric motor, and stores enough kinetic energy to power the alternator for up to 15 seconds. When utility power fails the diesel motor is supposed to start in under 5 seconds, well before the flywheel's kinetic energy is exhausted, providing uninterrupted electrical power.The advantage of a CPS over a battery-based system is that the power going to the datacenter is decoupled from the utility power. This eliminates the complex electrical switching required from most battery-based systems, making many CPS systems simpler and sometimes more reliable.
  • In this incident, latent defects caused three generators to fail during start-up. No customers were affected until a fourth generator failed 30 seconds later, which overloaded the surviving backup system and caused power failures to 3 of 8 customer areas.What's most interesting is that the redundant design of the system is what caused it to fail so completely. The failure of the fourth generator should have only brought down one area instead of three. This kind of cascade failure is common in complex & tightly coupled systems. In my experience, these sorts of failure-modes are often identified and then promptly dismissed as being \"nearly impossible\". Unfortunately, the impossible often becomes reality.To put it another way... Failure Happens.








  • Hurricane Katrina landed, and like many people I wanted to help.





  • Failure Happens: CloudCamp Interop

    1. 1. Failure Happens F***, the F***ing thing is F***king F***ed* *Official WebOps term from Artur Bergman jesse@oreilly.com
    2. 2. This will be on the test: FAILURE HAPPENS!
    3. 3. 25% 75%
    4. 4. 25% 75% Paranoid
    5. 5. 25% Pyromaniac 75% Paranoid
    6. 6. Good Book!
    7. 7. “multiple and unexpected interactions of failures are inevitable” -Charles Perrow
    8. 8. Failure Happens
    9. 9. define: Nines (roughly)
    10. 10. define: Nines (roughly) 99% 5256 min (3.5 days)
    11. 11. define: Nines (roughly) 99% 5256 min (3.5 days) 99.9% 528 min ( 8.8 hours )
    12. 12. define: Nines (roughly) 99% 5256 min (3.5 days) 99.9% 528 min ( 8.8 hours ) 99.99% 53 min
    13. 13. define: Nines (roughly) 99% 5256 min (3.5 days) 99.9% 528 min ( 8.8 hours ) 99.99% 53 min 99.999% 5 min
    14. 14. define: Nines (roughly) 99% 5256 min (3.5 days) 99.9% 528 min ( 8.8 hours ) 99.99% 53 min 99.999% 5 min 99.9999% 30 Seconds
    15. 15. define: Nines (roughly) 99% 5256 min (3.5 days) 99.9% 528 min ( 8.8 hours ) 99.99% 53 min 99.999% 5 min 99.9999% 30 Seconds 99.99999% 3 Seconds
    16. 16. Internet Routing... won’t.
    17. 17. ;''-1(<quot;=/-)quot;3.1>0?-'quot;@'-': !quot;#$$%quot;&'(')*)quot;+,-.,-/01,( +/.01210*quot;345467quot;89: #
    18. 18. #googlefail
    19. 19. YOU
    20. 20. Continuous Power... isn’t
    21. 21. 365 Main SF
    22. 22. 365 364.96 Main SF
    23. 23. Failure happens A single datacenter is the problem • Since they all fail at some point Recovery procedures after failure • Power was gone ~45 minutes • Most services took hours to come back • Some unnamed ones more than 12 hours
    24. 24. Truck 1, Rackspace 0
    25. 25. Geography is a Single Point of Failure
    26. 26. +2304,$5%67quot;#,-8$1 !quot;#$%#&'()(#*&+,&!quot;#$%&!'()* #%-#%*%,.&'(/,.#+%*&0+.1&-#%2+3&(/.quot;4%*&(2&quot;.&)%quot;*.&5678 !quot;#$%&''( +#,$-#$,%./-$0,1 )*
    27. 27. Taser weilding robbers C I Hosts' Chicago facility robbed twice! (the other two times were merely quot;break-ins where things were stolenquot;)
    28. 28. Providers are baskets too.
    29. 29. Failure Happens. Anyone promising otherwise is either foolish or lying (or both).
    30. 30. Go Here! une 22-24, 2009 Jesse Robbins jesse@oreilly.com
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×