MTBF / MTTR - Energized Work TekTalk, Mar 2012

3,137 views
2,883 views

Published on

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
3,137
On SlideShare
0
From Embeds
0
Number of Embeds
459
Actions
Shares
0
Downloads
54
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

MTBF / MTTR - Energized Work TekTalk, Mar 2012

  1. 1. MTBF / MTTR Availability or recoverability? Presented by Michael Richardson, Energized Work 21 March 2012ENERGIZED WORK25 MACKLIN STREETLONDON WC2B 5NN+44 (0)20 7691 8933WWW.ENERGIZEDWORK.COM
  2. 2. Michael Richardson Twitter: @mr_spb Email: michael@energizedwork.com #ewtektalk © 2012 Energized Work - www.energizedwork.com 2
  3. 3. So what is high availability?•  Five nines?•  No single point of failures?•  Multiple data centres?•  Fault tolerance?•  Load balancing?•  Uptime?© 2012 Energized Work - www.energizedwork.com 3
  4. 4. Ninesof availability 9 9 9 9 99 9 9© 2012 Energized Work - www.energizedwork.com 4
  5. 5. Ninesof availability Availability Downtime per Year One nine (90%) 36.5 days Two nines (99%) 3.65 days Three nines (99.9%) 8.76 hours Four nines (99.99%) 52.56 minutes Five nines (99.999%) 5.26 minutes© 2012 Energized Work - www.energizedwork.com 5
  6. 6. Problem withthe nines•  What do they mean?•  Guaranteed or just an SLA?•  Multiplicity (99.9% * 99.9% * 99.9% = 99.7%)© 2012 Energized Work - www.energizedwork.com 6
  7. 7. SLA availability numbersjust aim to provide a level ofconfidence in a website’s service© 2012 Energized Work - www.energizedwork.com 7
  8. 8. No single point of failure(SPOF)© 2012 Energized Work - www.energizedwork.com 8
  9. 9. Two of everything?© 2012 Energized Work - www.energizedwork.com 9
  10. 10. Start with this Users Index.html© 2012 Energized Work - www.energizedwork.com 10
  11. 11. End with this Users Firewall 1 Firewall 2 Switch 1 Switch 2 WEB1 WEB2 APP1 APP2 DB1 DB2© 2012 Energized Work - www.energizedwork.com 11
  12. 12. Problems witheliminating SPOF•  It’s expensive•  Where do you draw the line?•  Are failures independent?•  Can you guarantee no SPOF?•  Increased complexity© 2012 Energized Work - www.energizedwork.com 12
  13. 13. Problem:Data centres fail© 2012 Energized Work - www.energizedwork.com 13
  14. 14. Solution:Get a second data centre© 2012 Energized Work - www.energizedwork.com 14
  15. 15. Hot – Hotmultisite•  Full range of services available in multiple locations•  Easy to automate failover of sites•  Data consistency is hard•  Capacity planning concerns +© 2012 Energized Work - www.energizedwork.com 15
  16. 16. Hot – Warmmultisite•  Simpler than hot – hot•  Read / Write ratio dependent•  Synchronously or asynchronously replicate data? +© 2012 Energized Work - www.energizedwork.com 16
  17. 17. Hot – Coldmultisite•  Easy to setup•  Will it work?•  Can it be trusted?•  Cold site rapidly becomes stale•  Is it actually valuable? +© 2012 Energized Work - www.energizedwork.com 17
  18. 18. DR multisite•  Fingers crossed you never need it•  How can / should you test it?•  Cloud? +© 2012 Energized Work - www.energizedwork.com 18
  19. 19. Problemswith multiple sites•  It’s expensive•  Managing more systems•  Managing data consistency•  Managing capacity•  Is it still fail proof?•  Unless you test it, it’s just a plan© 2012 Energized Work - www.energizedwork.com 19
  20. 20. We now havea complex system© 2012 Energized Work - www.energizedwork.com 20
  21. 21. Complex systems•  More redundancy and automation leads to more complexity•  More complexity often adds more points of failure© 2012 Energized Work - www.energizedwork.com 21
  22. 22. How complex systems fail - Dr. Richard Cook•  Catastrophe is always just around the corner•  Human operators have dual roles•  Change introduces new forms of failure© 2012 Energized Work - www.energizedwork.com 22
  23. 23. Failure and recovery© 2012 Energized Work - www.energizedwork.com 23
  24. 24. Questionsfor the business•  What is the cost of downtime?•  What are the Recovery Time Objectives (RTO)•  What are the Recovery Point Objectives (RPO)?© 2012 Energized Work - www.energizedwork.com 24
  25. 25. Aggressive RTO and RPOare expensive and have aperformance impact© 2012 Energized Work - www.energizedwork.com 25
  26. 26. RTO / RPOexampleProblem:•  Simple DB•  Business can tolerate up to 15 minutes downtime•  10-minute window of data loss© 2012 Energized Work - www.energizedwork.com 26
  27. 27. RTO / RPOexamplePossible solution:•  Continuously replicate data to second host•  Continue with nightly backups and also copy DB transaction logs from the primary host to another system© 2012 Energized Work - www.energizedwork.com 27
  28. 28. So what is more important –increasing availabilityor reducing recovery time?© 2012 Energized Work - www.energizedwork.com 28
  29. 29. MTBF or MTTR?What about MTTD?© 2012 Energized Work - www.energizedwork.com 29
  30. 30. The answer is:It depends© 2012 Energized Work - www.energizedwork.com 30
  31. 31. Failureis inevitable© 2012 Energized Work - www.energizedwork.com 31
  32. 32. Ask anyone© 2012 Energized Work - www.energizedwork.com 32
  33. 33. LicenseThis presentation is provided under the Creative Commons Attribution Share Alike 3.0 Unported License. You are free: To share – to copy, distribute and transmit the work To remix – to adapt the work Under the following conditions: Attribution – You must attribute the work in the manner specified by Energized Work (but not in any way that suggests that Energized Work endorse you or your use of the work). Share Alike – If you alter, transform, or build upon this work, you may distribute the resulting work only under the same or similar license to this one. ENERGIZED WORK 25 MACKLIN STREET LONDON WC2B 5NN +44 (0)20 7691 8933© 2012 Energized Work - www.energizedwork.com WWW.ENERGIZEDWORK.COM 33

×