0
Michael RichardsonTwitter: @Mr_SPB1© 2011 Energized Work - www.energizedwork.comAvailability and Recoverability
So what is High Availability?• Five 9s?• No Single point of failure?• Multiple Data Centre’s?• Fault Tolerance?• Load Bala...
The 9’s of Availability3© 2012 Energized Work - www.energizedwork.com99
The 9’s of Availability4© 2012 Energized Work - www.energizedwork.comAvailability Downtime per YearOne nine (90%) 36.5 day...
Problem with the 9’s5© 2012 Energized Work - www.energizedwork.com• What do they mean?• Guaranteed or just an SLA• Multipl...
SLA availability numbers:just aim to provide a level ofconfidence in a website’sservice6© 2012 Energized Work - www.energi...
No Single Point ofFailure (SPOF)7© 2012 Energized Work - www.energizedwork.com
two of everything?8© 2012 Energized Work - www.energizedwork.com
Start with this9© 2012 Energized Work - www.energizedwork.comIndex.htmlUsers
End with this10© 2012 Energized Work - www.energizedwork.comWEB1switch 1 switch 2WEB2 APP1 APP2 DB1 DB2Firewall 1 Firewall...
• It’s expensive ££• Where do you draw the line?• Are failures independent• Can you guarantee No SPOF?• Increased complexi...
Problem: Data Centre’s Fail12© 2012 Energized Work - www.energizedwork.com
Solution: Get a 2ndData Centre13© 2012 Energized Work - www.energizedwork.com
Hot/Hot Multisite14© 2012 Energized Work - www.energizedwork.com• Full range of services available inmultiple locations.• ...
Hot/Warm Multisite15© 2012 Energized Work - www.energizedwork.com• Simpler than Hot/Hot• Read/write ratio dependant• Synch...
Hot/Cold Multisite16© 2012 Energized Work - www.energizedwork.com• Easy to setup• Will it work?• Can it be trusted?• Cold ...
DR Multisite17© 2012 Energized Work - www.energizedwork.com• Fingers crossed you never need it.• How can/should you test i...
Problems with Multiple sites18© 2012 Energized Work - www.energizedwork.com• ££ - it’s expensive• Managing more systems• M...
19© 2012 Energized Work - www.energizedwork.comWe now have aComplex System
• More redundancy and automation leadsto more complexity.• More complexity often adds morepoints of failure.20© 2012 Energ...
Author: Dr. Richard Cook21© 2012 Energized Work - www.energizedwork.com“How Complex Systems fail”• Catastrophe is always j...
Failure and Recovery22© 2012 Energized Work - www.energizedwork.com
Questions for the Customer23© 2012 Energized Work - www.energizedwork.com• What is the cost of downtime?• What are the RTO...
24© 2012 Energized Work - www.energizedwork.comRTO = Recovery Time ObjectiveRPO = Recovery Point Objective
Aggressive RTO & RPO isexpensive and has aperformance impact.25© 2012 Energized Work - www.energizedwork.com
RTO / RPO example26© 2012 Energized Work - www.energizedwork.comproblem•Simple DB•Business can tolerate up to 15 minutesdo...
RTO / RPO example27© 2012 Energized Work - www.energizedwork.comPossible solution1.Continuously replicate data to 2ndhost2...
So what’s more important?28© 2012 Energized Work - www.energizedwork.comIncreasing AvailabilityOrReducing Recovery Time
29© 2012 Energized Work - www.energizedwork.comMTBFOrMTTRWhat about MTTD??
30© 2012 Energized Work - www.energizedwork.comAnswer?It Depends
31© 2012 Energized Work - www.energizedwork.comFailure is inevitable
32© 2012 Energized Work - www.energizedwork.comAsk anyone
33© 2011 Energized Work - www.energizedwork.comThank youThe EndTwitter - @Mr_SPB
Upcoming SlideShare
Loading in...5
×

System Availability Talk

380

Published on

Talk i gave on HA, resiliency and recovery of systems

Published in: Technology, Business
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
380
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
10
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • 28/10/10 © Energized Work Limited 2010 Agile Evangelists - LEANING
  • 28/10/10 © Energized Work Limited 2010 Agile Evangelists - LEANING
  • Ask any business how much downtime is acceptable and you will get a consistent answer. 28/10/10 © Energized Work Limited 2010 Agile Evangelists - LEANING
  • 28/10/10 © Energized Work Limited 2010 Agile Evangelists - LEANING
  • Found more in Marketing literature than technical literature 28/10/10 © Energized Work Limited 2010 Agile Evangelists - LEANING
  • An SLA is just an instrument that makes business people comfortable (just like insurance) 28/10/10 © Energized Work Limited 2010 Agile Evangelists - LEANING
  • 28/10/10 © Energized Work Limited 2010 Agile Evangelists - LEANING
  • 28/10/10 © Energized Work Limited 2010 Agile Evangelists - LEANING
  • 28/10/10 © Energized Work Limited 2010 Agile Evangelists - LEANING
  • 28/10/10 © Energized Work Limited 2010 Agile Evangelists - LEANING
  • 1 & 2 Diminishing returns Paradoxically, adding more components to an overall system design can undermine efforts to achieve high availability Cascading failures 28/10/10 © Energized Work Limited 2010 Agile Evangelists - LEANING
  • 28/10/10 © Energized Work Limited 2010 Agile Evangelists - LEANING
  • 28/10/10 © Energized Work Limited 2010 Agile Evangelists - LEANING
  • Read & Write anywhere Global Server Load Balancing with DNS 28/10/10 © Energized Work Limited 2010 Agile Evangelists - LEANING
  • Read intensive apps are well suited to this – Reads Hot/Hot 28/10/10 © Energized Work Limited 2010 Agile Evangelists - LEANING
  • Cold site is so untrusted that perhaps spending hours restoring the primary DC is a better and safer bet. 28/10/10 © Energized Work Limited 2010 Agile Evangelists - LEANING
  • Cold site is so untrusted that perhaps spending hours restoring the primary DC is a better and safer bet. 28/10/10 © Energized Work Limited 2010 Agile Evangelists - LEANING
  • Talk about capacity planning Hot/Hot – config switches Most companies don ’ t thoroughly test DC failover. When failure occurs many companies will often focus on restoring the failure in the primary DC rather attempt a failover. So why bother having a 2 nd DC anyway. If you plan on having multiple DC ’ s or DR then test your procedures when you ’ re not in an emergency situation. Game Day events 28/10/10 © Energized Work Limited 2010 Agile Evangelists - LEANING
  • 28/10/10 © Energized Work Limited 2010 Agile Evangelists - LEANING
  • 28/10/10 © Energized Work Limited 2010 Agile Evangelists - LEANING
  • Mention John Alspaw ’ s Qcon talk 2. Dual roles of humans Defenders against failure Producers of failure 3. Introduce a technology change To prevent low-consequence, but high frequency failures May introduce low frequency, but high consequence failure Introduce new pathways to large-scale, catastrophic failures. Focus of humans is on the beneficial charactistics of the change. New failure ’ s maybe difficult to foresee. Give config management example Knife Resolv.conf 3. Also covers maintenance and why many find it difficult. Build and forget mentality. 28/10/10 © Energized Work Limited 2010 Agile Evangelists - LEANING
  • 28/10/10 © Energized Work Limited 2010 Agile Evangelists - LEANING
  • Cost of downtime – easy or difficult to measure Can downtime actually be equated to lost revenue. Give online shopping example 28/10/10 © Energized Work Limited 2010 Agile Evangelists - LEANING
  • RTO and RPO are often in competition Give eg of replication lag between 2 sites. Zero RPO example - If replication lags between systems and you have an aggressive RPO you maybe better off taking a few hours outage and focusing on restoring your primary site. Zero RTO example – if replication lags between DC ’ s you may decide to failover immediately and take the data loss for some inflight transactions Aggressive RTO & RPO is expensive and has a performance 28/10/10 © Energized Work Limited 2010 Agile Evangelists - LEANING
  • 28/10/10 © Energized Work Limited 2010 Agile Evangelists - LEANING
  • Typical nightly backups aren ’ t going to cut it. Common practice is to backup systems nightly. Is your business happy to lose up to 24 hours of data? Probably not. 28/10/10 © Energized Work Limited 2010 Agile Evangelists - LEANING
  • Covers you for any catastrophic hardware failure 2 nd host has independent storage infrastructure. Data corruption would however result in 2 copies of crap 2. Covers you for data corruption Playing back transaction logs will also allow you to identify the place where corruption occurred. 28/10/10 © Energized Work Limited 2010 Agile Evangelists - LEANING
  • 28/10/10 © Energized Work Limited 2010 Agile Evangelists - LEANING
  • What about MTTD? 28/10/10 © Energized Work Limited 2010 Agile Evangelists - LEANING
  • My experience tells me most companies focus on availability How many companies take nightly tape backups but have never bothered trying to restore or test them? If you think you can built a completely fail-proof system you are kidding yourself. How many companies have game days? 28/10/10 © Energized Work Limited 2010 Agile Evangelists - LEANING
  • 28/10/10 © Energized Work Limited 2010 Agile Evangelists - LEANING
  • 28/10/10 © Energized Work Limited 2010 Agile Evangelists - LEANING
  • 28/10/10 © Energized Work Limited 2010 Agile Evangelists - LEANING
  • Transcript of "System Availability Talk"

    1. 1. Michael RichardsonTwitter: @Mr_SPB1© 2011 Energized Work - www.energizedwork.comAvailability and Recoverability
    2. 2. So what is High Availability?• Five 9s?• No Single point of failure?• Multiple Data Centre’s?• Fault Tolerance?• Load Balancing?• Uptime?2© 2012 Energized Work - www.energizedwork.com
    3. 3. The 9’s of Availability3© 2012 Energized Work - www.energizedwork.com99
    4. 4. The 9’s of Availability4© 2012 Energized Work - www.energizedwork.comAvailability Downtime per YearOne nine (90%) 36.5 daysTwo nines (99%) 3.65 daysThree nines (99.9%) 8.76 hoursFour nines (99.99%) 52.56 minutesFive nines (99.999%) 5.26 minutes
    5. 5. Problem with the 9’s5© 2012 Energized Work - www.energizedwork.com• What do they mean?• Guaranteed or just an SLA• Multiplicity(99.9% * 99.9% * 99.9% = 99.7%)
    6. 6. SLA availability numbers:just aim to provide a level ofconfidence in a website’sservice6© 2012 Energized Work - www.energizedwork.com
    7. 7. No Single Point ofFailure (SPOF)7© 2012 Energized Work - www.energizedwork.com
    8. 8. two of everything?8© 2012 Energized Work - www.energizedwork.com
    9. 9. Start with this9© 2012 Energized Work - www.energizedwork.comIndex.htmlUsers
    10. 10. End with this10© 2012 Energized Work - www.energizedwork.comWEB1switch 1 switch 2WEB2 APP1 APP2 DB1 DB2Firewall 1 Firewall 2Users
    11. 11. • It’s expensive ££• Where do you draw the line?• Are failures independent• Can you guarantee No SPOF?• Increased complexity11© 2012 Energized Work - www.energizedwork.comProblems witheliminating SPOF
    12. 12. Problem: Data Centre’s Fail12© 2012 Energized Work - www.energizedwork.com
    13. 13. Solution: Get a 2ndData Centre13© 2012 Energized Work - www.energizedwork.com
    14. 14. Hot/Hot Multisite14© 2012 Energized Work - www.energizedwork.com• Full range of services available inmultiple locations.• Easy to automate failover of sites• Data Consistency is hard.• Capacity Planning concerns+
    15. 15. Hot/Warm Multisite15© 2012 Energized Work - www.energizedwork.com• Simpler than Hot/Hot• Read/write ratio dependant• Synchronous or Asynchronouslyreplicate data?+
    16. 16. Hot/Cold Multisite16© 2012 Energized Work - www.energizedwork.com• Easy to setup• Will it work?• Can it be trusted?• Cold site rapidly become stale• Is it actually valuable?+
    17. 17. DR Multisite17© 2012 Energized Work - www.energizedwork.com• Fingers crossed you never need it.• How can/should you test it?• Cloud?+
    18. 18. Problems with Multiple sites18© 2012 Energized Work - www.energizedwork.com• ££ - it’s expensive• Managing more systems• Managing consistency of Data• Managing Capacity• Is it still fail proof?• Unless you test it, it’s just a plan
    19. 19. 19© 2012 Energized Work - www.energizedwork.comWe now have aComplex System
    20. 20. • More redundancy and automation leadsto more complexity.• More complexity often adds morepoints of failure.20© 2012 Energized Work - www.energizedwork.comComplex Systems
    21. 21. Author: Dr. Richard Cook21© 2012 Energized Work - www.energizedwork.com“How Complex Systems fail”• Catastrophe is always just around thecorner.• Human Operators have dual roles.• Change introduces new forms of failure
    22. 22. Failure and Recovery22© 2012 Energized Work - www.energizedwork.com
    23. 23. Questions for the Customer23© 2012 Energized Work - www.energizedwork.com• What is the cost of downtime?• What are the RTO and RPO?
    24. 24. 24© 2012 Energized Work - www.energizedwork.comRTO = Recovery Time ObjectiveRPO = Recovery Point Objective
    25. 25. Aggressive RTO & RPO isexpensive and has aperformance impact.25© 2012 Energized Work - www.energizedwork.com
    26. 26. RTO / RPO example26© 2012 Energized Work - www.energizedwork.comproblem•Simple DB•Business can tolerate up to 15 minutesdowntime•10 minute window of data lose.
    27. 27. RTO / RPO example27© 2012 Energized Work - www.energizedwork.comPossible solution1.Continuously replicate data to 2ndhost2.Continue with nightly backups and alsocopy DB transaction logs from the primaryhost to another system.
    28. 28. So what’s more important?28© 2012 Energized Work - www.energizedwork.comIncreasing AvailabilityOrReducing Recovery Time
    29. 29. 29© 2012 Energized Work - www.energizedwork.comMTBFOrMTTRWhat about MTTD??
    30. 30. 30© 2012 Energized Work - www.energizedwork.comAnswer?It Depends
    31. 31. 31© 2012 Energized Work - www.energizedwork.comFailure is inevitable
    32. 32. 32© 2012 Energized Work - www.energizedwork.comAsk anyone
    33. 33. 33© 2011 Energized Work - www.energizedwork.comThank youThe EndTwitter - @Mr_SPB
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×