Architecting for High-Availability and Multi-Cloud Environments<br />Brian Adler - Sr. Professional Services Architect, Ri...
Agenda<br /><ul><li>Terminology/Level-Setting
Takeaways
Designing for Failure
Cloud and component definitions (more terminology)
Architectural options and considerations to protect against cloud failures
Conclusions / Q&A</li></li></ul><li>Terminology<br /><ul><li>Fault Tolerance
Designs incorporating redundancy and replication to enable systems to continue operating properly (perhaps at a degraded l...
High Availability (HA)
Fault Tolerant systems are measured by their Availability in terms of planned and unplanned service outages for end users
99% Availability = 3.65 days of downtime per year
99.9% Availability = 8.76 hours of downtime per year
99.99% Availability = 53 minutes of downtime per year
99.999% Availability = 5.26 minutes of downtime per year
Disaster Recovery (DR)
The process, policies and procedures related to restoring critical systems after a catastrophic event</li></li></ul><li>Ag...
Takeaways
Designing for Failure
Cloud and component definitions (more terminology)
Architectural options and considerations to protect against cloud failures
Conclusions / Q&A</li></li></ul><li>Takeaways<br /><ul><li>Introduction to architectural options for designing highly-avai...
Best Practices for implementation of these architectural options
Multi-Availability Zone (AZ), Multi-Region and Multi-Cloud
Architectural options
Considerations / pros and cons of these options</li></li></ul><li>Agenda<br /><ul><li>Terminology/Level-Setting
Takeaways
Designing for Failure
Upcoming SlideShare
Loading in …5
×

Architecting for High-Availability and Multi-Cloud Environments

3,939 views

Published on

RightScale User Conference NYC 2011 -

Brian Adler - Solutions Architect, RightScale

As a result of assisting many customers with complex multi-tiered deployments across diverse industries and use cases, RightScale Professional Services has amassed a set of best practices for deploying applications in highly available, fault-tolerant environments. These configurations may span multiple zones, regions, and clouds from multiple vendors, or they may simply be designed for portability so they can be readily redeployed from one cloud to another for performance, price, geo-location, or other drivers. This session will cover best practices and tips and tricks for architecting multi-zone and multi-cloud deployments.

Published in: Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
3,939
On SlideShare
0
From Embeds
0
Number of Embeds
142
Actions
Shares
0
Downloads
112
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide
  • Cold DR(Most common... hours) Staged Server Configuration and generally no staged data. Bring up the servers and load the data to failover. Cold DR failover is typically manual.Warm DR(Recommended... &gt;hour) Staged Server Configuration, pre-staged data and running Database Slave Server. Warm DR failover is typically manual but can be automated.Hot DR(Least common... but needed if &lt;5 min) Parallel Deployment with all servers running but all traffic going to primary. Hot DR failover is normally automated.Hot HALive/Live configuration. May use Geo-target IP services to direct traffic to regional load balancers. Failover to other region if one has problems. Hot HA is normally seamlessly automated.
  • Architecting for High-Availability and Multi-Cloud Environments

    1. 1. Architecting for High-Availability and Multi-Cloud Environments<br />Brian Adler - Sr. Professional Services Architect, RightScale<br />June 8th, 2011<br />
    2. 2. Agenda<br /><ul><li>Terminology/Level-Setting
    3. 3. Takeaways
    4. 4. Designing for Failure
    5. 5. Cloud and component definitions (more terminology)
    6. 6. Architectural options and considerations to protect against cloud failures
    7. 7. Conclusions / Q&A</li></li></ul><li>Terminology<br /><ul><li>Fault Tolerance
    8. 8. Designs incorporating redundancy and replication to enable systems to continue operating properly (perhaps at a degraded level) if one or more components fails
    9. 9. High Availability (HA)
    10. 10. Fault Tolerant systems are measured by their Availability in terms of planned and unplanned service outages for end users
    11. 11. 99% Availability = 3.65 days of downtime per year
    12. 12. 99.9% Availability = 8.76 hours of downtime per year
    13. 13. 99.99% Availability = 53 minutes of downtime per year
    14. 14. 99.999% Availability = 5.26 minutes of downtime per year
    15. 15. Disaster Recovery (DR)
    16. 16. The process, policies and procedures related to restoring critical systems after a catastrophic event</li></li></ul><li>Agenda<br /><ul><li>Terminology/Level-Setting
    17. 17. Takeaways
    18. 18. Designing for Failure
    19. 19. Cloud and component definitions (more terminology)
    20. 20. Architectural options and considerations to protect against cloud failures
    21. 21. Conclusions / Q&A</li></li></ul><li>Takeaways<br /><ul><li>Introduction to architectural options for designing highly-available, fault-tolerant applications
    22. 22. Best Practices for implementation of these architectural options
    23. 23. Multi-Availability Zone (AZ), Multi-Region and Multi-Cloud
    24. 24. Architectural options
    25. 25. Considerations / pros and cons of these options</li></li></ul><li>Agenda<br /><ul><li>Terminology/Level-Setting
    26. 26. Takeaways
    27. 27. Designing for Failure
    28. 28. Cloud and component definitions (more terminology)
    29. 29. Architectural options and considerations to protect against cloud failures
    30. 30. Conclusions / Q&A</li></li></ul><li>Designing for Failure (in a good way)<br /><ul><li>Large scale failures in the cloud are rare but do happen
    31. 31. Application owners are ultimately responsible for availability and recoverability
    32. 32. Need to balance cost and complexity of HA efforts against risk(s) you are willing to bear
    33. 33. Cloud infrastructure has made DR and HA remarkably affordable versus past options
    34. 34. Multi-Server
    35. 35. Multi-AZ
    36. 36. Multi-Region
    37. 37. Multi-Cloud</li></li></ul><li>Agenda<br /><ul><li>Terminology/Level-Setting
    38. 38. Takeaways
    39. 39. Designing for Failure
    40. 40. Cloud and component definitions (more terminology)
    41. 41. Architectural options and considerations to protect against cloud failures
    42. 42. Conclusions / Q&A</li></li></ul><li>What do we mean by “Cloud”?<br /><ul><li>A cloud is a physical datacenter entity behind an API endpoint
    43. 43. What does that really mean?
    44. 44. Amazon Web Services is not a cloud
    45. 45. EC2 is not a cloud
    46. 46. Eucalyptus, Cloud.com, OpenStack are not clouds
    47. 47. EC2 US-East, Rackspace, ‘my private cloud’… these are clouds
    48. 48. An availability zone is not a cloud (but it is part of one)</li></ul>Think of a cloud as a “resource pool” accessed via an API<br />
    49. 49. Regions & Availability Zones<br /><ul><li> Zones within a region share a LAN (high bandwidth, low latency, private IP access)
    50. 50. Zones utilize separate power sources, are physically segregated
    51. 51. Regions are “islands”, and share no resources. </li></li></ul><li>Overcoming Multi-Cloud Pain Points<br /><ul><li>APIs differ
    52. 52. Different sets of available resources
    53. 53. Different formats, encodings and versions
    54. 54. Abstractions and features differ
    55. 55. Network architectures differ: VLANs, security groups, NAT, IPs, ACLs, …
    56. 56. Storage architectures differ: local/attachable disks, backup, snapshots, …
    57. 57. Hypervisors, machine images, cost models, billing, reporting… etc.</li></ul>Each cloud is unique in some/many/all respects, with different<br />access mechanisms and varying functionalities provided<br />by the managed resources.<br />
    58. 58. Overcoming Multi-Cloud Pain Points<br /><ul><li>Navigating the obstacles
    59. 59. Design using generic concepts (“durable storage”) yet deploy using cloud specifics (“EBS volumes”)
    60. 60. Have tools that translate your concepts to cloud-specific ones (e.g. scripts/recipes that choose the correct provider for the desired resource)
    61. 61. Think of how to share resources across clouds (i.e. data sharing)</li></li></ul><li>Agenda<br /><ul><li>Terminology/Level-Setting
    62. 62. Takeaways
    63. 63. Designing for Failure
    64. 64. Cloud and component definitions (more terminology)
    65. 65. Infrastructure abstraction and automation as building blocks for highly available applications
    66. 66. Architectural options and considerations to protect against cloud failures
    67. 67. Conclusions / Q&A</li></li></ul><li>HA/DR Checklist for Risk Mitigation<br /><ul><li>Determine who owns the architecture, DR process and testing.
    68. 68. Develop expertise in-house and / or get outside help.
    69. 69. Conduct a risk assessment for each application.
    70. 70. Specify your target Recovery Time Objective and Recovery Point Objective.
    71. 71. Design for failure starting with application architecture. This will help drive the infrastructure architecture.
    72. 72. Implement HA best practices balancing cost, complexity and risk.
    73. 73. Automate infrastructure for consistency and reliability.
    74. 74. Document operational processes and automations.
    75. 75. Test the failover... then test it again.
    76. 76. Release the Chaos Monkey.</li></li></ul><li>Application Architecture Deployment Options<br />
    77. 77. General HA Best Practices<br /><ul><li>Avoid single points of failure.
    78. 78. Always place (at least) one of each component (load balancers, app servers, databases) in at least two AZs.
    79. 79. Maintain sufficient capacity to absorb AZ / cloud failures.
    80. 80. Reserved Instances – guarantee capacity is available in a separate region/cloud
    81. 81. Replicate data across AZs and backup or replicate across clouds/regions for failover.
    82. 82. Setup monitoring, alerts and operations to identify and automate problem resolution or failover process.
    83. 83. Design stateless applications for resilience to reboot / relaunch.</li></li></ul><li>Multi-Availability Zone<br />Consider distributed NoSQL databases with the same distribution considerations. Spread primary and replica nodes across multiple AZs. Place as many as you need for required resiliency.<br />Place Slave databases in one or more AZs for failover.<br />Snapshot data volume for backups so the database can be readily recovered within the region.<br />Consider local storage for additional slave database to remove dependency on attached volume <br />(Use LVM snapshots to create backups) <br />
    84. 84. Multi-Cloud Cold / Warm / Hot DR Options<br />No Downtime<br />Multi-Cloud HA<br />(Live/Live Config)<br />Hot DR<br />> 5 Minutes<br />(Least Common)<br />> 1 Hour<br />Warm DR<br />(Recommended)<br />Cold DR<br />> Few Hours<br />(Most Common)<br />$<br />$$<br />$$$<br />$$$$<br />
    85. 85. Multi-Cloud Cold DRStaged Server Configuration and generally no staged data<br />Data Copy Mechanism<br /><ul><li> Not recommended if rapid recovery is required
    86. 86. Slow to replicate data to other cloud
    87. 87. Slow to bring database to an operational state</li></li></ul><li>Multi-Cloud Warm DRStaged Server Configuration, pre-staged data and running Slave Database Server<br /><ul><li> Generally recommended DR solution
    88. 88. Minimal additional cost
    89. 89. Allows fairly rapid recovery</li></li></ul><li>Multi-Cloud Hot DRParallel Deployment with all servers running but all traffic going to primary<br /><ul><li> Not recommended. Very high additional cost
    90. 90. Allows rapid recovery, but not significantly faster than “warm” configuration</li></li></ul><li>Multi-Cloud HALive/Live configuration. May use Geo-target IP services to direct traffic to regional load balancers.<br /><ul><li> Possible, but not recommended (more to follow…). Maximum additional cost.
    91. 91. Provides high availability, but complex to implement and manage.</li></li></ul><li>Multi-Cloud HAMulti-Cloud looks similar to Multi-AZ… but there are additional problems to solve as some resources are not shared across clouds<br />Security is an issue as security groups are Region-specific.<br />You need DNS management or a global load balancer.<br />Machine Images are specific to the cloud /region.<br />You need to copy or replicate data yourself as snapshots are specific to the source Region.<br />Data migration requires manual synchronization or taking LVM snapshots and transferring the data.<br />
    92. 92. Agenda<br /><ul><li>Terminology/Level-Setting
    93. 93. Takeaways
    94. 94. Designing for Failure
    95. 95. Cloud and component definitions (more terminology)
    96. 96. Infrastructure abstraction and automation as building blocks for highly available applications
    97. 97. Architectural options and considerations to protect against cloud failures
    98. 98. Conclusions / Q&A</li></li></ul><li>So What’s Best?<br /><ul><li>Design for failure
    99. 99. Assume everything will fail and architect a solution capable of handing each and every failure condition
    100. 100. No one size fits all solution is available
    101. 101. Every application has its own architecture
    102. 102. Not all infrastructure building blocks fit well with all applications
    103. 103. Tradeoffs between levels of resiliency and cost
    104. 104. The options available in the cloud today are unprecedented
    105. 105. Capabilities for global redundancy
    106. 106. Time to access
    107. 107. Investment required
    108. 108. Follow our High Availability Checklist (or create your own)
    109. 109. Multi-AZ configurations with a solid DR plan are generally the most viable and cost-conscious solutions</li></li></ul><li>Questions?<br />Brian Adler - Sr. Professional Services Architect, RightScale<br />June 8th, 2011<br />
    110. 110. We hope to see you at our next RightScale User Conference!<br />See all presentations and videos at RightScale.com/Conference.<br />

    ×