Cloud Immortality - Architecting for High Availability & Disaster Recovery

Like this? Share it with your network

Share

Cloud Immortality - Architecting for High Availability & Disaster Recovery

  • 978 views
Uploaded on

RightScale Conference Santa Clara 2011: RightScale is involved in building diverse cloud environments for our customers. Want to deploy applications in highly available, fault-tolerant......

RightScale Conference Santa Clara 2011: RightScale is involved in building diverse cloud environments for our customers. Want to deploy applications in highly available, fault-tolerant environments? Across multiple zones, regions, and clouds providers? We’ve walked the walk. And we have five years of best practices to share. This session will cover best practices and tips and tricks for architecting highly available, fault-tolerant multi-zone and multi-cloud deployments.

More in: Technology , Business
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
978
On Slideshare
903
From Embeds
75
Number of Embeds
2

Actions

Shares
Downloads
44
Comments
0
Likes
1

Embeds 75

http://www.rightscale.com 73
http://localhost 2

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. 1Cloud Immortality: Architecting forHigh Availability & Disaster RecoveryBrian Adler, Sr. Professional Services Architect, RightScaleJosep Blanquer, Sr. Systems Architect, RightScale Watch the video of this presentation
  • 2. 2#Agenda• Terminology/Level-Setting• Takeaways• Designing for Failure• Cloud and component definitions (more terminology)• HA and DR Checklists and Best Practices• Building a complete HA example across multiple clouds• Conclusions / Q&AReal Cloud Experience. Shared.
  • 3. 3#Terminology• Fault Tolerance • Designs incorporating redundancy and replication to enable systems to continue operating properly (perhaps at a degraded level) if one or more components fails• High Availability (HA) • Fault Tolerant systems are measured by their Availability in terms of planned and unplanned service outages for end users • 99% Availability = 3.65 days of downtime per year • 99.9% Availability = 8.76 hours of downtime per year • 99.99% Availability = 53 minutes of downtime per year • 99.999% Availability = 5.26 minutes of downtime per year• Disaster Recovery (DR) • The process, policies and procedures related to restoring critical systems after a catastrophic eventReal Cloud Experience. Shared.
  • 4. 4#Agenda• Terminology/Level-Setting• Takeaways• Designing for Failure• Cloud and component definitions (more terminology)• HA and DR Checklists and Best Practices• Building a complete HA example across multiple clouds• Conclusions / Q&AReal Cloud Experience. Shared.
  • 5. 5#Takeaways• Introduction to architectural options for designing highly- available, fault-tolerant applications• Best Practices for implementation of these architectural options• Multi-Availability Zone (AZ), Multi-Region, and Multi-Cloud • Architectural options • Considerations/pros and cons of these optionsReal Cloud Experience. Shared.
  • 6. 6#Agenda• Terminology/Level-Setting• Takeaways• Designing for Failure• Cloud and component definitions (more terminology)• HA and DR Checklists and Best Practices• Building a complete HA example across multiple clouds• Conclusions / Q&AReal Cloud Experience. Shared.
  • 7. 7#Designing for Failure• Large scale failures in the cloud are rare but do happen• Application owners are ultimately responsible for availability and recoverability• Need to balance cost and complexity of HA efforts against risk(s) you are willing to bear• Cloud infrastructure has made DR and HA remarkably affordable versus past options • Multi-server • Multi-AZ • Multi-Region • Multi-CloudReal Cloud Experience. Shared.
  • 8. 8#Agenda• Terminology/Level-Setting• Takeaways• Designing for Failure• Cloud and component definitions (more terminology)• HA and DR Checklists and Best Practices• Building a complete HA example across multiple clouds• Conclusions / Q&AReal Cloud Experience. Shared.
  • 9. 9#What do we mean by “Cloud”?• A cloud is a physical datacenter entity behind an API endpoint• What does that really mean? • Amazon Web Services is not a cloud • EC2 is not a cloud • Eucalyptus, Cloud.com, OpenStack are not clouds • EC2 US-East, Rackspace, „my private cloud‟… these are clouds • An availability zone is not a cloud (but it is part of one) Think of a cloud as a “resource pool” accessed via an APIReal Cloud Experience. Shared.
  • 10. 10#Regions & Availability Zones •Zones within a region share a LAN (high bandwidth, low latency, private IP access) •Zones utilize separate power sources, are physically segregated •Regions are “islands”, and share no resourcesReal Cloud Experience. Shared.
  • 11. 11#Overcoming Multi-Cloud Pain Points• APIs differ • Different sets of available resources • Different formats, encodings and versions• Abstractions and features differ • Network architectures differ: VLANs, security groups, NAT, IPs, ACLs, … • Storage architectures differ: local/attachable disks, backup, snapshots, … • Hypervisors, machine images, cost models, billing, reporting…etc. Each cloud is unique in some/many/all respects, with different access mechanisms and varying functionalities provided by the managed resources.Real Cloud Experience. Shared.
  • 12. 12#Overcoming Multi-Cloud Pain Points• Navigating the obstacles • Design using generic concepts (“durable storage”) yet deploy using cloud specifics (“EBS volumes”) • Have tools that translate your concepts to cloud-specific ones (e.g. scripts/recipes that choose the correct provider for the desired resource) • Think of how to share resources across clouds (i.e. data sharing)Real Cloud Experience. Shared.
  • 13. 13#Agenda• Terminology/Level-Setting• Takeaways• Designing for Failure• Cloud and component definitions (more terminology)• Infrastructure abstraction and automation as building blocks for highly available applications• HA and DR Checklists and Best Practices• Building a complete HA example across multiple clouds• Conclusions / Q&AReal Cloud Experience. Shared.
  • 14. 14#HA/DR Checklist for Risk Mitigation Determine who owns the architecture, DR process and testing. Develop expertise in-house and / or get outside help. Conduct a risk assessment for each application. Specify your target Recovery Time Objective and Recovery Point Objective. Design for failure starting with application architecture. This will help drive the infrastructure architecture. Implement HA best practices balancing cost, complexity and risk.  Automate infrastructure for consistency and reliability. Document operational processes and automations. Test the failover... then test it again. Release the Chaos Monkey.Real Cloud Experience. Shared.
  • 15. 15#General HA Best Practices Avoid single points of failure Always place (at least) one of each component (load balancers, app servers, databases) in at least two AZs Maintain sufficient capacity to absorb AZ / cloud failures  Reserved Instances – guarantee capacity is available in a separate region/cloud Replicate data across AZs and backup or replicate across clouds/regions for failover Setup monitoring, alerts and operations to identify and automate problem resolution or failover process Design stateless applications for resilience to reboot / relaunchReal Cloud Experience. Shared.
  • 16. 16#Agenda• Terminology/Level-Setting• Takeaways• Designing for Failure• Cloud and component definitions (more terminology)• Infrastructure abstraction and automation as building blocks for highly available applications• HA and DR Checklists and Best Practices• Building a complete HA example across multiple clouds• Conclusions / Q&AReal Cloud Experience. Shared.
  • 17. 17#The Start: A Typical Single-Cloud Application • Three tiered Application Cloud 1 • RR DNS Load Balancers Zone1 Zone2 • Array of Application Servers LB LB • Master – Slave Databases App App … App • Isolation Features • Faults: using Availability Zones Master Slave • Security: Security Groups / Firewalls • Plus • Monitoring • Alerts • AutomationReal Cloud Experience. Shared.
  • 18. 18#Scaling: Bursting App Capacity to another Cloud Private Cloud Public Cloud LB LB App App … App Master SlaveReal Cloud Experience. Shared.
  • 19. 19#Scaling: Bursting App Capacity to another Cloud Private Cloud Public Cloud LB LB App App … App App App … App CPU Scale Up Scale Down ServerArray Master SlaveReal Cloud Experience. Shared.
  • 20. 20#Scaling: Bursting App Capacity to another Cloud Register with LB’s: • Using SSH Private Cloud • Using RightNet Public Cloud • Central repo LB LB App App … App Public Internet App App … App CPU Scale Up Scale Down ServerArray Master Slave Security: Contact DB: • Access Control • Using DNS / IP • Encryption • Executing a scriptReal Cloud Experience. Shared.
  • 21. 21#DR: Replicating Data in another Cloud Private Cloud Public Cloud LB LB App App … App Public Internet App App … App CPU Scale Up Scale Down ServerArray snap snap snap snap Master Slave snap snap snap snap snap snap Slave snap snap Access and encryption Where’s the data?Real Cloud Experience. Shared.
  • 22. 22#DR: Replicating Data in another Cloud Private Cloud Public Cloud LB LB App App … App Public Internet App App … App CPU Scale Up Scale Down ServerArray snap snap snap snap Master Slave snap snap snap snap snap snap Slave snap snap S3/Cloudfiles • Global storage • Live replication + rsync • NoSQLReal Cloud Experience. Shared.
  • 23. 23#Full distribution: Splitting the Load Balancers Private Cloud Public Cloud LB LB App App … App Public Internet App App … App CPU Scale Up Scale Down ServerArray snap snap snap snap Master Slave snap snap snap snap snap snap Slave snap snapReal Cloud Experience. Shared.
  • 24. 24#Agenda• Terminology/Level-Setting• Takeaways• Designing for Failure• Cloud and component definitions (more terminology)• Infrastructure abstraction and automation as building blocks for highly available applications• Building a complete HA example across multiple clouds• Conclusions / Q&AReal Cloud Experience. Shared.
  • 25. 25#So how do I make my service immortal?• Be pessimistic: Design for failure • Assume everything will fail and architect a solution capable of handling it • Trade off between levels of resiliency and cost• Embrace the Cloud mentality: Unprecedented features • Build architectural building blocks you can reuse, build up and teardown • Use the powerful provider capabilities within a cloud • Build the glue across clouds for global redundancy (RightScale helps a lot) • Automate every single last detail• Have it your way: No single architecture that fits all cases • Reuse all patterns that fit your use case • Customize where they don‟t fit• Crawl, then walk: Build HA within cloud, then expand • Cost exponentially increases, the 80-20 rule. • Multi-AZ HA, solid DR plan, then full multi-cloud HAReal Cloud Experience. Shared.