Your SlideShare is downloading. ×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

High Availability and Fault Tolerance: AWS + RightScale - RightScale Compute 2013


Published on

Speaker: Miles Ward - Solutions Architect, Amazon Web Services …

Speaker: Miles Ward - Solutions Architect, Amazon Web Services

Today’s technology systems deliver ever more critical capabilities to enterprises, startups, and all users in-between. Amazon Web Services, the leader in Infrastructure-as-a-Service, has delivered several solutions that provide unique value for your efforts towards high-availbility and fault-tolerance. Learn best practices for delivering these innovations to your operations from experienced HA innovator and AWS Solutions Architect Manager Miles Ward.

Published in: Technology, Business
  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide
  • Cloud computing is a better way to run your business. The cloud helps companies of all sizesbecome moreagile. Instead of running your applications yourself you can run them on the cloud where IT infrastructure is offered as a service like a utility. With the cloud, your company saves money: there are no up-front capital expenses as you don’t have to buy hardware for your projects. The massive scale and fast pace of innovation of the cloud drive the costs down for you. In the cloud, you pay only for what you use just like electricity.The cloud can also help your company save time and improve agility – it’s faster to get started: you can build new environments in minutes as you don’t need to wait for new servers to arrive. The elastic nature of the cloud makes it easy to scale up and down as needed. At the end of the day you have more resources left for innovation which allows you to focus on projects that can really impact your businesses like building and deploying more applications. “With the high growth nature of our business, we were looking for a cloud solution to enable us to scale fast. Think twice before buying your next server. Cloud computing is the way forward.” - Sami Lababidi, CTO, Playfish
  • AWS is useful for low-end traditional DR to high-end HA, but…AWS encourages a rethinking of traditional DR / HA practicesEverything in the cloud is “off-site” and (potentially) “multi-site”Using multiple sites (multiple AZs) comes largely for freeUsing multiple geographically-distributed sites (multiple Regions) is significantly cheaper and easierTends to move the default design point away from “cold” Disaster Recovery toward “hot” High AvailabilityMakes it easier to stack multiple mechanismse.g., Basic HA within one Region, DR site in second Region
  • Cold DR(Most common... hours) Staged Server Configuration and generally no staged data. Bring up the servers and load the data to failover. Cold DR failover is typically manual.Warm DR(Recommended... >hour) Staged Server Configuration, pre-staged data and running Database Slave Server. Warm DR failover is typically manual but can be automated.Hot DR(Least common... but needed if <5 min) Parallel Deployment with all servers running but all traffic going to primary. Hot DR failover is normally automated.Hot HALive/Live configuration. May use Geo-target IP services to direct traffic to regional load balancers. Failover to other region if one has problems. Hot HA is normally seamlessly automated.
  • Note: Other costs such as IOPS, volumes, other bandwidth, object storage, and snapshot storage is additional
  • Transcript

    • 1. #1Safeguard Your Cloud Applications:High Availability and Fault Tolerance
    • 2. #2#Agenda• Terminology/Level-Setting• Takeaways• Cloud and Component Definitions• Designing for Failure• Architectural Options and ConsiderationsHigh AvailabilityDisaster Recovery• Conclusions / Q&A
    • 3. #3#Faults?• Facilities• Hardware• Networking• Code• People
    • 4. #4#What is “Fault-Tolerant”?• Degrees of risk mitigation - not binary• Automated• Tested!
    • 5. #5#Old School Fault-Tolerance: Build Two
    • 6. #6#No Up-FrontCapital ExpensePay Only forWhat You UseSelf-ServiceInfrastructureEasily Scale Upand DownImprove Agility &Time-to-MarketLow CostCloud Computing BenefitsDeploy
    • 7. #7#No Up-Front HACapital ExpensePay for DR OnlyWhen You Use itSelf-ServiceDR InfrastructureEasily Deliver Fault-Tolerant ApplicationsImprove Agility &Time-to-RecoveryLow CostBackupsCloud Computing Fault-ToleranceBenefitsDeploy
    • 8. #8#AWS Cloud allows Overcast RedundancyHave the shadow duplicateof your infrastructure readyto go when you need it……but only pay for whatyou actually use
    • 9. #9#Old Barriers to HAare now Surmountable• Cost• Complexity• Expertise
    • 10. #10#AWS Building Blocks: Two StrategiesInherently fault-tolerant servicesServices that are fault-tolerantwith the right architectureAmazon EC2Amazon Virtual Private Cloud (Amazon VPC)Amazon Elastic Block Store (EBS)Amazon Relational Database Service(Amazon RDS)Amazon S3Amazon SimpleDBAmazon DynamoDBAmazon CloudFrontAmazon SWFAmazon SQSAmazon SNSAmazon SESAmazon Route 53Elastic Load BalancingAWS Elastic BeanstalkAmazon ElastiCacheAmazon Elastic MapReduceAWS Identity and AccessManagement (IAM)
    • 11. #11#The Stack:ResourcesDeploymentManagementConfigurationNetworkingFacilitiesGeographies
    • 12. #12#TerminologyAbility of a system tocontinue operatingproperly (perhaps ata degraded level) ifone or morecomponents fails.The process, policiesand proceduresrelated to restoringcritical systems aftera catastrophic event.Goal is to getapplication back upand running within adefined time period(RTO) and within acertain data losswindow (RPO).Fault Tolerantsystems aremeasured by theirAvailability in termsof planned andunplanned serviceoutages for endusers.
    • 13. #13#Terminology - continuedTime period in which servicemust be restored to meetBCP (Business ContinuityPlanning) objectivesAcceptable data loss as aresult of a recovering from adisaster/catastrophic eventRTO and RPO are often at odds, and tradeoffs need tobe made in order to find an acceptable middle ground
    • 14. #14#Takeaways• Understand core concepts behind HA and DR• Introduction to architectural options for designing HA, fault-tolerant applications and DR environments and procedures• Best Practices for implementation of these architecturaloptions within AWS (independent of RightScale)• Multi-Availability Zone (AZ) and Multi-Region• Architectural options and Considerations / pros and cons of these options• Understanding of the tools RightScale brings to AWS tosimplify the creation of these HA and DR environments
    • 15. #15#Regions & Availability Zones• Zones within a region share a LAN (high bandwidth, low latency, private IP access)• Zones utilize separate power sources, are physically segregated• Regions are “islands”, and share no resources.JapanAvailabilityZone AAvailabilityZone BEU West RegionAvailabilityZone AAvailabilityZone BUS East RegionAvailabilityZone AAvailabilityZone CAvailabilityZone BUS West RegionAvailabilityZone AAvailabilityZone BSingaporeAvailabilityZone AAvailabilityZone BSource: AWS
    • 16. #16#Designing for Failure• Large scale failures in the cloud are rare but do happen• Application owners are ultimately responsible foravailability and recoverability• Balance cost and complexity of HA efforts againstrisk(s) you are willing to bear• Cloud infrastructure has made DR and HA remarkablyaffordable versus past options-Multi-Server-Multi-AZ (Availability Zone)-Multi-Region“Everything fails, all the time.”Werner Vogels, CTO
    • 17. #17#Designing for Failure – Basic Concepts• Fault tolerance is the goal. Degradation of service may occur,but application continues to function.• Avoid single points of failure (SPOF)• Assume everything fails (remember Werner’s mantra) anddesign accordingly• Plan and practice your recovery process (both for HA and DR)• Remember that better HA and DR equals more $$$. So findthat acceptable balance.
    • 18. #18#High AvailabilityDon’t sweat the small stuff.And it’s all small stuff**(until it’s not)Follow a few general best practices to absorbapplication component outages…
    • 19. #19#General HA Best Practices• Avoid single points of failure.• Always place one of each component (load balancers,app servers, databases) in at least two AZs.• Replicate data across AZs (HA) and backup or replicateacross regions for failover (DR)• Setup monitoring, alerts and operations to identify andautomate problem resolution or failover process.
    • 20. #20#• High availability for top web propertieswith 270M visitors/month• Migration from datacenter to AWS• RightScale provides-Self-service access to developers-Consistency and low maintenance-Usage and cost accounting-Multi-region architectures to avoid downtime
    • 21. #21#Multi-Zone HASLAVE DBMASTER DBSNAPSHOTSLOAD BALANCERSREPLICATEDNSS3EBSUS-EAST 1a1US-EAST 1bLOAD BALANCERSAPP SERVERSAUTOSCALE172.168.7.31 data volume for backupsso the database can be readilyrecovered within the region.Place Slave databases in oneor more zones for failover.Consider local storage for additionalslave database to removedependency on attached volumeConsiderdistributedNoSQLdatabases withthe samedistributionconsiderations.
    • 22. #22#Disaster RecoveryDR presents a few new wrinkles compared to HA,but there are multiple options depending on yourneeds and budget…Don’t sweat the small stuff.And it’s all small stuff**(until it’s not)
    • 23. #23#HA/DR Checklist for Risk Mitigation• Determine who owns the architecture, DR process and testing.• Develop expertise in-house and / or get outside help.• Conduct a risk assessment for each application.• Specify your target RTO and RPO.• Design for failure starting with application architecture. Thiswill help drive the infrastructure architecture.
    • 24. #24#HA/DR Checklist for Risk Mitigation• Implement HA best practices balancing cost, complexity andrisk.-Automate infrastructure for consistency and reliability.• Document operational processes and automations.• Test the failover... then test it again.• Release the Chaos Monkey.
    • 25. #25#Multi-Region/Cloud DR OptionsCold DRWarm DRHot DRMulti-Cloud HA0< 5 Mins< 1 Hour> 1 Hour$ $$ $$$ $$$$(Most Common)(Recommended)(Least Common)(Live/Live Config)DowntimeAvailability99.999%99.9%99.5%99%
    • 26. #26#Multi-Region Cold DRLOAD BALANCERSMASTER DB SLAVE DBAPP SERVERSLOAD BALANCERSREPLICATEDNSAPP SERVERSUS WESTSNAPSHOTS172.168.7.31SLAVE DBUS EASTS3Staged Server Configuration and generally no staged data• Not recommended if rapid recovery is required• Slow to replicate data to other cloud and bring database onlineEBS
    • 27. #27#Multi-Region Warm DRLOAD BALANCERSMASTER DB SLAVE DBAPP SERVERSLOAD BALANCERSREPLICATEDNSAPP SERVERSSLAVE DBREPLICATEUS WEST172.168.7.31US EASTSNAPSHOTSStaged Server Configuration, pre-staged data and running Slave Database Server• Generally recommended DR solution• Minimal additional cost and allows fairly rapid recoverySNAPSHOTSEBSS3
    • 28. #28#APP SERVERSMulti-Region Hot DRLOAD BALANCERSMASTER DB SLAVE DBAPP SERVERSLOAD BALANCERSREPLICATEDNSSLAVE DBREPLICATEUS WESTSNAPSHOTS172.168.7.31US EASTParallel Deployment with all servers running but all traffic going to primary• Not recommended• Very high additional cost to allow rapid recoverySNAPSHOTSEBSS3
    • 29. #29#Hybrid HAAPP SERVERSLOAD BALANCERSMASTER DB SLAVE DBAPP SERVERSLOAD BALANCERSREPLICATEDNSSLAVE DBREPLICATECHICAGOSNAPSHOTS172.168.7.31 SWIFTSNAPSHOTSLive/Live configuration. Geo-target IP services to direct traffic to regional LBs.• Possible, but not recommended (more to follow…)• Max additional cost and max availability, but complex to implement and manageEBS
    • 30. #30#APP SERVERSLOAD BALANCERSMASTER DB SLAVE DBAPP SERVERSLOAD BALANCERSREPLICATEDNSSLAVE DBREPLICATECHICAGOSNAPSHOTS172.168.7.31 HAYou need DNS managementor a global load balancer.Security requires addt’l effort assecurity groups are Region-specific.Machine Imagesare specific to thecloud/region.Looks similar to Multi-Zone… but additional problems to solve as some resourcesare not sharedSNAPSHOTSSWIFTEBS VOLUME
    • 31. #31#• Procurement software• SLA to their customers require HA• Subway chain is a customer that procures perishable goodsthrough Coupa
    • 32. #32#In the DashboardMulti-regionor cloudMulti-regionWarm DRStagedserversCostforecastingfor DRenvironment
    • 33. #33#Automating HA and DR• Use dynamic DNS for your database serversAllow app servers to use a single FQDN.Use a low TTL to allow rapid failover in the case of a change in masterdatabase• Automatic connection of app servers to load balancing serversApp servers can connect to all load balancers automatically at launchNo manual interventionNo DNS modifications• Automated promotion of slave to masterProcess is automatedDecision to run process is manual
    • 34. #34#MultiCloud Images• MultiCloud Images can be launched across regions and hybridwithout modificationHow RightScale makes it possibleMultiCloud ImagesCloud A, RightImage 1Cloud B, RightImage 2Cloud C, RightImage 3ServerTemplate contains a listof MultiCloud Images (MCIs)When the Server iscreated, a specific MCIis chosen.Cloud A, RightImage 1Cloud AImage 1The appropriateRightImage is used atlaunch.RightImageStability across clouds123
    • 35. #35#How RightScale makes it possibleServerTemplates, Tags, and Inputs• Automated load balancer registration and database connections• Autoscaling across zones• Dynamic configuration
    • 36. #36#DR Cost Comparison ExampleMulti-RegionCold DRMulti-RegionWarm DRMulti-RegionHot DRTotal $4480 / month $5630 / month $8800 / monthRunning $4470 / month3 Load Balancers (Large)6 App Servers (XLarge)1 Master DB (2XLarge)1 Slave DB (2XLarge)$5540 / month3 Load Balancers (Large)6 App Servers (XLarge)1 Master DB (2XLarge)2 Slave DB (2XLarge)$8440 / month6 Load Balancers (Large)12 App Servers (XLarge)1 Master DB (2XLarge)2 Slave DB (2XLarge)Staged $0 / month3 Load Balancers (Large)6 App Servers (XLarge)1 Slave DB (2XLarge)$0 / month3 Load Balancers (Large)6 App Servers (Xlarge)Replication $10 / month25GB / day cross-zone$90 / month25GB / day cross-region$360 / month100GB / day cross-region
    • 37. #37#Outage-Proofing Best PracticesPlace in >1 zone:• Load balancers• App servers• DatabasesMaintain capacityto absorb zone orregion failuresReplicate dataacross zonesDesign statelessapps for resilienceto reboot / relaunchReplicate dataacross zonesBackup acrossregionsMonitoring, alert, and automateoperations tospeed up failover
    • 38. and Q&ARightScaleTry: RightScale Free Free: 1.866.720.0208Int’l: 1.805.855.0265