Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
april25-26sanfranciscocloud success starts hereOutage-Proof Your ApplicationsBrian Adler, Sr. Services Architect, RightSca...
#2#2#RightscaleComputeAgenda• Introductions• Terminology/Level-Setting• Takeaways• Cloud and Component Definitions• Design...
#3#3#RightscaleComputeOur MissionDelivering software innovationthat breeds responsible spendingwhile impacting the company...
#4#4#RightscaleComputeWhat does success look like?Manual Companies10%Spend Under ManagementMostCompanies30%-50%Spend Under...
#5#5#RightscaleComputeSpend HappensPre-Approved Un-ApprovedWe are the only solution that captures all threeProcurementExpe...
Over 300 Customers and growing… Yousmarter spending simplified
#7#7#RightscaleComputeTerminologyAbility of a system tocontinue operatingproperly (perhaps ata degraded level) ifone or mo...
#8#8#RightscaleComputeTerminology - continuedTime period in which servicemust be restored to meetBCP (Business ContinuityP...
#9#9#RightscaleComputeTakeaways• Understand core concepts behind HA and DR• Introduction to architectural options for desi...
#10#10#RightscaleComputeRegions & Zones• Zones within a region share a LAN (high bandwidth, low latency, private IP access...
#11#11#RightscaleComputeDesigning for Failure• Large scale failures in the cloud are rare but do happen• Application owner...
#12#12#RightscaleComputeDesigning for Failure – Basic Concepts• Fault tolerance is the goal. Degradation of service may oc...
#13#13#RightscaleComputeHigh AvailabilityDon’t sweat the small stuff.And it’s all small stuff**(until it’s not)Follow a fe...
#14#14#RightscaleComputeGeneral HA Best Practices• Avoid single points of failure.• Always place one of each component (lo...
#15#15#RightscaleComputeMulti-Zone HASLAVE DBMASTER DBSNAPSHOTSLOAD BALANCERSREPLICATEDNSS3EBSUS-EAST 1a1US-EAST 1bLOAD BA...
#16#16#RightscaleComputeDisaster RecoveryDR presents a few new wrinkles compared to HA,but there are multiple options depe...
#17#17#RightscaleComputeHA/DR Checklist for Risk Mitigation• Determine who owns the architecture, DR process and testing.•...
#18#18#RightscaleComputeHA/DR Checklist for Risk Mitigation• Implement HA best practices balancing cost, complexity andris...
#19#19#RightscaleComputeMulti-Region/Cloud DR OptionsCold DRWarm DRHot DRMulti-Cloud HA0< 5 Mins< 1 Hour> 1 Hour$ $$ $$$ $...
#20#20#RightscaleComputeMulti-Region Cold DRLOAD BALANCERSMASTER DB SLAVE DBAPP SERVERSLOAD BALANCERSREPLICATEDNSAPP SERVE...
#21#21#RightscaleComputeMulti-Region Warm DRLOAD BALANCERSMASTER DB SLAVE DBAPP SERVERSLOAD BALANCERSREPLICATEDNSAPP SERVE...
#22#22#RightscaleComputeAPP SERVERSMulti-Region Hot DRLOAD BALANCERSMASTER DB SLAVE DBAPP SERVERSLOAD BALANCERSREPLICATEDN...
#23#23#RightscaleComputeHybrid HAAPP SERVERSLOAD BALANCERSMASTER DB SLAVE DBAPP SERVERSLOAD BALANCERSREPLICATEDNSSLAVE DBR...
#24#24#RightscaleComputeAPP SERVERSLOAD BALANCERSMASTER DB SLAVE DBAPP SERVERSLOAD BALANCERSREPLICATEDNSSLAVE DBREPLICATEC...
#25#25#RightscaleComputeDesigned for Failure
#26#26#RightscaleComputeMulti-Region Warm DRLOAD BALANCERSMASTER DB SLAVE DBAPP SERVERSLOAD BALANCERSREPLICATEDNSAPP SERVE...
#27#27#RightscaleComputeIn the DashboardMulti-regionor cloudMulti-regionWarm DRStagedserversCostforecastingfor DRenvironment
#28#28#RightscaleComputeAutomating HA and DR• Use dynamic DNS for your database serversAllow app servers to use a single F...
#29#29#RightscaleComputeMultiCloud Images• MultiCloud Images can be launched across regions and hybridwithout modification...
#30#30#RightscaleComputeHow RightScale makes it possibleServerTemplates, Tags, and Inputs• Automated load balancer registr...
#31#31#RightscaleComputeDR Cost Comparison ExampleMulti-RegionCold DRMulti-RegionWarm DRMulti-RegionHot DRTotal $4480 / mo...
#32#32#RightscaleComputeOutage-Proofing Best PracticesPlace in >1 zone:• Load balancers• App servers• DatabasesMaintain ca...
#33#33#RightscaleComputeResources• White Papers• http://www.rightscale.com/info_center/white-papers.php• Webinars• http://...
april25-26sanfranciscocloud success starts hereQuestions?
Upcoming SlideShare
Loading in …5
×

Outage-Proof Your Applications - RightScale Compute 2013

1,002 views

Published on

Speakers:
Brian Adler - Sr. Services Architect, RightScale
Sanket Naik - VP of Cloud Operations, Coupa

Design for failure. Easier said than done? RightScale experts will review best practices for application architectures that survive cloud outages, reduce mean time to recovery, and meet your SLAs.

Published in: Technology
  • Be the first to comment

Outage-Proof Your Applications - RightScale Compute 2013

  1. 1. april25-26sanfranciscocloud success starts hereOutage-Proof Your ApplicationsBrian Adler, Sr. Services Architect, RightScaleSanket Naik, VP Cloud Operations, Coupa
  2. 2. #2#2#RightscaleComputeAgenda• Introductions• Terminology/Level-Setting• Takeaways• Cloud and Component Definitions• Designing for Failure• Architectural Options and ConsiderationsHigh AvailabilityDisaster Recovery• Conclusions / Q&A
  3. 3. #3#3#RightscaleComputeOur MissionDelivering software innovationthat breeds responsible spendingwhile impacting the company bottom line
  4. 4. #4#4#RightscaleComputeWhat does success look like?Manual Companies10%Spend Under ManagementMostCompanies30%-50%Spend Under ManagementSpend UnderManagement80%Source: Aberdeen Group
  5. 5. #5#5#RightscaleComputeSpend HappensPre-Approved Un-ApprovedWe are the only solution that captures all threeProcurementExpenseManagementPost-ApprovedInvoiceManagement
  6. 6. Over 300 Customers and growing… Yousmarter spending simplified
  7. 7. #7#7#RightscaleComputeTerminologyAbility of a system tocontinue operatingproperly (perhaps ata degraded level) ifone or morecomponents fails.The process, policiesand proceduresrelated to restoringcritical systems aftera catastrophic event.Goal is to getapplication back upand running within adefined time period(RTO) and within acertain data losswindow (RPO).Fault Tolerantsystems aremeasured by theirAvailability in termsof planned andunplanned serviceoutages for endusers.
  8. 8. #8#8#RightscaleComputeTerminology - continuedTime period in which servicemust be restored to meetBCP (Business ContinuityPlanning) objectivesAcceptable data loss as aresult of a recovering from adisaster/catastrophic eventRTO and RPO are often at odds, and tradeoffs need tobe made in order to find an acceptable middle ground
  9. 9. #9#9#RightscaleComputeTakeaways• Understand core concepts behind HA and DR• Introduction to architectural options for designing HA, fault-tolerant applications and DR environments and procedures• Best Practices for implementation of these architecturaloptions within AWS (independent of RightScale)• Multi-Availability Zone (AZ) and Multi-Region• Architectural options and Considerations / pros and cons of these options• Understanding of the tools RightScale brings to AWS tosimplify the creation of these HA and DR environments
  10. 10. #10#10#RightscaleComputeRegions & Zones• Zones within a region share a LAN (high bandwidth, low latency, private IP access)• Zones utilize separate power sources, are physically segregated• Regions are “islands”, and share no resources.Region 3AvailabilityZone AAvailabilityZone BRegion 2AvailabilityZone AAvailabilityZone BRegion 1AvailabilityZone AAvailabilityZone CAvailabilityZone BRegion 4AvailabilityZone AAvailabilityZone BRegion 5AvailabilityZone AAvailabilityZone B
  11. 11. #11#11#RightscaleComputeDesigning for Failure• Large scale failures in the cloud are rare but do happen• Application owners are ultimately responsible foravailability and recoverability• Balance cost and complexity of HA efforts againstrisk(s) you are willing to bear• Cloud infrastructure has made DR and HA remarkablyaffordable versus past options-Multi-Server-Multi-AZ (Availability Zone)-Multi-Region“Everything fails, all the time.”Werner Vogels, CTO Amazon.com
  12. 12. #12#12#RightscaleComputeDesigning for Failure – Basic Concepts• Fault tolerance is the goal. Degradation of service may occur,but application continues to function.• Avoid single points of failure (SPOF)• Assume everything fails (remember Werner’s mantra) anddesign accordingly• Plan and practice your recovery process (both for HA and DR)• Remember that better HA and DR equals more $$$. So findthat acceptable balance.
  13. 13. #13#13#RightscaleComputeHigh AvailabilityDon’t sweat the small stuff.And it’s all small stuff**(until it’s not)Follow a few general best practices to absorbapplication component outages…
  14. 14. #14#14#RightscaleComputeGeneral HA Best Practices• Avoid single points of failure.• Always place one of each component (load balancers,app servers, databases) in at least two AZs.• Replicate data across AZs (HA) and backup or replicateacross regions for failover (DR)• Setup monitoring, alerts and operations to identify andautomate problem resolution or failover process.
  15. 15. #15#15#RightscaleComputeMulti-Zone HASLAVE DBMASTER DBSNAPSHOTSLOAD BALANCERSREPLICATEDNSS3EBSUS-EAST 1a1US-EAST 1bLOAD BALANCERSAPP SERVERSAUTOSCALE172.168.7.31 172.168.8.62Snapshot data volume for backupsso the database can be readilyrecovered within the region.Place Slave databases in oneor more zones for failover.Consider local storage for additionalslave database to removedependency on attached volumeConsiderdistributedNoSQLdatabases withthe samedistributionconsiderations.
  16. 16. #16#16#RightscaleComputeDisaster RecoveryDR presents a few new wrinkles compared to HA,but there are multiple options depending on yourneeds and budget…Don’t sweat the small stuff.And it’s all small stuff**(until it’s not)
  17. 17. #17#17#RightscaleComputeHA/DR Checklist for Risk Mitigation• Determine who owns the architecture, DR process and testing.• Develop expertise in-house and / or get outside help.• Conduct a risk assessment for each application.• Specify your target RTO and RPO.• Design for failure starting with application architecture. Thiswill help drive the infrastructure architecture.
  18. 18. #18#18#RightscaleComputeHA/DR Checklist for Risk Mitigation• Implement HA best practices balancing cost, complexity andrisk.-Automate infrastructure for consistency and reliability.• Document operational processes and automations.• Test the failover... then test it again.• Release the Chaos Monkey.
  19. 19. #19#19#RightscaleComputeMulti-Region/Cloud DR OptionsCold DRWarm DRHot DRMulti-Cloud HA0< 5 Mins< 1 Hour> 1 Hour$ $$ $$$ $$$$(Most Common)(Recommended)(Least Common)(Live/Live Config)DowntimeAvailability99.999%99.9%99.5%99%
  20. 20. #20#20#RightscaleComputeMulti-Region Cold DRLOAD BALANCERSMASTER DB SLAVE DBAPP SERVERSLOAD BALANCERSREPLICATEDNSAPP SERVERSUS WESTSNAPSHOTS172.168.7.31SLAVE DBUS EASTPersistentStorageStaged Server Configuration and generally no staged data• Not recommended if rapid recovery is required• Slow to replicate data to other cloud and bring database onlineBlockStorage
  21. 21. #21#21#RightscaleComputeMulti-Region Warm DRLOAD BALANCERSMASTER DB SLAVE DBAPP SERVERSLOAD BALANCERSREPLICATEDNSAPP SERVERSSLAVE DBREPLICATEUS WEST172.168.7.31US EASTSNAPSHOTSStaged Server Configuration, pre-staged data and running Slave Database Server• Generally recommended DR solution• Minimal additional cost and allows fairly rapid recoverySNAPSHOTSBlockStoragePersistentStorage
  22. 22. #22#22#RightscaleComputeAPP SERVERSMulti-Region Hot DRLOAD BALANCERSMASTER DB SLAVE DBAPP SERVERSLOAD BALANCERSREPLICATEDNSSLAVE DBREPLICATEUS WESTSNAPSHOTS172.168.7.31US EASTParallel Deployment with all servers running but all traffic going to primary• Not recommended• Very high additional cost to allow rapid recoverySNAPSHOTSBlockStoragePersistentStorage
  23. 23. #23#23#RightscaleComputeHybrid HAAPP SERVERSLOAD BALANCERSMASTER DB SLAVE DBAPP SERVERSLOAD BALANCERSREPLICATEDNSSLAVE DBREPLICATECHICAGOSNAPSHOTS172.168.7.31 172.168.8.62US-EASTSWIFTSNAPSHOTSLive/Live configuration. Geo-target IP services to direct traffic to regional LBs.• Possible, but not recommended (more to follow…)• Max additional cost and max availability, but complex to implement and manageBlockStoragePersistentStorage
  24. 24. #24#24#RightscaleComputeAPP SERVERSLOAD BALANCERSMASTER DB SLAVE DBAPP SERVERSLOAD BALANCERSREPLICATEDNSSLAVE DBREPLICATECHICAGOSNAPSHOTS172.168.7.31 172.168.8.62US-EASTHybrid HAYou need DNS managementor a global load balancer.Security requires addt’l effort assecurity groups are Region-specific.Machine Imagesare specific to thecloud/region.Looks similar to Multi-Zone… but additional problems to solve as some resourcesare not sharedSNAPSHOTSSWIFTVOLUMEBlockStoragePersistentStorage
  25. 25. #25#25#RightscaleComputeDesigned for Failure
  26. 26. #26#26#RightscaleComputeMulti-Region Warm DRLOAD BALANCERSMASTER DB SLAVE DBAPP SERVERSLOAD BALANCERSREPLICATEDNSAPP SERVERSSLAVE DBREPLICATEUS WEST172.168.7.31US EASTSNAPSHOTSSNAPSHOTSBlockStoragePersistentStoragePersistentStorageZero Data Loss 99.99% Uptime Enterprise Software True Cloud
  27. 27. #27#27#RightscaleComputeIn the DashboardMulti-regionor cloudMulti-regionWarm DRStagedserversCostforecastingfor DRenvironment
  28. 28. #28#28#RightscaleComputeAutomating HA and DR• Use dynamic DNS for your database serversAllow app servers to use a single FQDN.Use a low TTL to allow rapid failover in the case of a change in masterdatabase• Automatic connection of app servers to load balancing serversApp servers can connect to all load balancers automatically at launchNo manual interventionNo DNS modifications• Automated promotion of slave to masterProcess is automatedDecision to run process is manual
  29. 29. #29#29#RightscaleComputeMultiCloud Images• MultiCloud Images can be launched across regions and hybridwithout modificationHow RightScale makes it possibleMultiCloud ImagesCloud A, RightImage 1Cloud B, RightImage 2Cloud C, RightImage 3ServerTemplate contains a listof MultiCloud Images (MCIs)When the Server iscreated, a specific MCIis chosen.Cloud A, RightImage 1Cloud AImage 1The appropriateRightImage is used atlaunch.RightImageStability across clouds123
  30. 30. #30#30#RightscaleComputeHow RightScale makes it possibleServerTemplates, Tags, and Inputs• Automated load balancer registration and database connections• Autoscaling across zones• Dynamic configuration
  31. 31. #31#31#RightscaleComputeDR Cost Comparison ExampleMulti-RegionCold DRMulti-RegionWarm DRMulti-RegionHot DRTotal $4480 / month $5630 / month $8800 / monthRunning $4470 / month3 Load Balancers (Large)6 App Servers (XLarge)1 Master DB (2XLarge)1 Slave DB (2XLarge)$5540 / month3 Load Balancers (Large)6 App Servers (XLarge)1 Master DB (2XLarge)2 Slave DB (2XLarge)$8440 / month6 Load Balancers (Large)12 App Servers (XLarge)1 Master DB (2XLarge)2 Slave DB (2XLarge)Staged $0 / month3 Load Balancers (Large)6 App Servers (XLarge)1 Slave DB (2XLarge)$0 / month3 Load Balancers (Large)6 App Servers (Xlarge)Replication $10 / month25GB / day cross-zone$90 / month25GB / day cross-region$360 / month100GB / day cross-region
  32. 32. #32#32#RightscaleComputeOutage-Proofing Best PracticesPlace in >1 zone:• Load balancers• App servers• DatabasesMaintain capacityto absorb zone orregion failuresReplicate dataacross zonesDesign statelessapps for resilienceto reboot / relaunchReplicate dataacross zonesBackup acrossregionsMonitoring, alert, and automateoperations tospeed up failover
  33. 33. #33#33#RightscaleComputeResources• White Papers• http://www.rightscale.com/info_center/white-papers.php• Webinars• http://www.rightscale.com/info_center/webinars.php• http://www.rightscale.com/info_center/webinars/safeguard-cloud-apps-aws.phpYou can reach Sanket at: sanket.naik@coupa.comis hiring: jobs.coupa.com
  34. 34. april25-26sanfranciscocloud success starts hereQuestions?

×