• Share
  • Email
  • Embed
  • Like
  • Private Content
Outage-Proof Your Applications - RightScale Compute 2013

Outage-Proof Your Applications - RightScale Compute 2013



Speakers: ...

Brian Adler - Sr. Services Architect, RightScale
Sanket Naik - VP of Cloud Operations, Coupa

Design for failure. Easier said than done? RightScale experts will review best practices for application architectures that survive cloud outages, reduce mean time to recovery, and meet your SLAs.



Total Views
Views on SlideShare
Embed Views



0 Embeds 0

No embeds



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment
  • Coupa has a Straightforward mission[CLICK]Delivering software INNOVATIONWe drive innovation everywhere in the companyWe don’t do things the way they always have been doneWe look for most effective, most innovative ways to do things. Particularly as it pertains to our process & technology that we design and deliver for our customers.[CLICK] That breeds RESPONSIBLE SPENDINGinternally here at Coupa hundreds of customers we have around the worldPeople recognize the actual cost of their purchasesdrive responsible spending [CLICK] Impact the BOTTOM LINETo drive bottom line impact on your organizationTo drive profitabilityTo drive earnings per share[CLICK] To drive results for your company. That is what Coupa is really all about.
  • Cold DR(Most common... hours) Staged Server Configuration and generally no staged data. Bring up the servers and load the data to failover. Cold DR failover is typically manual.Warm DR(Recommended... >hour) Staged Server Configuration, pre-staged data and running Database Slave Server. Warm DR failover is typically manual but can be automated.Hot DR(Least common... but needed if <5 min) Parallel Deployment with all servers running but all traffic going to primary. Hot DR failover is normally automated.Hot HALive/Live configuration. May use Geo-target IP services to direct traffic to regional load balancers. Failover to other region if one has problems. Hot HA is normally seamlessly automated.
  • I am going to read a touching account of what one of our customers told his Coupa account manager during Super Storm Sandy.[Take out paper and read]Our building in New York is under five feet of water and all systems (including email) are down. I am on my way to a neighbors house (one that still has power) so I can get online and get what I need from Coupa. We need to access our contracts in Coupa to get in touch with our vendors and to initiate emergency protocols with suppliers. If it weren't for Coupa - we would be in even worse shape. I am so thankful that Coupa is up when all of our other systems are down. I know it doesn't make the devastation any less impactful but it is great to be able to have this one bit of stability and normalcy.[Fold and put paper away]Taking on-premise software that was running in a data center and hosting it on the Amazon cloud or any other cloud does not make it resilient. The software needs to be designed from the ground up to run on the cloud – which is what we have done over the last 7 years with a true cloud solution.Designing for the cloud requires a completely different mindset than a traditional data center mindset. We had to take a 180 degree view and design for failure. We had to assume that any component of the system can fail at any time – entire data centers might be impacted by hurricanes or floods and the service should be resilient to that.
  • Note: Other costs such as IOPS, volumes, other bandwidth, object storage, and snapshot storage is additional

Outage-Proof Your Applications - RightScale Compute 2013 Outage-Proof Your Applications - RightScale Compute 2013 Presentation Transcript

  • april25-26sanfranciscocloud success starts hereOutage-Proof Your ApplicationsBrian Adler, Sr. Services Architect, RightScaleSanket Naik, VP Cloud Operations, Coupa
  • #2#2#RightscaleComputeAgenda• Introductions• Terminology/Level-Setting• Takeaways• Cloud and Component Definitions• Designing for Failure• Architectural Options and ConsiderationsHigh AvailabilityDisaster Recovery• Conclusions / Q&A
  • #3#3#RightscaleComputeOur MissionDelivering software innovationthat breeds responsible spendingwhile impacting the company bottom line
  • #4#4#RightscaleComputeWhat does success look like?Manual Companies10%Spend Under ManagementMostCompanies30%-50%Spend Under ManagementSpend UnderManagement80%Source: Aberdeen Group
  • #5#5#RightscaleComputeSpend HappensPre-Approved Un-ApprovedWe are the only solution that captures all threeProcurementExpenseManagementPost-ApprovedInvoiceManagement
  • Over 300 Customers and growing… Yousmarter spending simplified
  • #7#7#RightscaleComputeTerminologyAbility of a system tocontinue operatingproperly (perhaps ata degraded level) ifone or morecomponents fails.The process, policiesand proceduresrelated to restoringcritical systems aftera catastrophic event.Goal is to getapplication back upand running within adefined time period(RTO) and within acertain data losswindow (RPO).Fault Tolerantsystems aremeasured by theirAvailability in termsof planned andunplanned serviceoutages for endusers.
  • #8#8#RightscaleComputeTerminology - continuedTime period in which servicemust be restored to meetBCP (Business ContinuityPlanning) objectivesAcceptable data loss as aresult of a recovering from adisaster/catastrophic eventRTO and RPO are often at odds, and tradeoffs need tobe made in order to find an acceptable middle ground
  • #9#9#RightscaleComputeTakeaways• Understand core concepts behind HA and DR• Introduction to architectural options for designing HA, fault-tolerant applications and DR environments and procedures• Best Practices for implementation of these architecturaloptions within AWS (independent of RightScale)• Multi-Availability Zone (AZ) and Multi-Region• Architectural options and Considerations / pros and cons of these options• Understanding of the tools RightScale brings to AWS tosimplify the creation of these HA and DR environments
  • #10#10#RightscaleComputeRegions & Zones• Zones within a region share a LAN (high bandwidth, low latency, private IP access)• Zones utilize separate power sources, are physically segregated• Regions are “islands”, and share no resources.Region 3AvailabilityZone AAvailabilityZone BRegion 2AvailabilityZone AAvailabilityZone BRegion 1AvailabilityZone AAvailabilityZone CAvailabilityZone BRegion 4AvailabilityZone AAvailabilityZone BRegion 5AvailabilityZone AAvailabilityZone B
  • #11#11#RightscaleComputeDesigning for Failure• Large scale failures in the cloud are rare but do happen• Application owners are ultimately responsible foravailability and recoverability• Balance cost and complexity of HA efforts againstrisk(s) you are willing to bear• Cloud infrastructure has made DR and HA remarkablyaffordable versus past options-Multi-Server-Multi-AZ (Availability Zone)-Multi-Region“Everything fails, all the time.”Werner Vogels, CTO Amazon.com
  • #12#12#RightscaleComputeDesigning for Failure – Basic Concepts• Fault tolerance is the goal. Degradation of service may occur,but application continues to function.• Avoid single points of failure (SPOF)• Assume everything fails (remember Werner’s mantra) anddesign accordingly• Plan and practice your recovery process (both for HA and DR)• Remember that better HA and DR equals more $$$. So findthat acceptable balance.
  • #13#13#RightscaleComputeHigh AvailabilityDon’t sweat the small stuff.And it’s all small stuff**(until it’s not)Follow a few general best practices to absorbapplication component outages…
  • #14#14#RightscaleComputeGeneral HA Best Practices• Avoid single points of failure.• Always place one of each component (load balancers,app servers, databases) in at least two AZs.• Replicate data across AZs (HA) and backup or replicateacross regions for failover (DR)• Setup monitoring, alerts and operations to identify andautomate problem resolution or failover process.
  • #15#15#RightscaleComputeMulti-Zone HASLAVE DBMASTER DBSNAPSHOTSLOAD BALANCERSREPLICATEDNSS3EBSUS-EAST 1a1US-EAST 1bLOAD BALANCERSAPP SERVERSAUTOSCALE172.168.7.31 data volume for backupsso the database can be readilyrecovered within the region.Place Slave databases in oneor more zones for failover.Consider local storage for additionalslave database to removedependency on attached volumeConsiderdistributedNoSQLdatabases withthe samedistributionconsiderations.
  • #16#16#RightscaleComputeDisaster RecoveryDR presents a few new wrinkles compared to HA,but there are multiple options depending on yourneeds and budget…Don’t sweat the small stuff.And it’s all small stuff**(until it’s not)
  • #17#17#RightscaleComputeHA/DR Checklist for Risk Mitigation• Determine who owns the architecture, DR process and testing.• Develop expertise in-house and / or get outside help.• Conduct a risk assessment for each application.• Specify your target RTO and RPO.• Design for failure starting with application architecture. Thiswill help drive the infrastructure architecture.
  • #18#18#RightscaleComputeHA/DR Checklist for Risk Mitigation• Implement HA best practices balancing cost, complexity andrisk.-Automate infrastructure for consistency and reliability.• Document operational processes and automations.• Test the failover... then test it again.• Release the Chaos Monkey.
  • #19#19#RightscaleComputeMulti-Region/Cloud DR OptionsCold DRWarm DRHot DRMulti-Cloud HA0< 5 Mins< 1 Hour> 1 Hour$ $$ $$$ $$$$(Most Common)(Recommended)(Least Common)(Live/Live Config)DowntimeAvailability99.999%99.9%99.5%99%
  • #20#20#RightscaleComputeMulti-Region Cold DRLOAD BALANCERSMASTER DB SLAVE DBAPP SERVERSLOAD BALANCERSREPLICATEDNSAPP SERVERSUS WESTSNAPSHOTS172.168.7.31SLAVE DBUS EASTPersistentStorageStaged Server Configuration and generally no staged data• Not recommended if rapid recovery is required• Slow to replicate data to other cloud and bring database onlineBlockStorage
  • #21#21#RightscaleComputeMulti-Region Warm DRLOAD BALANCERSMASTER DB SLAVE DBAPP SERVERSLOAD BALANCERSREPLICATEDNSAPP SERVERSSLAVE DBREPLICATEUS WEST172.168.7.31US EASTSNAPSHOTSStaged Server Configuration, pre-staged data and running Slave Database Server• Generally recommended DR solution• Minimal additional cost and allows fairly rapid recoverySNAPSHOTSBlockStoragePersistentStorage
  • #22#22#RightscaleComputeAPP SERVERSMulti-Region Hot DRLOAD BALANCERSMASTER DB SLAVE DBAPP SERVERSLOAD BALANCERSREPLICATEDNSSLAVE DBREPLICATEUS WESTSNAPSHOTS172.168.7.31US EASTParallel Deployment with all servers running but all traffic going to primary• Not recommended• Very high additional cost to allow rapid recoverySNAPSHOTSBlockStoragePersistentStorage
  • #23#23#RightscaleComputeHybrid HAAPP SERVERSLOAD BALANCERSMASTER DB SLAVE DBAPP SERVERSLOAD BALANCERSREPLICATEDNSSLAVE DBREPLICATECHICAGOSNAPSHOTS172.168.7.31 configuration. Geo-target IP services to direct traffic to regional LBs.• Possible, but not recommended (more to follow…)• Max additional cost and max availability, but complex to implement and manageBlockStoragePersistentStorage
  • #24#24#RightscaleComputeAPP SERVERSLOAD BALANCERSMASTER DB SLAVE DBAPP SERVERSLOAD BALANCERSREPLICATEDNSSLAVE DBREPLICATECHICAGOSNAPSHOTS172.168.7.31 HAYou need DNS managementor a global load balancer.Security requires addt’l effort assecurity groups are Region-specific.Machine Imagesare specific to thecloud/region.Looks similar to Multi-Zone… but additional problems to solve as some resourcesare not sharedSNAPSHOTSSWIFTVOLUMEBlockStoragePersistentStorage
  • #25#25#RightscaleComputeDesigned for Failure
  • #26#26#RightscaleComputeMulti-Region Warm DRLOAD BALANCERSMASTER DB SLAVE DBAPP SERVERSLOAD BALANCERSREPLICATEDNSAPP SERVERSSLAVE DBREPLICATEUS WEST172.168.7.31US EASTSNAPSHOTSSNAPSHOTSBlockStoragePersistentStoragePersistentStorageZero Data Loss 99.99% Uptime Enterprise Software True Cloud
  • #27#27#RightscaleComputeIn the DashboardMulti-regionor cloudMulti-regionWarm DRStagedserversCostforecastingfor DRenvironment
  • #28#28#RightscaleComputeAutomating HA and DR• Use dynamic DNS for your database serversAllow app servers to use a single FQDN.Use a low TTL to allow rapid failover in the case of a change in masterdatabase• Automatic connection of app servers to load balancing serversApp servers can connect to all load balancers automatically at launchNo manual interventionNo DNS modifications• Automated promotion of slave to masterProcess is automatedDecision to run process is manual
  • #29#29#RightscaleComputeMultiCloud Images• MultiCloud Images can be launched across regions and hybridwithout modificationHow RightScale makes it possibleMultiCloud ImagesCloud A, RightImage 1Cloud B, RightImage 2Cloud C, RightImage 3ServerTemplate contains a listof MultiCloud Images (MCIs)When the Server iscreated, a specific MCIis chosen.Cloud A, RightImage 1Cloud AImage 1The appropriateRightImage is used atlaunch.RightImageStability across clouds123
  • #30#30#RightscaleComputeHow RightScale makes it possibleServerTemplates, Tags, and Inputs• Automated load balancer registration and database connections• Autoscaling across zones• Dynamic configuration
  • #31#31#RightscaleComputeDR Cost Comparison ExampleMulti-RegionCold DRMulti-RegionWarm DRMulti-RegionHot DRTotal $4480 / month $5630 / month $8800 / monthRunning $4470 / month3 Load Balancers (Large)6 App Servers (XLarge)1 Master DB (2XLarge)1 Slave DB (2XLarge)$5540 / month3 Load Balancers (Large)6 App Servers (XLarge)1 Master DB (2XLarge)2 Slave DB (2XLarge)$8440 / month6 Load Balancers (Large)12 App Servers (XLarge)1 Master DB (2XLarge)2 Slave DB (2XLarge)Staged $0 / month3 Load Balancers (Large)6 App Servers (XLarge)1 Slave DB (2XLarge)$0 / month3 Load Balancers (Large)6 App Servers (Xlarge)Replication $10 / month25GB / day cross-zone$90 / month25GB / day cross-region$360 / month100GB / day cross-region
  • #32#32#RightscaleComputeOutage-Proofing Best PracticesPlace in >1 zone:• Load balancers• App servers• DatabasesMaintain capacityto absorb zone orregion failuresReplicate dataacross zonesDesign statelessapps for resilienceto reboot / relaunchReplicate dataacross zonesBackup acrossregionsMonitoring, alert, and automateoperations tospeed up failover
  • #33#33#RightscaleComputeResources• White Papers• http://www.rightscale.com/info_center/white-papers.php• Webinars• http://www.rightscale.com/info_center/webinars.php• http://www.rightscale.com/info_center/webinars/safeguard-cloud-apps-aws.phpYou can reach Sanket at: sanket.naik@coupa.comis hiring: jobs.coupa.com
  • april25-26sanfranciscocloud success starts hereQuestions?