What happens when your mission-critical application is unavailable because of a cloud outage? Do you have a disaster recovery plan? Are you prepared to fail over to an alternate cloud, or are you just crossing your fingers that you'll make it through unscathed?
The cloud isn't magic, it's a data center. And it's not "what if" the cloud goes down -- it's "when".
Join RightScale for a webinar to learn from the experts how to outage-proof your cloud applications. At RightScale, we have seen the good, the bad, and the ugly of cloud outages. Now is the time to build for failure and plan for the worst-case scenario.
In this webinar, we will:
- Level-set critical concepts: Fault tolerance, high availability and disaster recovery
- Show you how to design for failure
- Talk you through disaster recovery options that you can tailor based on your uptime requirements
- Share best practices for outage-proofing your cloud applications
Breaking the Kubernetes Kill Chain: Host Path Mount
Rightscale Webinar: Outage Proof Your Cloud Applications
1. Outage-Proof Your Cloud
Applications
Brian Adler, Sr. Services Architect
Roberto Monge, Cloud Solutions Engineer
RightScale
December 18, 2012
Watch the video of this webinar
2. 2#
Your Panel Today
Presenting
• Brian Adler, Sr. Services Architect, RightScale
• Roberto Monge, Cloud Solutions Engineer, RightScale
Q&A
• Spencer Adams, Account Manager, RightScale
• Noel Cohen, Account Manager, RightScale
Please use the “Questions” window
to ask questions any time!
Cloud Management #rightscale
3. 3#
Agenda
• High Availability and Disaster Recovery
• Terminology/Level-Setting
• Designing for Failure
• Cloud and component definitions
• HA and DR configurations
• Conclusions / Q&A
Cloud Management #rightscale
4. 4#
Terminology
Ability of a system to Fault Tolerant The process, policies
continue operating systems are and procedures
properly (perhaps at measured by their related to restoring
a degraded level) if Availability in terms critical systems after
one or more of planned and a catastrophic event
components fails unplanned service
outages for end
users
Cloud Management #rightscale
5. 5#
Designing for Failure
1 Large scale failures in 3 Need to balance cost
the cloud are rare but do and complexity of HA
happen efforts against risks you
are willing to bear
2 Application owners are 4 Cloud infrastructure has
ultimately responsible made DR and HA
for availability and remarkably affordable
recoverability • Multi-server
• Multi-Zone
• Multi-Region
• Multi-Cloud
Cloud Management #rightscale
6. 6#
Cloud Isolation Definitions
Region Zone
Resources One or more Datacenter with
geographically separate power source
proximate Zones
API endpoint, control plane Shared Shared
Local Area Network Shared Shared
Clouds
Amazon Web Services Region Availability Zone
Rackspace Region
Windows Azure Region
Google Cloud Platform Region Availability Group
CloudStack Region Zone
OpenStack Zone Availability Zone
Cloud Management #rightscale
7. 7#
Multi-Zone HA
Consider
distributed
DNS NoSQL
databases with
172.168.7.31 172.168.8.62
the same
US-EAST 1a
1
US-EAST 1b
distribution
considerations.
LOAD BALANCERS LOAD BALANCERS
Spread primary
and replica
nodes across
multiple zones.
Place as many
APP SERVERS as you need for
required
AUTOSCALE
resiliency.
MASTER DB SLAVE DB
REPLICATE
EBS
SNAPSHOTS
S3
Snapshot data volume for backups Consider local storage for additional
Place Slave databases in one
so the database can be readily slave database to remove
or more zones for failover.
recovered within the region. dependency on attached volume
Cloud Management #rightscale
8. 8#
Multi-Region/Cloud DR Options
Availability Downtime
99.999% 0 Multi-Cloud HA
(Live/Live Config)
99.9% < 5 Mins Hot DR
(Least Common)
99.5% < 1 Hour Warm DR
(Recommended)
99% > 1 Hour Cold DR
(Most Common)
$ $$ $$$ $$$$
Cloud Management #rightscale
9. 9#
Multi-Region Cold DR
Staged Server Configuration and generally no staged data
• Not recommended if rapid recovery is required
• Slow to replicate data to other cloud and bring database online
DNS
172.168.7.31
CHICAGO DALLAS
LOAD BALANCERS LOAD BALANCERS
APP SERVERS APP SERVERS
MASTER DB SLAVE DB SLAVE DB
REPLICATE
CBS
SNAPSHOTS
CLOUD
FILES
Cloud Management #rightscale
10. 10#
Multi-Region Warm DR
Staged Server Configuration, pre-staged data and running Slave Database Server
• Generally recommended DR solution
• Minimal additional cost and allows fairly rapid recovery
DNS
172.168.7.31
CHICAGO DALLAS
LOAD BALANCERS LOAD BALANCERS
APP SERVERS APP SERVERS
MASTER DB SLAVE DB SLAVE DB
REPLICATE REPLICATE
CBS
SNAPSHOTS
SNAPSHOTS
CLOUD
FILES
Cloud Management #rightscale
11. 11#
Multi-Region Hot DR
Parallel Deployment with all servers running but all traffic going to primary
• Not recommended
• Very high additional cost to allow rapid recovery
DNS
172.168.7.31
CHICAGO DALLAS
LOAD BALANCERS LOAD BALANCERS
APP SERVERS APP SERVERS
MASTER DB SLAVE DB SLAVE DB
REPLICATE REPLICATE
CBS
SNAPSHOTS SNAPSHOTS
CLOUD
FILES
Cloud Management #rightscale
12. 12#
Multi-Cloud HA
Live/Live configuration. Geo-target IP services to direct traffic to regional LBs.
• Possible, but not recommended (more to follow…)
• Max additional cost and max availability, but complex to implement and manage
DNS
172.168.7.31 172.168.8.62
US-EAST CHICAGO
LOAD BALANCERS LOAD BALANCERS
APP SERVERS APP SERVERS
MASTER DB SLAVE DB SLAVE DB
REPLICATE REPLICATE
EBS
SNAPSHOTS SNAPSHOTS
S3 SWIFT
Cloud Management #rightscale
13. 13#
Multi-Cloud HA
Looks similar to Multi-Zone… but additional problems to solve as some resources
are not shared
You need DNS management Security is an issue as security
or a global load balancer. DNS groups are Region-specific.
172.168.7.31 172.168.8.62
US-EAST CHICAGO
Machine Images LOAD BALANCERS LOAD BALANCERS
are specific to the
cloud/region.
APP SERVERS APP SERVERS
MASTER DB SLAVE DB SLAVE DB
REPLICATE REPLICATE
EBS VOLUME
SNAPSHOTS SNAPSHOTS
S3 SWIFT
Cloud Management #rightscale
14. 14#
In the Dashboard
Cost
forecasting
Multi-region for DR
or cloud environment
Multi-region
Warm DR
Staged
servers
Cloud Management #rightscale
15. 15#
Automating HA and DR
• Use dynamic DNS for your database servers
• Allow app servers to use a single FQDN.
• Use a low TTL to allow rapid failover in the case of a change in master
database
• Automatic connection of app servers to load balancing servers
• App servers can connect to all load balancers automatically at launch
• No manual intervention
• No DNS modifications
• Automated promotion of slave to master
• Process is automated
• Decision to run process is manual
Cloud Management #rightscale
16. 16#
How RightScale makes it possible
MultiCloud Images
• MultiCloud Images can be launched across regions and clouds
without modification
ServerTemplate contains a list
1 of MultiCloud Images (MCIs)
When the Server is
2 created, a specific MCI
is chosen.
The appropriate
3 RightImage is used at
MultiCloud Images
launch.
Cloud A, B, Image 1
Cloud A C, Image 2
Cloud B, Image 1 Cloud A, B, Image 1
Cloud B
Stability across clouds
Image 1
RightImage
Cloud Management #rightscale
17. 17#
How RightScale makes it possible
ServerTemplates, Tags, and Inputs
• Automated load balancer registration and database connections
• Autoscaling across zones
• Dynamic configuration
Cloud Management #rightscale
18. 18#
DR Cost Comparison Example
Multi-Region Multi-Region Multi-Region
Cold DR Warm DR Hot DR
Total $4480 / month $5630 / month $8800 / month
Running $4470 / month $5540 / month $8440 / month
3 Load Balancers (Large) 3 Load Balancers (Large) 6 Load Balancers (Large)
6 App Servers (XLarge) 6 App Servers (XLarge) 12 App Servers (XLarge)
1 Master DB (2XLarge) 1 Master DB (2XLarge) 1 Master DB (2XLarge)
1 Slave DB (2XLarge) 2 Slave DB (2XLarge) 2 Slave DB (2XLarge)
Staged $0 / month $0 / month
3 Load Balancers (Large) 3 Load Balancers (Large)
6 App Servers (XLarge) 6 App Servers (Xlarge)
1 Slave DB (2XLarge)
Replication $10 / month $90 / month $360 / month
25GB / day cross-zone 25GB / day cross-region 100GB / day cross-region
Cloud Management #rightscale
19. 19#
Most Common Observed Cloud Outages
• Outage of specific services in a zone
• Degraded performance
• E.g. EBS, ELB, RDS
• Outage of specific services in a region
• Control plane error or cascading problems
• E.g. EBS
• Outage of power or network in a zone
• No connectivity
• E.g. EC2, Azure
• Capacity availability in a region during an outage
• Not possible to provision instances, volumes, or other services
Cloud Management #rightscale
20. 20#
Outage-Proofing Best Practices
Place in >1 Replicate data Replicate data
zone: across zones across zones
• Load balancers Backup across Design stateless
• App servers regions & clouds apps for
• Databases Monitoring, alert, resilience to
Maintain and automate reboot / relaunch
capacity to operations to
absorb zone or speed up
region failures failover
Cloud Management #rightscale
21. 21#
Next Steps
• Learn: Building Scalable Applications in the Cloud Whitepaper
• http://www.rightscale.com/info_center/white-papers/building-scalable-
applications-in-the-cloud.php
• Analyze: Deployment review of your environment
• http://www.rightscale.com/about_us/contact_us.php
• Try: Free Edition
• www.rightscale.com/free
Contact RightScale
(866) 720-0208
sales@rightscale.com
www.rightscale.com
Cloud Management #rightscale
Editor's Notes
Cold DR(Most common... hours) Staged Server Configuration and generally no staged data. Bring up the servers and load the data to failover. Cold DR failover is typically manual.Warm DR(Recommended... >hour) Staged Server Configuration, pre-staged data and running Database Slave Server. Warm DR failover is typically manual but can be automated.Hot DR(Least common... but needed if <5 min) Parallel Deployment with all servers running but all traffic going to primary. Hot DR failover is normally automated.Hot HALive/Live configuration. May use Geo-target IP services to direct traffic to regional load balancers. Failover to other region if one has problems. Hot HA is normally seamlessly automated.
Note: Other costs such as IOPS, volumes, other bandwidth, object storage, and snapshot storage is additional