High Availability and Fault Tolerance: AWS + RightScale - RightScale Compute 2013

#1
Safeguard Your Cloud Applications:
High Availability and Fault Tolerance

#2#
Agenda
• Terminology/Level-Setting
• Takeaways
• Cloud and Component Definitions
• Designing for Failure
• Architectural Options and Considerations
High Availability
Disaster Recovery
• Conclusions / Q&A

#3#
Faults?
• Facilities
• Hardware
• Networking
• Code
• People

#4#
What is “Fault-Tolerant”?
• Degrees of risk mitigation - not binary
• Automated
• Tested!

#5#
Old School Fault-Tolerance: Build Two

#6#
No Up-Front
Capital Expense
Pay Only for
What You Use
Self-Service
Infrastructure
Easily Scale Up
and Down
Improve Agility &
Time-to-Market
Low Cost
Cloud Computing Benefits
Deploy

#7#
No Up-Front HA
Capital Expense
Pay for DR Only
When You Use it
Self-Service
DR Infrastructure
Easily Deliver Fault-
Tolerant Applications
Improve Agility &
Time-to-Recovery
Low Cost
Backups
Cloud Computing Fault-Tolerance
Benefits
Deploy

#8#
AWS Cloud allows Overcast Redundancy
Have the shadow duplicate
of your infrastructure ready
to go when you need it…
…but only pay for what
you actually use

#9#
Old Barriers to HA
are now Surmountable
• Cost
• Complexity
• Expertise

#10#
AWS Building Blocks: Two Strategies
Inherently fault-
tolerant services
Services that are fault-tolerant
with the right architecture
Amazon EC2
Amazon Virtual Private Cloud (Amazon VPC)
Amazon Elastic Block Store (EBS)
Amazon Relational Database Service
(Amazon RDS)
Amazon S3
Amazon SimpleDB
Amazon DynamoDB
Amazon CloudFront
Amazon SWF
Amazon SQS
Amazon SNS
Amazon SES
Amazon Route 53
Elastic Load Balancing
AWS Elastic Beanstalk
Amazon ElastiCache
Amazon Elastic MapReduce
AWS Identity and Access
Management (IAM)

#11#
The Stack:
Resources
Deployment
Management
Configuration
Networking
Facilities
Geographies

#12#
Terminology
Ability of a system to
continue operating
properly (perhaps at
a degraded level) if
one or more
components fails.
The process, policies
and procedures
related to restoring
critical systems after
a catastrophic event.
Goal is to get
application back up
and running within a
defined time period
(RTO) and within a
certain data loss
window (RPO).
Fault Tolerant
systems are
measured by their
Availability in terms
of planned and
unplanned service
outages for end
users.

#13#
Terminology - continued
Time period in which service
must be restored to meet
BCP (Business Continuity
Planning) objectives
Acceptable data loss as a
result of a recovering from a
disaster/catastrophic event
RTO and RPO are often at odds, and tradeoffs need to
be made in order to find an acceptable middle ground

#14#
Takeaways
• Understand core concepts behind HA and DR
• Introduction to architectural options for designing HA, fault-
tolerant applications and DR environments and procedures
• Best Practices for implementation of these architectural
options within AWS (independent of RightScale)
• Multi-Availability Zone (AZ) and Multi-Region
• Architectural options and Considerations / pros and cons of these options
• Understanding of the tools RightScale brings to AWS to
simplify the creation of these HA and DR environments

#15#
Regions & Availability Zones
• Zones within a region share a LAN (high bandwidth, low latency, private IP access)
• Zones utilize separate power sources, are physically segregated
• Regions are “islands”, and share no resources.
Japan
Availability
Zone A
Availability
Zone B
EU West Region
Availability
Zone A
Availability
Zone B
US East Region
Availability
Zone A
Availability
Zone C
Availability
Zone B
US West Region
Availability
Zone A
Availability
Zone B
Singapore
Availability
Zone A
Availability
Zone B
Source: AWS

#16#
Designing for Failure
• Large scale failures in the cloud are rare but do happen
• Application owners are ultimately responsible for
availability and recoverability
• Balance cost and complexity of HA efforts against
risk(s) you are willing to bear
• Cloud infrastructure has made DR and HA remarkably
affordable versus past options
-Multi-Server
-Multi-AZ (Availability Zone)
-Multi-Region
“Everything fails, all the time.”
Werner Vogels, CTO Amazon.com

#17#
Designing for Failure – Basic Concepts
• Fault tolerance is the goal. Degradation of service may occur,
but application continues to function.
• Avoid single points of failure (SPOF)
• Assume everything fails (remember Werner’s mantra) and
design accordingly
• Plan and practice your recovery process (both for HA and DR)
• Remember that better HA and DR equals more $$$. So find
that acceptable balance.

#18#
High Availability
Don’t sweat the small stuff.
And it’s all small stuff*
*(until it’s not)
Follow a few general best practices to absorb
application component outages…

#19#
General HA Best Practices
• Avoid single points of failure.
• Always place one of each component (load balancers,
app servers, databases) in at least two AZs.
• Replicate data across AZs (HA) and backup or replicate
across regions for failover (DR)
• Setup monitoring, alerts and operations to identify and
automate problem resolution or failover process.

#20#
• High availability for top web properties
with 270M visitors/month
• Migration from datacenter to AWS
• RightScale provides
-Self-service access to developers
-Consistency and low maintenance
-Usage and cost accounting
-Multi-region architectures to avoid downtime

#21#
Multi-Zone HA
SLAVE DBMASTER DB
SNAPSHOTS
LOAD BALANCERS
REPLICATE
DNS
S3
EBS
US-EAST 1a
1US-EAST 1b
LOAD BALANCERS
APP SERVERS
AUTOSCALE
172.168.7.31 172.168.8.62
Snapshot data volume for backups
so the database can be readily
recovered within the region.
Place Slave databases in one
or more zones for failover.
Consider local storage for additional
slave database to remove
dependency on attached volume
Consider
distributed
NoSQL
databases with
the same
distribution
considerations
.

#22#
Disaster Recovery
DR presents a few new wrinkles compared to HA,
but there are multiple options depending on your
needs and budget…
Don’t sweat the small stuff.
And it’s all small stuff*
*(until it’s not)

#23#
HA/DR Checklist for Risk Mitigation
• Determine who owns the architecture, DR process and testing.
• Develop expertise in-house and / or get outside help.
• Conduct a risk assessment for each application.
• Specify your target RTO and RPO.
• Design for failure starting with application architecture. This
will help drive the infrastructure architecture.

#24#
HA/DR Checklist for Risk Mitigation
• Implement HA best practices balancing cost, complexity and
risk.
-Automate infrastructure for consistency and reliability.
• Document operational processes and automations.
• Test the failover... then test it again.
• Release the Chaos Monkey.

#25#
Multi-Region/Cloud DR Options
Cold DR
Warm DR
Hot DR
Multi-Cloud HA0
< 5 Mins
< 1 Hour
> 1 Hour
$ $$ $$$ $$$$
(Most Common)
(Recommended)
(Least Common)
(Live/Live Config)
DowntimeAvailability
99.999%
99.9%
99.5%
99%

#26#
Multi-Region Cold DR
LOAD BALANCERS
MASTER DB SLAVE DB
APP SERVERS
LOAD BALANCERS
REPLICATE
DNS
APP SERVERS
US WEST
SNAPSHOTS
172.168.7.31
SLAVE DB
US EAST
S3
Staged Server Configuration and generally no staged data
• Not recommended if rapid recovery is required
• Slow to replicate data to other cloud and bring database online
EBS

#27#
Multi-Region Warm DR
LOAD BALANCERS
MASTER DB SLAVE DB
APP SERVERS
LOAD BALANCERS
REPLICATE
DNS
APP SERVERS
SLAVE DB
REPLICATE
US WEST
172.168.7.31
US EAST
SNAPSHOTS
Staged Server Configuration, pre-staged data and running Slave Database Server
• Generally recommended DR solution
• Minimal additional cost and allows fairly rapid recovery
SNAPSHOTS
EBS
S3

#28#
APP SERVERS
Multi-Region Hot DR
LOAD BALANCERS
MASTER DB SLAVE DB
APP SERVERS
LOAD BALANCERS
REPLICATE
DNS
SLAVE DB
REPLICATE
US WEST
SNAPSHOTS
172.168.7.31
US EAST
Parallel Deployment with all servers running but all traffic going to primary
• Not recommended
• Very high additional cost to allow rapid recovery
SNAPSHOTS
EBS
S3

#29#
Hybrid HA
APP SERVERS
LOAD BALANCERS
MASTER DB SLAVE DB
APP SERVERS
LOAD BALANCERS
REPLICATE
DNS
SLAVE DB
REPLICATE
CHICAGO
SNAPSHOTS
172.168.7.31 172.168.8.62
US-EAST
S3 SWIFT
SNAPSHOTS
Live/Live configuration. Geo-target IP services to direct traffic to regional LBs.
• Possible, but not recommended (more to follow…)
• Max additional cost and max availability, but complex to implement and manage
EBS

#30#
APP SERVERS
LOAD BALANCERS
MASTER DB SLAVE DB
APP SERVERS
LOAD BALANCERS
REPLICATE
DNS
SLAVE DB
REPLICATE
CHICAGO
SNAPSHOTS
172.168.7.31 172.168.8.62
US-EAST
S3
Hybrid HA
You need DNS management
or a global load balancer.
Security requires addt’l effort as
security groups are Region-
specific.
Machine Images
are specific to the
cloud/region.
Looks similar to Multi-Zone… but additional problems to solve as some resources
are not shared
SNAPSHOTS
SWIFT
EBS VOLUME

#31#
• Procurement software
• SLA to their customers require HA
• Subway chain is a customer that procures perishable goods
through Coupa

#32#
In the Dashboard
Multi-region
or cloud
Multi-region
Warm DR
Staged
servers
Cost
forecasting
for DR
environment

#33#
Automating HA and DR
• Use dynamic DNS for your database servers
Allow app servers to use a single FQDN.
Use a low TTL to allow rapid failover in the case of a change in master
database
• Automatic connection of app servers to load balancing servers
App servers can connect to all load balancers automatically at launch
No manual intervention
No DNS modifications
• Automated promotion of slave to master
Process is automated
Decision to run process is manual

#34#
MultiCloud Images
• MultiCloud Images can be launched across regions and hybrid
without modification
How RightScale makes it possible
MultiCloud Images
Cloud A, RightImage 1
Cloud B, RightImage 2
Cloud C, RightImage 3
ServerTemplate contains a list
of MultiCloud Images (MCIs)
When the Server is
created, a specific MCI
is chosen.
Cloud A, RightImage 1
Cloud A
Image 1
The appropriate
RightImage is used at
launch.
RightImage
Stability across clouds
1
2
3

#35#
How RightScale makes it possible
ServerTemplates, Tags, and Inputs
• Automated load balancer registration and database connections
• Autoscaling across zones
• Dynamic configuration

#36#
DR Cost Comparison Example
Multi-Region
Cold DR
Multi-Region
Warm DR
Multi-Region
Hot DR
Total $4480 / month $5630 / month $8800 / month
Running $4470 / month
3 Load Balancers (Large)
6 App Servers (XLarge)
1 Master DB (2XLarge)
1 Slave DB (2XLarge)
$5540 / month
$8440 / month
Staged $0 / month
$0 / month
6 App Servers (Xlarge)
Replication $10 / month
25GB / day cross-zone
$90 / month
25GB / day cross-region
$360 / month
100GB / day cross-region

#37#
Outage-Proofing Best Practices
Place in >1 zone:
• Load balancers
• App servers
• Databases
Maintain capacity
to absorb zone or
region failures
Replicate data
across zones
Design stateless
apps for resilience
to reboot / relaunch
Replicate data
across zones
Backup across
regions
Monitoring, alert, a
nd automate
operations to
speed up failover

#38#
AWS
Contact:aws.amazon.com/contact-
us
Resources and Q&A
RightScale
Try: RightScale Free Edition
www.rightscale.com/free
Contact:
Toll Free: 1.866.720.0208
Int’l: 1.805.855.0265

High Availability and Fault Tolerance: AWS + RightScale - RightScale Compute 2013

Recommended

Recommended

More Related Content

More from RightScale

More from RightScale (20)

Recently uploaded

Recently uploaded (20)

High Availability and Fault Tolerance: AWS + RightScale - RightScale Compute 2013

Editor's Notes