Stacking up with OpenStack: Building for High Availability

Stacking up with OpenStack:
Building for High Availability
Utpal Thakrar, Sr. Product Manager
April 17, 2013

2#

My relationship with HA 1975

Cloud Management #rightscale

3#



4#


How many 9-s can
your product do?


5#

So what did they mean by 5-9s?

Availability Allowed Down Time each Year
99% 3.65 days
99.9% 8.76 hours
99.99% 52.56 minutes
99.999% 5.26 minutes


6#

Stuff happens, are you prepared?


7#

Who dunnit?…


8#

And you see these …


9#

Is 100% Outage-proofing possible?


10#

Old School Fault-Tolerance: Build Two


11#

Golden Age of Cloud Computing
No Up-Front Low Cost Pay Only for
Capital Expense What You Use

Self-Service Easily Scale Up Improve Agility &
Infrastructure and Down Time-to-Market

Deploy


12#

Golden Age for Fault-Tolerance
No Up-Front HA Low Cost Pay for DR Only
Capital Expense Backups When You Use it

Self-Service Easily Deliver Fault- Improve Agility &
DR Infrastructure Tolerant Applications Time-to-Recovery

Deploy


13#

Yeah, but …
What about my private cloud?

Applications deployed in private clouds have to worry about:

• Private Cloud Infrastructure being HA
• Application architecture HA / DR

• With Public Clouds – Well, you get what your provider gives
you


14#

Private Cloud Infrastructure HA
Several single points of failure in OpenStack deployment
• OpenStack API services
• MySQL
• RabbitMQ

Solved in various ways
• Pacemaker cluster management
• Keepalived (e.g: RAX Private Cloud)
• MySQL (Galera), RabbitMQ (active-active mirrored queues)

Eliminate SPoFs as best as you can.


15#

What about my app?
Design for failure:
• If your application relies on Cloud infrastructure
SLA for its HA needs, you are STUCK with that
vendor / infrastructure

• Need to balance cost and complexity against risk
tolerance

• Design application so that its:
 Build for server failure
 Build for zone failure
 Build for cloud failure
 Keep management layer separate from infrastructure

16#

Build for Server Failure
• Set up auto-scaling

• Set up database mirroring,
master/slave configuration

• Use static public IPs

• Use Dynamic DNS for
private IPs


17#

Build for Zone Failure
Static Public IPs

DNS
172.168.7.31 172.168.8.62
Zone 1 Zone 2
1
LOAD BALANCERS LOAD BALANCERS Where possible,
use NoSQL DB
like Cassandra
or MongoDB

APP SERVERS
AUTOSCALE

MASTER DB SLAVE DB
REPLICATE

Block
SNAPSHOTS
Object store
Snapshot data volume for backups so
Place Slave databases in one
the database can be readily recovered
or more zones for failover.
within the region.

A creative deployment model would be to make your private cloud an “AZ” by placing
it in close physical proximity to a public cloud provider

18#

Build for Cloud Failure (Cold DR)
Staged Server Configuration and generally no staged data
$
• Not recommended if rapid recovery is required
• Slow to replicate data to other cloud and bring database online
DNS
172.168.7.31

Private DALLAS

LOAD BALANCERS LOAD BALANCERS

APP SERVERS APP SERVERS

MASTER DB SLAVE DB SLAVE DB

REPLICATE

Block

SNAPSHOTS

CLOUD
Cloud Management FILES #rightscale

19#

Build for Cloud Failure (Warm DR)
Staged Server Configuration, pre-staged data and running Slave Database Server
$$
• Generally recommended DR solution
• Minimal additional cost and allows fairly rapid recovery
DNS
172.168.7.31

Private DALLAS




REPLICATE REPLICATE

Block
SNAPSHOTS
SNAPSHOTS

CLOUD

20#

Build for Cloud Failure (Hot DR)
Parallel Deployment with all servers running but all traffic going to primary
$$$
• Not recommended
• Very high additional cost to allow rapid recovery
DNS
172.168.7.31

Private DALLAS




REPLICATE REPLICATE

Block

SNAPSHOTS SNAPSHOTS

CLOUD

21#

Availability vs. Cost - Dial

Cost

Availability
Min Min Max Max


22#

Make sure workload is portable across clouds


23#

Automate and test everything

• Automate backups of your data
• Setup monitoring and alerts
• Run fire-drills! Plan and Practice your recovery procedures!


24#

Separate Management layer from Infrastructure

• Keep the keys to the car outside the car


25#

Automating HA and DR
• Use dynamic DNS for your database servers
• Allow app servers to use a single FQDN.
• Use a low TTL to allow rapid failover in the case of a change in master
database
• Automatic connection of app servers to load balancing servers
• App servers can connect to all load balancers automatically at launch
• No manual intervention
• No DNS modifications
• Automated promotion of slave to master
• Process is automated
• Decision to run process is manual


Hybrid Cloud Network Architecture

Internet traffic
CF Router
Public ASN: XXXX

Firewall
IPS
VPN Gateway Compute
EIP: e.x.y.b EIP: e.x.y.a
VM VM

Private Network

VM VM VPC
Virtual GW

Private: 10.x.x.x/24 Private: 10.x.x.x/24 VM VM
Public: *.*.*.0/24 Public: *.*.*.0/24
Internet
GW

10.x.x.x/24

Object
Storage

SPCS
Public Cloud
Between SPCS and Public Cloud using public
IP
Between SPCS and Public Cloud using private
IP
Internet traffic to SPCS and Public Cloud using public IP
Copyright © 2013 Samsung SDS Co., Ltd. All rights reserved
27

28#

How RightScale makes it possible

RightScale ServerTemplates™
• Reproducible: Predictable
deployment
• Dynamic: Configuration from
scripts at boot time
• Multi-cloud: Cloud agnostic
and portable
• Modular: Role and behavior
abstracted from cloud
infrastructure


29#

How RightScale makes it possible
MultiCloud Images
• MultiCloud Images can be launched across regions and clouds
without modification
ServerTemplate contains a list
1 of MultiCloud Images (MCIs)
When the Server is
2 created, a specific MCI
is chosen.
The appropriate
3 RightImage is used at
MultiCloud Images
launch.
Cloud A, B, Image 1
Cloud A C, Image 2
Cloud B, Image 1 Cloud A, B, Image 1

Cloud B
Stability across clouds
Image 1

RightImage


30#

Outage-Proofing Best Practices

Place in >1 Replicate data Replicate data
zone: across zones across zones
• Load balancers  Backup across Design stateless
• App servers regions & clouds apps for
• Databases  Monitoring, alert, resilience to
Maintain and automate reboot / relaunch
capacity to operations to
absorb zone or speed up
region failures failover


31#

Thank you!
Sign-up for a free account at: www.rightscale.com

Check out job postings are: www.rightscale.com/jobs

We are hiring!


Stacking up with OpenStack: Building for High Availability

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (17)

Similar to Stacking up with OpenStack: Building for High Availability

Similar to Stacking up with OpenStack: Building for High Availability (20)

More from OpenStack Foundation

More from OpenStack Foundation (20)

Recently uploaded

Recently uploaded (20)

Stacking up with OpenStack: Building for High Availability

Editor's Notes