OpenStack Summit :: Redundancy Doesn't Always Mean "HA" or "Cluster"

Redundancy Doesn't Always
Mean "HA" or "Cluster"
A cautionary tale against using hammers to solve all redundancy and resiliency problems ...

OpenStack Design Summit – Oct 2012

Randy Bias Dan Sneddon
@randybias @dxs
CTO, Cloudscaling Sr. Engineer, Cloudscaling

CCA - NoDerivs 3.0 Unported License - Usage OK, no modifications, full attribution*
* All unlicensed or borrowed works retain their original licenses 1
Thursday, October 18, 12

Our Journey Today

1. “HA” pairs are not the only type of redundancy

2. Alternative redundancy patterns for HA

3. Redundancy patterns in Open Cloud System*

* Cloudscaling’s OpenStack-powered cloud operating system (“distribution”)

2

What Do We Mean By “HA”?
We mean what most people mean ...

Two servers or network devices that look like one

3

“HA HA”?
HA pairs come in a couple flavors

Active / Passive

4

“HA HA”?
People like this flavor best, but it’s not always possible...

Active / Active

5

“HA HA HA HA HA”??
Many people wish they could get it more like this ...

HA cluster aka ‘massive operational nightmare’

6

Cluster<bleep>!
Imagine this was 4 or 6 nodes in the cluster

• 4 network tech.
• 7 NICs / node
• A million different ways
to break

7

“HA” Pairs Are One Type of Redundancy
Herein lies the problem ...

8

The Problem With “HA”-mmers
There are many, but these two matter most ...

• Catastrophic failures
• No scale out

9

HA Pairs Have Binary Failures
Either working or dead, nothing in-between

10

What is Scale-out?

A B

A B C D N

A B

Scale-up - Make boxes Scale-out - Make moar
bigger (usually an HA pair) boxes

11

Scaling out is a mindset
Scaling up is like treating your servers as pets

bowzer.company.com web001.company.com

Servers *are* cattle
12

HA Pair Failures* - 100% down
Hardware rarely fails, operators fail, software fails
Who Type Year Why Duration
Apple Switch 2005 Bug 2 hrs

Flexiscale SAN 2007 Ops Err 24 hrs

Vendio NAS 2008 Ops Err 8 hrs

UOL Brazil SAN 2011 Bug 72 hrs

Twitter Datacenter 2012 Bug+Ops 2 hrs

* This is a handful of examples as a baseline; I’m sure you can find many more

13

“HA” Pairs Are an All-in Move
They better not fail ...

14

Risk Reduction
Many small failure domains is usually better

15

Big failure domains vs. small
Would you rather have the whole cloud down or just a
small bit for a short period of time?

Still a scale-up pattern ...
wouldn’t you rather scale-out?
16

Pair vs. Scale-out Load Balancing
No scale-out

State Sync Shared-nothing Architecture

(100% loss) (20% loss)

17

What’s Usually an “HA” Pair in OpenStack?
Everything ...

Service Endpoints Messaging System
(APIs) (RPC)

Worker Threads
Database
(e.g. Scheduler,
(MySQL)
Networking)

18

What needs to be an HA pair?
Not much needs state synchronization

(APIs) (RPC)

Worker Threads
Database
(e.g. Scheduler,
Networking) (MySQL)

19

Fault Tolerance Methodologies

20

Fault Tolerance in OCS

21

Service Distribution
High Availability Without Compromise

Resilient Stateless Scale-out

22

Service Distribution
Combines Standard Networking Technologies
router ospf
OSPF /etc/quagga/ospfd.conf ospf router-id 10.1.1.1
network 10.1.255.1 area 0.0.0.0

interface lo:2
Anycast /etc/quagga/zebra.conf description Pound listening address
ip address 10.1.255.1/32

ListenHTTP
Address 10.1.255.1
Port 8774
Load- xHTTP
Service
BackEnd
1

Balancing /etc/pound/pound.conf
End
Address 10.1.1.1
Port 8774

Proxy BackEnd
Address 10.1.1.2
Port 8774
End
End
End

23

Resilient OpenStack
Horizontally Scalable, No Single Point Of Failure

Service Distribution ZeroMQ

(APIs) (RPC)

Service Distribution MMR + HA
Worker Threads Database
(e.g. Scheduler,
Networking) (MySQL)


Service Distribution Advantages
What Makes This a Superior Solution?

• True horizontal scalability with no centralized controller
• Services are always running, failover is nearly instant
• Reduced complexity, fewer idle resources
• No need for separate load balancers

Server Server Server Server Server Server Server
...
Failover vs. Distributed Services

25

Perfect For Site Resiliency
Service Distribution Works With Multiple Sites
• Traditional HA pairs do not support cross-site resiliency

• Service Distribution fail across sites without DNS redirections

26

Service Distribution in Action
Example: Distributed Load Balancing
1) OSPF

OSPF Router(s)

OSPF OSPF
advertisement advertisement

V
Quagga Quagga

HTTP Proxy HTTP Proxy

27

1) OSPF

OSPF Router(s)

2) ECMP Per-ﬂow
Load Balancing

OSPF OSPF
Per-Flow
Load
3) Load-balancing V
Balancing
Quagga Quagga
HTTP Proxy


28

1) OSPF

OSPF Router(s)

2) ECMP Per-ﬂow
Load Balancing

OSPF OSPF
Per-Flow
Load
3) Load-balancing V
Balancing
Quagga Quagga
HTTP Proxy


4) Unlimited #
of Back-End
Servers
Server Server Server Server

29

Failure Resiliency
Client Client Client Client

1 2 3 4

1 2 3 4
Load Balancer/
Load Balancer/ Load Balancer/ Load Balancer/ Load Balancer/
Proxy
Proxy Proxy Proxy Proxy

10%
Server Server Server Server Server Load
Each
Server Server Server Server Server Server

30

Failure Resiliency

1 2 3 4

1 12 3 4

X
Load Balancer/
Proxy

10%
Server Server Server Server Server Load
Each
Server Server Server Server Server Server

31

Failure Resiliency

1 2 3 4

1 2 3 4
Load Balancer/
Proxy

10%

X
Server

Server
Server

Server
Server

Server
Server

Server
Server

Server
Increased
Server
Load

32

OCS NAT Service
Example: Scale-out Network Address Translation
BGP Multiple ISP
providers

NAT

Service
Distribution

VMs

33

Brokerless Messaging With ZeroMQ
Avoiding RabbitMQ’s Single Point Of Failure
Nova-Compute

Single Point
Of Failure

RabbitMQ
Broker

Nova-Scheduler Nova-API

RabbitMQ
(Brokered)
34

Brokerless Messaging With ZeroMQ
Avoiding RabbitMQ’s Single Point Of Failure
Nova-Compute Nova-Compute

Single Point
Of Failure

RabbitMQ
Broker

Nova-Scheduler Nova-API Nova-Scheduler Nova-API

RabbitMQ vs. ZeroMQ
(Brokered) (Peer To Peer)
35

What did we learn today?

1. HA-mmers are for nails

2. Scale-out rules for redundancy

3. Design-for-failure is a mentality, not a pair

4. Resiliency over redundancy

36

Q&A
Randy Bias Dan Sneddon
@randybias @dxs
CTO, Cloudscaling Sr. Engineer, Cloudscaling

OCS 2.0
Public Cloud Benefits | Private Cloud Control | Open Cloud Economics

37

OpenStack Summit :: Redundancy Doesn't Always Mean "HA" or "Cluster"

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (10)

Similar to OpenStack Summit :: Redundancy Doesn't Always Mean "HA" or "Cluster"

Similar to OpenStack Summit :: Redundancy Doesn't Always Mean "HA" or "Cluster" (20)

More from Randy Bias

More from Randy Bias (20)

OpenStack Summit :: Redundancy Doesn't Always Mean "HA" or "Cluster"