Redundancy Doesnt Always           Mean "HA" or "Cluster"            A cautionary tale against using hammers to solve all ...
Our Journey Today      1. “HA” pairs are not the only type of redundancy      2. Alternative redundancy patterns for HA   ...
What Do We Mean By “HA”?      We mean what most people mean ...                   Two servers or network devices that look...
“HA HA”?      HA pairs come in a couple flavors                            Active / Passive                               ...
“HA HA”?      People like this flavor best, but it’s not always possible...                               Active / Active ...
“HA HA HA HA HA”??      Many people wish they could get it more like this ...                           HA cluster aka ‘ma...
Cluster<bleep>!      Imagine this was 4 or 6 nodes in the cluster• 4 network tech.• 7 NICs / node• A million different way...
“HA” Pairs Are One Type of Redundancy      Herein lies the problem ...                                    8Thursday, Octob...
The Problem With “HA”-mmers      There are many, but these two matter most ...      • Catastrophic failures      • No scal...
HA Pairs Have Binary Failures      Either working or dead, nothing in-between                                           10...
What is Scale-out?                           A   B                                   A   B    C     D        N            ...
Scaling out is a mindset      Scaling up is like treating your servers as pets     bowzer.company.com                     ...
HA Pair Failures* - 100% down      Hardware rarely fails, operators fail, software fails                     Who        Ty...
“HA” Pairs Are an All-in Move      They better not fail ...                                 14Thursday, October 18, 12
Risk Reduction      Many small failure domains is usually better                                            15Thursday, Oc...
Big failure domains vs. small      Would you rather have the whole cloud down or just a      small bit for a short period ...
Pair vs. Scale-out Load Balancing      No scale-out           State Sync      Shared-nothing Architecture         (100% lo...
Pair vs. Scale-out Load Balancing      No scale-out           State Sync      Shared-nothing Architecture         (100% lo...
What’s Usually an “HA” Pair in OpenStack?      Everything ...                 Service Endpoints     Messaging System      ...
What needs to be an HA pair?      Not much needs state synchronization                 Service Endpoints     Messaging Sys...
Fault Tolerance Methodologies                                      20Thursday, October 18, 12
Fault Tolerance in OCS                               21Thursday, October 18, 12
Service Distribution      High Availability Without Compromise          Resilient        Stateless         Scale-out      ...
Service Distribution      Combines Standard Networking Technologies                                                    rou...
Resilient OpenStack      Horizontally Scalable, No Single Point Of Failure             Service Distribution          ZeroM...
Service Distribution Advantages      What Makes This a Superior Solution?      • True horizontal scalability with no centr...
Perfect For Site Resiliency      Service Distribution Works With Multiple Sites        • Traditional HA pairs do not suppo...
Service Distribution in Action        Example: Distributed Load Balancing                 1)        OSPF                  ...
Service Distribution in Action        Example: Distributed Load Balancing                 1)        OSPF                  ...
Service Distribution in Action        Example: Distributed Load Balancing                 1)        OSPF                  ...
Failure Resiliency            Client                      Client                             Client               Client  ...
Failure Resiliency            Client                      Client                             Client               Client  ...
Failure Resiliency            Client                      Client                             Client               Client  ...
OCS NAT Service      Example: Scale-out Network Address Translation             BGP                               Multiple...
Brokerless Messaging With ZeroMQ      Avoiding RabbitMQ’s Single Point Of Failure                           Nova-Compute  ...
Brokerless Messaging With ZeroMQ      Avoiding RabbitMQ’s Single Point Of Failure                           Nova-Compute  ...
What did we learn today?          1. HA-mmers are for nails          2. Scale-out rules for redundancy          3. Design-...
Q&A Randy Bias                                         Dan Sneddon @randybias                                         @dxs...
Upcoming SlideShare
Loading in …5
×

OpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdf

1,570
-1

Published on

true

0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,570
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
78
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

OpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdf

  1. 1. Redundancy Doesnt Always Mean "HA" or "Cluster" A cautionary tale against using hammers to solve all redundancy and resiliency problems ... OpenStack Design Summit – Oct 2012 Randy Bias Dan Sneddon @randybias @dxs CTO, Cloudscaling Sr. Engineer, Cloudscaling CCA - NoDerivs 3.0 Unported License - Usage OK, no modifications, full attribution* * All unlicensed or borrowed works retain their original licenses 1Thursday, October 18, 12
  2. 2. Our Journey Today 1. “HA” pairs are not the only type of redundancy 2. Alternative redundancy patterns for HA 3. Redundancy patterns in Open Cloud System* * Cloudscaling’s OpenStack-powered cloud operating system (“distribution”) 2Thursday, October 18, 12
  3. 3. What Do We Mean By “HA”? We mean what most people mean ... Two servers or network devices that look like one 3Thursday, October 18, 12
  4. 4. “HA HA”? HA pairs come in a couple flavors Active / Passive 4Thursday, October 18, 12
  5. 5. “HA HA”? People like this flavor best, but it’s not always possible... Active / Active 5Thursday, October 18, 12
  6. 6. “HA HA HA HA HA”?? Many people wish they could get it more like this ... HA cluster aka ‘massive operational nightmare’ 6Thursday, October 18, 12
  7. 7. Cluster<bleep>! Imagine this was 4 or 6 nodes in the cluster• 4 network tech.• 7 NICs / node• A million different ways to break 7Thursday, October 18, 12
  8. 8. “HA” Pairs Are One Type of Redundancy Herein lies the problem ... 8Thursday, October 18, 12
  9. 9. The Problem With “HA”-mmers There are many, but these two matter most ... • Catastrophic failures • No scale out 9Thursday, October 18, 12
  10. 10. HA Pairs Have Binary Failures Either working or dead, nothing in-between 10Thursday, October 18, 12
  11. 11. What is Scale-out? A B A B C D N A B Scale-up - Make boxes Scale-out - Make moar bigger (usually an HA pair) boxes 11Thursday, October 18, 12
  12. 12. Scaling out is a mindset Scaling up is like treating your servers as pets bowzer.company.com web001.company.com Servers *are* cattle 12Thursday, October 18, 12
  13. 13. HA Pair Failures* - 100% down Hardware rarely fails, operators fail, software fails Who Type Year Why Duration Apple Switch 2005 Bug 2 hrs Flexiscale SAN 2007 Ops Err 24 hrs Vendio NAS 2008 Ops Err 8 hrs UOL Brazil SAN 2011 Bug 72 hrs Twitter Datacenter 2012 Bug+Ops 2 hrs * This is a handful of examples as a baseline; I’m sure you can find many more 13Thursday, October 18, 12
  14. 14. “HA” Pairs Are an All-in Move They better not fail ... 14Thursday, October 18, 12
  15. 15. Risk Reduction Many small failure domains is usually better 15Thursday, October 18, 12
  16. 16. Big failure domains vs. small Would you rather have the whole cloud down or just a small bit for a short period of time? Still a scale-up pattern ... wouldn’t you rather scale-out? 16Thursday, October 18, 12
  17. 17. Pair vs. Scale-out Load Balancing No scale-out State Sync Shared-nothing Architecture (100% loss) (20% loss) 17Thursday, October 18, 12
  18. 18. Pair vs. Scale-out Load Balancing No scale-out State Sync Shared-nothing Architecture (100% loss) (20% loss) 17Thursday, October 18, 12
  19. 19. What’s Usually an “HA” Pair in OpenStack? Everything ... Service Endpoints Messaging System (APIs) (RPC) Worker Threads Database (e.g. Scheduler, (MySQL) Networking) 18Thursday, October 18, 12
  20. 20. What needs to be an HA pair? Not much needs state synchronization Service Endpoints Messaging System (APIs) (RPC) Worker Threads Database (e.g. Scheduler, Networking) (MySQL) 19Thursday, October 18, 12
  21. 21. Fault Tolerance Methodologies 20Thursday, October 18, 12
  22. 22. Fault Tolerance in OCS 21Thursday, October 18, 12
  23. 23. Service Distribution High Availability Without Compromise Resilient Stateless Scale-out 22Thursday, October 18, 12
  24. 24. Service Distribution Combines Standard Networking Technologies router ospf OSPF /etc/quagga/ospfd.conf ospf router-id 10.1.1.1 network 10.1.255.1 area 0.0.0.0 interface lo:2 Anycast /etc/quagga/zebra.conf description Pound listening address ip address 10.1.255.1/32 ListenHTTP Address 10.1.255.1 Port 8774 Load- xHTTP Service BackEnd 1 Balancing /etc/pound/pound.conf End Address 10.1.1.1 Port 8774 Proxy BackEnd Address 10.1.1.2 Port 8774 End End End 23Thursday, October 18, 12
  25. 25. Resilient OpenStack Horizontally Scalable, No Single Point Of Failure Service Distribution ZeroMQ Service Endpoints Messaging System (APIs) (RPC) Service Distribution MMR + HA Worker Threads Database (e.g. Scheduler, Networking) (MySQL)Thursday, October 18, 12
  26. 26. Service Distribution Advantages What Makes This a Superior Solution? • True horizontal scalability with no centralized controller • Services are always running, failover is nearly instant • Reduced complexity, fewer idle resources • No need for separate load balancers Server Server Server Server Server Server Server ... Failover vs. Distributed Services 25Thursday, October 18, 12
  27. 27. Perfect For Site Resiliency Service Distribution Works With Multiple Sites • Traditional HA pairs do not support cross-site resiliency • Service Distribution fail across sites without DNS redirections 26Thursday, October 18, 12
  28. 28. Service Distribution in Action Example: Distributed Load Balancing 1) OSPF OSPF Router(s) OSPF OSPF advertisement advertisement V Quagga Quagga HTTP Proxy HTTP Proxy 27Thursday, October 18, 12
  29. 29. Service Distribution in Action Example: Distributed Load Balancing 1) OSPF OSPF Router(s) 2) ECMP Per-flow Load Balancing OSPF OSPF advertisement advertisement Per-Flow Load 3) Load-balancing V Balancing Quagga Quagga HTTP Proxy HTTP Proxy HTTP Proxy 28Thursday, October 18, 12
  30. 30. Service Distribution in Action Example: Distributed Load Balancing 1) OSPF OSPF Router(s) 2) ECMP Per-flow Load Balancing OSPF OSPF advertisement advertisement Per-Flow Load 3) Load-balancing V Balancing Quagga Quagga HTTP Proxy HTTP Proxy HTTP Proxy 4) Unlimited # of Back-End Servers Server Server Server Server 29Thursday, October 18, 12
  31. 31. Failure Resiliency Client Client Client Client 1 2 3 4 1 2 3 4 Load Balancer/ Load Balancer/ Load Balancer/ Load Balancer/ Load Balancer/ Proxy Proxy Proxy Proxy Proxy 10% Server Server Server Server Server Load Each Server Server Server Server Server Server 30Thursday, October 18, 12
  32. 32. Failure Resiliency Client Client Client Client 1 2 3 4 1 12 3 4 X Load Balancer/ Load Balancer/ Load Balancer/ Load Balancer/ Load Balancer/ Proxy Proxy Proxy Proxy Proxy 10% Server Server Server Server Server Load Each Server Server Server Server Server Server 31Thursday, October 18, 12
  33. 33. Failure Resiliency Client Client Client Client 1 2 3 4 1 2 3 4 Load Balancer/ Load Balancer/ Load Balancer/ Load Balancer/ Load Balancer/ Proxy Proxy Proxy Proxy Proxy 10% X Server Server Server Server Server Server Server Server Server Server Increased Server Load 32Thursday, October 18, 12
  34. 34. OCS NAT Service Example: Scale-out Network Address Translation BGP Multiple ISP providers NAT Service Distribution VMs 33Thursday, October 18, 12
  35. 35. Brokerless Messaging With ZeroMQ Avoiding RabbitMQ’s Single Point Of Failure Nova-Compute Single Point Of Failure RabbitMQ Broker Nova-Scheduler Nova-API RabbitMQ (Brokered) 34Thursday, October 18, 12
  36. 36. Brokerless Messaging With ZeroMQ Avoiding RabbitMQ’s Single Point Of Failure Nova-Compute Nova-Compute Single Point Of Failure RabbitMQ Broker Nova-Scheduler Nova-API Nova-Scheduler Nova-API RabbitMQ vs. ZeroMQ (Brokered) (Peer To Peer) 35Thursday, October 18, 12
  37. 37. What did we learn today? 1. HA-mmers are for nails 2. Scale-out rules for redundancy 3. Design-for-failure is a mentality, not a pair 4. Resiliency over redundancy 36Thursday, October 18, 12
  38. 38. Q&A Randy Bias Dan Sneddon @randybias @dxs CTO, Cloudscaling Sr. Engineer, Cloudscaling OCS 2.0 Public Cloud Benefits | Private Cloud Control | Open Cloud Economics 37Thursday, October 18, 12
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×