Your SlideShare is downloading. ×
0
OpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdf
OpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdf
OpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdf
OpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdf
OpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdf
OpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdf
OpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdf
OpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdf
OpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdf
OpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdf
OpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdf
OpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdf
OpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdf
OpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdf
OpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdf
OpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdf
OpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdf
OpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdf
OpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdf
OpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdf
OpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdf
OpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdf
OpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdf
OpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdf
OpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdf
OpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdf
OpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdf
OpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdf
OpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdf
OpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdf
OpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdf
OpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdf
OpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdf
OpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdf
OpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdf
OpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdf
OpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdf
OpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdf
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

OpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdf

1,421

Published on

true

true

0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,421
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
71
Comments
0
Likes
3
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Redundancy Doesnt Always Mean "HA" or "Cluster" A cautionary tale against using hammers to solve all redundancy and resiliency problems ... OpenStack Design Summit – Oct 2012 Randy Bias Dan Sneddon @randybias @dxs CTO, Cloudscaling Sr. Engineer, Cloudscaling CCA - NoDerivs 3.0 Unported License - Usage OK, no modifications, full attribution* * All unlicensed or borrowed works retain their original licenses 1Thursday, October 18, 12
  • 2. Our Journey Today 1. “HA” pairs are not the only type of redundancy 2. Alternative redundancy patterns for HA 3. Redundancy patterns in Open Cloud System* * Cloudscaling’s OpenStack-powered cloud operating system (“distribution”) 2Thursday, October 18, 12
  • 3. What Do We Mean By “HA”? We mean what most people mean ... Two servers or network devices that look like one 3Thursday, October 18, 12
  • 4. “HA HA”? HA pairs come in a couple flavors Active / Passive 4Thursday, October 18, 12
  • 5. “HA HA”? People like this flavor best, but it’s not always possible... Active / Active 5Thursday, October 18, 12
  • 6. “HA HA HA HA HA”?? Many people wish they could get it more like this ... HA cluster aka ‘massive operational nightmare’ 6Thursday, October 18, 12
  • 7. Cluster<bleep>! Imagine this was 4 or 6 nodes in the cluster• 4 network tech.• 7 NICs / node• A million different ways to break 7Thursday, October 18, 12
  • 8. “HA” Pairs Are One Type of Redundancy Herein lies the problem ... 8Thursday, October 18, 12
  • 9. The Problem With “HA”-mmers There are many, but these two matter most ... • Catastrophic failures • No scale out 9Thursday, October 18, 12
  • 10. HA Pairs Have Binary Failures Either working or dead, nothing in-between 10Thursday, October 18, 12
  • 11. What is Scale-out? A B A B C D N A B Scale-up - Make boxes Scale-out - Make moar bigger (usually an HA pair) boxes 11Thursday, October 18, 12
  • 12. Scaling out is a mindset Scaling up is like treating your servers as pets bowzer.company.com web001.company.com Servers *are* cattle 12Thursday, October 18, 12
  • 13. HA Pair Failures* - 100% down Hardware rarely fails, operators fail, software fails Who Type Year Why Duration Apple Switch 2005 Bug 2 hrs Flexiscale SAN 2007 Ops Err 24 hrs Vendio NAS 2008 Ops Err 8 hrs UOL Brazil SAN 2011 Bug 72 hrs Twitter Datacenter 2012 Bug+Ops 2 hrs * This is a handful of examples as a baseline; I’m sure you can find many more 13Thursday, October 18, 12
  • 14. “HA” Pairs Are an All-in Move They better not fail ... 14Thursday, October 18, 12
  • 15. Risk Reduction Many small failure domains is usually better 15Thursday, October 18, 12
  • 16. Big failure domains vs. small Would you rather have the whole cloud down or just a small bit for a short period of time? Still a scale-up pattern ... wouldn’t you rather scale-out? 16Thursday, October 18, 12
  • 17. Pair vs. Scale-out Load Balancing No scale-out State Sync Shared-nothing Architecture (100% loss) (20% loss) 17Thursday, October 18, 12
  • 18. Pair vs. Scale-out Load Balancing No scale-out State Sync Shared-nothing Architecture (100% loss) (20% loss) 17Thursday, October 18, 12
  • 19. What’s Usually an “HA” Pair in OpenStack? Everything ... Service Endpoints Messaging System (APIs) (RPC) Worker Threads Database (e.g. Scheduler, (MySQL) Networking) 18Thursday, October 18, 12
  • 20. What needs to be an HA pair? Not much needs state synchronization Service Endpoints Messaging System (APIs) (RPC) Worker Threads Database (e.g. Scheduler, Networking) (MySQL) 19Thursday, October 18, 12
  • 21. Fault Tolerance Methodologies 20Thursday, October 18, 12
  • 22. Fault Tolerance in OCS 21Thursday, October 18, 12
  • 23. Service Distribution High Availability Without Compromise Resilient Stateless Scale-out 22Thursday, October 18, 12
  • 24. Service Distribution Combines Standard Networking Technologies router ospf OSPF /etc/quagga/ospfd.conf ospf router-id 10.1.1.1 network 10.1.255.1 area 0.0.0.0 interface lo:2 Anycast /etc/quagga/zebra.conf description Pound listening address ip address 10.1.255.1/32 ListenHTTP Address 10.1.255.1 Port 8774 Load- xHTTP Service BackEnd 1 Balancing /etc/pound/pound.conf End Address 10.1.1.1 Port 8774 Proxy BackEnd Address 10.1.1.2 Port 8774 End End End 23Thursday, October 18, 12
  • 25. Resilient OpenStack Horizontally Scalable, No Single Point Of Failure Service Distribution ZeroMQ Service Endpoints Messaging System (APIs) (RPC) Service Distribution MMR + HA Worker Threads Database (e.g. Scheduler, Networking) (MySQL)Thursday, October 18, 12
  • 26. Service Distribution Advantages What Makes This a Superior Solution? • True horizontal scalability with no centralized controller • Services are always running, failover is nearly instant • Reduced complexity, fewer idle resources • No need for separate load balancers Server Server Server Server Server Server Server ... Failover vs. Distributed Services 25Thursday, October 18, 12
  • 27. Perfect For Site Resiliency Service Distribution Works With Multiple Sites • Traditional HA pairs do not support cross-site resiliency • Service Distribution fail across sites without DNS redirections 26Thursday, October 18, 12
  • 28. Service Distribution in Action Example: Distributed Load Balancing 1) OSPF OSPF Router(s) OSPF OSPF advertisement advertisement V Quagga Quagga HTTP Proxy HTTP Proxy 27Thursday, October 18, 12
  • 29. Service Distribution in Action Example: Distributed Load Balancing 1) OSPF OSPF Router(s) 2) ECMP Per-flow Load Balancing OSPF OSPF advertisement advertisement Per-Flow Load 3) Load-balancing V Balancing Quagga Quagga HTTP Proxy HTTP Proxy HTTP Proxy 28Thursday, October 18, 12
  • 30. Service Distribution in Action Example: Distributed Load Balancing 1) OSPF OSPF Router(s) 2) ECMP Per-flow Load Balancing OSPF OSPF advertisement advertisement Per-Flow Load 3) Load-balancing V Balancing Quagga Quagga HTTP Proxy HTTP Proxy HTTP Proxy 4) Unlimited # of Back-End Servers Server Server Server Server 29Thursday, October 18, 12
  • 31. Failure Resiliency Client Client Client Client 1 2 3 4 1 2 3 4 Load Balancer/ Load Balancer/ Load Balancer/ Load Balancer/ Load Balancer/ Proxy Proxy Proxy Proxy Proxy 10% Server Server Server Server Server Load Each Server Server Server Server Server Server 30Thursday, October 18, 12
  • 32. Failure Resiliency Client Client Client Client 1 2 3 4 1 12 3 4 X Load Balancer/ Load Balancer/ Load Balancer/ Load Balancer/ Load Balancer/ Proxy Proxy Proxy Proxy Proxy 10% Server Server Server Server Server Load Each Server Server Server Server Server Server 31Thursday, October 18, 12
  • 33. Failure Resiliency Client Client Client Client 1 2 3 4 1 2 3 4 Load Balancer/ Load Balancer/ Load Balancer/ Load Balancer/ Load Balancer/ Proxy Proxy Proxy Proxy Proxy 10% X Server Server Server Server Server Server Server Server Server Server Increased Server Load 32Thursday, October 18, 12
  • 34. OCS NAT Service Example: Scale-out Network Address Translation BGP Multiple ISP providers NAT Service Distribution VMs 33Thursday, October 18, 12
  • 35. Brokerless Messaging With ZeroMQ Avoiding RabbitMQ’s Single Point Of Failure Nova-Compute Single Point Of Failure RabbitMQ Broker Nova-Scheduler Nova-API RabbitMQ (Brokered) 34Thursday, October 18, 12
  • 36. Brokerless Messaging With ZeroMQ Avoiding RabbitMQ’s Single Point Of Failure Nova-Compute Nova-Compute Single Point Of Failure RabbitMQ Broker Nova-Scheduler Nova-API Nova-Scheduler Nova-API RabbitMQ vs. ZeroMQ (Brokered) (Peer To Peer) 35Thursday, October 18, 12
  • 37. What did we learn today? 1. HA-mmers are for nails 2. Scale-out rules for redundancy 3. Design-for-failure is a mentality, not a pair 4. Resiliency over redundancy 36Thursday, October 18, 12
  • 38. Q&A Randy Bias Dan Sneddon @randybias @dxs CTO, Cloudscaling Sr. Engineer, Cloudscaling OCS 2.0 Public Cloud Benefits | Private Cloud Control | Open Cloud Economics 37Thursday, October 18, 12

×