1) "HA" pairs are commonly used but are not the only way to achieve redundancy. They have limitations around catastrophic failures and lack of scale out.
2) Alternative patterns like distributed load balancing and brokerless messaging can provide redundancy without single points of failure and allow for scale out.
3) Service distribution is presented as a superior approach that combines standard networking technologies to provide resilient, stateless and scale out services for OpenStack.
This short deck provides a high-level overview of how Apache Kafka works under the covers. It covers logs, topics partitions, consumer groups, and replication from a conceptual perspective.
Agile software development is necessary but not sufficient to create maneuverability at the scale of your whole organization. Maneuverability is the ability to gain, shed, or redirect your momentum. It is an emergent property of your organization, architecture, and processes.
We have learned to decentralize our organizations, atomize our applications, and deploy a hundred times a day. How do we direct all that energy toward a strategic goal? It is possible to have decentralization and strategy... it's not a contradiction.
This talk introduces the key concepts of Tempo, Maneuverability, and Initiative. We can apply these concepts at every level of an organization to achieve strategic victory.
Hugtakið hugbúnaðararkítektúr er yfirhlaðið orð og þýðir mismunandi hluti fyrir mismunandi fólk. Við ætlum í þessum fyrirlestri að skilgreina ýmis hugtök tengd arkítektúr til að fá betri skilning á þessu. Við munum einnig skilgreina hvað agile arkítektúr þýðir eða hvað það þýðir ekki. Þá skoðum við monolith arkítektúr sem er hinn hefðbundi arkítektúr sem flestir nota í dag. Vandinn er sá að í dag eru kröfurnar meiri en þessi arkítektúr ræður við og því hafa menn verið að skoða aðrar leiðir eins og lightweight Service Oriented Architecture og hvernig smíða má hugbúnað sem þjónustur eða microapps eða microservice.
Við skoðum einnig lagskiptingu en það er elsta trikkið í bókinni og byggir á deila og drottna aðferðinni.
Understanding the connection between trauma, resiliency and a child's ability to thrive through adversity we explore the metaphor of the sea star from an evolutionary perspective. We then talk about how everyone has a resiliency toolbox and highlight some of the most effective internal resources.
This short deck provides a high-level overview of how Apache Kafka works under the covers. It covers logs, topics partitions, consumer groups, and replication from a conceptual perspective.
Agile software development is necessary but not sufficient to create maneuverability at the scale of your whole organization. Maneuverability is the ability to gain, shed, or redirect your momentum. It is an emergent property of your organization, architecture, and processes.
We have learned to decentralize our organizations, atomize our applications, and deploy a hundred times a day. How do we direct all that energy toward a strategic goal? It is possible to have decentralization and strategy... it's not a contradiction.
This talk introduces the key concepts of Tempo, Maneuverability, and Initiative. We can apply these concepts at every level of an organization to achieve strategic victory.
Hugtakið hugbúnaðararkítektúr er yfirhlaðið orð og þýðir mismunandi hluti fyrir mismunandi fólk. Við ætlum í þessum fyrirlestri að skilgreina ýmis hugtök tengd arkítektúr til að fá betri skilning á þessu. Við munum einnig skilgreina hvað agile arkítektúr þýðir eða hvað það þýðir ekki. Þá skoðum við monolith arkítektúr sem er hinn hefðbundi arkítektúr sem flestir nota í dag. Vandinn er sá að í dag eru kröfurnar meiri en þessi arkítektúr ræður við og því hafa menn verið að skoða aðrar leiðir eins og lightweight Service Oriented Architecture og hvernig smíða má hugbúnað sem þjónustur eða microapps eða microservice.
Við skoðum einnig lagskiptingu en það er elsta trikkið í bókinni og byggir á deila og drottna aðferðinni.
Understanding the connection between trauma, resiliency and a child's ability to thrive through adversity we explore the metaphor of the sea star from an evolutionary perspective. We then talk about how everyone has a resiliency toolbox and highlight some of the most effective internal resources.
AppSphere 15 - Preparing for System Failure: How Pearson used AppDynamics to ...AppDynamics
Pearson is the leader in global education and has been going through two large transformations. From print to digital publisher and from a federated to centralized business model. A centralized business model has brought together many different processes, technologies and tech stacks. This created challenges around scalability and stability within our environments and communications to internal and external stakeholders.
Pearson presented their story at AppSphere 2015. Learn how AppDynamics helped Pearson prepare for failure by:
- Enabling Pearson to be more proactive in our environments by instrumenting our business transactions across multiple complex systems and anticipating problems before our users were affected.
- Feeding our AppDynamics data into a Pearson built dashboard for better status and communications to our stakeholders.
- How Pearson was able to substantially reduce the number of P1 and P2 incidents over previous high usage time periods.
- Tangibly Improved customer experience including NPS score during Back to school timeframe.
- Significantly changed end user expectations by significantly reducing our MTTR.
Fault tolerance in general is a challenging topic. Yet we need fault toleranct designs more badly than ever in order to provide robust, highly available systems - especially in times of scale out systems becoming more and more popular.
Unfortunately, most developers do not care too much about a fault tolerant design, either because they are scared by the complexity of the realm or because they do not care enough. One of the problems is that a lack of fault tolerant design does not hurt a lot in development or in QA, but it hurts a lot in production - as Michael Nygard said: "It's all about production!" (at least figuratively).
In this presentation I do *not* try to give a general introduction to fault tolerant design. Instead I pick a few generic case studies that demonstrate the results of missing fault tolerant design, try to sensitize a bit about the production relevance of fault tolerant design and then go along with a few selected patterns. I picked a few patterns which are surprisingly easy to implement and help to mitigate the problems of the former case studies.
This way I try to show two things:
1. A piece of architecture or design as a pattern is not necessarily hard to implement. Sometimes the code is written quicker than it takes to explain the pattern beforehand.
2. Even if fault tolerant design as a general topic might be hard, some parts of it can be implemented very easily and it's more than worth the coding effort if you look how much better your system behaves in production just from adding those few lines of code.
When we devise plans in Enterprise Architecture, we often propose a multiyear plan with a vision of the end state. The trouble is, before we reach that end state, _something_ comes along to disrupt our grand vision. It may be a business event (acquisition, partnership, divestiture) or a technology event (rise of the web, mobile devices, shift to REST instead of SOA).
Instead of making grand visions, we should focus on continuous adaptation to changing circumstances.
This presentation, from OOP 2012 in Munich, offers 8 heuristics for riding the continuing wavefronts of change.
In this slide deck, I first describe what resilience is, what it is about, why it is important and how it is different from traditional stability approaches.
After that introductory part the main part is a "small" pattern language which is organized around isolation, the typical starting point of resilient software design. I used quotation marks for "small" as even this subset of a complete resilience pattern language still consists of around 20 patterns.
All the patterns are briefly described and for some of the patterns I added a bit of detail, but as this is a slide deck, the voice track - as usual - is missing. Also this pattern language is still sort of work in progress, i.e., it has not yet settled and some details are still missing. Yet I think (or at least hope), that the slides might contain a few useful insights for you.
Slides from my talk at QCon New York on how Netflix increases resiliency through failure, covering the Chaos Monkey, Chaos Gorilla, Latency Monkey, and others from the Simian Army.
NoSQL overview presentation with details on Riak and CouchDB.
Presented at Qbranch CODE Night 2010-04-15.
Thanks to @frli01 for arranging and @xlson for invitation.
Following up from my recent deep code dive into the less-rails and less-rails-bootstrap gems come some of the best hidden features of the rails asset pipeline. My talk will include a behind the scenes look of what makes the asset pipeline possible, best practices, advanced usage followed by a review of some of the top level CSS frameworks being used.
OSDC 2018 | The Computer science behind a modern distributed data store by Ma...NETWAYS
What we see in the modern data store world is a race between different approaches to achieve a distributed and resilient storage of data. Most applications need a stateful layer which holds the data. There are at least three necessary ingredients which are everything else than trivial to combine and of course even more challenging when heading for an acceptable performance. Over the past years there has been significant progress in respect in both the science and practical implementations of such data stores. In his talk Max Neunhoeffer will introduce the audience to some of the needed ingredients, address the difficulties of their interplay and show four modern approaches of distributed open-source data stores.
Topics are:
– Challenges in developing a distributed, resilient data store
– Consensus, distributed transactions, distributed query optimization and execution
– The inner workings of ArangoDB, Cassandra, Cockroach and RethinkDB
The talk will touch complex and difficult computer science, but will at the same time be accessible to and enjoyable by a wide range of developers.
Marius Eriksen considers that scalability problems appear when leaky abstractions are used, exemplifying with RDBMS, GC, and threads, presenting abstractions that help dealing with scalability issues: map-reduce, shared-nothing web applications, big table, all providing narrow access to explicit resources.
AppSphere 15 - Preparing for System Failure: How Pearson used AppDynamics to ...AppDynamics
Pearson is the leader in global education and has been going through two large transformations. From print to digital publisher and from a federated to centralized business model. A centralized business model has brought together many different processes, technologies and tech stacks. This created challenges around scalability and stability within our environments and communications to internal and external stakeholders.
Pearson presented their story at AppSphere 2015. Learn how AppDynamics helped Pearson prepare for failure by:
- Enabling Pearson to be more proactive in our environments by instrumenting our business transactions across multiple complex systems and anticipating problems before our users were affected.
- Feeding our AppDynamics data into a Pearson built dashboard for better status and communications to our stakeholders.
- How Pearson was able to substantially reduce the number of P1 and P2 incidents over previous high usage time periods.
- Tangibly Improved customer experience including NPS score during Back to school timeframe.
- Significantly changed end user expectations by significantly reducing our MTTR.
Fault tolerance in general is a challenging topic. Yet we need fault toleranct designs more badly than ever in order to provide robust, highly available systems - especially in times of scale out systems becoming more and more popular.
Unfortunately, most developers do not care too much about a fault tolerant design, either because they are scared by the complexity of the realm or because they do not care enough. One of the problems is that a lack of fault tolerant design does not hurt a lot in development or in QA, but it hurts a lot in production - as Michael Nygard said: "It's all about production!" (at least figuratively).
In this presentation I do *not* try to give a general introduction to fault tolerant design. Instead I pick a few generic case studies that demonstrate the results of missing fault tolerant design, try to sensitize a bit about the production relevance of fault tolerant design and then go along with a few selected patterns. I picked a few patterns which are surprisingly easy to implement and help to mitigate the problems of the former case studies.
This way I try to show two things:
1. A piece of architecture or design as a pattern is not necessarily hard to implement. Sometimes the code is written quicker than it takes to explain the pattern beforehand.
2. Even if fault tolerant design as a general topic might be hard, some parts of it can be implemented very easily and it's more than worth the coding effort if you look how much better your system behaves in production just from adding those few lines of code.
When we devise plans in Enterprise Architecture, we often propose a multiyear plan with a vision of the end state. The trouble is, before we reach that end state, _something_ comes along to disrupt our grand vision. It may be a business event (acquisition, partnership, divestiture) or a technology event (rise of the web, mobile devices, shift to REST instead of SOA).
Instead of making grand visions, we should focus on continuous adaptation to changing circumstances.
This presentation, from OOP 2012 in Munich, offers 8 heuristics for riding the continuing wavefronts of change.
In this slide deck, I first describe what resilience is, what it is about, why it is important and how it is different from traditional stability approaches.
After that introductory part the main part is a "small" pattern language which is organized around isolation, the typical starting point of resilient software design. I used quotation marks for "small" as even this subset of a complete resilience pattern language still consists of around 20 patterns.
All the patterns are briefly described and for some of the patterns I added a bit of detail, but as this is a slide deck, the voice track - as usual - is missing. Also this pattern language is still sort of work in progress, i.e., it has not yet settled and some details are still missing. Yet I think (or at least hope), that the slides might contain a few useful insights for you.
Slides from my talk at QCon New York on how Netflix increases resiliency through failure, covering the Chaos Monkey, Chaos Gorilla, Latency Monkey, and others from the Simian Army.
NoSQL overview presentation with details on Riak and CouchDB.
Presented at Qbranch CODE Night 2010-04-15.
Thanks to @frli01 for arranging and @xlson for invitation.
Following up from my recent deep code dive into the less-rails and less-rails-bootstrap gems come some of the best hidden features of the rails asset pipeline. My talk will include a behind the scenes look of what makes the asset pipeline possible, best practices, advanced usage followed by a review of some of the top level CSS frameworks being used.
OSDC 2018 | The Computer science behind a modern distributed data store by Ma...NETWAYS
What we see in the modern data store world is a race between different approaches to achieve a distributed and resilient storage of data. Most applications need a stateful layer which holds the data. There are at least three necessary ingredients which are everything else than trivial to combine and of course even more challenging when heading for an acceptable performance. Over the past years there has been significant progress in respect in both the science and practical implementations of such data stores. In his talk Max Neunhoeffer will introduce the audience to some of the needed ingredients, address the difficulties of their interplay and show four modern approaches of distributed open-source data stores.
Topics are:
– Challenges in developing a distributed, resilient data store
– Consensus, distributed transactions, distributed query optimization and execution
– The inner workings of ArangoDB, Cassandra, Cockroach and RethinkDB
The talk will touch complex and difficult computer science, but will at the same time be accessible to and enjoyable by a wide range of developers.
Marius Eriksen considers that scalability problems appear when leaky abstractions are used, exemplifying with RDBMS, GC, and threads, presenting abstractions that help dealing with scalability issues: map-reduce, shared-nothing web applications, big table, all providing narrow access to explicit resources.
The Computer Science Behind a modern Distributed DatabaseArangoDB Database
What we see in the modern data store world is a race between different approaches to achieve a distributed and resilient storage of data. Every application needs a stateful layer which holds the data. There are several different necessary components which are anything but trivial to combine, and, of course, even more challenging when attempting to optimize for performance. Over the past years there has been significant progress in both the science and practical implementations of such data stores. In this talk Dan Larkin-York will introduce the audience to some of the challenges, address the difficulties of their interplay, and cover key approaches taken by some of the industry’s leaders (ArangoDB, Cassandra, CockroachDB, MarkLogic, and more).
The computer science behind a modern disributed data storeJ On The Beach
What we see in the modern data store world is a race between different approaches to achieve a distributed and resilient storage of data. Every application needs a stateful layer which holds the data. There are at least three necessary components which are everything else than trivial to combine, and, of course, even more challenging when heading for an acceptable performance.
Over the past years there has been significant progress in both the science and practical implementations of such data stores. In his talk Max Neunhoeffer will introduce the audience to some of the needed ingredients, address the difficulties of their interplay and show four modern approaches of distributed open-source data stores (ArangoDB, Cassandra, Cockroach and RethinkDB).
Inside the Atlassian OnDemand Private CloudAtlassian
In order to launch Atlassian OnDemand, we needed to rethink the way we did infrastructure. Join Atlassian SaaS Platform Architect, George Barnett as he discusses how we delivered a scalable platform that runs tens of thousands of JVMs, all while reducing the cost by ten-fold. This talk will cover design decisions, technology choices and the lessons learned during the build out.
Services are the New Cloud Platform (Services-as-a-Platform)Randy Bias
How Amazon Web Services and other public clouds are really building Services-as-a-Platform (SaaP) not IaaS or PaaS. SaaP combined with DevOps is the ultimate path to faster, more nimble enterprise services and application delivery and lowering business time to value (TTV).
Juniper's plans to reboot the OpenContrail community and transition from a Juniper-led project to a community led project. We need your help. Get involved.
State of the Stack v4 - OpenStack in All It's GloryRandy Bias
The almost annual State of the Stack, version 4, an end-to-end view of OpenStack. This edition focuses on what the challenges are within the community and how they can be addressed.
v1 of SOTS has over 90,000 views and is one of the highest viewed OpenStack presentations ever.
The Lie of a Benevolent Dictator; the Truth of a Working Democratic MeritocracyRandy Bias
Keynote at OpenStackSV's inaugural event. Essentially a call to arms to fix the missing "product leadership gap" that is clearly causing drag on the project(s).
OpenStack Architected Like AWS (and GCP)Randy Bias
A description of how we built Open Cloud System (OCS), an OpenStack-powered complete cloud operating system. With a focus on AWS and GCE interoperability, we describe why hybrid cloud interoperability matters and how we got there. Anyone can do it and we think you should too.
A detailed description of how Cloudscaling's Open Cloud System (OCS) has solved the network scalability problems in OpenStack. We'll cover how and why we designed a Layer-3 (L3) scale-out network, how we plugin and extend OpenStack, and talk about why we did it this way.
Pets vs. Cattle: The Elastic Cloud StoryRandy Bias
My recent presentation to the Chicago DevOps Meetup that explains how we're moving from a servers as Pets world to a servers as Cattle world. Understanding this change is critical to success in cloud, DevOps, and delivering new value to the enterprise.
SFBay OpenStack Meetup // Neutron and SDN in Production – Dec 3 2013Randy Bias
Cloud architects deploying OpenStack have multiple options for virtualizing the network layer. At this meetup, folks who’ve built big clouds and designed the networking fabrics for them will talk about those choices, including those that are native to OpenStack as well as other open source options. They’ll also dig into what’s new in Havana and what’s on tap for Icehouse next spring from a networking standpoint.
Bring your questions about network virtualization and SDN in OpenStack, and we’ll talk about Neutron and more.
Moderator Randy Bias of Cloudscaling will be joined by Rudra Rugge of Juniper Networks, Aaron Rosen of VMware / Nicira, Edgar Magana of PLUMgrid, and Ryu Ishimoto of Midokura.
Replay of the live broadcast can be found (soon) at http://youtube.com/siliconangle
Running your own infrastructure *can* be as little as half the cost of running on AWS once you are at scale. OpenStack-based cloud systems can provide the same or similar economies of scale if you leverage the lessons of AWS and GCE when building your cloud. This talk discusses the economic factors in designing a cost-efficient AWS + OpenStack hybrid cloud. We look at the issues involved in repatriating existing applications, and show a couple of real-world demonstration of tools that can assist in the repatriation process. Repatriation isn quite as simple as hitting the Easy button, but if you plan your deployment correctly, you can make it work, both technically and economically.
This 2nd major State of the Stack address is a complete refresh of the spring 2013 edition, broadcast live on BrightTALK from the OpenStack Summit in Hong Kong.
(Replay: https://www.brighttalk.com/webcast/10353/92159)
Randy Bias, CEO and Co-founder of Cloudscaling examines the progress from Grizzly to Havana and delves into new areas like refstack, tripleO, bare metal server provisioning, the move from "projects" to "programs", and public/hybrid cloud compatibility. Check out the updated statistics on project momentum and look more closely at big upgrades in Havana, including OpenStack Orchestrate (Heat), which has the opportunity to change the game for OpenStack in the greater private and hybrid cloud game. We also discuss the "what is 'core'" debate and examine the idea that OpenStack is a kernel, not a complete cloud OS.
Networking is NOT Free: Lessons in Network DesignRandy Bias
An in-depth critique of the existing OpenStack networking approach, with a focus on how the Nova network controller is more of a hindrance than a help. Discusses the gap in Quantum's functionality required to close the gap, and alternative solutions. How can we make networking in OpenStack robust, high performance, and fault tolerant? What do typical large scale networks look like and what lessons can we learn from them? Is there an approach to networking we can take that is the same with a handful of servers as it is with hundreds of racks?
Existing approaches to delivering persistent block storage in OpenStack focus on integrating existing SAN/NAS hardware solutions, using Distributed File Systems (DFS), or using simple Direct Attached Storage (DAS) with Cinder. There is another alternative: scale-out block storage nodes with intelligent scheduling. This is the same approach that Amazon Web Services (AWS) uses for Elastic Block Storage (EBS) and it's worth taking a close look at the pros and cons. This presentation will explore the differences between SAN, NAS, DFS, DAS, and EBS. We will look at the implicit and explicit contracts that users and operators get from the different approaches and look at a variety of failure conditions. EBS may not be right for some clouds, but for many it's an important and viable alternative to the existing approaches.
A comprehensive review of OpenStack then and now, each project's architecture, and hard data on why the race for open cloud is over. (First edition delivered April 2013 at OpenStack Summit. This version is from SPDEcon on June 10, 2013.)
Randy Bias, Co-Founder and CTO of Cloudscaling, speaks on open storage, fault tolerance and the concept of failure "blast radius" at the Open Storage Summit, hosted by Nexenta in May 2012.
OpenStack Summit :: Redundancy Doesn't Always Mean "HA" or "Cluster"
1. Redundancy Doesn't Always
Mean "HA" or "Cluster"
A cautionary tale against using hammers to solve all redundancy and resiliency problems ...
OpenStack Design Summit – Oct 2012
Randy Bias Dan Sneddon
@randybias @dxs
CTO, Cloudscaling Sr. Engineer, Cloudscaling
CCA - NoDerivs 3.0 Unported License - Usage OK, no modifications, full attribution*
* All unlicensed or borrowed works retain their original licenses 1
Thursday, October 18, 12
2. Our Journey Today
1. “HA” pairs are not the only type of redundancy
2. Alternative redundancy patterns for HA
3. Redundancy patterns in Open Cloud System*
* Cloudscaling’s OpenStack-powered cloud operating system (“distribution”)
2
Thursday, October 18, 12
3. What Do We Mean By “HA”?
We mean what most people mean ...
Two servers or network devices that look like one
3
Thursday, October 18, 12
4. “HA HA”?
HA pairs come in a couple flavors
Active / Passive
4
Thursday, October 18, 12
5. “HA HA”?
People like this flavor best, but it’s not always possible...
Active / Active
5
Thursday, October 18, 12
6. “HA HA HA HA HA”??
Many people wish they could get it more like this ...
HA cluster aka ‘massive operational nightmare’
6
Thursday, October 18, 12
7. Cluster<bleep>!
Imagine this was 4 or 6 nodes in the cluster
• 4 network tech.
• 7 NICs / node
• A million different ways
to break
7
Thursday, October 18, 12
8. “HA” Pairs Are One Type of Redundancy
Herein lies the problem ...
8
Thursday, October 18, 12
9. The Problem With “HA”-mmers
There are many, but these two matter most ...
• Catastrophic failures
• No scale out
9
Thursday, October 18, 12
10. HA Pairs Have Binary Failures
Either working or dead, nothing in-between
10
Thursday, October 18, 12
11. What is Scale-out?
A B
A B C D N
A B
Scale-up - Make boxes Scale-out - Make moar
bigger (usually an HA pair) boxes
11
Thursday, October 18, 12
12. Scaling out is a mindset
Scaling up is like treating your servers as pets
bowzer.company.com web001.company.com
Servers *are* cattle
12
Thursday, October 18, 12
13. HA Pair Failures* - 100% down
Hardware rarely fails, operators fail, software fails
Who Type Year Why Duration
Apple Switch 2005 Bug 2 hrs
Flexiscale SAN 2007 Ops Err 24 hrs
Vendio NAS 2008 Ops Err 8 hrs
UOL Brazil SAN 2011 Bug 72 hrs
Twitter Datacenter 2012 Bug+Ops 2 hrs
* This is a handful of examples as a baseline; I’m sure you can find many more
13
Thursday, October 18, 12
14. “HA” Pairs Are an All-in Move
They better not fail ...
14
Thursday, October 18, 12
15. Risk Reduction
Many small failure domains is usually better
15
Thursday, October 18, 12
16. Big failure domains vs. small
Would you rather have the whole cloud down or just a
small bit for a short period of time?
Still a scale-up pattern ...
wouldn’t you rather scale-out?
16
Thursday, October 18, 12
17. Pair vs. Scale-out Load Balancing
No scale-out
State Sync Shared-nothing Architecture
(100% loss) (20% loss)
17
Thursday, October 18, 12
18. Pair vs. Scale-out Load Balancing
No scale-out
State Sync Shared-nothing Architecture
(100% loss) (20% loss)
17
Thursday, October 18, 12
19. What’s Usually an “HA” Pair in OpenStack?
Everything ...
Service Endpoints Messaging System
(APIs) (RPC)
Worker Threads
Database
(e.g. Scheduler,
(MySQL)
Networking)
18
Thursday, October 18, 12
20. What needs to be an HA pair?
Not much needs state synchronization
Service Endpoints Messaging System
(APIs) (RPC)
Worker Threads
Database
(e.g. Scheduler,
Networking) (MySQL)
19
Thursday, October 18, 12
23. Service Distribution
High Availability Without Compromise
Resilient Stateless Scale-out
22
Thursday, October 18, 12
24. Service Distribution
Combines Standard Networking Technologies
router ospf
OSPF /etc/quagga/ospfd.conf ospf router-id 10.1.1.1
network 10.1.255.1 area 0.0.0.0
interface lo:2
Anycast /etc/quagga/zebra.conf description Pound listening address
ip address 10.1.255.1/32
ListenHTTP
Address 10.1.255.1
Port 8774
Load- xHTTP
Service
BackEnd
1
Balancing /etc/pound/pound.conf
End
Address 10.1.1.1
Port 8774
Proxy BackEnd
Address 10.1.1.2
Port 8774
End
End
End
23
Thursday, October 18, 12
25. Resilient OpenStack
Horizontally Scalable, No Single Point Of Failure
Service Distribution ZeroMQ
Service Endpoints Messaging System
(APIs) (RPC)
Service Distribution MMR + HA
Worker Threads Database
(e.g. Scheduler,
Networking) (MySQL)
Thursday, October 18, 12
26. Service Distribution Advantages
What Makes This a Superior Solution?
• True horizontal scalability with no centralized controller
• Services are always running, failover is nearly instant
• Reduced complexity, fewer idle resources
• No need for separate load balancers
Server Server Server Server Server Server Server
...
Failover vs. Distributed Services
25
Thursday, October 18, 12
27. Perfect For Site Resiliency
Service Distribution Works With Multiple Sites
• Traditional HA pairs do not support cross-site resiliency
• Service Distribution fail across sites without DNS redirections
26
Thursday, October 18, 12
28. Service Distribution in Action
Example: Distributed Load Balancing
1) OSPF
OSPF Router(s)
OSPF OSPF
advertisement advertisement
V
Quagga Quagga
HTTP Proxy HTTP Proxy
27
Thursday, October 18, 12
29. Service Distribution in Action
Example: Distributed Load Balancing
1) OSPF
OSPF Router(s)
2) ECMP Per-flow
Load Balancing
OSPF OSPF
advertisement advertisement
Per-Flow
Load
3) Load-balancing V
Balancing
Quagga Quagga
HTTP Proxy
HTTP Proxy HTTP Proxy
28
Thursday, October 18, 12
30. Service Distribution in Action
Example: Distributed Load Balancing
1) OSPF
OSPF Router(s)
2) ECMP Per-flow
Load Balancing
OSPF OSPF
advertisement advertisement
Per-Flow
Load
3) Load-balancing V
Balancing
Quagga Quagga
HTTP Proxy
HTTP Proxy HTTP Proxy
4) Unlimited #
of Back-End
Servers
Server Server Server Server
29
Thursday, October 18, 12
31. Failure Resiliency
Client Client Client Client
1 2 3 4
1 2 3 4
Load Balancer/
Load Balancer/ Load Balancer/ Load Balancer/ Load Balancer/
Proxy
Proxy Proxy Proxy Proxy
10%
Server Server Server Server Server Load
Each
Server Server Server Server Server Server
30
Thursday, October 18, 12
32. Failure Resiliency
Client Client Client Client
1 2 3 4
1 12 3 4
X
Load Balancer/
Load Balancer/ Load Balancer/ Load Balancer/ Load Balancer/
Proxy
Proxy Proxy Proxy Proxy
10%
Server Server Server Server Server Load
Each
Server Server Server Server Server Server
31
Thursday, October 18, 12
33. Failure Resiliency
Client Client Client Client
1 2 3 4
1 2 3 4
Load Balancer/
Load Balancer/ Load Balancer/ Load Balancer/ Load Balancer/
Proxy
Proxy Proxy Proxy Proxy
10%
X
Server
Server
Server
Server
Server
Server
Server
Server
Server
Server
Increased
Server
Load
32
Thursday, October 18, 12
34. OCS NAT Service
Example: Scale-out Network Address Translation
BGP Multiple ISP
providers
NAT
Service
Distribution
VMs
33
Thursday, October 18, 12
35. Brokerless Messaging With ZeroMQ
Avoiding RabbitMQ’s Single Point Of Failure
Nova-Compute
Single Point
Of Failure
RabbitMQ
Broker
Nova-Scheduler Nova-API
RabbitMQ
(Brokered)
34
Thursday, October 18, 12
36. Brokerless Messaging With ZeroMQ
Avoiding RabbitMQ’s Single Point Of Failure
Nova-Compute Nova-Compute
Single Point
Of Failure
RabbitMQ
Broker
Nova-Scheduler Nova-API Nova-Scheduler Nova-API
RabbitMQ vs. ZeroMQ
(Brokered) (Peer To Peer)
35
Thursday, October 18, 12
37. What did we learn today?
1. HA-mmers are for nails
2. Scale-out rules for redundancy
3. Design-for-failure is a mentality, not a pair
4. Resiliency over redundancy
36
Thursday, October 18, 12
38. Q&A
Randy Bias Dan Sneddon
@randybias @dxs
CTO, Cloudscaling Sr. Engineer, Cloudscaling
OCS 2.0
Public Cloud Benefits | Private Cloud Control | Open Cloud Economics
37
Thursday, October 18, 12