The document discusses the dangers of complexity, interconnectedness, and drift in complex adaptive systems like cloud computing and stock markets. It notes that the 2010 "Flash Crash" was caused by an automated trading algorithm exacerbating volatility in an interconnected market. It advocates embracing complexity but with systems thinking, designing for resilience over stability, focusing on relationships, and releasing failures to avoid drift over time.
4. ”[A] large fundamental trader chose to execute [a
$4.1B] sell program via an automated execution
algorithm ('Sell Algorithm')."
- Findings Regarding The Market Events of May 6, 2010
http://www.sec.gov/news/studies/2010/marketevents-report.pdf
5. "...the Sell Algorithm…executed the sell program extremely
rapidly in just 20 minutes.”
The market responded, and trading volume increased…
"... [The Sell Algorithm] responded to the increased volume by
increasing the rate at which it was feeding the orders into the
market."
- Findings Regarding The Market Events of May 6, 2010
http://www.sec.gov/news/studies/2010/marketevents-report.pdf
16. "May 6 was…an important reminder of the
interconnectedness of our derivatives and
securities markets, particularly with respect to
index products."
- Findings Regarding The Market Events of May 6, 2010
http://www.sec.gov/news/studies/2010/marketevents-report.pdf
54. “The goal of producing a maximum
sustained yield may result in a more stable
system of reduced resilience.”
“Command and Control and the Pathology of Natural Resource Management”,
C. S. Holling and Gary K. Meffe
http://landscape.forest.wisc.edu/courses/Landscape565spr01/Holling_Meffe1996.pdf
55. “[T]he resilience-stability tradeoff is more
than just a simple transformation in
distribution. …[A]gents adapt to a
prolonged period of stability in such a
manner that the system cannot ‘withstand
even modest adverse shocks.’”
“The Euro and the Resilience-Stability Tradeoff”, Ashwin Parameswaran
http://www.macroresilience.com/2011/11/14/the-euro-and-the-resilience-stability-tradeoff/
67. App-centric System-centric
• Monitor each app • Monitor system at many
separately levels, and use as feedback
• Dig for root cause • Search for system weakness
• Attempt to stabilize • Focus on resilience
69. Don’t let this happen to you!
Cloud, complexity and drift
Editor's Notes
Instead, let me take you back to May 6, 2010…
A typical morning on the stock market. No major breaking news, and all boards trading normally. A slight drop in the indexes, but nothing special. But something went very wrong that afternoon.
According to a joint report written later about that day by the Securities and Exchange Commission and the Commodity Futures Trading Commission, a single trading algorithm was used to mete out $4.1 billion dollars in trades, using an algorithm that metered out individual trades over time, attempting to represent no more than 10% of trading volume at any given time.To achieve this, the algorithm adjusted each trade’s volume based on overall market volume in the previous minute.
Unfortunately, for one reason or another…a simple bug, perhaps, or human error…trades that were meant to be metered out over days or weeks were actually executed within 20 minutes.As might be expected, this resulted in some pretty big trade executions in a very short period of time, especially for the relatively small electronic exchange on which they were executed.The market responded, and other automatic trading algorithms sensed a “sell” signal, and started executing sell trades in response. This increased market volume.The original sell algorithm then responded to the increased trade volume, and increased it’s own trading volume.
The result was an about 4% drop in the Dow in about 20 minutes, a total of 8.1% down from the opening bell. To put that into context, CNN noted that it was the biggest intraday point drop in Dow Jones history at that time.
So, to fix this, we need the root cause, right? I mean, it had to be that original trading company’s bug or bad logic or whatever. Case closed.
Well, it was that errant trading algorithm, right? I mean it initiated the trades, and feedback from that market was then used to initiate further trades.
Then, for whatever reason, other trading algorithms quickly saw this larger than normal activity as a sell sign…
…so they in turn initiated trades on both the original exchange, and (in a few cases) on other exchange mechanisms.
Those trades were either frequent enough or large enough to get the attention of yet more trading algorithms.
A few of those algorithms probably triggered increased trading on the original exchange, further exacerbating the problem.Now, this is an extremely simplified view of the changing state of the exchange system that day, and one that appears much more serial than events really unfolded.
But at this point, how can we say that first algorithm was the “problem” to be “fixed”.In fact, the truth is that it’s the way the other algorithms reacted to the initial trades that made those trades a problem. In theory, any one of a number of large trade sources could have triggered the same series of events. Or a similar series of events. Or, perhaps, an even more devastating series of events.Another important truth is that many, many decisions were made in parallel, often affecting large numbers of trades made by entirely unrelated parties. And there were a huge number of parties. The actual trading graph probably looked more like…
…this. A huge web of players interacting over a variety of paths via a broad range of rules and protocols.
In systems like this, small actions can trigger huge consequences. And most of the time, it’s not the initial trigger that is the problem. In this case, the root cause is interesting, but it’s not the problem.
The Flash Crash post mortem itself noted that “May 6 was an important reminder of the interconnectedness of our derivatives and securities markets”.
“Interconnectedness”. I love that word. Isn’t that what we are all here to talk about this week? Isn’t that what the world of computing is working towards with such fervor and focus?
I mean, the Internet is all about interconnectedness…from the early days of The World Wide Web through human social networks to—increasingly—system to system, device-to-device, thing-to-thing connectedness.
APIs are also a good example of what drives interconnectedness in computing. Simple interfaces to powerful services are driving intense acceleration in the linking of software systems.
And cloud computing is an interesting use case for interconnectedness. Cloud is bringing together computing systems at a scale never before seen. WE are defining perhaps the most interconnected knowledge system in the history of humanity. The cloud is starting to look like…
…this. A huge web of players interacting over a variety of paths via a broad range of rules and protocols.Oh, my. Computing is increasingly looking like the stock market. (How long until we see “certified capacity planners” cold calling us to “manage our portfolios”. Perhaps it’s already begun.)OK, so what’s going on here?
Both the cloud and the stock market are examples of a phenomenon that has received a tremendous amount of scientific interest in the past few decades: complex adaptive systems.
Wait. What? That word makes most people cringe like a Dell PR rep…
…when they hear the name Mads Christiansen.Don’t worry. In this context, it’s not nearly as obscene.
So, what the hell arecomplex adaptive systems?
You experience complex adaptive systems every day. In ecosystems…
…in societies,…
…and in economies. These are just a few examples.Complex adaptive systems are often big, mysterious systems that just seem to work (except when they don’t …more on that later).However, for as large and complicated as they appear, complex adaptive systems are made up of just a few basic elements.
The first is a large number of individual agents…
Agents are defined as entities that operate individually, responding to feedback from the world around them and taking action as their own internal state and rules dictate.So, for example, in computing, an example might be a component that responds to input and takes action as it’s own program dictates.There can be a variety of agent types in a system, for example in computing, some may be mobile clients, services, management agents, servers, disk filers, and so on.When systems are adaptive, the agents can also learn from the feedback they receive, adjusting rules in an attempt to achieve whatever results are optimal for the agent.So, in ecology, organisms adapt to survive and reproduce. In markets, business adapt to profit and grow.The critical thing here is that these agents are not directly controlled by any outside force. Each agent makes it’s own “decisions” on action to take in response to input.
The second element is the way these agents interact with each other.
Relationships between agents change over time, sometimes quite frequently, getting created or destroyed as the agents decide which relationships benefit them the most, or—occasionally—when the system itself forces a relationship to change.If you were to graph out the relationships between agents over the system as a whole, however, you’d probably discover some patterns emerging, depending on the nature of the system.Some systems organize into complex hub-and-spoke networks, like the one above.Others organize into fields in which there are few central players, but a large number of connections between individual agents on an ad hoc basis. There are hierarchical graphs, and more.Regardless, the agents find each other, and begin to interact in ways that begin to create a cohesive system of agents.
Finally, the agents work off of some (usually)simple set of rules.
Now, these rules could be defined differently for different types of agents, and even individual agents may see slight variances from its peers.Nonetheless, these rules are typically defined to benefit the class of agent or the individual agent.Some rules are about how the agent should send signals to other agents. Other rules are about how the agent should evaluate and adjust its behavior. Still others are about how the agent itself should change in response to its environment. I’m simplifying here, but it’s critical to understand that there are no central controllers in complex adaptive systems. It’s all about how the agents interact.
All of this results in something spectacular.
With independent agents, dynamic interaction and rules for that interaction, you get an agent that demonstrates emergent behavior.
An excellent example of this emergent behavior is a flock of birds. There are no leaders in most bird flocks. Rather each bird is responding to a simple set of rules ingrained in it’s DNA. Those rules create a beautiful, fluid movement that almost behaves like a single entity. A system is born from individual agents, dynamic interaction and simple rules.
Another interesting thing about these systems is that they can typically be modeled. Maybe not perfectly, but certainly in ways that are helpful in evaluating the system itself.
Going back to our bird flock example, there is a phenomenal example model that was introduced by computer scientist Craig Reynolds in 1986.Three elements are managed for any given bird: separation from other birds (not too close), cohesion (not too far), and alignment (try to head in the same direction as neighboring birds).The resulting model is amazing to watch in action. A flocking behavior that seems so “real”, that you can’t help but believe you’re seeing a natural flocking behavior.The financial industry, by the way, is famous for modeling large parts of the stock markets and the overall economy itself. Those models are actually getting pretty good, but they won’t create any magic way to predict future markets. Why?
Because the last trait of complex adaptive systems is that there sheer scale and decentralized nature makes it almost impossible to predict the future state of the system.
In late 2010 and early 2011, hundreds of birds were found dead in states ranging from California to Georgia. The cause? No one knows. In some cases, people witnessed birds divebombing semi-trucks. One witness said “it looked like they were committing suicide.” Could anyone have predicted this before it happened?
The Deepwater Horizon disaster is a classic example of failure caused by systems flaws that were unseen or ignored until the disaster happened. The decisions about safety and technologies leading up to the explosion were made by people who believed those decisions were smart and safe. Hell, they were running with those technologies for years in tests and the early life of the platform without problems. What reason did they have to predict a failure—except in retrospect.
The flash crash, of course, is a great example of the unpredictability of systems automation and computing when a large number of individual programs interact over a network against the same data. No matter how well you tighten each individual algorithm, the system as a whole can have hidden traps that are only discovered in extreme conditions.
All of this means that, if you are responsible for a critical component of your companies IT infrastructure running in this complex systems environment, you look a little like this guy.
So…what can you do?
Well, the first thing you can do is embrace complexity. The typical response to this is…
“Hell no! Complexity is evil!”
So, how about if we get a little more specific. Embrace complex systems. Specifically embrace complex adaptive systems, and all that we have learned about how they work and how we can take advantage of their properties.
If you want a keyword to search for solutions that target applying complex systems science to everything from organizations to processes to computing, the term commonly used is “Systems Thinking”.
At this point, it would probably be good for me to get out of theory a little bit, and start talking about practical steps you can take to embrace complexity…er, complex systems.
First, please, please, please, take some time to learn about complex adaptive systems and systems thinking.
Here are three excellent works to start with. “Complexity”, on the left, is an amazing telling of how the Sante Fe Institute, an academic research institution that gathers the best and the brightest in a variety of disciplines to explore the effect of complex adaptive systems on their respective fields, giving a phenomenal overview of the science along the way. “Thinking in Systems” is one of the most respected introductions to Systems Thinking, and Drift into Failure makes clear the dangers and pitfalls that await us all when working with complex systems.
Second, pay attention to one of the most important tradeoffs in complex systems, and choose resilience.
A seminal paper about the pathology of direct management of natural ecosystems proved pretty conclusively that targeting a stable high yield will often result in reduced resilience.
This tradeoff between stability and resilience is critical to understand. If you work towards stability—an environment where any form of change is discouraged in favor of completely predictable outcomes, you’ll find your applications and services actually becoming more susceptible to conditions outside of anticipated norms. When that happens, “stable” systems tend to collapse entirely. Systems build for resilience, on the other hand, have failures all the time, but are build so that those failures have a minimum impact on the system as a whole. So the system remains resilient.
One of the best examples of design for resilience succeeding is the Amazon.com home page. Each of those features you see up there: featured offers, highlighted ads, personalized “what’s hot” sections, are all separate components working within a larger system. If one component fails, either other components fill the need, or the feature is just not displayed. The result is obvious…when was the last time you went to Amazon.com and it just wasn’t there?
Third, understand that, to build agents that thrive in complex systems, you need to focus much of your engineering not on the agent itself, but on how that agent relates to the world around it.
Remember the flash crash model I talked about earlier? What were the things that could be changed that would avoid the likelihood of a similar incident a few months or years later? Not the individual algorithms (hell, the financial services companies would never share those, anyways). Nope…
…it’s the relationships between those components (including many components not included in this model). The first thing the exchanges looked at (as well as the SEC and the Commodity Futures Trading Commission) was what mechanisms they could put in place to capture the “system” going haywire and stop a crisis before it starts.
In computing, believe it or not, there is already an excellent pattern that has proven itself in large scale computing environments that does exactly that. From a book called “Release IT!” which is another read I highly recommend, the Circuit Breaker pattern is all about intercepting traffic at an API or on a network, applying rules to that traffic and triggering certain actions, like forcing a failure of the API call, when it sees something it doesn’t like.
I wish I had more time to tell you about Circuit Breakers, but luckily one of the pioneers of its use at high scale has shared most of what they did in implementing them. I highly recommend checking out this link and the rest of the Netflix techblog if you want to learn more.
Finally, one of the hardest things you’ll have to do as a developer operating in a complex adaptive system is to stay disciplined and aware enough of the system to avoid a drift into critical failures.
Drift is everywhere in complex systems. It is a gradual shift in the structure of a system that actually makes failure MORE likely, even if the goal was to make it less likely. The Deepwater Horizon explosion and subsequent oil spill is an example that Sydney Dekker calls out in his book, that I mentioned earlier. The BT employees interviewed after the accident often mentioned they were uncomfortable with specific decisions, given other factors they knew about, but felt it wasn’t worth the political cost to call them out, and besides—those decisions were often in place for months or years (on a variety of platforms) without negative consequence, so perhaps it wasn’t so risky after all…
Here’s how you avoid disaster in your production systems: beat the hell out of them…
The concept of “Chaos Monkeys” was made famous in part through the practices of companies like Amazon and Netflix. They actually continuously run test systems that stress the system from a usage perspective, perhaps degrade the performance of one component or another, or even outright kill components just to see what happens. The result is that developers KNOW there will be challenges to the components they are building, and they begin to architect them to be resilient…not stable.
Another great idea to keep in mind to avoid drift: remember that the component you are building is NOT the system, but an agent in a larger system.
This is really quite different from the “build the most stable application you can” mentality of the siloed world. In an app-centric world, the focus is on the component—the application. The result is all about stabilizing the design with assumptions that the world around the application can be controlled. In cloud, that won’t work. You HAVE to take time to instrument as much of the world around your app or service or whatever as you can. Use that data as feedback, both in component design and operation, as well as in tweaking the system as much as you can to help your stuff thrive.Not survive. Thrive.
One last thing. The hardest thing to remember when working with complex systems is that everything that touches the system is (generally) a part of that system. That means you are a part of the system, and can only have—at best—a limited understanding of how the system works as a whole. It’s as if you were standing in a vast forest of tall trees. Quick. How big is the forest?Embrace complexity—er, complex adaptive systems. Don’t let this happen to you…