Don't let this happen to you! Cloud, complexity and drift

•Download as PPTX, PDF•

18 likes•33,825 views

The document discusses the dangers of complexity, interconnectedness, and drift in complex adaptive systems like cloud computing and stock markets. It notes that the 2010 "Flash Crash" was caused by an automated trading algorithm exacerbating volatility in an interconnected market. It advocates embracing complexity but with systems thinking, designing for resilience over stability, focusing on relationships, and releasing failures to avoid drift over time.

Business Economy & Finance

Don’t let this happen to you!
Cloud, complexity and drift

James Urquhart
@jamesurquhart http://gigaom.com/cloud

”[A] large fundamental trader chose to execute [a
$4.1B] sell program via an automated execution
algorithm ('Sell Algorithm')."

- Findings Regarding The Market Events of May 6, 2010
http://www.sec.gov/news/studies/2010/marketevents-report.pdf

"...the Sell Algorithm…executed the sell program extremely
rapidly in just 20 minutes.”

The market responded, and trading volume increased…

"... [The Sell Algorithm] responded to the increased volume by
increasing the rate at which it was feeding the orders into the
market."
- Findings Regarding The Market Events of May 6, 2010
http://www.sec.gov/news/studies/2010/marketevents-report.pdf

Automatic
Trading Market A
Algorithm 1

Automatic Automatic Automatic
Trading Trading Trading
Algorithm 2 Algorithm 3 Algorithm 4

Market B
Automatic
Trading Market A
Algorithm 1

Automatic Automatic Automatic
Trading Trading Trading
Algorithm 2 Algorithm 3 Algorithm 4

Automatic
Automatic Trading
Trading Algorithm 6
Algorithm 5

Market B
Automatic
Trading Market A
Algorithm 1

Automatic Automatic Automatic
Trading Trading Trading
Algorithm 2 Algorithm 3 Algorithm 4

Automatic
Automatic Trading
Trading Algorithm 6
Algorithm 5

Automatic
Automatic
Trading Market B
Automatic
Trading
Algorithm N
Trading Market A
Algorithm N
Algorithm N

Automatic Automatic Automatic
Trading Trading Trading
Algorithm 2 Algorithm 3 Algorithm 4

“Root cause” is not an
answer—it’s a clue.

"May 6 was…an important reminder of the
interconnectedness of our derivatives and
securities markets, particularly with respect to
index products."

- Findings Regarding The Market Events of May 6, 2010
http://www.sec.gov/news/studies/2010/marketevents-report.pdf

The Internet is about
interconnectedness.

“The cloud” is about
interconnectedness.

Both cloud computing
and stock markets are
complex adaptive
systems.

Agents

Other Information/A
ctions
State
Agents
Learning
Information/A Other
Rules ctions
Agents

A large number of individual agents

+ dynamic interactions between agents

A large number of individual agents

+ dynamic interactions between agents

+ rules for reacting to/interacting with
other agents

A system that:

demonstrates emergent behavior,

A system that:

demonstrates emergent behavior,

can be modeled,

Models

Time

Flocking model from
NetLogo 4.1.3
http://ccl.northwestern.edu/netlogo/

A system that:

demonstrates emergent behavior,

can be modeled,

but

makes precise prediction of future
behavior impossible.

“The goal of producing a maximum
sustained yield may result in a more stable
system of reduced resilience.”

“Command and Control and the Pathology of Natural Resource Management”,
C. S. Holling and Gary K. Meffe
http://landscape.forest.wisc.edu/courses/Landscape565spr01/Holling_Meffe1996.pdf

“[T]he resilience-stability tradeoff is more
than just a simple transformation in
distribution. …[A]gents adapt to a
prolonged period of stability in such a
manner that the system cannot ‘withstand
even modest adverse shocks.’”

“The Euro and the Resilience-Stability Tradeoff”, Ashwin Parameswaran
http://www.macroresilience.com/2011/11/14/the-euro-and-the-resilience-stability-tradeoff/

http://techblog.netflix.com/2011/12/makin
g-netflix-api-more-resilient.html

Release the
monkeys!!!
• Failure
• Degredation
• Usage

Avoiding Drift:
The agent is not the
system

App-centric System-centric
• Monitor each app • Monitor system at many
separately levels, and use as feedback
• Dig for root cause • Search for system weakness
• Attempt to stabilize • Focus on resilience

Avoiding Drift:
YOU
are part of the system

Don’t let this happen to you!
Cloud, complexity and drift

Viewers also liked

Social Media Optimization (Diplomarbeit)D. Lewicki

Krispy kreme doughtnuts (kkd)wajahathailian

Krispy cream case study finalwajahathailian

Q and A - Rav Dhillon Argyle CoinsRav Dhillon Argyle Coins

Подобряване на ефективността на регионалната структура на националния статист...Светла Иванова

Allora Blu & Gli amici di Davide il drago ONLUS: "Nati per vivere" di M. Jank...Maurizio De Filippis

New marketer breed Toronata Tambun

College mediacoaches bibliotheken 200317Netherlands Institute for Sound and Vision

Presentation STEM careers on 9th Math International ConferencePanagiota Argiri

Bang portfolioAine Doris

Viewers also liked (10)

Social Media Optimization (Diplomarbeit)

Krispy kreme doughtnuts (kkd)

Krispy cream case study final

Q and A - Rav Dhillon Argyle Coins

Подобряване на ефективността на регионалната структура на националния статист...

Allora Blu & Gli amici di Davide il drago ONLUS: "Nati per vivere" di M. Jank...

New marketer breed

College mediacoaches bibliotheken 200317

Presentation STEM careers on 9th Math International Conference

Bang portfolio

Similar to Don't let this happen to you! Cloud, complexity and drift

Algo Trading – Best Algorithmic Trading Examples.pdfNazim Khan

Algorithmic Trading-An IntroductionRajeev Ranjan

Introduction to Algoaction -Web Based Trading PlatformMarketcalls

A case for intelligent autonomous ai (iai)Mark Albala

Algorithmic tradingJayadeep Govindu

Ags AIforTradingCristiana Corno

UNRAVELING THE POWER OF QUANTOPIAN ALGORITHMS IN FINANCIAL MARKETSRiya Sen

Crypto Trading Bots_ Unlocking the Potential of Automated Trading.pptxShubhamKumar758510

Quantopian: Crowd-sourced Systematic AlphaQuantopian

Lares - 11 Tips for Choosing the Best Automated Trading Software in India.pptxLaresFintech

Ashirth Barthur, Security Scientist, H2O, at MLconf Seattle 2017MLconf

Machine Learning and RoboCop TestingIosif Itkin

Atom 2018-share-v1Kaushal Sheth

Unlock Profit Opportunities with the help of a Leading Crypto Arbitrage Bot D...Tess Casali

Algorithmic Trading: an Overview EXANTE

Algorithmic TradingPrashant Maharshi

Grid Forex StrategiesHenry Foster

How to (almost certainly) fail: Building vs. buying your API infrastructureApigee | Google Cloud

Welcome to the future of stock trading—welcome to StockDreams.ai!pixelincredible

BUILDING AUTOMATED TRADING STRATEGIESQexpert.com Financial

Similar to Don't let this happen to you! Cloud, complexity and drift (20)

Algo Trading – Best Algorithmic Trading Examples.pdf

Algorithmic Trading-An Introduction

Introduction to Algoaction -Web Based Trading Platform

A case for intelligent autonomous ai (iai)

Algorithmic trading

Ags AIforTrading

UNRAVELING THE POWER OF QUANTOPIAN ALGORITHMS IN FINANCIAL MARKETS

Crypto Trading Bots_ Unlocking the Potential of Automated Trading.pptx

Quantopian: Crowd-sourced Systematic Alpha

Lares - 11 Tips for Choosing the Best Automated Trading Software in India.pptx

Ashirth Barthur, Security Scientist, H2O, at MLconf Seattle 2017

Machine Learning and RoboCop Testing

Atom 2018-share-v1

Unlock Profit Opportunities with the help of a Leading Crypto Arbitrage Bot D...

Algorithmic Trading: an Overview

Algorithmic Trading

Grid Forex Strategies

How to (almost certainly) fail: Building vs. buying your API infrastructure

Welcome to the future of stock trading—welcome to StockDreams.ai!

BUILDING AUTOMATED TRADING STRATEGIES

Recently uploaded

(Best) ENJOY Call Girls in Faridabad Ex | 8377087607dollysharma2066

Lowrate Call Girls In Sector 18 Noida ❤️8860477959 Escorts 100% Genuine Servi...lizamodels9

Market Sizes Sample Report - 2024 EditionMintel Group

8447779800, Low rate Call girls in Tughlakabad Delhi NCRashishs7044

Contemporary Economic Issues Facing the Filipino Entrepreneur (1).pptxMarkAnthonyAurellano

Corporate Profile 47Billion Information TechnologyData Analytics Company - 47Billion Inc.

Independent Call Girls Andheri Nightlaila 9967584737Riya Pathan

Call Girls In Sikandarpur Gurgaon ❤️8860477959_Russian 100% Genuine Escorts I...lizamodels9

Call Girls In Radisson Blu Hotel New Delhi Paschim Vihar ❤️8860477959 Escorts...lizamodels9

Islamabad Escorts | Call 03070433345 | Escort Service in IslamabadAyesha Khan

FULL ENJOY Call girls in Paharganj Delhi | 8377087607dollysharma2066

8447779800, Low rate Call girls in Kotla Mubarakpur Delhi NCRashishs7044

Cybersecurity Awareness Training Presentation v2024.03DallasHaselhorst

Flow Your Strategy at Flight Levels Day 2024Kirill Klimov

Digital Transformation in the PLM domain - distrib.pdfJos Voskuil

The CMO Survey - Highlights and Insights Report - Spring 2024christinemoorman

Youth Involvement in an Innovative Coconut Value Chain by Mwalimu Menzaictsugar

Marketplace and Quality Assurance Presentation - Vincent Chirchirictsugar

Intro to BCG's Carbon Emissions Benchmark_vF.pdfpollardmorgan

NewBase 19 April 2024 Energy News issue - 1717 by Khaled Al Awadi.pdfKhaled Al Awadi

Recently uploaded (20)

(Best) ENJOY Call Girls in Faridabad Ex | 8377087607

Lowrate Call Girls In Sector 18 Noida ❤️8860477959 Escorts 100% Genuine Servi...

Market Sizes Sample Report - 2024 Edition

8447779800, Low rate Call girls in Tughlakabad Delhi NCR

Contemporary Economic Issues Facing the Filipino Entrepreneur (1).pptx

Corporate Profile 47Billion Information Technology

Independent Call Girls Andheri Nightlaila 9967584737

Call Girls In Sikandarpur Gurgaon ❤️8860477959_Russian 100% Genuine Escorts I...

Call Girls In Radisson Blu Hotel New Delhi Paschim Vihar ❤️8860477959 Escorts...

Islamabad Escorts | Call 03070433345 | Escort Service in Islamabad

FULL ENJOY Call girls in Paharganj Delhi | 8377087607

8447779800, Low rate Call girls in Kotla Mubarakpur Delhi NCR

Cybersecurity Awareness Training Presentation v2024.03

Flow Your Strategy at Flight Levels Day 2024

Digital Transformation in the PLM domain - distrib.pdf

The CMO Survey - Highlights and Insights Report - Spring 2024

Youth Involvement in an Innovative Coconut Value Chain by Mwalimu Menza

Marketplace and Quality Assurance Presentation - Vincent Chirchir

Intro to BCG's Carbon Emissions Benchmark_vF.pdf

NewBase 19 April 2024 Energy News issue - 1717 by Khaled Al Awadi.pdf

Don't let this happen to you! Cloud, complexity and drift

1. Don’t let this happen to you! Cloud, complexity and drift James Urquhart @jamesurquhart http://gigaom.com/cloud

2. May 6, 2010

4. ”[A] large fundamental trader chose to execute [a $4.1B] sell program via an automated execution algorithm ('Sell Algorithm')." - Findings Regarding The Market Events of May 6, 2010 http://www.sec.gov/news/studies/2010/marketevents-report.pdf

5. "...the Sell Algorithm…executed the sell program extremely rapidly in just 20 minutes.” The market responded, and trading volume increased… "... [The Sell Algorithm] responded to the increased volume by increasing the rate at which it was feeding the orders into the market." - Findings Regarding The Market Events of May 6, 2010 http://www.sec.gov/news/studies/2010/marketevents-report.pdf

7. Quick! What was the root cause?

8. Automatic Trading Market A Algorithm 1

9. Automatic Trading Market A Algorithm 1 Automatic Automatic Automatic Trading Trading Trading Algorithm 2 Algorithm 3 Algorithm 4

10. Market B Automatic Trading Market A Algorithm 1 Automatic Automatic Automatic Trading Trading Trading Algorithm 2 Algorithm 3 Algorithm 4

11. Automatic Automatic Trading Trading Algorithm 6 Algorithm 5 Market B Automatic Trading Market A Algorithm 1 Automatic Automatic Automatic Trading Trading Trading Algorithm 2 Algorithm 3 Algorithm 4

12. Automatic Automatic Trading Trading Algorithm 6 Algorithm 5 Market B Automatic Trading Market A Algorithm 1 Automatic Automatic Automatic Trading Trading Trading Algorithm 2 Algorithm 3 Algorithm 4

13. Automatic Automatic Trading Trading Algorithm 6 Algorithm 5 Automatic Automatic Trading Market B Automatic Trading Algorithm N Trading Market A Algorithm N Algorithm N Automatic Automatic Automatic Trading Trading Trading Algorithm 2 Algorithm 3 Algorithm 4

14.

15. “Root cause” is not an answer—it’s a clue.

16. "May 6 was…an important reminder of the interconnectedness of our derivatives and securities markets, particularly with respect to index products." - Findings Regarding The Market Events of May 6, 2010 http://www.sec.gov/news/studies/2010/marketevents-report.pdf

17. Interconnectedness

18. The Internet is about interconnectedness.

19. APIs are about interconnectedness.

20. “The cloud” is about interconnectedness.

21.

22. Both cloud computing and stock markets are complex adaptive systems.

23. COMPLEX?!?

24.

25. What are Complex Adaptive Systems?

26.

27.

28.

29. A large number of individual agents

30. Agents Other Information/A ctions State Agents Learning Information/A Other Rules ctions Agents

31. A large number of individual agents + dynamic interactions between agents

32. Dynamic Interaction From NetLogo 4.1.3

33. A large number of individual agents + dynamic interactions between agents + rules for reacting to/interacting with other agents

34. Rules

35. Equals

36. A system that: demonstrates emergent behavior,

37. Emergent Behavior

38. A system that: demonstrates emergent behavior, can be modeled,

39. Models Time Flocking model from NetLogo 4.1.3 http://ccl.northwestern.edu/netlogo/

40. A system that: demonstrates emergent behavior, can be modeled, but makes precise prediction of future behavior impossible.

45. So…what can you do?

46. EMBRACE COMPLEXITY

47. Never!

48. EMBRACE COMPLEXITY SYSTEMS

49. Embrace Systems Thinking

50. Practical advice?

51. Do your homework!

52.

53. Design for resilience

54. “The goal of producing a maximum sustained yield may result in a more stable system of reduced resilience.” “Command and Control and the Pathology of Natural Resource Management”, C. S. Holling and Gary K. Meffe http://landscape.forest.wisc.edu/courses/Landscape565spr01/Holling_Meffe1996.pdf

55. “[T]he resilience-stability tradeoff is more than just a simple transformation in distribution. …[A]gents adapt to a prolonged period of stability in such a manner that the system cannot ‘withstand even modest adverse shocks.’” “The Euro and the Resilience-Stability Tradeoff”, Ashwin Parameswaran http://www.macroresilience.com/2011/11/14/the-euro-and-the-resilience-stability-tradeoff/

56.

57. Focus on relationships

58. Automatic Automatic Trading Trading Algorithm 6 Algorithm 5 Automatic Automatic Trading Market B Automatic Trading Algorithm N Trading Market A Algorithm N Algorithm N Automatic Automatic Automatic Trading Trading Trading Algorithm 2 Algorithm 3 Algorithm 4

59. Automatic Automatic Trading Trading Algorithm 6 Algorithm 5 Automatic Automatic Trading Market B Automatic Trading Algorithm N Trading Market A Algorithm N Algorithm N Automatic Automatic Automatic Trading Trading Trading Algorithm 2 Algorithm 3 Algorithm 4

60. Circuit Breaker

61. http://techblog.netflix.com/2011/12/makin g-netflix-api-more-resilient.html

62. Avoid Drift

63. Drift

64. Avoiding Drift: Test production

65. Release the monkeys!!! • Failure • Degredation • Usage

66. Avoiding Drift: The agent is not the system

67. App-centric System-centric • Monitor each app • Monitor system at many separately levels, and use as feedback • Dig for root cause • Search for system weakness • Attempt to stabilize • Focus on resilience

68. Avoiding Drift: YOU are part of the system

69. Don’t let this happen to you! Cloud, complexity and drift

Editor's Notes

Instead, let me take you back to May 6, 2010…
A typical morning on the stock market. No major breaking news, and all boards trading normally. A slight drop in the indexes, but nothing special. But something went very wrong that afternoon.
According to a joint report written later about that day by the Securities and Exchange Commission and the Commodity Futures Trading Commission, a single trading algorithm was used to mete out $4.1 billion dollars in trades, using an algorithm that metered out individual trades over time, attempting to represent no more than 10% of trading volume at any given time.To achieve this, the algorithm adjusted each trade’s volume based on overall market volume in the previous minute.
Unfortunately, for one reason or another…a simple bug, perhaps, or human error…trades that were meant to be metered out over days or weeks were actually executed within 20 minutes.As might be expected, this resulted in some pretty big trade executions in a very short period of time, especially for the relatively small electronic exchange on which they were executed.The market responded, and other automatic trading algorithms sensed a “sell” signal, and started executing sell trades in response. This increased market volume.The original sell algorithm then responded to the increased trade volume, and increased it’s own trading volume.
The result was an about 4% drop in the Dow in about 20 minutes, a total of 8.1% down from the opening bell. To put that into context, CNN noted that it was the biggest intraday point drop in Dow Jones history at that time.
So, to fix this, we need the root cause, right? I mean, it had to be that original trading company’s bug or bad logic or whatever. Case closed.
Well, it was that errant trading algorithm, right? I mean it initiated the trades, and feedback from that market was then used to initiate further trades.
Then, for whatever reason, other trading algorithms quickly saw this larger than normal activity as a sell sign…
…so they in turn initiated trades on both the original exchange, and (in a few cases) on other exchange mechanisms.
Those trades were either frequent enough or large enough to get the attention of yet more trading algorithms.
A few of those algorithms probably triggered increased trading on the original exchange, further exacerbating the problem.Now, this is an extremely simplified view of the changing state of the exchange system that day, and one that appears much more serial than events really unfolded.
But at this point, how can we say that first algorithm was the “problem” to be “fixed”.In fact, the truth is that it’s the way the other algorithms reacted to the initial trades that made those trades a problem. In theory, any one of a number of large trade sources could have triggered the same series of events. Or a similar series of events. Or, perhaps, an even more devastating series of events.Another important truth is that many, many decisions were made in parallel, often affecting large numbers of trades made by entirely unrelated parties. And there were a huge number of parties. The actual trading graph probably looked more like…
…this. A huge web of players interacting over a variety of paths via a broad range of rules and protocols.
In systems like this, small actions can trigger huge consequences. And most of the time, it’s not the initial trigger that is the problem. In this case, the root cause is interesting, but it’s not the problem.
The Flash Crash post mortem itself noted that “May 6 was an important reminder of the interconnectedness of our derivatives and securities markets”.
“Interconnectedness”. I love that word. Isn’t that what we are all here to talk about this week? Isn’t that what the world of computing is working towards with such fervor and focus?
I mean, the Internet is all about interconnectedness…from the early days of The World Wide Web through human social networks to—increasingly—system to system, device-to-device, thing-to-thing connectedness.
APIs are also a good example of what drives interconnectedness in computing. Simple interfaces to powerful services are driving intense acceleration in the linking of software systems.
And cloud computing is an interesting use case for interconnectedness. Cloud is bringing together computing systems at a scale never before seen. WE are defining perhaps the most interconnected knowledge system in the history of humanity. The cloud is starting to look like…
…this. A huge web of players interacting over a variety of paths via a broad range of rules and protocols.Oh, my. Computing is increasingly looking like the stock market. (How long until we see “certified capacity planners” cold calling us to “manage our portfolios”. Perhaps it’s already begun.)OK, so what’s going on here?
Both the cloud and the stock market are examples of a phenomenon that has received a tremendous amount of scientific interest in the past few decades: complex adaptive systems.
Wait. What? That word makes most people cringe like a Dell PR rep…
…when they hear the name Mads Christiansen.Don’t worry. In this context, it’s not nearly as obscene.
So, what the hell arecomplex adaptive systems?
You experience complex adaptive systems every day. In ecosystems…
…in societies,…
…and in economies. These are just a few examples.Complex adaptive systems are often big, mysterious systems that just seem to work (except when they don’t …more on that later).However, for as large and complicated as they appear, complex adaptive systems are made up of just a few basic elements.
The first is a large number of individual agents…
Agents are defined as entities that operate individually, responding to feedback from the world around them and taking action as their own internal state and rules dictate.So, for example, in computing, an example might be a component that responds to input and takes action as it’s own program dictates.There can be a variety of agent types in a system, for example in computing, some may be mobile clients, services, management agents, servers, disk filers, and so on.When systems are adaptive, the agents can also learn from the feedback they receive, adjusting rules in an attempt to achieve whatever results are optimal for the agent.So, in ecology, organisms adapt to survive and reproduce. In markets, business adapt to profit and grow.The critical thing here is that these agents are not directly controlled by any outside force. Each agent makes it’s own “decisions” on action to take in response to input.
The second element is the way these agents interact with each other.
Relationships between agents change over time, sometimes quite frequently, getting created or destroyed as the agents decide which relationships benefit them the most, or—occasionally—when the system itself forces a relationship to change.If you were to graph out the relationships between agents over the system as a whole, however, you’d probably discover some patterns emerging, depending on the nature of the system.Some systems organize into complex hub-and-spoke networks, like the one above.Others organize into fields in which there are few central players, but a large number of connections between individual agents on an ad hoc basis. There are hierarchical graphs, and more.Regardless, the agents find each other, and begin to interact in ways that begin to create a cohesive system of agents.
Finally, the agents work off of some (usually)simple set of rules.
Now, these rules could be defined differently for different types of agents, and even individual agents may see slight variances from its peers.Nonetheless, these rules are typically defined to benefit the class of agent or the individual agent.Some rules are about how the agent should send signals to other agents. Other rules are about how the agent should evaluate and adjust its behavior. Still others are about how the agent itself should change in response to its environment. I’m simplifying here, but it’s critical to understand that there are no central controllers in complex adaptive systems. It’s all about how the agents interact.
All of this results in something spectacular.
With independent agents, dynamic interaction and rules for that interaction, you get an agent that demonstrates emergent behavior.
An excellent example of this emergent behavior is a flock of birds. There are no leaders in most bird flocks. Rather each bird is responding to a simple set of rules ingrained in it’s DNA. Those rules create a beautiful, fluid movement that almost behaves like a single entity. A system is born from individual agents, dynamic interaction and simple rules.
Another interesting thing about these systems is that they can typically be modeled. Maybe not perfectly, but certainly in ways that are helpful in evaluating the system itself.
Going back to our bird flock example, there is a phenomenal example model that was introduced by computer scientist Craig Reynolds in 1986.Three elements are managed for any given bird: separation from other birds (not too close), cohesion (not too far), and alignment (try to head in the same direction as neighboring birds).The resulting model is amazing to watch in action. A flocking behavior that seems so “real”, that you can’t help but believe you’re seeing a natural flocking behavior.The financial industry, by the way, is famous for modeling large parts of the stock markets and the overall economy itself. Those models are actually getting pretty good, but they won’t create any magic way to predict future markets. Why?
Because the last trait of complex adaptive systems is that there sheer scale and decentralized nature makes it almost impossible to predict the future state of the system.
In late 2010 and early 2011, hundreds of birds were found dead in states ranging from California to Georgia. The cause? No one knows. In some cases, people witnessed birds divebombing semi-trucks. One witness said “it looked like they were committing suicide.” Could anyone have predicted this before it happened?
The Deepwater Horizon disaster is a classic example of failure caused by systems flaws that were unseen or ignored until the disaster happened. The decisions about safety and technologies leading up to the explosion were made by people who believed those decisions were smart and safe. Hell, they were running with those technologies for years in tests and the early life of the platform without problems. What reason did they have to predict a failure—except in retrospect.
The flash crash, of course, is a great example of the unpredictability of systems automation and computing when a large number of individual programs interact over a network against the same data. No matter how well you tighten each individual algorithm, the system as a whole can have hidden traps that are only discovered in extreme conditions.
All of this means that, if you are responsible for a critical component of your companies IT infrastructure running in this complex systems environment, you look a little like this guy.
So…what can you do?
Well, the first thing you can do is embrace complexity. The typical response to this is…
“Hell no! Complexity is evil!”
So, how about if we get a little more specific. Embrace complex systems. Specifically embrace complex adaptive systems, and all that we have learned about how they work and how we can take advantage of their properties.
If you want a keyword to search for solutions that target applying complex systems science to everything from organizations to processes to computing, the term commonly used is “Systems Thinking”.
At this point, it would probably be good for me to get out of theory a little bit, and start talking about practical steps you can take to embrace complexity…er, complex systems.
First, please, please, please, take some time to learn about complex adaptive systems and systems thinking.
Here are three excellent works to start with. “Complexity”, on the left, is an amazing telling of how the Sante Fe Institute, an academic research institution that gathers the best and the brightest in a variety of disciplines to explore the effect of complex adaptive systems on their respective fields, giving a phenomenal overview of the science along the way. “Thinking in Systems” is one of the most respected introductions to Systems Thinking, and Drift into Failure makes clear the dangers and pitfalls that await us all when working with complex systems.
Second, pay attention to one of the most important tradeoffs in complex systems, and choose resilience.
A seminal paper about the pathology of direct management of natural ecosystems proved pretty conclusively that targeting a stable high yield will often result in reduced resilience.
This tradeoff between stability and resilience is critical to understand. If you work towards stability—an environment where any form of change is discouraged in favor of completely predictable outcomes, you’ll find your applications and services actually becoming more susceptible to conditions outside of anticipated norms. When that happens, “stable” systems tend to collapse entirely. Systems build for resilience, on the other hand, have failures all the time, but are build so that those failures have a minimum impact on the system as a whole. So the system remains resilient.
One of the best examples of design for resilience succeeding is the Amazon.com home page. Each of those features you see up there: featured offers, highlighted ads, personalized “what’s hot” sections, are all separate components working within a larger system. If one component fails, either other components fill the need, or the feature is just not displayed. The result is obvious…when was the last time you went to Amazon.com and it just wasn’t there?
Third, understand that, to build agents that thrive in complex systems, you need to focus much of your engineering not on the agent itself, but on how that agent relates to the world around it.
Remember the flash crash model I talked about earlier? What were the things that could be changed that would avoid the likelihood of a similar incident a few months or years later? Not the individual algorithms (hell, the financial services companies would never share those, anyways). Nope…
…it’s the relationships between those components (including many components not included in this model). The first thing the exchanges looked at (as well as the SEC and the Commodity Futures Trading Commission) was what mechanisms they could put in place to capture the “system” going haywire and stop a crisis before it starts.
In computing, believe it or not, there is already an excellent pattern that has proven itself in large scale computing environments that does exactly that. From a book called “Release IT!” which is another read I highly recommend, the Circuit Breaker pattern is all about intercepting traffic at an API or on a network, applying rules to that traffic and triggering certain actions, like forcing a failure of the API call, when it sees something it doesn’t like.
I wish I had more time to tell you about Circuit Breakers, but luckily one of the pioneers of its use at high scale has shared most of what they did in implementing them. I highly recommend checking out this link and the rest of the Netflix techblog if you want to learn more.
Finally, one of the hardest things you’ll have to do as a developer operating in a complex adaptive system is to stay disciplined and aware enough of the system to avoid a drift into critical failures.
Drift is everywhere in complex systems. It is a gradual shift in the structure of a system that actually makes failure MORE likely, even if the goal was to make it less likely. The Deepwater Horizon explosion and subsequent oil spill is an example that Sydney Dekker calls out in his book, that I mentioned earlier. The BT employees interviewed after the accident often mentioned they were uncomfortable with specific decisions, given other factors they knew about, but felt it wasn’t worth the political cost to call them out, and besides—those decisions were often in place for months or years (on a variety of platforms) without negative consequence, so perhaps it wasn’t so risky after all…
Here’s how you avoid disaster in your production systems: beat the hell out of them…
The concept of “Chaos Monkeys” was made famous in part through the practices of companies like Amazon and Netflix. They actually continuously run test systems that stress the system from a usage perspective, perhaps degrade the performance of one component or another, or even outright kill components just to see what happens. The result is that developers KNOW there will be challenges to the components they are building, and they begin to architect them to be resilient…not stable.
Another great idea to keep in mind to avoid drift: remember that the component you are building is NOT the system, but an agent in a larger system.
This is really quite different from the “build the most stable application you can” mentality of the siloed world. In an app-centric world, the focus is on the component—the application. The result is all about stabilizing the design with assumptions that the world around the application can be controlled. In cloud, that won’t work. You HAVE to take time to instrument as much of the world around your app or service or whatever as you can. Use that data as feedback, both in component design and operation, as well as in tweaking the system as much as you can to help your stuff thrive.Not survive. Thrive.
One last thing. The hardest thing to remember when working with complex systems is that everything that touches the system is (generally) a part of that system. That means you are a part of the system, and can only have—at best—a limited understanding of how the system works as a whole. It’s as if you were standing in a vast forest of tall trees. Quick. How big is the forest?Embrace complexity—er, complex adaptive systems. Don’t let this happen to you…
Thank you.

Don't let this happen to you! Cloud, complexity and drift

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (10)

Similar to Don't let this happen to you! Cloud, complexity and drift

Similar to Don't let this happen to you! Cloud, complexity and drift (20)

More from James Urquhart

More from James Urquhart (11)

Recently uploaded

Recently uploaded (20)

Don't let this happen to you! Cloud, complexity and drift

Editor's Notes