Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

There is no Root Cause - Emergent Behavior in Complex Systems

1,085 views

Published on

What went wrong? Why does this always happen? How can we ensure it Never Happens Again? For most of the internet age, engineering teams have focused on finding a cause of an outage. A belief existed, and persists, that all errors or behaviors can be traced back to a single causal entity. The Root Cause Analysis is conducted in service of finding that entity, and correcting it. By doing so, we have been taught, we prevent recurrence of the error in question.

Much of RCA thinking comes from manufacturing and electrical systems, where simple causality can exist. An oft failing fuse is caused by poor wiring. In computing environments, there is rarely so simple a cause. Within even the simplest application nest dependencies, logic, bottlenecks, and inefficiency. By wrapping that application in an operating system, on a server, on a network, on the internet, managed by process, actioned by people we add enough complexity to force us to reconsider the Root Cause Analysis approach.

Modern tools and practices, like DevOps, enable engineering teams to adopt significant complexity at relatively low operational cost. Once unthinkable, microservice architecture in a public cloud environment is now a common choice for new software projects. Consider, for a moment, the layers of complexity captured in that decision. Now consider how opaque the agents in those systems are to the operators (us).

Emergence is a phenomenon whereby larger entities arise through interactions among smaller or simpler entities. In theory, complex systems exhibit highly unpredictable behavior, and generate surprising patterns. In practice, teams operating complex engineering systems always see deeply interrelated causality - a blend of people, process, and the systems themselves. So why do we still focus our after action analysis on a Single Cause?

In this talk, we’ll explore these conflicting realities for incident management teams. Attendees will learn about differences between Root Cause Analysis, and more techniques like Postmortem. While this is a technical talk with examples of both simple and complex infrastructures, much time will be spent considering the impacts of people and process to those same systems. Attendees will leave with some actionable ideas to bring back to their teams to improve their own after action analysis activities.

Speaker
matthew-boeckman

Matthew Boeckman

Matthew is an 18 year veteran building infrastructure and leading engineering teams. Despite his heavy Ops background, Matthew has been a longtime friend of Developers and considers DevOps his primary passion and focus. Most recently VP of Infrastructure at Craftsy, Matthew now owns Dryas.io, a consulting practice focused on DevOps, Cloud adoption, and startup growth strategy.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

There is no Root Cause - Emergent Behavior in Complex Systems

  1. 1. There is No Root Cause Emergent behavior in complex systems Matthew Boeckman
  2. 2. Developer Advocate VictorOps Technology Strategist Dryas.io 18 years (dev)Ops @matthewboeckman Matthew Boeckman
  3. 3. Incident ingestion, scheduling, routing, escalations, chatops, transformation, reporting Incident Management for DevOps teams VictorOps
  4. 4. Root Cause Analysis What went wrong? Why did that happen? Who was responsible? How can we prevent this from recurring?
  5. 5. It’s like a tree, but sideways Fishbone! People Process Pipeline Code Systems Data Something Bad Happened!
  6. 6. “3 tiers should be enough tiers for anybody” - some guy, probably Simple Systems
  7. 7. “I guess there’s more tiers?” - that guy Simple Systems
  8. 8. “We can easily identify the cause of faults in our digital offerings” - same guy Simple systems
  9. 9. Let’s change once a year, then it will be easier to point fingers at Dev. Deployment Schedules
  10. 10. ● It took a long time to create requirements ● It took a long time to write software ● It took a very long time to deploy applications ● It took a really, really long time to test software ● Testing patches was hard ● Deploying patches was all or nothing ● Managing Hardware was an entire departments job ● Software and Hardware changes often required orchestration (that was hard) Playing the long game There were some good reasons
  11. 11. ● It took a long time to create requirements ● It took a long time to write software ● It took a very long time to deploy applications ● It took a really, really long time to test software ● Testing patches was hard ● Deploying patches was all or nothing ● Managing Hardware was an entire departments job ● Software and Hardware changes often required orchestration (that was hard) Playing the long game There aren’t anymore
  12. 12. Root Cause = Static model, Binary Thinking GOOD Working Expected Certain Understood Responsible Uptime BAD Broken Problem Disaster Confused Wrong FAILURE
  13. 13. “3 tiers should be enough tiers for anybody” - some guy, probably Simple Systems
  14. 14. So many tiers “I thought we agreed on 3 tiers?”
  15. 15. What’s traceability, precious?
  16. 16. It’s not a tree...
  17. 17. … it’s a forest
  18. 18. “... refers to the existence or formation of collective behaviors — what parts of a system do together that they would not do alone.”1 Emergence and Complex Systems 1 Bar-Yam Concepts: Emergence Properties and behaviors of systems arise from both the fine structures that compose those systems, and the interrelationships between the systems’ discrete parts.
  19. 19. Root Cause Language
  20. 20. Emergence Language
  21. 21. Subtlety and Nuance Our shared reality High Complexity + Dramatic Change Vectors = Emergent Behavior
  22. 22. Are we doomed?
  23. 23. Cynefin
  24. 24. Cynefin ● Created by Dave Snowden @snowded ● Originally for managing IBM Intellectual Capital ● Draws on research in systems, complexity, network and learning theories
  25. 25. Simple Sense Categorize Respond *known
  26. 26. Complicated Sense Analyze Respond *probable
  27. 27. Complex Probe Sense Respond *emergent
  28. 28. Chaotic Act Sense Respond *buckle up
  29. 29. Disorder Reduce Analyze Iterate *culture
  30. 30. Analysis changes the game Knowledge and practice move patterns towards more favorable quadrants.
  31. 31. Complacency erodes progress Slacking off walks back progress
  32. 32. Adopting Cynefin In the moment: What Quadrant does this map to? In the PIR: How did we manage the pattern? In your sprint planning: What patterns can we manage clockwise?
  33. 33. Root Cause Analysis Cynefin Simple Causality Static Model Binary Thinking After-Action Focus on Blame Dynamic Expects Change Embraces Emergence Present in the Moment Call to Action
  34. 34. Subtlety and Nuance Binary no more There is no broken.
  35. 35. “Uncertainty is an uncomfortable position. But certainty is an absurd one.” -Voltaire
  36. 36. Thank you! @matthewboeckman

×