Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
PETER ALVARO
Orchestrated
Chaos
With a prelude
of vignettes
and an appendix
of fairy tales
Mythology
About me
About me
About me
Platitudes
“Managing complexity”
Easy: removing complexity
Much harder: moving complexity around
Much harder: moving complexity around
Much harder: moving complexity around
Much harder: moving complexity around
Much harder: moving complexity around
Nontrivial systems problems
always require tradeoffs
Productivity /
Convenience
Purity /
Correctness
Vignettes
Vignette 1: teaching myself docker
Vignette 2: a DBA tale
Vignette 3: selling lovely languages
Vignette 4: Microservices
The UNIX philosophy:
Do one thing and do it well.
> man ls
Ease of release wins
Ease of release wins
Ease of release wins
Ease of release wins
Ease of release wins
Ease of release wins
Ease of release wins
Ease of release wins
Ease of release wins
Ease of release wins
Ease of release wins
Ease of release wins
Ease of release wins
The profound solipsism of the microservice
The profound solipsism of the microservice
The profound solipsism of the microservice
The profound solipsism of the microservice
Every microservice is a piece of the continent
Every microservice is a piece of the continent
What could possibly go wrong?
Consider computation
involving 100 services
Search Space:
2100
executions
“Depth” of bugs
Single Faults Search Space:
100 executions
“Depth” of bugs
Combination of 4 faults Search Space:
3M executions
“Depth” of bugs
Combination of 7 faults Search Space:
16B executions
Reflections
1. Managing complexity can be a zero-sum game
2. Productivity trumps purity
3. Chaos results…. and gives rise ...
Opportunity
What the hell is going on? (Observability)
Call
graph
tracing
(e.g. Zipkin)
What could possibly go wrong? (Fault injection)
A fault
injection
framework
(e.g. FIT)
Random search
A fault
injection
framework
(e.g. FIT)
Call
graph
tracing
(e.g. Zipkin)
Random Search
Search Space:
2100
executions
Engineer-guided search
A fault
injection
framework
(e.g. FIT)
Call
graph
tracing
(e.g. Zipkin)
Engineer-guided Search
Search Space:
???
…?
A fault
injection
framework
(e.g. FIT)
Call
graph
tracing
(e.g. Zipkin)
A cunning malevolent sentience?
A fault
injection
framework
(e.g. FIT)
Call
graph
tracing
(e.g. Zipkin)
A cunning malevolent sentience?
A fault
injection
framework
(e.g. FIT)
Call
graph
tracing
(e.g. Zipkin)
Lineage-driven Fault Injection
A fault
injection
framework
(e.g. FIT)
LDFI
Call
graph
tracing
(e.g. Zipkin)
Fault-tolerance “is just” redundancy
But how do we know redundancy when we see it?
Hard question: “Could a bad thing ever happen?”
Easier: “Exactly why did a g...
Lineage-driven fault injection
Why did a good thing happen?
Consider its lineage.
The write
is stable
Stored on
RepA
Store...
Lineage-driven fault injection
Why did a good thing happen?
Consider its lineage.
What could have gone wrong?
Faults are c...
Lineage-driven fault injection
Why did a good thing happen?
Consider its lineage.
What could have gone wrong?
Faults are c...
What would have to go wrong?
(RepA OR Bcast1)
The write
is stable
Stored on
RepA
Stored on
RepB
Bcast2
Client Client
Bcast1
What would have to go wrong?
(RepA OR Bcast1)
AND (RepA OR Bcast2)
The write
is stable
Stored on
RepA
Stored on
RepB
Bcast...
What would have to go wrong?
(RepA OR Bcast1)
AND (RepA OR Bcast2)
AND (RepB OR Bcast2)
The write
is stable
Stored on
RepA...
What would have to go wrong?
(RepA OR Bcast1)
AND (RepA OR Bcast2)
AND (RepB OR Bcast2)
AND (RepB OR Bcast1)
The write
is ...
Lineage-driven fault injection The write
is stable
Stored on
RepA
Stored on
RepB
Bcast1 Bcast2
Client Client
Hypothesis: {...
Lineage-driven fault injection The write
is stable
Stored on
RepA
Stored on
RepB
Bcast1 Bcast2
Client Client
Bcast3
Client...
Lineage-driven fault injection The write
is stable
Stored on
RepA
Stored on
RepB
Bcast1 Bcast2
Client Client
Bcast3
Client...
Search Space Reduction
Each Experiment finds
a bug, OR
Reduces the
Search space
Lineage-driven Fault Injection
Recipe:
1. Start with a successful
outcome. Work backwards.
2. Ask why it happened: Lineage...
Minimal requirements
1. Fault injection infrastructure
2. Mechanism for collecting lineage
3. Ability to replay interactio...
Lineage
Request Tracing
Request Tracing
Alternate Execution
Redundancy through History
Case study: “Netflix AppBoot”
Services ~100
Search space (executions) 2100
(1,000,000,000,000,000,000,000,000,000,000)
Exp...
Fairy tale
Growing Research
Don’t:
“Throw it over the wall”
Do:
Deep embeddings
Trading shoes
Growing Research
Work with us
Search prioritization
Input generation
Richer lineage collection
Search prioritization: minimal hitting sets
(A ∨ B ∨ C ) ∧ (C ∨ D ∨ E ∨ F) ∧ (D ∨ E ∨ F ∨ G) ∧ (H ∨ I)
(A, B, C), (C, D, E...
Search prioritization: minimal hitting sets
(A ∨ B ∨ C ) ∧ (C ∨ D ∨ E ∨ F) ∧ (D ∨ E ∨ F ∨ G) ∧ (H ∨ I)
(A, B, C), (C, D, E...
Measuring FT by counting alternatives
Measuring fault tolerance by counting alternatives
Most likely combination of faults
X
X
X
X
X
Most likely combination of faults
X
X
X
X
X
Most likely combination of faults
X
X
X
X
X
Input generation
Using lightweight modeling to understand Chord
Pamela Zave
The importance of being inputs
Using lightweight modeling to understand Chord
Pamela Zave
The importance of being inputs
Using lightweight modeling to understand Chord
Pamela Zave
The importance of being inputs
Using lightweight modeling to understand Chord
Pamela Zave
Richer lineage collection
Where we are
A fault
injection
framework
(e.g. FIT)
Call
graph
tracing
(e.g. Zipkin)
Where we’re headed
A fault
injection
framework
(e.g. FIT)
Lineage-
driven
fault
injection
Call
graph
tracing
(e.g. Zipkin)
Thanks to our hosts, benefactors and collaborators!
References
● ‘Automating Failure Testing at Internet Scale [ACM SoCC’16]
https://people.ucsc.edu/~palvaro/fit-ldfi.pdf
● ‘...
FOLD
The profound solipsism of the microservice
UGLY
GOOD RAW
GOOD RAW
GOOD RAW
GOOD RAW
True Silicon Valley Stories
1. Crazy legwork
2. The “what the hell does our site do” project
3. Offsite => online
Replay
Bins and Balls
Request
Class 1
Class 2
Class 3
Class n
[...]
r’ r
Class n
Predicting Request Graphs
Request
Class n
Predicting Request Graphs
Request
Some function f:
Requests → Classes
F( ) =
Class n
Request
Predicting Request Graphs
The profound solipsism of the microservice
Orchestrated Chaos: Applying Failure Testing Research at Scale.
Orchestrated Chaos: Applying Failure Testing Research at Scale.
Orchestrated Chaos: Applying Failure Testing Research at Scale.
Orchestrated Chaos: Applying Failure Testing Research at Scale.
Orchestrated Chaos: Applying Failure Testing Research at Scale.
Orchestrated Chaos: Applying Failure Testing Research at Scale.
Orchestrated Chaos: Applying Failure Testing Research at Scale.
Orchestrated Chaos: Applying Failure Testing Research at Scale.
Orchestrated Chaos: Applying Failure Testing Research at Scale.
Orchestrated Chaos: Applying Failure Testing Research at Scale.
Upcoming SlideShare
Loading in …5
×

Orchestrated Chaos: Applying Failure Testing Research at Scale.

481 views

Published on

Large-scale distributed systems must be built to anticipate and mitigate a variety of hardware and software failures. In order to build confidence that fault-tolerant systems are correctly implemented, an increasing number of large-scale sites practice Chaos Engineering, running regular failure drills in which faults are deliberately injected in their production system. While fault injection infrastructures are becoming relatively mature, existing approaches either explore the space of potential failures randomly or exploit the “hunches” of domain experts to guide the search—the combinatorial space of failure scenarios is too large to search exhaustively. Random strategies waste resources testing “uninteresting” faults, while programmer-guided approaches are only as good as the intuition of a programmer and only scale with human effort.
In this talk, I will present intuition, experience and research directions related to lineage-driven fault injection (LDFI), a novel approach to automating failure testing. LDFI utilizes existing tracing or logging infrastructures to work backwards from good outcomes, identifying redundant computations that allow it to aggressively prune the space of faults that must be explored via fault injection. I will describe LDFI’s theoretical roots in the database research notion of provenance, present results from the lab as well as the field, and present a call to arms for the reliability community to improve our understanding of when and how our fault-tolerant systems actually tolerate faults.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Orchestrated Chaos: Applying Failure Testing Research at Scale.

  1. 1. PETER ALVARO Orchestrated Chaos With a prelude of vignettes and an appendix of fairy tales
  2. 2. Mythology
  3. 3. About me
  4. 4. About me
  5. 5. About me
  6. 6. Platitudes
  7. 7. “Managing complexity”
  8. 8. Easy: removing complexity
  9. 9. Much harder: moving complexity around
  10. 10. Much harder: moving complexity around
  11. 11. Much harder: moving complexity around
  12. 12. Much harder: moving complexity around
  13. 13. Much harder: moving complexity around
  14. 14. Nontrivial systems problems always require tradeoffs Productivity / Convenience Purity / Correctness
  15. 15. Vignettes
  16. 16. Vignette 1: teaching myself docker
  17. 17. Vignette 2: a DBA tale
  18. 18. Vignette 3: selling lovely languages
  19. 19. Vignette 4: Microservices The UNIX philosophy: Do one thing and do it well.
  20. 20. > man ls
  21. 21. Ease of release wins
  22. 22. Ease of release wins
  23. 23. Ease of release wins
  24. 24. Ease of release wins
  25. 25. Ease of release wins
  26. 26. Ease of release wins
  27. 27. Ease of release wins
  28. 28. Ease of release wins
  29. 29. Ease of release wins
  30. 30. Ease of release wins
  31. 31. Ease of release wins
  32. 32. Ease of release wins
  33. 33. Ease of release wins
  34. 34. The profound solipsism of the microservice
  35. 35. The profound solipsism of the microservice
  36. 36. The profound solipsism of the microservice
  37. 37. The profound solipsism of the microservice
  38. 38. Every microservice is a piece of the continent
  39. 39. Every microservice is a piece of the continent
  40. 40. What could possibly go wrong? Consider computation involving 100 services Search Space: 2100 executions
  41. 41. “Depth” of bugs Single Faults Search Space: 100 executions
  42. 42. “Depth” of bugs Combination of 4 faults Search Space: 3M executions
  43. 43. “Depth” of bugs Combination of 7 faults Search Space: 16B executions
  44. 44. Reflections 1. Managing complexity can be a zero-sum game 2. Productivity trumps purity 3. Chaos results…. and gives rise to a new order
  45. 45. Opportunity
  46. 46. What the hell is going on? (Observability) Call graph tracing (e.g. Zipkin)
  47. 47. What could possibly go wrong? (Fault injection) A fault injection framework (e.g. FIT)
  48. 48. Random search A fault injection framework (e.g. FIT) Call graph tracing (e.g. Zipkin)
  49. 49. Random Search Search Space: 2100 executions
  50. 50. Engineer-guided search A fault injection framework (e.g. FIT) Call graph tracing (e.g. Zipkin)
  51. 51. Engineer-guided Search Search Space: ???
  52. 52. …? A fault injection framework (e.g. FIT) Call graph tracing (e.g. Zipkin)
  53. 53. A cunning malevolent sentience? A fault injection framework (e.g. FIT) Call graph tracing (e.g. Zipkin)
  54. 54. A cunning malevolent sentience? A fault injection framework (e.g. FIT) Call graph tracing (e.g. Zipkin)
  55. 55. Lineage-driven Fault Injection A fault injection framework (e.g. FIT) LDFI Call graph tracing (e.g. Zipkin)
  56. 56. Fault-tolerance “is just” redundancy
  57. 57. But how do we know redundancy when we see it? Hard question: “Could a bad thing ever happen?” Easier: “Exactly why did a good thing happen?” “What could have gone wrong?”
  58. 58. Lineage-driven fault injection Why did a good thing happen? Consider its lineage. The write is stable Stored on RepA Stored on RepB Bcast1 Bcast2 Client Client
  59. 59. Lineage-driven fault injection Why did a good thing happen? Consider its lineage. What could have gone wrong? Faults are cuts in the lineage graph. Is there a cut that breaks all supports? The write is stable Stored on RepA Stored on RepB Bcast1 Bcast2 Client Client
  60. 60. Lineage-driven fault injection Why did a good thing happen? Consider its lineage. What could have gone wrong? Faults are cuts in the lineage graph. Is there a cut that breaks all supports? The write is stable Stored on RepA Stored on RepB Bcast1 Bcast2 Client Client
  61. 61. What would have to go wrong? (RepA OR Bcast1) The write is stable Stored on RepA Stored on RepB Bcast2 Client Client Bcast1
  62. 62. What would have to go wrong? (RepA OR Bcast1) AND (RepA OR Bcast2) The write is stable Stored on RepA Stored on RepB Bcast1 Bcast2 Client Client
  63. 63. What would have to go wrong? (RepA OR Bcast1) AND (RepA OR Bcast2) AND (RepB OR Bcast2) The write is stable Stored on RepA Stored on RepB Bcast1 Client Client Bcast2
  64. 64. What would have to go wrong? (RepA OR Bcast1) AND (RepA OR Bcast2) AND (RepB OR Bcast2) AND (RepB OR Bcast1) The write is stable Stored on RepA Stored on RepB Bcast1 Bcast2 Client Client
  65. 65. Lineage-driven fault injection The write is stable Stored on RepA Stored on RepB Bcast1 Bcast2 Client Client Hypothesis: {Bcast1, Bcast2}
  66. 66. Lineage-driven fault injection The write is stable Stored on RepA Stored on RepB Bcast1 Bcast2 Client Client Bcast3 Client (RepA OR Bcast1) AND (RepA OR Bcast2) AND (RepB OR Bcast2) AND (RepB OR Bcast1)
  67. 67. Lineage-driven fault injection The write is stable Stored on RepA Stored on RepB Bcast1 Bcast2 Client Client Bcast3 Client (RepA OR Bcast1) AND (RepA OR Bcast2) AND (RepB OR Bcast2) AND (RepB OR Bcast1) AND (RepA OR Bcast3) AND (RepB OR Bcast3)
  68. 68. Search Space Reduction Each Experiment finds a bug, OR Reduces the Search space
  69. 69. Lineage-driven Fault Injection Recipe: 1. Start with a successful outcome. Work backwards. 2. Ask why it happened: Lineage 3. Convert lineage to a boolean formula and solve 4. Lather, rinse, repeat 2. Lineage 3. CNF Fail1. Success Why? Encode Solve 4. REPEAT
  70. 70. Minimal requirements 1. Fault injection infrastructure 2. Mechanism for collecting lineage 3. Ability to replay interactions
  71. 71. Lineage
  72. 72. Request Tracing
  73. 73. Request Tracing
  74. 74. Alternate Execution
  75. 75. Redundancy through History
  76. 76. Case study: “Netflix AppBoot” Services ~100 Search space (executions) 2100 (1,000,000,000,000,000,000,000,000,000,000) Experiments performed 200 Critical bugs found 11
  77. 77. Fairy tale
  78. 78. Growing Research Don’t: “Throw it over the wall” Do: Deep embeddings Trading shoes
  79. 79. Growing Research
  80. 80. Work with us Search prioritization Input generation Richer lineage collection
  81. 81. Search prioritization: minimal hitting sets (A ∨ B ∨ C ) ∧ (C ∨ D ∨ E ∨ F) ∧ (D ∨ E ∨ F ∨ G) ∧ (H ∨ I) (A, B, C), (C, D, E, F), (D, E, F, G), (H, I) ⇩
  82. 82. Search prioritization: minimal hitting sets (A ∨ B ∨ C ) ∧ (C ∨ D ∨ E ∨ F) ∧ (D ∨ E ∨ F ∨ G) ∧ (H ∨ I) (A, B, C), (C, D, E, F), (D, E, F, G), (H, I) ⇩ e.g. (C, E, H) ✔ X X X X X
  83. 83. Measuring FT by counting alternatives
  84. 84. Measuring fault tolerance by counting alternatives
  85. 85. Most likely combination of faults X X X X X
  86. 86. Most likely combination of faults X X X X X
  87. 87. Most likely combination of faults X X X X X
  88. 88. Input generation Using lightweight modeling to understand Chord Pamela Zave
  89. 89. The importance of being inputs Using lightweight modeling to understand Chord Pamela Zave
  90. 90. The importance of being inputs Using lightweight modeling to understand Chord Pamela Zave
  91. 91. The importance of being inputs Using lightweight modeling to understand Chord Pamela Zave
  92. 92. Richer lineage collection
  93. 93. Where we are A fault injection framework (e.g. FIT) Call graph tracing (e.g. Zipkin)
  94. 94. Where we’re headed A fault injection framework (e.g. FIT) Lineage- driven fault injection Call graph tracing (e.g. Zipkin)
  95. 95. Thanks to our hosts, benefactors and collaborators!
  96. 96. References ● ‘Automating Failure Testing at Internet Scale [ACM SoCC’16] https://people.ucsc.edu/~palvaro/fit-ldfi.pdf ● ‘Lineage Driven Fault Injection’ [ACM SIGMOD’15] http://people.ucsc.edu/~palvaro/molly.pdf ● Netflix Tech Blog on ‘Automated Failure Testing’ http://techblog.netflix.com/2016/01/automated-failure-testing.html
  97. 97. FOLD
  98. 98. The profound solipsism of the microservice
  99. 99. UGLY
  100. 100. GOOD RAW
  101. 101. GOOD RAW
  102. 102. GOOD RAW
  103. 103. GOOD RAW
  104. 104. True Silicon Valley Stories 1. Crazy legwork 2. The “what the hell does our site do” project 3. Offsite => online
  105. 105. Replay
  106. 106. Bins and Balls Request Class 1 Class 2 Class 3 Class n [...] r’ r
  107. 107. Class n Predicting Request Graphs Request
  108. 108. Class n Predicting Request Graphs Request Some function f: Requests → Classes
  109. 109. F( ) = Class n Request Predicting Request Graphs
  110. 110. The profound solipsism of the microservice

×