Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Resiliency,
Failures vs Errors,
Isolation, Delegation
and Replication
in Reactive Systems
Jonas Bonér
CTO TypEsafe
@jboner
“But it ain’t how hard you’re hit;
it’s about how hard you can get
hit, and keep moving forward.
How much you can take, an...
“But it ain’t how hard you’re hit;
it’s about how hard you can get
hit, and keep moving forward.
How much you can take, an...
Resilience
is Beyond
Fault Tolerance
Resilience
“The ability of a substance or
object to spring back into shape.
The capacity to recover quickly
from difficult...
Without Resilience
Nothing Else Matters
Without Resilience
Nothing Else Matters
“We can model and understand in isolation. 

But, when released into competitive nominally
regulated societies, their conn...
We Need to Study
Resilience in
Complex Systems
Complicated System
Complicated System
Complex System
Complex System
Complex System
Complicated ≠Complex
“Counterintuitive. That’s [Jay] Forrester’s word
to describe complex systems. Leverage points
are not intuitive. Or if the...
‘‘Going solid’’: a model of system dynamics and consequences for patient safety - R Cook, J Rasmussen
Resilience in comple...
‘‘Going solid’’: a model of system dynamics and consequences for patient safety - R Cook, J Rasmussen
Resilience in comple...
‘‘Going solid’’: a model of system dynamics and consequences for patient safety - R Cook, J Rasmussen
Resilience in comple...
‘‘Going solid’’: a model of system dynamics and consequences for patient safety - R Cook, J Rasmussen
Resilience in comple...
‘‘Going solid’’: a model of system dynamics and consequences for patient safety - R Cook, J Rasmussen
Resilience in comple...
‘‘Going solid’’: a model of system dynamics and consequences for patient safety - R Cook, J Rasmussen
Resilience in comple...
‘‘Going solid’’: a model of system dynamics and consequences for patient safety - R Cook, J Rasmussen
Resilience in comple...
Economic
Failure
Boundary
Unacceptable
Workload
Boundary
Accident
Boundary
‘‘Going solid’’: a model of system dynamics and...
Economic
Failure
Boundary
Unacceptable
Workload
Boundary
Accident
Boundary
Management Pressure
Towards Economic Efficiency...
Economic
Failure
Boundary
Unacceptable
Workload
Boundary
Accident
Boundary
Management Pressure
Towards Economic Efficiency...
Economic
Failure
Boundary
Unacceptable
Workload
Boundary
Accident
Boundary
Management Pressure
Towards Economic Efficiency...
Economic
Failure
Boundary
Unacceptable
Workload
Boundary
Accident
Boundary
Management Pressure
Towards Economic Efficiency...
‘‘Going solid’’: a model of system dynamics and consequences for patient safety - R Cook, J Rasmussen
Resilience in comple...
‘‘Going solid’’: a model of system dynamics and consequences for patient safety - R Cook, J Rasmussen
Resilience in comple...
‘‘Going solid’’: a model of system dynamics and consequences for patient safety - R Cook, J Rasmussen
Resilience in comple...
‘‘Going solid’’: a model of system dynamics and consequences for patient safety - R Cook, J Rasmussen
Resilience in comple...
‘‘Going solid’’: a model of system dynamics and consequences for patient safety - R Cook, J Rasmussen
Resilience in comple...
‘‘Going solid’’: a model of system dynamics and consequences for patient safety - R Cook, J Rasmussen
Resilience in comple...
‘‘Going solid’’: a model of system dynamics and consequences for patient safety - R Cook, J Rasmussen
Resilience in comple...
‘‘Going solid’’: a model of system dynamics and consequences for patient safety - R Cook, J Rasmussen
Resilience in comple...
‘‘Going solid’’: a model of system dynamics and consequences for patient safety - R Cook, J Rasmussen
Resilience in comple...
‘‘Going solid’’: a model of system dynamics and consequences for patient safety - R Cook, J Rasmussen
Resilience in comple...
Embrace Failure
Resilience in
Social Systems
Simple Critical Infrastructure Map
Understanding vital services, and how they keep you safe
6ways to die
3sets of essentia...
7 Principles for Building
Resilience in Social Systems
1. Maintain diversity & Redundancy
2. Manage connectivity
3. Manage...
Resilience in
Biological Systems
MeerkatsPuppies! Now that I’ve got your attention, complexity theory - Nicolas Perony, TED talk
What We Can Learn
From Biological Systems
1. Feature Diversity and redundancy
2. Inter-Connected network structure
3. Wide...
“Animals show extraordinary social complexity,
and this allows them to adapt and 

respond to changes in their environment...
Resilience in
Computer Systems
“Complex systems run in degraded mode.”
“Complex systems run as broken systems.”
- richard Cook
How Complex Systems Fail -...
Resilience
is by
Design
Photo courtesy of FEMA/Joselyne Augustino
We Need to
Manage Failure
“Post-accident attribution to a 

‘root cause’ is fundamentally wrong: 

Because overt failure requires multiple faults,
t...
There is No
Root Cause
Crash Only
Software
Crash-Only Software - George Candea, Armando Fox
Stop = Crash Safely
Start = Recover Fast
“To make a system of interconnected components
crash-only, it must be designed so that components
can tolerate the crashes...
Recursive Restartability
Turning the Crash-Only Sledgehammer into a Scalpel
Recursive Restartability: Turning the Reboot S...
Services need to accept
NO for an answer
"Software components should be designed such
that they can deny service for any request or call.
Then, if an underlying co...
“The explosive growth of software has
added greatly to systems’ interactive
complexity. With software, the possible
states...
Classification
of State
• Static Data
• Scratch Data
• Dynamic Data
• Recomputable
• not recomputable
Classification
of State
• Static Data
• Scratch Data
• Dynamic Data
• Recomputable
• not recomputable
Critical
We Need a
Way Out of the
State Tar Pit
Out of the Tar Pit - Ben Moseley , Peter Marks
Essential
State
We Need a
Way Out of the
State Tar Pit
Out of the Tar Pit - Ben Moseley , Peter Marks
Essential
State
We Need a
Way Out of the
State Tar Pit
Out of the Tar Pit - Ben Moseley , Peter Marks
Essential
Logic
Essential
State
We Need a
Way Out of the
State Tar Pit
Out of the Tar Pit - Ben Moseley , Peter Marks
Essential
Logic
Acci...
Traditional
State Management
Object
Critical state
that needs protection
Client
Thread boundary
Traditional
State Management
Object
Critical state
that needs protection
Client
Thread boundary
Traditional
State Management
Object
Critical state
that needs protection
Client
Thread boundary
Traditional
State Management
Object
Critical state
that needs protection
Client
Thread boundary
Synchronous dispatch Threa...
Traditional
State Management
Object
Critical state
that needs protection
Client
Thread boundary
Synchronous dispatch Threa...
Traditional
State Management
Object
Critical state
that needs protection
Client
Thread boundary
Synchronous dispatch Threa...
Traditional
State Management
Object
Critical state
that needs protection
Client
Thread boundary
Synchronous dispatch Threa...
Requirements for a
Sane Failure Mode
1. Contained
2. Reified—as messages
3. Signalled—Asynchronously
4. Observed—by 1-N
5....
Akka Actors
Akka Actors
case class Greeting(who: String)
class GreetingActor extends Actor with ActorLogging {
def receive = {
case Gr...
Akka Actors
case class Greeting(who: String)
class GreetingActor extends Actor with ActorLogging {
def receive = {
case Gr...
Akka Actors
case class Greeting(who: String)
class GreetingActor extends Actor with ActorLogging {
def receive = {
case Gr...
Akka Actors
case class Greeting(who: String)
class GreetingActor extends Actor with ActorLogging {
def receive = {
case Gr...
Akka Actors
case class Greeting(who: String)
class GreetingActor extends Actor with ActorLogging {
def receive = {
case Gr...
Akka Actors
case class Greeting(who: String)
class GreetingActor extends Actor with ActorLogging {
def receive = {
case Gr...
Akka Actors
case class Greeting(who: String)
class GreetingActor extends Actor with ActorLogging {
def receive = {
case Gr...
Akka Actors
case class Greeting(who: String)
class GreetingActor extends Actor with ActorLogging {
def receive = {
case Gr...
Akka Actors
case class Greeting(who: String)
class GreetingActor extends Actor with ActorLogging {
def receive = {
case Gr...
Akka Actors
case class Greeting(who: String)
class GreetingActor extends Actor with ActorLogging {
def receive = {
case Gr...
Akka Actors
case class Greeting(who: String)
class GreetingActor extends Actor with ActorLogging {
def receive = {
case Gr...
Akka Actors
case class Greeting(who: String)
class GreetingActor extends Actor with ActorLogging {
def receive = {
case Gr...
Akka Actors
case class Greeting(who: String)
class GreetingActor extends Actor with ActorLogging {
def receive = {
case Gr...
Akka Actors
case class Greeting(who: String)
class GreetingActor extends Actor with ActorLogging {
def receive = {
case Gr...
Akka Actors
case class Greeting(who: String)
class GreetingActor extends Actor with ActorLogging {
def receive = {
case Gr...
Bulkhead
Pattern
Bulkhead
Pattern
Bulkhead
Pattern
Enter Supervision
Enter Supervision
The
Vending
Machine
Pattern
Think Vending Machine
Coffee
Machine
Programmer
Think Vending Machine
Coffee
Machine
Programmer
Inserts coins
Think Vending Machine
Coffee
Machine
Programmer
Inserts coins
Add more coins
Think Vending Machine
Coffee
Machine
Programmer
Inserts coins
Gets coffee
Add more coins
Think Vending Machine
Programmer
Coffee
Machine
Think Vending Machine
Programmer
Inserts coins
Coffee
Machine
Think Vending Machine
Programmer
Inserts coins
Out of coffee beans error Coffee
Machine
Think Vending Machine
Programmer
Inserts coins
Out of coffee beans error
WRONG
Coffee
Machine
Think Vending Machine
Programmer
Inserts coins
Coffee
Machine
Think Vending Machine
Programmer
Inserts coins
Out of
coffee beans
failure
Coffee
Machine
Think Vending Machine
Programmer
Service
Guy
Inserts coins
Out of
coffee beans
failure
Coffee
Machine
Think Vending Machine
Programmer
Service
Guy
Inserts coins
Out of
coffee beans
failure
Adds
more
beans
Coffee
Machine
Think Vending Machine
Programmer
Service
Guy
Inserts coins
Gets coffee
Out of
coffee beans
failure
Adds
more
beans
Coffee
...
Think Vending Machine
ServiceClient
Think Vending Machine
ServiceClient
Request
Think Vending Machine
ServiceClient
Request
Response
Think Vending Machine
ServiceClient
Request
Response
Validation Error
Think Vending Machine
ServiceClient
Request
Response
Validation Error
Application
Failure
Think Vending Machine
ServiceClient
Supervisor
Request
Response
Validation Error
Application
Failure
Think Vending Machine
ServiceClient
Supervisor
Request
Response
Validation Error
Application
Failure
Manages
Failure
“Accidents come from relationships 
not broken parts.”
- Sidney dekker
Drift into Failure - Sidney Dekker
Error Kernel
Pattern
Onion-layered state & Failure management
Making reliable distributed systems in the presence of softw...
Onion Layered
State Management
Object
Critical state
that needs protection
Client
Thread boundary
Onion Layered
State Management
Object
Critical state
that needs protection
Client
Thread boundary
Onion Layered
State Management
Error Kernel
Object
Critical state
that needs protection
Client
Thread boundary
Onion Layered
State Management
Error Kernel
Object
Critical state
that needs protection
Client
Thread boundary
Onion Layered
State Management
Error Kernel
Object
Critical state
that needs protection
Client
Supervision
Thread boundary
Onion Layered
State Management
Error Kernel
Object
Critical state
that needs protection
Client
Supervision
Supervision
Thr...
Onion Layered
State Management
Error Kernel
Object
Critical state
that needs protection
Client
Supervision
Supervision
Thr...
Onion Layered
State Management
Error Kernel
Object
Critical state
that needs protection
Client
Supervision
Supervision
Thr...
Onion Layered
State Management
Error Kernel
Object
Critical state
that needs protection
Client
Supervision
Supervision
Thr...
Onion Layered
State Management
Error Kernel
Object
Critical state
that needs protection
Client
Supervision
Supervision
Thr...
Onion Layered
State Management
Error Kernel
Object
Critical state
that needs protection
Client
Supervision
Supervision
Thr...
Onion Layered
State Management
Error Kernel
Object
Critical state
that needs protection
Client
Supervision
Supervision
Thr...
Onion Layered
State Management
Error Kernel
Object
Critical state
that needs protection
Client
Supervision
Supervision
Thr...
Onion Layered
State Management
Error Kernel
Object
Critical state
that needs protection
Client
Supervision
Supervision
Thr...
Supervision in Akka
Every actor has a default supervisor strategy.
Which can, and often should, be overridden.
class Super...
Supervision in Akka
Every actor has a default supervisor strategy.
Which can, and often should, be overridden.
class Super...
Supervision in Akka
Every actor has a default supervisor strategy.
Which can, and often should, be overridden.
class Super...
Supervision in Akka
Every actor has a default supervisor strategy.
Which can, and often should, be overridden.
class Super...
Monitor through
Death Watch
class Watcher extends Actor {
val child = context.actorOf(Props.empty, "child")
context.watch(...
Monitor through
Death Watch
class Watcher extends Actor {
val child = context.actorOf(Props.empty, "child")
context.watch(...
Monitor through
Death Watch
class Watcher extends Actor {
val child = context.actorOf(Props.empty, "child")
context.watch(...
Monitor through
Death Watch
class Watcher extends Actor {
val child = context.actorOf(Props.empty, "child")
context.watch(...
Maintain Diversity
and Redundancy
Maintain Diversity
and Redundancy
Akka Routing
akka {
actor {
deployment {
/service/router {
router = round-robin-pool
resizer {
lower-bound = 12
upper-boun...
Akka Cluster
akka {
actor {
deployment {
/service/router {
router = round-robin-pool
resizer {
lower-bound = 12
upper-boun...
The Network
is Reliable
The Network
is Reliable
NAT
We are living in the
Looming
Shadowof
Impossibility
Theorems
We are living in the
Looming
Shadowof
Impossibility
Theorems
CAP: Consistency is impossible
We are living in the
Looming
Shadowof
Impossibility
Theorems
CAP: Consistency is impossible
FLP: Consensus is impossible
Strong
Consistency
Is the wrong default
We need
Decoupling IN
Time and Space
Resilient Protocols
Depend on
Asynchronous Communication
Eventual Consistency
Resilient Protocols
• are tolerant to
• Message loss
• Message reordering
• Message duplication
Depend on
Asynchronous Com...
Resilient Protocols
• are tolerant to
• Message loss
• Message reordering
• Message duplication
• Embrace ACID 2.0
• Assoc...
Akka Distributed Data
class DataBot extends Actor with ActorLogging {
val replicator = DistributedData(context.system).rep...
Akka Distributed Data
class DataBot extends Actor with ActorLogging {
val replicator = DistributedData(context.system).rep...
“An escalator can never break: it can only
become stairs. You should never see an
Escalator Temporarily Out Of Order sign,...
Graceful
Degradation
Always Rely on
Asynchronous
Communication
Always Rely on
Asynchronous
Communication
…And If not pOssiblE
Always use
Timeouts
Circuit Breaker
Circuit Breaker in Akka
val breaker =
new CircuitBreaker(
context.system.scheduler,
maxFailures = 5,
callTimeout = 10.seco...
Little’s Law
W: Response Time
L: Queue Length
Little’s Law
Queue Length = Arrival Rate * Response Time
W: Response Time
L: Queue Length
Little’s Law
Response Time = Queue Length / Arrival Rate
W: Response Time
L: Queue Length
Flow Control
Flow Control
Always Apply Back Pressure
Akka Streams
Akka Streams
Akka Streams
in ~> f1 ~> bcast ~> f2 ~> merge ~> f3 ~> out
bcast ~> f4 ~> merge
Akka Streams
in ~> f1 ~> bcast ~> f2 ~> merge ~> f3 ~> out
bcast ~> f4 ~> merge
val g = FlowGraph.closed() {
implicit buil...
Akka Streams
in ~> f1 ~> bcast ~> f2 ~> merge ~> f3 ~> out
bcast ~> f4 ~> merge
val g = FlowGraph.closed() {
implicit buil...
Akka Streams
in ~> f1 ~> bcast ~> f2 ~> merge ~> f3 ~> out
bcast ~> f4 ~> merge
val g = FlowGraph.closed() {
implicit buil...
Akka Streams
in ~> f1 ~> bcast ~> f2 ~> merge ~> f3 ~> out
bcast ~> f4 ~> merge
val g = FlowGraph.closed() {
implicit buil...
Akka Streams
in ~> f1 ~> bcast ~> f2 ~> merge ~> f3 ~> out
bcast ~> f4 ~> merge
val g = FlowGraph.closed() {
implicit buil...
Akka Streams
in ~> f1 ~> bcast ~> f2 ~> merge ~> f3 ~> out
bcast ~> f4 ~> merge
val g = FlowGraph.closed() {
implicit buil...
Testing
What can we learn from Arnold?
What can we learn from Arnold?
What can we learn from Arnold?
Blow things up
Shoot
Your App
Down
Pull the Plug
…and see what happens
Executive
Summary
“Complex systems run in degraded mode.”
“Complex systems run as broken systems.”
- richard Cook
How Complex Systems Fail -...
Resilience
is by
Design
Photo courtesy of FEMA/Joselyne Augustino
ReferencesAntifragile: Things That Gain from Disorder - http://www.amazon.com/Antifragile-Things-that-Gain-Disorder-ebook/...
EXPERT TRAINING
Delivered On-site For Spark, Scala, Akka And Play
Help is just a click away. Get in touch with
Typesafe ab...
Thank
You
Reactive Revealed Part 3 of 3: Resiliency, Failures vs Errors, Isolation, Delegation and Replication in Reactive Systems
Reactive Revealed Part 3 of 3: Resiliency, Failures vs Errors, Isolation, Delegation and Replication in Reactive Systems
Reactive Revealed Part 3 of 3: Resiliency, Failures vs Errors, Isolation, Delegation and Replication in Reactive Systems
Upcoming SlideShare
Loading in …5
×

Reactive Revealed Part 3 of 3: Resiliency, Failures vs Errors, Isolation, Delegation and Replication in Reactive Systems

5,256 views

Published on

Part 3: What you should know about Resiliency, Errors vs Failures, Isolation (and Containment), Delegation and Replication in Reactive systems

In the final webinar with live Q/A in the Reactive Revealed series, we explore the way that Reactive systems maintain resiliency with an infrastructural approach designed to welcome failure often and recover gracefully. Presented by Reactive Manifesto co-author, Akka creator and CTO at Typesafe, Inc., Jonas Bonér explores what you should know about:

What you should know about maintaining resiliency with monolithic systems compared to distributed systems
How Reactive systems handle errors and prevents catastrophic failures with isolation and containment, delegation and replication
How isolation (and containment) of error state and behavior works to block the ripple effect of cascading failures
How delegation of failure management and replication lets Reactive systems continue running in the face of failures using a different error handling context, on a different thread or thread pool, in a different process, or on a different network node or computing center
Previous

Part 1 - Asynchronous I/O, Back-pressure and the Message-driven vs. Event-driven approach in Reactive systems | presented by Konrad Malawski

Part 2 - Elasticity, Scalability and Location Transparency in Reactive Systems | presented by Viktor Klang

Published in: Software
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website! https://vk.cc/818RFv
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Reactive Revealed Part 3 of 3: Resiliency, Failures vs Errors, Isolation, Delegation and Replication in Reactive Systems

  1. 1. Resiliency, Failures vs Errors, Isolation, Delegation and Replication in Reactive Systems Jonas Bonér CTO TypEsafe @jboner
  2. 2. “But it ain’t how hard you’re hit; it’s about how hard you can get hit, and keep moving forward. How much you can take, and keep moving forward. That’s how winning is done.” - Rocky Balboa
  3. 3. “But it ain’t how hard you’re hit; it’s about how hard you can get hit, and keep moving forward. How much you can take, and keep moving forward. That’s how winning is done.” - Rocky Balboa This is Fault Tolerance
  4. 4. Resilience is Beyond Fault Tolerance
  5. 5. Resilience “The ability of a substance or object to spring back into shape. The capacity to recover quickly from difficulties.” -Merriam Webster
  6. 6. Without Resilience Nothing Else Matters
  7. 7. Without Resilience Nothing Else Matters
  8. 8. “We can model and understand in isolation. 
 But, when released into competitive nominally regulated societies, their connections proliferate, 
 their interactions and interdependencies multiply, 
 their complexities mushroom. 
 And we are caught short.” - Sidney Dekker Drift into Failure - Sidney Dekker
  9. 9. We Need to Study Resilience in Complex Systems
  10. 10. Complicated System
  11. 11. Complicated System
  12. 12. Complex System
  13. 13. Complex System
  14. 14. Complex System
  15. 15. Complicated ≠Complex
  16. 16. “Counterintuitive. That’s [Jay] Forrester’s word to describe complex systems. Leverage points are not intuitive. Or if they are, we intuitively use them backward, systematically worsening whatever problems we are trying to solve.” - Donella Meadows Leverage Points: Places to Intervene in a System - Donella Meadows
  17. 17. ‘‘Going solid’’: a model of system dynamics and consequences for patient safety - R Cook, J Rasmussen Resilience in complex adaptive systems: Operating at the Edge of Failure - Richard Cook - Talk at Velocity NY 2013 Operating at the Edge of Failure
  18. 18. ‘‘Going solid’’: a model of system dynamics and consequences for patient safety - R Cook, J Rasmussen Resilience in complex adaptive systems: Operating at the Edge of Failure - Richard Cook - Talk at Velocity NY 2013 Economic Failure Boundary Operating at the Edge of Failure
  19. 19. ‘‘Going solid’’: a model of system dynamics and consequences for patient safety - R Cook, J Rasmussen Resilience in complex adaptive systems: Operating at the Edge of Failure - Richard Cook - Talk at Velocity NY 2013 Economic Failure Boundary Unacceptable Workload Boundary Operating at the Edge of Failure
  20. 20. ‘‘Going solid’’: a model of system dynamics and consequences for patient safety - R Cook, J Rasmussen Resilience in complex adaptive systems: Operating at the Edge of Failure - Richard Cook - Talk at Velocity NY 2013 Economic Failure Boundary Unacceptable Workload Boundary Accident Boundary Operating at the Edge of Failure
  21. 21. ‘‘Going solid’’: a model of system dynamics and consequences for patient safety - R Cook, J Rasmussen Resilience in complex adaptive systems: Operating at the Edge of Failure - Richard Cook - Talk at Velocity NY 2013 Economic Failure Boundary Unacceptable Workload Boundary Operating Point Accident Boundary Operating at the Edge of Failure
  22. 22. ‘‘Going solid’’: a model of system dynamics and consequences for patient safety - R Cook, J Rasmussen Resilience in complex adaptive systems: Operating at the Edge of Failure - Richard Cook - Talk at Velocity NY 2013 Economic Failure Boundary Unacceptable Workload Boundary Accident Boundary Operating at the Edge of Failure
  23. 23. ‘‘Going solid’’: a model of system dynamics and consequences for patient safety - R Cook, J Rasmussen Resilience in complex adaptive systems: Operating at the Edge of Failure - Richard Cook - Talk at Velocity NY 2013 Economic Failure Boundary Unacceptable Workload Boundary FAILURE Accident Boundary Operating at the Edge of Failure
  24. 24. Economic Failure Boundary Unacceptable Workload Boundary Accident Boundary ‘‘Going solid’’: a model of system dynamics and consequences for patient safety - R Cook, J Rasmussen Resilience in complex adaptive systems: Operating at the Edge of Failure - Richard Cook - Talk at Velocity NY 2013 Operating at the Edge of Failure
  25. 25. Economic Failure Boundary Unacceptable Workload Boundary Accident Boundary Management Pressure Towards Economic Efficiency ‘‘Going solid’’: a model of system dynamics and consequences for patient safety - R Cook, J Rasmussen Resilience in complex adaptive systems: Operating at the Edge of Failure - Richard Cook - Talk at Velocity NY 2013 Operating at the Edge of Failure
  26. 26. Economic Failure Boundary Unacceptable Workload Boundary Accident Boundary Management Pressure Towards Economic Efficiency Gradient Towards Least Effort ‘‘Going solid’’: a model of system dynamics and consequences for patient safety - R Cook, J Rasmussen Resilience in complex adaptive systems: Operating at the Edge of Failure - Richard Cook - Talk at Velocity NY 2013 Operating at the Edge of Failure
  27. 27. Economic Failure Boundary Unacceptable Workload Boundary Accident Boundary Management Pressure Towards Economic Efficiency Gradient Towards Least Effort ‘‘Going solid’’: a model of system dynamics and consequences for patient safety - R Cook, J Rasmussen Resilience in complex adaptive systems: Operating at the Edge of Failure - Richard Cook - Talk at Velocity NY 2013 Operating at the Edge of Failure
  28. 28. Economic Failure Boundary Unacceptable Workload Boundary Accident Boundary Management Pressure Towards Economic Efficiency Gradient Towards Least Effort Counter Gradient For More Resilience ‘‘Going solid’’: a model of system dynamics and consequences for patient safety - R Cook, J Rasmussen Resilience in complex adaptive systems: Operating at the Edge of Failure - Richard Cook - Talk at Velocity NY 2013 Operating at the Edge of Failure
  29. 29. ‘‘Going solid’’: a model of system dynamics and consequences for patient safety - R Cook, J Rasmussen Resilience in complex adaptive systems: Operating at the Edge of Failure - Richard Cook - Talk at Velocity NY 2013 Economic Failure Boundary Unacceptable Workload Boundary Accident Boundary Operating at the Edge of Failure
  30. 30. ‘‘Going solid’’: a model of system dynamics and consequences for patient safety - R Cook, J Rasmussen Resilience in complex adaptive systems: Operating at the Edge of Failure - Richard Cook - Talk at Velocity NY 2013 Economic Failure Boundary Unacceptable Workload Boundary Accident Boundary Error Margin Marginal Boundary Operating at the Edge of Failure
  31. 31. ‘‘Going solid’’: a model of system dynamics and consequences for patient safety - R Cook, J Rasmussen Resilience in complex adaptive systems: Operating at the Edge of Failure - Richard Cook - Talk at Velocity NY 2013 Economic Failure Boundary Unacceptable Workload Boundary Accident Boundary Error Margin Marginal Boundary Operating at the Edge of Failure
  32. 32. ‘‘Going solid’’: a model of system dynamics and consequences for patient safety - R Cook, J Rasmussen Resilience in complex adaptive systems: Operating at the Edge of Failure - Richard Cook - Talk at Velocity NY 2013 Economic Failure Boundary Unacceptable Workload Boundary Accident Boundary Error Margin Marginal Boundary Operating at the Edge of Failure
  33. 33. ‘‘Going solid’’: a model of system dynamics and consequences for patient safety - R Cook, J Rasmussen Resilience in complex adaptive systems: Operating at the Edge of Failure - Richard Cook - Talk at Velocity NY 2013 Accident Boundary Marginal Boundary Operating at the Edge of Failure
  34. 34. ‘‘Going solid’’: a model of system dynamics and consequences for patient safety - R Cook, J Rasmussen Resilience in complex adaptive systems: Operating at the Edge of Failure - Richard Cook - Talk at Velocity NY 2013 Marginal Boundary ? Operating at the Edge of Failure
  35. 35. ‘‘Going solid’’: a model of system dynamics and consequences for patient safety - R Cook, J Rasmussen Resilience in complex adaptive systems: Operating at the Edge of Failure - Richard Cook - Talk at Velocity NY 2013 Operating at the Edge of Failure Accident Boundary Marginal Boundary
  36. 36. ‘‘Going solid’’: a model of system dynamics and consequences for patient safety - R Cook, J Rasmussen Resilience in complex adaptive systems: Operating at the Edge of Failure - Richard Cook - Talk at Velocity NY 2013 Operating at the Edge of Failure Accident Boundary Marginal Boundary
  37. 37. ‘‘Going solid’’: a model of system dynamics and consequences for patient safety - R Cook, J Rasmussen Resilience in complex adaptive systems: Operating at the Edge of Failure - Richard Cook - Talk at Velocity NY 2013 Operating at the Edge of Failure Accident Boundary Marginal Boundary
  38. 38. ‘‘Going solid’’: a model of system dynamics and consequences for patient safety - R Cook, J Rasmussen Resilience in complex adaptive systems: Operating at the Edge of Failure - Richard Cook - Talk at Velocity NY 2013 Operating at the Edge of Failure Accident Boundary Marginal Boundary
  39. 39. Embrace Failure
  40. 40. Resilience in Social Systems
  41. 41. Simple Critical Infrastructure Map Understanding vital services, and how they keep you safe 6ways to die 3sets of essential services 7layers of PROTECTION Dealing in Security - Mike Bennet, Vinay Gupta
  42. 42. 7 Principles for Building Resilience in Social Systems 1. Maintain diversity & Redundancy 2. Manage connectivity 3. Manage slow variables & feedback 4. Foster complex adaptive systems thinking 5. Encourage learning 6. Broaden participation 7. Promote polycentric governance Principles for Building Resilience: Sustaining Ecosystem Services in Social-Ecological Systems - Reinette Biggs et. al.
  43. 43. Resilience in Biological Systems
  44. 44. MeerkatsPuppies! Now that I’ve got your attention, complexity theory - Nicolas Perony, TED talk
  45. 45. What We Can Learn From Biological Systems 1. Feature Diversity and redundancy 2. Inter-Connected network structure 3. Wide distribution across all scales 4. Capacity to self-adapt & self-organize Toward Resilient Architectures 1: Biology Lessons - Michael Mehaffy, Nikos A. Salingaros
  46. 46. “Animals show extraordinary social complexity, and this allows them to adapt and 
 respond to changes in their environment. In three words, in the animal kingdom, simplicity leads to complexity 
 which leads to resilience.” - Nicolas Perony Puppies! Now that I’ve got your attention, complexity theory - Nicolas Perony, TED talk
  47. 47. Resilience in Computer Systems
  48. 48. “Complex systems run in degraded mode.” “Complex systems run as broken systems.” - richard Cook How Complex Systems Fail - Richard Cook
  49. 49. Resilience is by Design Photo courtesy of FEMA/Joselyne Augustino
  50. 50. We Need to Manage Failure
  51. 51. “Post-accident attribution to a 
 ‘root cause’ is fundamentally wrong: 
 Because overt failure requires multiple faults, there is no isolated ‘cause’ of an accident.” - richard Cook How Complex Systems Fail - Richard Cook
  52. 52. There is No Root Cause
  53. 53. Crash Only Software Crash-Only Software - George Candea, Armando Fox Stop = Crash Safely Start = Recover Fast
  54. 54. “To make a system of interconnected components crash-only, it must be designed so that components can tolerate the crashes and temporary unavailability of their peers. This means we require: [1] strong modularity with relatively impermeable component boundaries, [2] timeout-based communication and lease-based resource allocation, and [3] self- describing requests that carry a time-to-live and information on whether they are idempotent.” - George Candea, Armando Fox Crash-Only Software - George Candea, Armando Fox
  55. 55. Recursive Restartability Turning the Crash-Only Sledgehammer into a Scalpel Recursive Restartability: Turning the Reboot Sledgehammer into a Scalpel - George Candea, Armando Fox
  56. 56. Services need to accept NO for an answer
  57. 57. "Software components should be designed such that they can deny service for any request or call. Then, if an underlying component can say No, apps must be designed to take No for an answer and decide how to proceed: give up, wait and retry, reduce fidelity, etc.” - George Candea, Armando Fox Recursive Restartability: Turning the Reboot Sledgehammer into a Scalpel - George Candea, Armando Fox Learn to take NO for an answer
  58. 58. “The explosive growth of software has added greatly to systems’ interactive complexity. With software, the possible states that a system can end up in become mind-boggling.” - Sidney Dekker Drift into Failure - Sidney Dekker
  59. 59. Classification of State • Static Data • Scratch Data • Dynamic Data • Recomputable • not recomputable
  60. 60. Classification of State • Static Data • Scratch Data • Dynamic Data • Recomputable • not recomputable Critical
  61. 61. We Need a Way Out of the State Tar Pit Out of the Tar Pit - Ben Moseley , Peter Marks
  62. 62. Essential State We Need a Way Out of the State Tar Pit Out of the Tar Pit - Ben Moseley , Peter Marks
  63. 63. Essential State We Need a Way Out of the State Tar Pit Out of the Tar Pit - Ben Moseley , Peter Marks Essential Logic
  64. 64. Essential State We Need a Way Out of the State Tar Pit Out of the Tar Pit - Ben Moseley , Peter Marks Essential Logic Accidental State and Control
  65. 65. Traditional State Management Object Critical state that needs protection Client Thread boundary
  66. 66. Traditional State Management Object Critical state that needs protection Client Thread boundary
  67. 67. Traditional State Management Object Critical state that needs protection Client Thread boundary
  68. 68. Traditional State Management Object Critical state that needs protection Client Thread boundary Synchronous dispatch Thread boundary
  69. 69. Traditional State Management Object Critical state that needs protection Client Thread boundary Synchronous dispatch Thread boundary
  70. 70. Traditional State Management Object Critical state that needs protection Client Thread boundary Synchronous dispatch Thread boundary ?
  71. 71. Traditional State Management Object Critical state that needs protection Client Thread boundary Synchronous dispatch Thread boundary ? Utterly broken
  72. 72. Requirements for a Sane Failure Mode 1. Contained 2. Reified—as messages 3. Signalled—Asynchronously 4. Observed—by 1-N 5. Managed Failures need to be
  73. 73. Akka Actors
  74. 74. Akka Actors case class Greeting(who: String) class GreetingActor extends Actor with ActorLogging { def receive = { case Greeting(who) => log.info(s”Hello ${who}") } }
  75. 75. Akka Actors case class Greeting(who: String) class GreetingActor extends Actor with ActorLogging { def receive = { case Greeting(who) => log.info(s”Hello ${who}") } } Define the message(s) the Actor should be able to respond to
  76. 76. Akka Actors case class Greeting(who: String) class GreetingActor extends Actor with ActorLogging { def receive = { case Greeting(who) => log.info(s”Hello ${who}") } } Define the message(s) the Actor should be able to respond to Define the Actor class
  77. 77. Akka Actors case class Greeting(who: String) class GreetingActor extends Actor with ActorLogging { def receive = { case Greeting(who) => log.info(s”Hello ${who}") } } Define the message(s) the Actor should be able to respond to Define the Actor class Define the Actor’s behavior
  78. 78. Akka Actors case class Greeting(who: String) class GreetingActor extends Actor with ActorLogging { def receive = { case Greeting(who) => log.info(s”Hello ${who}") } }
  79. 79. Akka Actors case class Greeting(who: String) class GreetingActor extends Actor with ActorLogging { def receive = { case Greeting(who) => log.info(s”Hello ${who}") } } val system = ActorSystem("MySystem") val greeter = system.actorOf(Props[GreetingActor], "greeter")
  80. 80. Akka Actors case class Greeting(who: String) class GreetingActor extends Actor with ActorLogging { def receive = { case Greeting(who) => log.info(s”Hello ${who}") } } val system = ActorSystem("MySystem") val greeter = system.actorOf(Props[GreetingActor], "greeter") Create an Actor system
  81. 81. Akka Actors case class Greeting(who: String) class GreetingActor extends Actor with ActorLogging { def receive = { case Greeting(who) => log.info(s”Hello ${who}") } } val system = ActorSystem("MySystem") val greeter = system.actorOf(Props[GreetingActor], "greeter") Create an Actor system Actor configuration
  82. 82. Akka Actors case class Greeting(who: String) class GreetingActor extends Actor with ActorLogging { def receive = { case Greeting(who) => log.info(s”Hello ${who}") } } val system = ActorSystem("MySystem") val greeter = system.actorOf(Props[GreetingActor], "greeter") Give it a name Create an Actor system Actor configuration
  83. 83. Akka Actors case class Greeting(who: String) class GreetingActor extends Actor with ActorLogging { def receive = { case Greeting(who) => log.info(s”Hello ${who}") } } val system = ActorSystem("MySystem") val greeter = system.actorOf(Props[GreetingActor], "greeter") Give it a nameCreate the Actor Create an Actor system Actor configuration
  84. 84. Akka Actors case class Greeting(who: String) class GreetingActor extends Actor with ActorLogging { def receive = { case Greeting(who) => log.info(s”Hello ${who}") } } val system = ActorSystem("MySystem") val greeter = system.actorOf(Props[GreetingActor], "greeter") Give it a nameCreate the ActorYou get an ActorRef back Create an Actor system Actor configuration
  85. 85. Akka Actors case class Greeting(who: String) class GreetingActor extends Actor with ActorLogging { def receive = { case Greeting(who) => log.info(s”Hello ${who}") } } val system = ActorSystem("MySystem") val greeter = system.actorOf(Props[GreetingActor], "greeter")
  86. 86. Akka Actors case class Greeting(who: String) class GreetingActor extends Actor with ActorLogging { def receive = { case Greeting(who) => log.info(s”Hello ${who}") } } val system = ActorSystem("MySystem") val greeter = system.actorOf(Props[GreetingActor], "greeter") greeter ! Greeting("Charlie Parker")
  87. 87. Akka Actors case class Greeting(who: String) class GreetingActor extends Actor with ActorLogging { def receive = { case Greeting(who) => log.info(s”Hello ${who}") } } val system = ActorSystem("MySystem") val greeter = system.actorOf(Props[GreetingActor], "greeter") greeter ! Greeting("Charlie Parker") Send the message asynchronously
  88. 88. Akka Actors case class Greeting(who: String) class GreetingActor extends Actor with ActorLogging { def receive = { case Greeting(who) => log.info(s”Hello ${who}") } } val system = ActorSystem("MySystem") val greeter = system.actorOf(Props[GreetingActor], "greeter") greeter ! Greeting("Charlie Parker")
  89. 89. Bulkhead Pattern
  90. 90. Bulkhead Pattern
  91. 91. Bulkhead Pattern
  92. 92. Enter Supervision
  93. 93. Enter Supervision
  94. 94. The Vending Machine Pattern
  95. 95. Think Vending Machine Coffee Machine Programmer
  96. 96. Think Vending Machine Coffee Machine Programmer Inserts coins
  97. 97. Think Vending Machine Coffee Machine Programmer Inserts coins Add more coins
  98. 98. Think Vending Machine Coffee Machine Programmer Inserts coins Gets coffee Add more coins
  99. 99. Think Vending Machine Programmer Coffee Machine
  100. 100. Think Vending Machine Programmer Inserts coins Coffee Machine
  101. 101. Think Vending Machine Programmer Inserts coins Out of coffee beans error Coffee Machine
  102. 102. Think Vending Machine Programmer Inserts coins Out of coffee beans error WRONG Coffee Machine
  103. 103. Think Vending Machine Programmer Inserts coins Coffee Machine
  104. 104. Think Vending Machine Programmer Inserts coins Out of coffee beans failure Coffee Machine
  105. 105. Think Vending Machine Programmer Service Guy Inserts coins Out of coffee beans failure Coffee Machine
  106. 106. Think Vending Machine Programmer Service Guy Inserts coins Out of coffee beans failure Adds more beans Coffee Machine
  107. 107. Think Vending Machine Programmer Service Guy Inserts coins Gets coffee Out of coffee beans failure Adds more beans Coffee Machine
  108. 108. Think Vending Machine ServiceClient
  109. 109. Think Vending Machine ServiceClient Request
  110. 110. Think Vending Machine ServiceClient Request Response
  111. 111. Think Vending Machine ServiceClient Request Response Validation Error
  112. 112. Think Vending Machine ServiceClient Request Response Validation Error Application Failure
  113. 113. Think Vending Machine ServiceClient Supervisor Request Response Validation Error Application Failure
  114. 114. Think Vending Machine ServiceClient Supervisor Request Response Validation Error Application Failure Manages Failure
  115. 115. “Accidents come from relationships  not broken parts.” - Sidney dekker Drift into Failure - Sidney Dekker
  116. 116. Error Kernel Pattern Onion-layered state & Failure management Making reliable distributed systems in the presence of software errors - Joe Armstrong On Erlang, State and Crashes - Jesper Louis Andersen
  117. 117. Onion Layered State Management Object Critical state that needs protection Client Thread boundary
  118. 118. Onion Layered State Management Object Critical state that needs protection Client Thread boundary
  119. 119. Onion Layered State Management Error Kernel Object Critical state that needs protection Client Thread boundary
  120. 120. Onion Layered State Management Error Kernel Object Critical state that needs protection Client Thread boundary
  121. 121. Onion Layered State Management Error Kernel Object Critical state that needs protection Client Supervision Thread boundary
  122. 122. Onion Layered State Management Error Kernel Object Critical state that needs protection Client Supervision Supervision Thread boundary
  123. 123. Onion Layered State Management Error Kernel Object Critical state that needs protection Client Supervision Supervision Thread boundary
  124. 124. Onion Layered State Management Error Kernel Object Critical state that needs protection Client Supervision Supervision Thread boundary
  125. 125. Onion Layered State Management Error Kernel Object Critical state that needs protection Client Supervision Supervision Thread boundary
  126. 126. Onion Layered State Management Error Kernel Object Critical state that needs protection Client Supervision Supervision Thread boundary
  127. 127. Onion Layered State Management Error Kernel Object Critical state that needs protection Client Supervision Supervision Thread boundary
  128. 128. Onion Layered State Management Error Kernel Object Critical state that needs protection Client Supervision Supervision Thread boundary
  129. 129. Onion Layered State Management Error Kernel Object Critical state that needs protection Client Supervision Supervision Thread boundary
  130. 130. Onion Layered State Management Error Kernel Object Critical state that needs protection Client Supervision Supervision Thread boundary
  131. 131. Supervision in Akka Every actor has a default supervisor strategy. Which can, and often should, be overridden. class Supervisor extends Actor { override val supervisorStrategy = OneForOneStrategy(maxNrOfRetries = 10, withinTimeRange = 1 minute) { case _: ArithmeticException => Resume case _: NullPointerException => Restart case _: Exception => Escalate } val worker = context.actorOf(Props[Worker], name = "worker") def receive = { case number: Int => worker.forward(number) } }
  132. 132. Supervision in Akka Every actor has a default supervisor strategy. Which can, and often should, be overridden. class Supervisor extends Actor { override val supervisorStrategy = OneForOneStrategy(maxNrOfRetries = 10, withinTimeRange = 1 minute) { case _: ArithmeticException => Resume case _: NullPointerException => Restart case _: Exception => Escalate } val worker = context.actorOf(Props[Worker], name = "worker") def receive = { case number: Int => worker.forward(number) } } Parent actor
  133. 133. Supervision in Akka Every actor has a default supervisor strategy. Which can, and often should, be overridden. class Supervisor extends Actor { override val supervisorStrategy = OneForOneStrategy(maxNrOfRetries = 10, withinTimeRange = 1 minute) { case _: ArithmeticException => Resume case _: NullPointerException => Restart case _: Exception => Escalate } val worker = context.actorOf(Props[Worker], name = "worker") def receive = { case number: Int => worker.forward(number) } } Parent actor All its children have their life-cycle managed through this declarative supervision strategy
  134. 134. Supervision in Akka Every actor has a default supervisor strategy. Which can, and often should, be overridden. class Supervisor extends Actor { override val supervisorStrategy = OneForOneStrategy(maxNrOfRetries = 10, withinTimeRange = 1 minute) { case _: ArithmeticException => Resume case _: NullPointerException => Restart case _: Exception => Escalate } val worker = context.actorOf(Props[Worker], name = "worker") def receive = { case number: Int => worker.forward(number) } } Create a supervised child actor Parent actor All its children have their life-cycle managed through this declarative supervision strategy
  135. 135. Monitor through Death Watch class Watcher extends Actor { val child = context.actorOf(Props.empty, "child") context.watch(child) def receive = { case Terminated(`child`) => … // handle child termination } }
  136. 136. Monitor through Death Watch class Watcher extends Actor { val child = context.actorOf(Props.empty, "child") context.watch(child) def receive = { case Terminated(`child`) => … // handle child termination } } Create a child actor
  137. 137. Monitor through Death Watch class Watcher extends Actor { val child = context.actorOf(Props.empty, "child") context.watch(child) def receive = { case Terminated(`child`) => … // handle child termination } } Create a child actor Watch it
  138. 138. Monitor through Death Watch class Watcher extends Actor { val child = context.actorOf(Props.empty, "child") context.watch(child) def receive = { case Terminated(`child`) => … // handle child termination } } Create a child actor Watch it Receive termination signal
  139. 139. Maintain Diversity and Redundancy
  140. 140. Maintain Diversity and Redundancy
  141. 141. Akka Routing akka { actor { deployment { /service/router { router = round-robin-pool resizer { lower-bound = 12 upper-bound = 15 } } } } }
  142. 142. Akka Cluster akka { actor { deployment { /service/router { router = round-robin-pool resizer { lower-bound = 12 upper-bound = 15 } } } provider = "akka.cluster.ClusterActorRefProvider" }    cluster { seed-nodes = [ “akka.tcp://ClusterSystem@127.0.0.1:2551", “akka.tcp://ClusterSystem@127.0.0.1:2552" ]   } }
  143. 143. The Network is Reliable
  144. 144. The Network is Reliable NAT
  145. 145. We are living in the Looming Shadowof Impossibility Theorems
  146. 146. We are living in the Looming Shadowof Impossibility Theorems CAP: Consistency is impossible
  147. 147. We are living in the Looming Shadowof Impossibility Theorems CAP: Consistency is impossible FLP: Consensus is impossible
  148. 148. Strong Consistency Is the wrong default
  149. 149. We need Decoupling IN Time and Space
  150. 150. Resilient Protocols Depend on Asynchronous Communication Eventual Consistency
  151. 151. Resilient Protocols • are tolerant to • Message loss • Message reordering • Message duplication Depend on Asynchronous Communication Eventual Consistency
  152. 152. Resilient Protocols • are tolerant to • Message loss • Message reordering • Message duplication • Embrace ACID 2.0 • Associative • Commutative • Idempotent • Distributed Depend on Asynchronous Communication Eventual Consistency
  153. 153. Akka Distributed Data class DataBot extends Actor with ActorLogging { val replicator = DistributedData(context.system).replicator implicit val node = Cluster(context.system) val tickTask = context.system.scheduler.schedule( 5.seconds, 5.seconds, self, Add) val DataKey = ORSetKey[String]("key") replicator ! Subscribe(DataKey, self)
  154. 154. Akka Distributed Data class DataBot extends Actor with ActorLogging { val replicator = DistributedData(context.system).replicator implicit val node = Cluster(context.system) val tickTask = context.system.scheduler.schedule( 5.seconds, 5.seconds, self, Add) val DataKey = ORSetKey[String]("key") replicator ! Subscribe(DataKey, self) def receive = { case Tick => val s = ThreadLocalRandom.current().nextInt(97, 123).toChar.toString if (ThreadLocalRandom.current().nextBoolean()) replicator ! Update(DataKey, ORSet.empty[String], WriteLocal)(_ + s) else replicator ! Update(DataKey, ORSet.empty[String], WriteLocal)(_ - s) case c @ Changed(DataKey) => val data = c.get(DataKey) case _: UpdateResponse[_] => // ignore } }
  155. 155. “An escalator can never break: it can only become stairs. You should never see an Escalator Temporarily Out Of Order sign, just Escalator Temporarily Stairs. Sorry for the convenience.” - Mitch Hedberg
  156. 156. Graceful Degradation
  157. 157. Always Rely on Asynchronous Communication
  158. 158. Always Rely on Asynchronous Communication …And If not pOssiblE Always use Timeouts
  159. 159. Circuit Breaker
  160. 160. Circuit Breaker in Akka val breaker = new CircuitBreaker( context.system.scheduler, maxFailures = 5, callTimeout = 10.seconds, resetTimeout = 1.minute ).onOpen(…) .onHalfOpen(…) val result = breaker.withCircuitBreaker(Future(dangerousCall()))
  161. 161. Little’s Law W: Response Time L: Queue Length
  162. 162. Little’s Law Queue Length = Arrival Rate * Response Time W: Response Time L: Queue Length
  163. 163. Little’s Law Response Time = Queue Length / Arrival Rate W: Response Time L: Queue Length
  164. 164. Flow Control
  165. 165. Flow Control Always Apply Back Pressure
  166. 166. Akka Streams
  167. 167. Akka Streams
  168. 168. Akka Streams in ~> f1 ~> bcast ~> f2 ~> merge ~> f3 ~> out bcast ~> f4 ~> merge
  169. 169. Akka Streams in ~> f1 ~> bcast ~> f2 ~> merge ~> f3 ~> out bcast ~> f4 ~> merge val g = FlowGraph.closed() { implicit builder: FlowGraph.Builder => import FlowGraph.Implicits._ val in = Source(1 to 10) val out = Sink.ignore val bcast = builder.add(Broadcast[Int](2)) val merge = builder.add(Merge[Int](2)) val f1, f2, f3, f4 = Flow[Int].map(_ + 10) }
  170. 170. Akka Streams in ~> f1 ~> bcast ~> f2 ~> merge ~> f3 ~> out bcast ~> f4 ~> merge val g = FlowGraph.closed() { implicit builder: FlowGraph.Builder => import FlowGraph.Implicits._ val in = Source(1 to 10) val out = Sink.ignore val bcast = builder.add(Broadcast[Int](2)) val merge = builder.add(Merge[Int](2)) val f1, f2, f3, f4 = Flow[Int].map(_ + 10) } Set up the context
  171. 171. Akka Streams in ~> f1 ~> bcast ~> f2 ~> merge ~> f3 ~> out bcast ~> f4 ~> merge val g = FlowGraph.closed() { implicit builder: FlowGraph.Builder => import FlowGraph.Implicits._ val in = Source(1 to 10) val out = Sink.ignore val bcast = builder.add(Broadcast[Int](2)) val merge = builder.add(Merge[Int](2)) val f1, f2, f3, f4 = Flow[Int].map(_ + 10) } Set up the context Create the Source and Sink
  172. 172. Akka Streams in ~> f1 ~> bcast ~> f2 ~> merge ~> f3 ~> out bcast ~> f4 ~> merge val g = FlowGraph.closed() { implicit builder: FlowGraph.Builder => import FlowGraph.Implicits._ val in = Source(1 to 10) val out = Sink.ignore val bcast = builder.add(Broadcast[Int](2)) val merge = builder.add(Merge[Int](2)) val f1, f2, f3, f4 = Flow[Int].map(_ + 10) } Set up the context Create the Source and Sink Create the fan out and fan in stages
  173. 173. Akka Streams in ~> f1 ~> bcast ~> f2 ~> merge ~> f3 ~> out bcast ~> f4 ~> merge val g = FlowGraph.closed() { implicit builder: FlowGraph.Builder => import FlowGraph.Implicits._ val in = Source(1 to 10) val out = Sink.ignore val bcast = builder.add(Broadcast[Int](2)) val merge = builder.add(Merge[Int](2)) val f1, f2, f3, f4 = Flow[Int].map(_ + 10) } Set up the context Create the Source and Sink Create the fan out and fan in stages Create a set of transformations
  174. 174. Akka Streams in ~> f1 ~> bcast ~> f2 ~> merge ~> f3 ~> out bcast ~> f4 ~> merge val g = FlowGraph.closed() { implicit builder: FlowGraph.Builder => import FlowGraph.Implicits._ val in = Source(1 to 10) val out = Sink.ignore val bcast = builder.add(Broadcast[Int](2)) val merge = builder.add(Merge[Int](2)) val f1, f2, f3, f4 = Flow[Int].map(_ + 10) } Set up the context Create the Source and Sink Create the fan out and fan in stages Create a set of transformations Define the graph processing blueprint
  175. 175. Testing
  176. 176. What can we learn from Arnold?
  177. 177. What can we learn from Arnold?
  178. 178. What can we learn from Arnold? Blow things up
  179. 179. Shoot Your App Down
  180. 180. Pull the Plug …and see what happens
  181. 181. Executive Summary
  182. 182. “Complex systems run in degraded mode.” “Complex systems run as broken systems.” - richard Cook How Complex Systems Fail - Richard Cook
  183. 183. Resilience is by Design Photo courtesy of FEMA/Joselyne Augustino
  184. 184. ReferencesAntifragile: Things That Gain from Disorder - http://www.amazon.com/Antifragile-Things-that-Gain-Disorder-ebook/dp/B009K6DKTS Drift into Failure - http://www.amazon.com/Drift-into-Failure-Components-Understanding-ebook/dp/B009KOKXKY How Complex Systems Fail - http://web.mit.edu/2.75/resources/random/How%20Complex%20Systems%20Fail.pdf Leverage Points: Places to Intervene in a System - http://www.donellameadows.org/archives/leverage-points-places-to-intervene-in-a-system/ Going Solid: A Model of System Dynamics and Consequences for Patient Safety - http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1743994/ Resilience in Complex Adaptive Systems: Operating at the Edge of Failure - https://www.youtube.com/watch?v=PGLYEDpNu60 Dealing in Security - http://resiliencemaps.org/files/Dealing_in_Security.July2010.en.pdf Principles for Building Resilience: Sustaining Ecosystem Services in Social-Ecological Systems - http://www.amazon.com/Principles- Building-Resilience-Sustaining-Social-Ecological/dp/110708265X Puppies! Now that I’ve got your attention, Complexity Theory - https://www.ted.com/talks/ nicolas_perony_puppies_now_that_i_ve_got_your_attention_complexity_theory How Bacteria Becomes Resistant - http://www.abc.net.au/science/slab/antibiotics/resistance.htm Towards Resilient Architectures: Biology Lessons - http://www.metropolismag.com/Point-of-View/March-2013/Toward-Resilient- Architectures-1-Biology-Lessons/ Crash-Only Software - https://www.usenix.org/legacy/events/hotos03/tech/full_papers/candea/candea.pdf Recursive Restartability: Turning the Reboot Sledgehammer into a Scalpel - http://roc.cs.berkeley.edu/papers/recursive_restartability.pdf Out of the Tar Pit - http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.93.8928 Bulkhead Pattern - http://skife.org/architecture/fault-tolerance/2009/12/31/bulkheads.html Making Reliable Distributed Systems in the Presence of Software Errors - http://www.erlang.org/download/armstrong_thesis_2003.pdf On Erlang, State and Crashes - http://jlouisramblings.blogspot.be/2010/11/on-erlang-state-and-crashes.html Akka Supervision - http://doc.akka.io/docs/akka/snapshot/general/supervision.html Release It!: Design and Deploy Production-Ready Software - https://pragprog.com/book/mnee/release-it Hystrix - https://github.com/Netflix/Hystrix Akka Circuit Breaker - http://doc.akka.io/docs/akka/snapshot/common/circuitbreaker.html Reactive Streams - http://reactive-streams.org Akka Streams - http://doc.akka.io/docs/akka-stream-and-http-experimental/1.0/scala/stream-introduction.html RxJava - https://github.com/ReactiveX/RxJava Feedback Control for Computer Systems - http://www.amazon.com/Feedback-Control-Computer-Systems-Philipp/dp/1449361692 Simian Army - https://github.com/Netflix/SimianArmy Gatling - http://gatling.io Akka MultiNode Testing - http://doc.akka.io/docs/akka/snapshot/dev/multi-node-testing.html
  185. 185. EXPERT TRAINING Delivered On-site For Spark, Scala, Akka And Play Help is just a click away. Get in touch with Typesafe about our training courses. • Intro Workshop to Apache Spark • Fast Track & Advanced Scala • Fast Track to Akka with Java or Scala • Fast Track to Play with Java or Scala • Advanced Akka with Java or Scala Ask us about local trainings available by 24 Typesafe partners in 14 countries around the world. CONTACT US Learn more about on-site training
  186. 186. Thank You

×