MANCHESTER LONDON NEW YORK
Petr Zapletal @petr_zapletal
#scaladays
@cakesolutions
Top Mistakes When Writing Reactive
Applications
Agenda
● Motivation
● Actors vs Futures
● Serialization
● Flat Actor Hierarchies
● Graceful Shutdown
● Distributed Transactions
● Longtail Latencies
● Quick Tips
Actors vs Futures
Constraints Liberate, Liberties Constrain
Pick the Right Tool for The Job
Scala
Future[T]
Akka
ACTORS
Power
Constraints
Akka
Stream
Pick the Right Tool for The Job
Scala
Future[T]
Akka
ACTORS
Power
Constraints
Akka
TYPED
Pick the Right Tool for The Job
Scala
Future[T] Akka
TYPED
Akka
ACTORS
Power
Constraints
Akka
Stream
Pick the Right Tool for The Job
Scala
Future[T]
Local Abstractions Distribution
Akka
TYPED
Akka
ACTORS
Power
Constraints
Akka
Stream
Actor Use Cases
● State management
● Location transparency
● Resilience mechanisms
● Single writer
● In-memory lock-free cache
● Sharding
Akka
ACTOR
Future Use Cases
● Local Concurrency
● Simplicity
● Composition
● Typesafety
Scala
Future[T]
Avoid Java Serialization
Java Serialization is the default in Akka, since
it is easy to start with it, but is very slow and
footprint heavy
Akka
ACTOR
Sending Data Through Network
Serialization Serialization
Akka
ACTOR
Persisting Data
Akka
ACTOR
Serialization
Java Serialization - Round Trip
Java Serialization - Footprint
Java Serialization - Footprint
case class Order (id: Long, description: String, totalCost: BigDecimal, orderLines: ArrayList[OrderLine], customer: Customer)
Java Serialization:
----sr--model.Order----h#-----J--idL--customert--Lmodel/Customer;L--descriptiont--Ljava/lang/String;L--orderLinest--Ljava/util
/List;L--totalCostt--Ljava/math/BigDecimal;xp--------ppsr--java.util.ArrayListx-----a----I--sizexp----w-----sr--model.OrderLine--
&-1-S----I--lineNumberL--costq-~--L--descriptionq-~--L--ordert--Lmodel/Order;xp----sr--java.math.BigDecimalT--W--(O---I--s
caleL--intValt--Ljava/math/BigInteger;xr--java.lang.Number-----------xp----sr--java.math.BigInteger-----;-----I--bitCountI--bitLe
ngthI--firstNonzeroByteNumI--lowestSetBitI--signum[--magnitudet--[Bxq-~----------------------ur--[B------T----xp----xxpq-~--x
q-~--
XML:
<order id="0" totalCost="0"><orderLines lineNumber="1" cost="0"><order>0</order></orderLines></order>
JSON:
{"order":{"id":0,"totalCost":0,"orderLines":[{"lineNumber":1,"cost":0,"order":0}]}}
Java Serialization Implementation
● Serializes
○ Data
○ Entire class definition
○ Definitions of all referenced classes
● It just “works”
○ Serializes almost everything (what implements Serializable)
○ Works with different JVMs
● Performance was not the main requirement
Points of Interest
● Performance
● Footprint
● Schema evolution
● Implementation effort
● Human readability
● Language bindings
● Backwards & forwards compatibility
● ...
JSON
● Advantages:
○ Human readability
○ Simple & well known
○ Many good libraries
for all platforms
● Disadvantages:
○ Slow
○ Large
○ Object names included
○ No schema (except e.g. json
schema)
○ Format and precision issues
● json4s, circe, µPickle, spray-json, argonaut, rapture-json, play-json, …
Binary formats [Schema-less]
● Metadata send together with data
● Advantages:
○ Implementation effort
○ Performance
○ Footprint *
● Disadvantages:
○ No human readability
● Kryo, Binary JSON (MessagePack, BSON, ... )
Binary formats [Schema]
● Schema defined by some kind of DSL
● Advantages:
○ Performance
○ Footprint
○ Schema evolution
● Disadvantages:
○ Implementation effort
○ No human readability
● Protobuf (+ projects like Flatbuffers, Cap’n Proto, etc.), Thrift, Avro
Summary
● Should be always changed
● Depends on particular use case
● Quick tips:
○ json4s
○ kryo
○ protobuf
Flat Actor Hierarchies
Errors should be handled out of band in a
parallel process - they are not part of the
main app
Top Level Actors
The Actor Hierarchy
/a1 /a2
Top Level Actors
The Actor Hierarchy
/a1 /a2
Root Actor
/user
Top Level Actors
The Actor Hierarchy
/a1 /a2
/b1 /b2
Root Actor
/c4/c3/c2/c1
/user
Top Level Actors
The Actor Hierarchy
/a1 /a2
/b1 /b2
Root Actor
/c4/c3/c2/c1
/user
/
/system
Two Different Battles to Win
● Separate business logic and failure handling
○ Less complexity
○ Better supportability
● Getting our application back to life after something bad happened
○ Failure isolation
○ Recovery
○ No more midnight calls :)
---> no more midnight calls :)
Errors & Failures
Errors
● Common events
● The current request is affected
● Will be communicated with the client/caller
● Incorrect requests, errors during validations, ...
Failures
● Unexpected events
● Service/actor is not able to operate normally
● Reports to supervisor
● Client can’t do anything, might be notified
● Database failures, network partitions, hardware
malfunctions, ...
Error Kernel Pattern
● Actor’s state is lost during restart and may not be recovered
● Delegating dangerous tasks to child actors and supervise them
/user/
a1
/user/
a1
/user/
a1/w1
/user/
a1
/user/
a1/w1
Backoff Supervisor
● Restarts actors each time with a growing time delay between restarts
BackoffSupervisor.props(
Backoff.onFailure(
childProps,
childName = "foo",
minBackoff = 3.seconds,
maxBackoff = 30.seconds,
randomFactor = 0.2
))
Backoff Supervisor
● Restarts actors each time with a growing time delay between restarts
BackoffSupervisor.props(
Backoff.onFailure(
childProps,
childName = "foo",
minBackoff = 3.seconds,
maxBackoff = 30.seconds,
randomFactor = 0.2
))
Backoff Supervisor
● Restarts actors each time with a growing time delay between restarts
BackoffSupervisor.props(
Backoff.onFailure(
childProps,
childName = "foo",
minBackoff = 3.seconds,
maxBackoff = 30.seconds,
randomFactor = 0.2
))
Backoff Supervisor
● Restarts actors each time with a growing time delay between restarts
BackoffSupervisor.props(
Backoff.onFailure(
childProps,
childName = "foo",
minBackoff = 3.seconds,
maxBackoff = 30.seconds,
randomFactor = 0.2
))
Backoff Supervisor
● Restarts actors each time with a growing time delay between restarts
BackoffSupervisor.props(
Backoff.onFailure(
childProps,
childName = "foo",
minBackoff = 3.seconds,
maxBackoff = 30.seconds,
randomFactor = 0.2
))
Summary
● Create rich actor hierarchies
● Separate business logic and failure handling
● Backoff Supervisor
Graceful Shutdown
We have thousands of sharded actors on
multiple nodes and we want to shut one of
them down
Graceful Shutdown
High-level Procedure
High-level Procedure
1. JVM gets the shutdown signal
High-level Procedure
1. JVM gets the shutdown signal
2. Coordinator tells all local ShardRegions to shut down gracefully
High-level Procedure
1. JVM gets the shutdown signal
2. Coordinator tells all local ShardRegions to shut down gracefully
3. Node leaves cluster
High-level Procedure
1. JVM gets the shutdown signal
2. Coordinator tells all local ShardRegions to shut down gracefully
3. Node leaves cluster
4. Coordinator gives singletons a grace period to migrate
High-level Procedure
1. JVM gets the shutdown signal
2. Coordinator tells all local ShardRegions to shut down gracefully
3. Node leaves cluster
4. Coordinator gives singletons a grace period to migrate
5. Actor System & JVM Termination
Integration with Sharded Actors
● Handling of added messages
○ Passivate() message for graceful stop
○ Context.stop() for immediate stop
● Priority mailbox
○ Priority message handling
○ Message retrying support
CoordinatedShutdown Extension
● Stops actors/services in a specific order
● Allows to register tasks and execute them during the shutdown
● More generic approach
● Added in Akka 2.5 (~ a week ago)
Summary
● We don’t want to lose data (usually)
● Shutdown coordinator on every node & Integration
with sharded actors
● Akka’s CoordinatedShutdown
Distributed Transactions
Any situation where a single event results in
the mutation of two separate sources of data
which cannot be committed atomically
What’s Wrong With Them
● Simple happy paths
● Fallacies of Distributed Programming
○ The network is reliable.
○ Latency is zero.
○ Bandwidth is infinite.
○ The network is secure.
○ Topology doesn't change.
○ There is one administrator.
○ Transport cost is zero.
○ The network is homogeneous.
Two-phase commit (2PC)
Stage 1 - Prepare Stage 2 - Commit
Prepare
Prepared
Prepare
Prepared
Com
m
it
Com
m
itted
Commit
Committed
Resource
Manager
Resource
Manager
Transaction
Manager
Resource
Manager
Resource
Manager
Transaction
Manager
Saga Pattern
T1 T2 T3 T4
C1 C2 C3 C4
The Big Trade-Off
● Distributed transactions can be usually avoided
○ Hard, expensive, fragile and do not scale
● Every business event needs to result in a single synchronous commit
● Other data sources should be updated asynchronously
● Introducing eventual consistency
Longtail Latencies
Consider a system where each service
typically responds in 10ms but with a 99th
percentile latency of one second
Longtail Latencies
Latency Normal vs. Longtail
Legend:
Normal
Longtail
50
40
30
20
10
0
25 50 75 90 99 99.9
Latency(ms)
Percentile
Longtails really matter
● Latency accumulation
● Not just noise
● Don’t have to be power users
● Real problem
Investigating Longtail Latencies
● Narrow the problem
● Isolate in a test environment
● Measure & monitor everything
● Tackle the problem
● Pretty hard job
Tolerating Longtail Latencies
Tolerating Longtail Latencies
● Hedging your bet
Tolerating Longtail Latencies
● Hedging your bet
● Tied requests
Tolerating Longtail Latencies
● Hedging your bet
● Tied requests
● Selectively increase replication factors
Tolerating Longtail Latencies
● Hedging your bet
● Tied requests
● Selectively increase replication factors
● Put slow machines on probation
Tolerating Longtail Latencies
● Hedging your bet
● Tied requests
● Selectively increase replication factors
● Put slow machines on probation
● Consider ‘good enough’ responses
Tolerating Longtail Latencies
● Hedging your bet
● Tied requests
● Selectively increase replication factors
● Put slow machines on probation
● Consider ‘good enough’ responses
● Hardware update
Quick Tips
Quick Tips
● Monitoring
Quick Tips
● Monitoring
● Network partitions
Quick Tips
● Monitoring
● Network partitions
○ Split Brain Resolver
Quick Tips
● Monitoring
● Network partitions
○ Split Brain Resolver
● Blocking
Quick Tips
● Monitoring
● Network partitions
○ Split Brain Resolver
● Blocking
● Too many actor systems
Questions
MANCHESTER LONDON NEW YORK
MANCHESTER LONDON NEW YORK
@petr_zapletal @cakesolutions
347 708 1518
petrz@cakesolutions.net
We are hiring
http://www.cakesolutions.net/careers
References
● http://www.reactivemanifesto.org/
● http://www.slideshare.net/ktoso/zen-of-akka
● http://eishay.github.io/jvm-serializers/prototype-results-page/
● http://java-persistence-performance.blogspot.com/2013/08/optimizing-java-serialization-java-vs.html
● https://github.com/romix/akka-kryo-serialization
● http://gotocon.com/dl/goto-chicago-2015/slides/CaitieMcCaffrey_ApplyingTheSagaPattern.pdf
● http://www.grahamlea.com/2016/08/distributed-transactions-microservices-icebergs/
● http://www.cs.duke.edu/courses/cps296.4/fall13/838-CloudPapers/dean_longtail.pdf
● https://engineering.linkedin.com/performance/who-moved-my-99th-percentile-latency
● http://doc.akka.io/docs/akka/rp-15v09p01/scala/split-brain-resolver.html
● http://manuel.bernhardt.io/2016/08/09/akka-anti-patterns-flat-actor-hierarchies-or-mixing-business-logic-a
nd-failure-handling/
Backup Slides
MANCHESTER LONDON NEW YORK
Adding Shutdown Hook
val nodeShutdownCoordinatorActor = system.actorOf(Props(
new NodeGracefulShutdownCoordinator(...)))
sys.addShutdownHook {
nodeShutdownCoordinatorActor ! StartNodeShutdown(shardRegions)
}
Adding Shutdown Hook
val nodeShutdownCoordinatorActor = system.actorOf(Props(
new NodeGracefulShutdownCoordinator(...)))
sys.addShutdownHook {
nodeShutdownCoordinatorActor ! StartNodeShutdown(shardRegions)
}
Adding Shutdown Hook
val nodeShutdownCoordinatorActor = system.actorOf(Props(
new NodeGracefulShutdownCoordinator(...)))
sys.addShutdownHook {
nodeShutdownCoordinatorActor ! StartNodeShutdown(shardRegions)
}
Tell Local Regions to Shutdown
when(AwaitNodeShutdownInitiation) {
case Event(StartNodeShutdown(shardRegions), _) =>
if (shardRegions.nonEmpty) {
// starts watching of every shard region and sends GracefulShutdown msg to them
stopShardRegions(shardRegions)
goto(AwaitShardRegionsShutdown) using ManagedRegions(shardRegions)
} else {
// registers OnMemberRemoved and leaves the cluster
leaveCluster()
goto(AwaitClusterExit)
}
}
Tell Local Regions to Shutdown
when(AwaitNodeShutdownInitiation) {
case Event(StartNodeShutdown(shardRegions), _) =>
if (shardRegions.nonEmpty) {
// starts watching of every shard region and sends GracefulShutdown msg to them
stopShardRegions(shardRegions)
goto(AwaitShardRegionsShutdown) using ManagedRegions(shardRegions)
} else {
// registers OnMemberRemoved and leaves the cluster
leaveCluster()
goto(AwaitClusterExit)
}
}
Tell Local Regions to Shutdown
when(AwaitNodeShutdownInitiation) {
case Event(StartNodeShutdown(shardRegions), _) =>
if (shardRegions.nonEmpty) {
// starts watching of every shard region and sends GracefulShutdown msg to them
stopShardRegions(shardRegions)
goto(AwaitShardRegionsShutdown) using ManagedRegions(shardRegions)
} else {
// registers OnMemberRemoved and leaves the cluster
leaveCluster()
goto(AwaitClusterExit)
}
}
Tell Local Regions to Shutdown
when(AwaitNodeShutdownInitiation) {
case Event(StartNodeShutdown(shardRegions), _) =>
if (shardRegions.nonEmpty) {
// starts watching of every shard region and sends GracefulShutdown msg to them
stopShardRegions(shardRegions)
goto(AwaitShardRegionsShutdown) using ManagedRegions(shardRegions)
} else {
// registers OnMemberRemoved and leaves the cluster
leaveCluster()
goto(AwaitClusterExit)
}
}
Node Leaves the Cluster
when(AwaitShardRegionsShutdown, stateTimeout = ... ){
case Event(Terminated(actor), ManagedRegions(regions)) =>
if (regions.contains(actor)) {
val remainingRegions = regions - actor
if (remainingRegions.isEmpty) {
leaveCluster()
goto(AwaitClusterExit)
} else {
goto(AwaitShardRegionsShutdown) using ManagedRegions(remainingRegions)
}
} else {
stay()
}
case Event(StateTimeout, _) =>
leaveCluster()
goto(AwaitNodeTerminationSignal)
}
Node Leaves the Cluster
when(AwaitShardRegionsShutdown, stateTimeout = ... ){
case Event(Terminated(actor), ManagedRegions(regions)) =>
if (regions.contains(actor)) {
val remainingRegions = regions - actor
if (remainingRegions.isEmpty) {
leaveCluster()
goto(AwaitClusterExit)
} else {
goto(AwaitShardRegionsShutdown) using ManagedRegions(remainingRegions)
}
} else {
stay()
}
case Event(StateTimeout, _) =>
leaveCluster()
goto(AwaitNodeTerminationSignal)
}
Node Leaves the Cluster
when(AwaitShardRegionsShutdown, stateTimeout = ... ){
case Event(Terminated(actor), ManagedRegions(regions)) =>
if (regions.contains(actor)) {
val remainingRegions = regions - actor
if (remainingRegions.isEmpty) {
leaveCluster()
goto(AwaitClusterExit)
} else {
goto(AwaitShardRegionsShutdown) using ManagedRegions(remainingRegions)
}
} else {
stay()
}
case Event(StateTimeout, _) =>
leaveCluster()
goto(AwaitNodeTerminationSignal)
}
Wait for Singletons to Migrate
when(AwaitClusterExit, stateTimeout = ...) {
case Event(NodeLeftCluster | StateTimeout, _) =>
// Waiting on cluster singleton migration
goto(AwaitClusterSingletonMigration)
}
when(AwaitClusterSingletonMigration, stateTimeout = ... ) {
case Event(StateTimeout, _) =>
goto(AwaitNodeTerminationSignal)
}
onTransition {
case AwaitClusterSingletonMigration -> AwaitNodeTerminationSignal =>
self ! TerminateNode
}
Wait for Singletons to Migrate
when(AwaitClusterExit, stateTimeout = ...) {
case Event(NodeLeftCluster | StateTimeout, _) =>
// Waiting on cluster singleton migration
goto(AwaitClusterSingletonMigration)
}
when(AwaitClusterSingletonMigration, stateTimeout = ... ) {
case Event(StateTimeout, _) =>
goto(AwaitNodeTerminationSignal)
}
onTransition {
case AwaitClusterSingletonMigration -> AwaitNodeTerminationSignal =>
self ! TerminateNode
}
Wait for Singletons to Migrate
when(AwaitClusterExit, stateTimeout = ...) {
case Event(NodeLeftCluster | StateTimeout, _) =>
// Waiting on cluster singleton migration
goto(AwaitClusterSingletonMigration)
}
when(AwaitClusterSingletonMigration, stateTimeout = ... ) {
case Event(StateTimeout, _) =>
goto(AwaitNodeTerminationSignal)
}
onTransition {
case AwaitClusterSingletonMigration -> AwaitNodeTerminationSignal =>
self ! TerminateNode
}
Wait for Singletons to Migrate
when(AwaitClusterExit, stateTimeout = ...) {
case Event(NodeLeftCluster | StateTimeout, _) =>
// Waiting on cluster singleton migration
goto(AwaitClusterSingletonMigration)
}
when(AwaitClusterSingletonMigration, stateTimeout = ... ) {
case Event(StateTimeout, _) =>
goto(AwaitNodeTerminationSignal)
}
onTransition {
case AwaitClusterSingletonMigration -> AwaitNodeTerminationSignal =>
self ! TerminateNode
}
Actor System & JVM Termination
when(AwaitNodeTerminationSignal, stateTimeout = ...) {
case Event(TerminateNode | StateTimeout, _) =>
// This is NOT an Akka thread-pool (since we're shutting those down)
val ec = scala.concurrent.ExecutionContext.global
// Calls context.system.terminate with registered onComplete block
terminateSystem {
case Success(ex) =>
System.exit(...)
case Failure(ex) =>
System.exit(...)
}(ec)
stop(Shutdown)
}
Actor System & JVM Termination
when(AwaitNodeTerminationSignal, stateTimeout = ...) {
case Event(TerminateNode | StateTimeout, _) =>
// This is NOT an Akka thread-pool (since we're shutting those down)
val ec = scala.concurrent.ExecutionContext.global
// Calls context.system.terminate with registered onComplete block
terminateSystem {
case Success(ex) =>
System.exit(...)
case Failure(ex) =>
System.exit(...)
}(ec)
stop(Shutdown)
}
Actor System & JVM Termination
when(AwaitNodeTerminationSignal, stateTimeout = ...) {
case Event(TerminateNode | StateTimeout, _) =>
// This is NOT an Akka thread-pool (since we're shutting those down)
val ec = scala.concurrent.ExecutionContext.global
// Calls context.system.terminate with registered onComplete block
terminateSystem {
case Success(ex) =>
System.exit(...)
case Failure(ex) =>
System.exit(...)
}(ec)
stop(Shutdown)
}
Actor System & JVM Termination
when(AwaitNodeTerminationSignal, stateTimeout = ...) {
case Event(TerminateNode | StateTimeout, _) =>
// This is NOT an Akka thread-pool (since we're shutting those down)
val ec = scala.concurrent.ExecutionContext.global
// Calls context.system.terminate with registered onComplete block
terminateSystem {
case Success(ex) =>
System.exit(...)
case Failure(ex) =>
System.exit(...)
}(ec)
stop(Shutdown)
}

Reactive mistakes - ScalaDays Chicago 2017

  • 1.
  • 2.
    Petr Zapletal @petr_zapletal #scaladays @cakesolutions TopMistakes When Writing Reactive Applications
  • 3.
    Agenda ● Motivation ● Actorsvs Futures ● Serialization ● Flat Actor Hierarchies ● Graceful Shutdown ● Distributed Transactions ● Longtail Latencies ● Quick Tips
  • 4.
    Actors vs Futures ConstraintsLiberate, Liberties Constrain
  • 5.
    Pick the RightTool for The Job Scala Future[T] Akka ACTORS Power Constraints Akka Stream
  • 6.
    Pick the RightTool for The Job Scala Future[T] Akka ACTORS Power Constraints Akka TYPED
  • 7.
    Pick the RightTool for The Job Scala Future[T] Akka TYPED Akka ACTORS Power Constraints Akka Stream
  • 8.
    Pick the RightTool for The Job Scala Future[T] Local Abstractions Distribution Akka TYPED Akka ACTORS Power Constraints Akka Stream
  • 9.
    Actor Use Cases ●State management ● Location transparency ● Resilience mechanisms ● Single writer ● In-memory lock-free cache ● Sharding Akka ACTOR
  • 10.
    Future Use Cases ●Local Concurrency ● Simplicity ● Composition ● Typesafety Scala Future[T]
  • 11.
    Avoid Java Serialization JavaSerialization is the default in Akka, since it is easy to start with it, but is very slow and footprint heavy
  • 12.
    Akka ACTOR Sending Data ThroughNetwork Serialization Serialization Akka ACTOR
  • 13.
  • 14.
  • 15.
  • 16.
    Java Serialization -Footprint case class Order (id: Long, description: String, totalCost: BigDecimal, orderLines: ArrayList[OrderLine], customer: Customer) Java Serialization: ----sr--model.Order----h#-----J--idL--customert--Lmodel/Customer;L--descriptiont--Ljava/lang/String;L--orderLinest--Ljava/util /List;L--totalCostt--Ljava/math/BigDecimal;xp--------ppsr--java.util.ArrayListx-----a----I--sizexp----w-----sr--model.OrderLine-- &-1-S----I--lineNumberL--costq-~--L--descriptionq-~--L--ordert--Lmodel/Order;xp----sr--java.math.BigDecimalT--W--(O---I--s caleL--intValt--Ljava/math/BigInteger;xr--java.lang.Number-----------xp----sr--java.math.BigInteger-----;-----I--bitCountI--bitLe ngthI--firstNonzeroByteNumI--lowestSetBitI--signum[--magnitudet--[Bxq-~----------------------ur--[B------T----xp----xxpq-~--x q-~-- XML: <order id="0" totalCost="0"><orderLines lineNumber="1" cost="0"><order>0</order></orderLines></order> JSON: {"order":{"id":0,"totalCost":0,"orderLines":[{"lineNumber":1,"cost":0,"order":0}]}}
  • 17.
    Java Serialization Implementation ●Serializes ○ Data ○ Entire class definition ○ Definitions of all referenced classes ● It just “works” ○ Serializes almost everything (what implements Serializable) ○ Works with different JVMs ● Performance was not the main requirement
  • 18.
    Points of Interest ●Performance ● Footprint ● Schema evolution ● Implementation effort ● Human readability ● Language bindings ● Backwards & forwards compatibility ● ...
  • 19.
    JSON ● Advantages: ○ Humanreadability ○ Simple & well known ○ Many good libraries for all platforms ● Disadvantages: ○ Slow ○ Large ○ Object names included ○ No schema (except e.g. json schema) ○ Format and precision issues ● json4s, circe, µPickle, spray-json, argonaut, rapture-json, play-json, …
  • 20.
    Binary formats [Schema-less] ●Metadata send together with data ● Advantages: ○ Implementation effort ○ Performance ○ Footprint * ● Disadvantages: ○ No human readability ● Kryo, Binary JSON (MessagePack, BSON, ... )
  • 21.
    Binary formats [Schema] ●Schema defined by some kind of DSL ● Advantages: ○ Performance ○ Footprint ○ Schema evolution ● Disadvantages: ○ Implementation effort ○ No human readability ● Protobuf (+ projects like Flatbuffers, Cap’n Proto, etc.), Thrift, Avro
  • 22.
    Summary ● Should bealways changed ● Depends on particular use case ● Quick tips: ○ json4s ○ kryo ○ protobuf
  • 23.
    Flat Actor Hierarchies Errorsshould be handled out of band in a parallel process - they are not part of the main app
  • 24.
    Top Level Actors TheActor Hierarchy /a1 /a2
  • 25.
    Top Level Actors TheActor Hierarchy /a1 /a2 Root Actor /user
  • 26.
    Top Level Actors TheActor Hierarchy /a1 /a2 /b1 /b2 Root Actor /c4/c3/c2/c1 /user
  • 27.
    Top Level Actors TheActor Hierarchy /a1 /a2 /b1 /b2 Root Actor /c4/c3/c2/c1 /user / /system
  • 28.
    Two Different Battlesto Win ● Separate business logic and failure handling ○ Less complexity ○ Better supportability ● Getting our application back to life after something bad happened ○ Failure isolation ○ Recovery ○ No more midnight calls :) ---> no more midnight calls :)
  • 29.
    Errors & Failures Errors ●Common events ● The current request is affected ● Will be communicated with the client/caller ● Incorrect requests, errors during validations, ... Failures ● Unexpected events ● Service/actor is not able to operate normally ● Reports to supervisor ● Client can’t do anything, might be notified ● Database failures, network partitions, hardware malfunctions, ...
  • 30.
    Error Kernel Pattern ●Actor’s state is lost during restart and may not be recovered ● Delegating dangerous tasks to child actors and supervise them /user/ a1 /user/ a1 /user/ a1/w1 /user/ a1 /user/ a1/w1
  • 31.
    Backoff Supervisor ● Restartsactors each time with a growing time delay between restarts BackoffSupervisor.props( Backoff.onFailure( childProps, childName = "foo", minBackoff = 3.seconds, maxBackoff = 30.seconds, randomFactor = 0.2 ))
  • 32.
    Backoff Supervisor ● Restartsactors each time with a growing time delay between restarts BackoffSupervisor.props( Backoff.onFailure( childProps, childName = "foo", minBackoff = 3.seconds, maxBackoff = 30.seconds, randomFactor = 0.2 ))
  • 33.
    Backoff Supervisor ● Restartsactors each time with a growing time delay between restarts BackoffSupervisor.props( Backoff.onFailure( childProps, childName = "foo", minBackoff = 3.seconds, maxBackoff = 30.seconds, randomFactor = 0.2 ))
  • 34.
    Backoff Supervisor ● Restartsactors each time with a growing time delay between restarts BackoffSupervisor.props( Backoff.onFailure( childProps, childName = "foo", minBackoff = 3.seconds, maxBackoff = 30.seconds, randomFactor = 0.2 ))
  • 35.
    Backoff Supervisor ● Restartsactors each time with a growing time delay between restarts BackoffSupervisor.props( Backoff.onFailure( childProps, childName = "foo", minBackoff = 3.seconds, maxBackoff = 30.seconds, randomFactor = 0.2 ))
  • 36.
    Summary ● Create richactor hierarchies ● Separate business logic and failure handling ● Backoff Supervisor
  • 37.
    Graceful Shutdown We havethousands of sharded actors on multiple nodes and we want to shut one of them down
  • 38.
  • 39.
  • 40.
    High-level Procedure 1. JVMgets the shutdown signal
  • 41.
    High-level Procedure 1. JVMgets the shutdown signal 2. Coordinator tells all local ShardRegions to shut down gracefully
  • 42.
    High-level Procedure 1. JVMgets the shutdown signal 2. Coordinator tells all local ShardRegions to shut down gracefully 3. Node leaves cluster
  • 43.
    High-level Procedure 1. JVMgets the shutdown signal 2. Coordinator tells all local ShardRegions to shut down gracefully 3. Node leaves cluster 4. Coordinator gives singletons a grace period to migrate
  • 44.
    High-level Procedure 1. JVMgets the shutdown signal 2. Coordinator tells all local ShardRegions to shut down gracefully 3. Node leaves cluster 4. Coordinator gives singletons a grace period to migrate 5. Actor System & JVM Termination
  • 45.
    Integration with ShardedActors ● Handling of added messages ○ Passivate() message for graceful stop ○ Context.stop() for immediate stop ● Priority mailbox ○ Priority message handling ○ Message retrying support
  • 46.
    CoordinatedShutdown Extension ● Stopsactors/services in a specific order ● Allows to register tasks and execute them during the shutdown ● More generic approach ● Added in Akka 2.5 (~ a week ago)
  • 47.
    Summary ● We don’twant to lose data (usually) ● Shutdown coordinator on every node & Integration with sharded actors ● Akka’s CoordinatedShutdown
  • 48.
    Distributed Transactions Any situationwhere a single event results in the mutation of two separate sources of data which cannot be committed atomically
  • 49.
    What’s Wrong WithThem ● Simple happy paths ● Fallacies of Distributed Programming ○ The network is reliable. ○ Latency is zero. ○ Bandwidth is infinite. ○ The network is secure. ○ Topology doesn't change. ○ There is one administrator. ○ Transport cost is zero. ○ The network is homogeneous.
  • 50.
    Two-phase commit (2PC) Stage1 - Prepare Stage 2 - Commit Prepare Prepared Prepare Prepared Com m it Com m itted Commit Committed Resource Manager Resource Manager Transaction Manager Resource Manager Resource Manager Transaction Manager
  • 51.
    Saga Pattern T1 T2T3 T4 C1 C2 C3 C4
  • 52.
    The Big Trade-Off ●Distributed transactions can be usually avoided ○ Hard, expensive, fragile and do not scale ● Every business event needs to result in a single synchronous commit ● Other data sources should be updated asynchronously ● Introducing eventual consistency
  • 53.
    Longtail Latencies Consider asystem where each service typically responds in 10ms but with a 99th percentile latency of one second
  • 54.
    Longtail Latencies Latency Normalvs. Longtail Legend: Normal Longtail 50 40 30 20 10 0 25 50 75 90 99 99.9 Latency(ms) Percentile
  • 55.
    Longtails really matter ●Latency accumulation ● Not just noise ● Don’t have to be power users ● Real problem
  • 56.
    Investigating Longtail Latencies ●Narrow the problem ● Isolate in a test environment ● Measure & monitor everything ● Tackle the problem ● Pretty hard job
  • 57.
  • 58.
  • 59.
    Tolerating Longtail Latencies ●Hedging your bet ● Tied requests
  • 60.
    Tolerating Longtail Latencies ●Hedging your bet ● Tied requests ● Selectively increase replication factors
  • 61.
    Tolerating Longtail Latencies ●Hedging your bet ● Tied requests ● Selectively increase replication factors ● Put slow machines on probation
  • 62.
    Tolerating Longtail Latencies ●Hedging your bet ● Tied requests ● Selectively increase replication factors ● Put slow machines on probation ● Consider ‘good enough’ responses
  • 63.
    Tolerating Longtail Latencies ●Hedging your bet ● Tied requests ● Selectively increase replication factors ● Put slow machines on probation ● Consider ‘good enough’ responses ● Hardware update
  • 64.
  • 65.
  • 66.
  • 67.
    Quick Tips ● Monitoring ●Network partitions ○ Split Brain Resolver
  • 68.
    Quick Tips ● Monitoring ●Network partitions ○ Split Brain Resolver ● Blocking
  • 69.
    Quick Tips ● Monitoring ●Network partitions ○ Split Brain Resolver ● Blocking ● Too many actor systems
  • 70.
  • 71.
    MANCHESTER LONDON NEWYORK @petr_zapletal @cakesolutions 347 708 1518 petrz@cakesolutions.net We are hiring http://www.cakesolutions.net/careers
  • 72.
    References ● http://www.reactivemanifesto.org/ ● http://www.slideshare.net/ktoso/zen-of-akka ●http://eishay.github.io/jvm-serializers/prototype-results-page/ ● http://java-persistence-performance.blogspot.com/2013/08/optimizing-java-serialization-java-vs.html ● https://github.com/romix/akka-kryo-serialization ● http://gotocon.com/dl/goto-chicago-2015/slides/CaitieMcCaffrey_ApplyingTheSagaPattern.pdf ● http://www.grahamlea.com/2016/08/distributed-transactions-microservices-icebergs/ ● http://www.cs.duke.edu/courses/cps296.4/fall13/838-CloudPapers/dean_longtail.pdf ● https://engineering.linkedin.com/performance/who-moved-my-99th-percentile-latency ● http://doc.akka.io/docs/akka/rp-15v09p01/scala/split-brain-resolver.html ● http://manuel.bernhardt.io/2016/08/09/akka-anti-patterns-flat-actor-hierarchies-or-mixing-business-logic-a nd-failure-handling/
  • 73.
  • 74.
    Adding Shutdown Hook valnodeShutdownCoordinatorActor = system.actorOf(Props( new NodeGracefulShutdownCoordinator(...))) sys.addShutdownHook { nodeShutdownCoordinatorActor ! StartNodeShutdown(shardRegions) }
  • 75.
    Adding Shutdown Hook valnodeShutdownCoordinatorActor = system.actorOf(Props( new NodeGracefulShutdownCoordinator(...))) sys.addShutdownHook { nodeShutdownCoordinatorActor ! StartNodeShutdown(shardRegions) }
  • 76.
    Adding Shutdown Hook valnodeShutdownCoordinatorActor = system.actorOf(Props( new NodeGracefulShutdownCoordinator(...))) sys.addShutdownHook { nodeShutdownCoordinatorActor ! StartNodeShutdown(shardRegions) }
  • 77.
    Tell Local Regionsto Shutdown when(AwaitNodeShutdownInitiation) { case Event(StartNodeShutdown(shardRegions), _) => if (shardRegions.nonEmpty) { // starts watching of every shard region and sends GracefulShutdown msg to them stopShardRegions(shardRegions) goto(AwaitShardRegionsShutdown) using ManagedRegions(shardRegions) } else { // registers OnMemberRemoved and leaves the cluster leaveCluster() goto(AwaitClusterExit) } }
  • 78.
    Tell Local Regionsto Shutdown when(AwaitNodeShutdownInitiation) { case Event(StartNodeShutdown(shardRegions), _) => if (shardRegions.nonEmpty) { // starts watching of every shard region and sends GracefulShutdown msg to them stopShardRegions(shardRegions) goto(AwaitShardRegionsShutdown) using ManagedRegions(shardRegions) } else { // registers OnMemberRemoved and leaves the cluster leaveCluster() goto(AwaitClusterExit) } }
  • 79.
    Tell Local Regionsto Shutdown when(AwaitNodeShutdownInitiation) { case Event(StartNodeShutdown(shardRegions), _) => if (shardRegions.nonEmpty) { // starts watching of every shard region and sends GracefulShutdown msg to them stopShardRegions(shardRegions) goto(AwaitShardRegionsShutdown) using ManagedRegions(shardRegions) } else { // registers OnMemberRemoved and leaves the cluster leaveCluster() goto(AwaitClusterExit) } }
  • 80.
    Tell Local Regionsto Shutdown when(AwaitNodeShutdownInitiation) { case Event(StartNodeShutdown(shardRegions), _) => if (shardRegions.nonEmpty) { // starts watching of every shard region and sends GracefulShutdown msg to them stopShardRegions(shardRegions) goto(AwaitShardRegionsShutdown) using ManagedRegions(shardRegions) } else { // registers OnMemberRemoved and leaves the cluster leaveCluster() goto(AwaitClusterExit) } }
  • 81.
    Node Leaves theCluster when(AwaitShardRegionsShutdown, stateTimeout = ... ){ case Event(Terminated(actor), ManagedRegions(regions)) => if (regions.contains(actor)) { val remainingRegions = regions - actor if (remainingRegions.isEmpty) { leaveCluster() goto(AwaitClusterExit) } else { goto(AwaitShardRegionsShutdown) using ManagedRegions(remainingRegions) } } else { stay() } case Event(StateTimeout, _) => leaveCluster() goto(AwaitNodeTerminationSignal) }
  • 82.
    Node Leaves theCluster when(AwaitShardRegionsShutdown, stateTimeout = ... ){ case Event(Terminated(actor), ManagedRegions(regions)) => if (regions.contains(actor)) { val remainingRegions = regions - actor if (remainingRegions.isEmpty) { leaveCluster() goto(AwaitClusterExit) } else { goto(AwaitShardRegionsShutdown) using ManagedRegions(remainingRegions) } } else { stay() } case Event(StateTimeout, _) => leaveCluster() goto(AwaitNodeTerminationSignal) }
  • 83.
    Node Leaves theCluster when(AwaitShardRegionsShutdown, stateTimeout = ... ){ case Event(Terminated(actor), ManagedRegions(regions)) => if (regions.contains(actor)) { val remainingRegions = regions - actor if (remainingRegions.isEmpty) { leaveCluster() goto(AwaitClusterExit) } else { goto(AwaitShardRegionsShutdown) using ManagedRegions(remainingRegions) } } else { stay() } case Event(StateTimeout, _) => leaveCluster() goto(AwaitNodeTerminationSignal) }
  • 84.
    Wait for Singletonsto Migrate when(AwaitClusterExit, stateTimeout = ...) { case Event(NodeLeftCluster | StateTimeout, _) => // Waiting on cluster singleton migration goto(AwaitClusterSingletonMigration) } when(AwaitClusterSingletonMigration, stateTimeout = ... ) { case Event(StateTimeout, _) => goto(AwaitNodeTerminationSignal) } onTransition { case AwaitClusterSingletonMigration -> AwaitNodeTerminationSignal => self ! TerminateNode }
  • 85.
    Wait for Singletonsto Migrate when(AwaitClusterExit, stateTimeout = ...) { case Event(NodeLeftCluster | StateTimeout, _) => // Waiting on cluster singleton migration goto(AwaitClusterSingletonMigration) } when(AwaitClusterSingletonMigration, stateTimeout = ... ) { case Event(StateTimeout, _) => goto(AwaitNodeTerminationSignal) } onTransition { case AwaitClusterSingletonMigration -> AwaitNodeTerminationSignal => self ! TerminateNode }
  • 86.
    Wait for Singletonsto Migrate when(AwaitClusterExit, stateTimeout = ...) { case Event(NodeLeftCluster | StateTimeout, _) => // Waiting on cluster singleton migration goto(AwaitClusterSingletonMigration) } when(AwaitClusterSingletonMigration, stateTimeout = ... ) { case Event(StateTimeout, _) => goto(AwaitNodeTerminationSignal) } onTransition { case AwaitClusterSingletonMigration -> AwaitNodeTerminationSignal => self ! TerminateNode }
  • 87.
    Wait for Singletonsto Migrate when(AwaitClusterExit, stateTimeout = ...) { case Event(NodeLeftCluster | StateTimeout, _) => // Waiting on cluster singleton migration goto(AwaitClusterSingletonMigration) } when(AwaitClusterSingletonMigration, stateTimeout = ... ) { case Event(StateTimeout, _) => goto(AwaitNodeTerminationSignal) } onTransition { case AwaitClusterSingletonMigration -> AwaitNodeTerminationSignal => self ! TerminateNode }
  • 88.
    Actor System &JVM Termination when(AwaitNodeTerminationSignal, stateTimeout = ...) { case Event(TerminateNode | StateTimeout, _) => // This is NOT an Akka thread-pool (since we're shutting those down) val ec = scala.concurrent.ExecutionContext.global // Calls context.system.terminate with registered onComplete block terminateSystem { case Success(ex) => System.exit(...) case Failure(ex) => System.exit(...) }(ec) stop(Shutdown) }
  • 89.
    Actor System &JVM Termination when(AwaitNodeTerminationSignal, stateTimeout = ...) { case Event(TerminateNode | StateTimeout, _) => // This is NOT an Akka thread-pool (since we're shutting those down) val ec = scala.concurrent.ExecutionContext.global // Calls context.system.terminate with registered onComplete block terminateSystem { case Success(ex) => System.exit(...) case Failure(ex) => System.exit(...) }(ec) stop(Shutdown) }
  • 90.
    Actor System &JVM Termination when(AwaitNodeTerminationSignal, stateTimeout = ...) { case Event(TerminateNode | StateTimeout, _) => // This is NOT an Akka thread-pool (since we're shutting those down) val ec = scala.concurrent.ExecutionContext.global // Calls context.system.terminate with registered onComplete block terminateSystem { case Success(ex) => System.exit(...) case Failure(ex) => System.exit(...) }(ec) stop(Shutdown) }
  • 91.
    Actor System &JVM Termination when(AwaitNodeTerminationSignal, stateTimeout = ...) { case Event(TerminateNode | StateTimeout, _) => // This is NOT an Akka thread-pool (since we're shutting those down) val ec = scala.concurrent.ExecutionContext.global // Calls context.system.terminate with registered onComplete block terminateSystem { case Success(ex) => System.exit(...) case Failure(ex) => System.exit(...) }(ec) stop(Shutdown) }