Reactive applications are becoming a de-facto industry standard and, if employed correctly, toolkits like Lightbend Reactive Platform make the implementation easier than ever. But design of these systems might be challenging as it requires particular mindset shift to tackle problems we might not be used to.
In this talk, we’re going to discuss the most common things I’ve seen in the field that prevented applications working as expected. I’d like to talk about typical pitfalls that might cause problems, about trade-offs that might not be fully understood and important choices that might be overlooked. These include persistent actors pitfalls, tackling of network partitions, proper implementations of graceful shutdown or distributed transactions, trade-offs of micro-services or actors and more.
This talk should be interesting for anyone who is thinking about, implementing, or has already deployed a reactive application. My goal is to provide a comprehensive explanation of common problems to be sure they won’t be repeated by fellow developers. The talk is a little bit more focused on the Lightbend platform but understanding of the concepts we are going to talk about should be beneficial for everyone interested in this field.
5. Pick the Right Tool for The Job
Scala
Future[T]
Akka
ACTORS
Power
Constraints
Akka
Stream
6. Pick the Right Tool for The Job
Scala
Future[T]
Akka
ACTORS
Power
Constraints
Akka
TYPED
7. Pick the Right Tool for The Job
Scala
Future[T] Akka
TYPED
Akka
ACTORS
Power
Constraints
Akka
Stream
8. Pick the Right Tool for The Job
Scala
Future[T]
Local Abstractions Distribution
Akka
TYPED
Akka
ACTORS
Power
Constraints
Akka
Stream
9. Actor Use Cases
● State management
● Location transparency
● Resilience mechanisms
● Single writer
● In-memory lock-free cache
● Sharding
Akka
ACTOR
10. Future Use Cases
● Local Concurrency
● Simplicity
● Composition
● Typesafety
Scala
Future[T]
11. Avoid Java Serialization
Java Serialization is the default in Akka, since
it is easy to start with it, but is very slow and
footprint heavy
17. Java Serialization Implementation
● Serializes
○ Data
○ Entire class definition
○ Definitions of all referenced classes
● It just “works”
○ Serializes almost everything (what implements Serializable)
○ Works with different JVMs
● Performance was not the main requirement
18. Points of Interest
● Performance
● Footprint
● Schema evolution
● Implementation effort
● Human readability
● Language bindings
● Backwards & forwards compatibility
● ...
19. JSON
● Advantages:
○ Human readability
○ Simple & well known
○ Many good libraries
for all platforms
● Disadvantages:
○ Slow
○ Large
○ Object names included
○ No schema (except e.g. json
schema)
○ Format and precision issues
● json4s, circe, µPickle, spray-json, argonaut, rapture-json, play-json, …
20. Binary formats [Schema-less]
● Metadata send together with data
● Advantages:
○ Implementation effort
○ Performance
○ Footprint *
● Disadvantages:
○ No human readability
● Kryo, Binary JSON (MessagePack, BSON, ... )
21. Binary formats [Schema]
● Schema defined by some kind of DSL
● Advantages:
○ Performance
○ Footprint
○ Schema evolution
● Disadvantages:
○ Implementation effort
○ No human readability
● Protobuf (+ projects like Flatbuffers, Cap’n Proto, etc.), Thrift, Avro
22. Summary
● Should be always changed
● Depends on particular use case
● Quick tips:
○ json4s
○ kryo
○ protobuf
23. Flat Actor Hierarchies
Errors should be handled out of band in a
parallel process - they are not part of the
main app
26. Top Level Actors
The Actor Hierarchy
/a1 /a2
/b1 /b2
Root Actor
/c4/c3/c2/c1
/user
27. Top Level Actors
The Actor Hierarchy
/a1 /a2
/b1 /b2
Root Actor
/c4/c3/c2/c1
/user
/
/system
28. Two Different Battles to Win
● Separate business logic and failure handling
○ Less complexity
○ Better supportability
● Getting our application back to life after something bad happened
○ Failure isolation
○ Recovery
○ No more midnight calls :)
---> no more midnight calls :)
29. Errors & Failures
Errors
● Common events
● The current request is affected
● Will be communicated with the client/caller
● Incorrect requests, errors during validations, ...
Failures
● Unexpected events
● Service/actor is not able to operate normally
● Reports to supervisor
● Client can’t do anything, might be notified
● Database failures, network partitions, hardware
malfunctions, ...
30. Error Kernel Pattern
● Actor’s state is lost during restart and may not be recovered
● Delegating dangerous tasks to child actors and supervise them
/user/
a1
/user/
a1
/user/
a1/w1
/user/
a1
/user/
a1/w1
31. Summary
● Create rich actor hierarchies
● Separate business logic and failure handling
● Properly written, your application will be self-healing and incredibly
resilient
32. Graceful Shutdown
We have thousands of sharded actors on
multiple nodes and we want to shut one of
them down
36. High-level Procedure
1. JVM gets the shutdown signal
2. Coordinator tells all local ShardRegions to shut down gracefully
37. High-level Procedure
1. JVM gets the shutdown signal
2. Coordinator tells all local ShardRegions to shut down gracefully
3. Node leaves cluster
38. High-level Procedure
1. JVM gets the shutdown signal
2. Coordinator tells all local ShardRegions to shut down gracefully
3. Node leaves cluster
4. Coordinator gives singletons a grace period to migrate
39. High-level Procedure
1. JVM gets the shutdown signal
2. Coordinator tells all local ShardRegions to shut down gracefully
3. Node leaves cluster
4. Coordinator gives singletons a grace period to migrate
5. Actor System & JVM Termination
40. Adding Shutdown Hook
val nodeShutdownCoordinatorActor = system.actorOf(Props(
new NodeGracefulShutdownCoordinator(...)))
sys.addShutdownHook {
nodeShutdownCoordinatorActor ! StartNodeShutdown(shardRegions)
}
41. Adding Shutdown Hook
val nodeShutdownCoordinatorActor = system.actorOf(Props(
new NodeGracefulShutdownCoordinator(...)))
sys.addShutdownHook {
nodeShutdownCoordinatorActor ! StartNodeShutdown(shardRegions)
}
42. Adding Shutdown Hook
val nodeShutdownCoordinatorActor = system.actorOf(Props(
new NodeGracefulShutdownCoordinator(...)))
sys.addShutdownHook {
nodeShutdownCoordinatorActor ! StartNodeShutdown(shardRegions)
}
43. Tell Local Regions to Shutdown
when(AwaitNodeShutdownInitiation) {
case Event(StartNodeShutdown(shardRegions), _) =>
if (shardRegions.nonEmpty) {
// starts watching of every shard region and sends GracefulShutdown msg to them
stopShardRegions(shardRegions)
goto(AwaitShardRegionsShutdown) using ManagedRegions(shardRegions)
} else {
// registers OnMemberRemoved and leaves the cluster
leaveCluster()
goto(AwaitClusterExit)
}
}
44. Tell Local Regions to Shutdown
when(AwaitNodeShutdownInitiation) {
case Event(StartNodeShutdown(shardRegions), _) =>
if (shardRegions.nonEmpty) {
// starts watching of every shard region and sends GracefulShutdown msg to them
stopShardRegions(shardRegions)
goto(AwaitShardRegionsShutdown) using ManagedRegions(shardRegions)
} else {
// registers OnMemberRemoved and leaves the cluster
leaveCluster()
goto(AwaitClusterExit)
}
}
45. Tell Local Regions to Shutdown
when(AwaitNodeShutdownInitiation) {
case Event(StartNodeShutdown(shardRegions), _) =>
if (shardRegions.nonEmpty) {
// starts watching of every shard region and sends GracefulShutdown msg to them
stopShardRegions(shardRegions)
goto(AwaitShardRegionsShutdown) using ManagedRegions(shardRegions)
} else {
// registers OnMemberRemoved and leaves the cluster
leaveCluster()
goto(AwaitClusterExit)
}
}
46. Tell Local Regions to Shutdown
when(AwaitNodeShutdownInitiation) {
case Event(StartNodeShutdown(shardRegions), _) =>
if (shardRegions.nonEmpty) {
// starts watching of every shard region and sends GracefulShutdown msg to them
stopShardRegions(shardRegions)
goto(AwaitShardRegionsShutdown) using ManagedRegions(shardRegions)
} else {
// registers OnMemberRemoved and leaves the cluster
leaveCluster()
goto(AwaitClusterExit)
}
}
47. Node Leaves the Cluster
when(AwaitShardRegionsShutdown, stateTimeout = ... ){
case Event(Terminated(actor), ManagedRegions(regions)) =>
if (regions.contains(actor)) {
val remainingRegions = regions - actor
if (remainingRegions.isEmpty) {
leaveCluster()
goto(AwaitClusterExit)
} else {
goto(AwaitShardRegionsShutdown) using ManagedRegions(remainingRegions)
}
} else {
stay()
}
case Event(StateTimeout, _) =>
leaveCluster()
goto(AwaitNodeTerminationSignal)
}
48. Node Leaves the Cluster
when(AwaitShardRegionsShutdown, stateTimeout = ... ){
case Event(Terminated(actor), ManagedRegions(regions)) =>
if (regions.contains(actor)) {
val remainingRegions = regions - actor
if (remainingRegions.isEmpty) {
leaveCluster()
goto(AwaitClusterExit)
} else {
goto(AwaitShardRegionsShutdown) using ManagedRegions(remainingRegions)
}
} else {
stay()
}
case Event(StateTimeout, _) =>
leaveCluster()
goto(AwaitNodeTerminationSignal)
}
49. Node Leaves the Cluster
when(AwaitShardRegionsShutdown, stateTimeout = ... ){
case Event(Terminated(actor), ManagedRegions(regions)) =>
if (regions.contains(actor)) {
val remainingRegions = regions - actor
if (remainingRegions.isEmpty) {
leaveCluster()
goto(AwaitClusterExit)
} else {
goto(AwaitShardRegionsShutdown) using ManagedRegions(remainingRegions)
}
} else {
stay()
}
case Event(StateTimeout, _) =>
leaveCluster()
goto(AwaitNodeTerminationSignal)
}
50. Wait for Singletons to Migrate
when(AwaitClusterExit, stateTimeout = ...) {
case Event(NodeLeftCluster | StateTimeout, _) =>
// Waiting on cluster singleton migration
goto(AwaitClusterSingletonMigration)
}
when(AwaitClusterSingletonMigration, stateTimeout = ... ) {
case Event(StateTimeout, _) =>
goto(AwaitNodeTerminationSignal)
}
onTransition {
case AwaitClusterSingletonMigration -> AwaitNodeTerminationSignal =>
self ! TerminateNode
}
51. Wait for Singletons to Migrate
when(AwaitClusterExit, stateTimeout = ...) {
case Event(NodeLeftCluster | StateTimeout, _) =>
// Waiting on cluster singleton migration
goto(AwaitClusterSingletonMigration)
}
when(AwaitClusterSingletonMigration, stateTimeout = ... ) {
case Event(StateTimeout, _) =>
goto(AwaitNodeTerminationSignal)
}
onTransition {
case AwaitClusterSingletonMigration -> AwaitNodeTerminationSignal =>
self ! TerminateNode
}
52. Wait for Singletons to Migrate
when(AwaitClusterExit, stateTimeout = ...) {
case Event(NodeLeftCluster | StateTimeout, _) =>
// Waiting on cluster singleton migration
goto(AwaitClusterSingletonMigration)
}
when(AwaitClusterSingletonMigration, stateTimeout = ... ) {
case Event(StateTimeout, _) =>
goto(AwaitNodeTerminationSignal)
}
onTransition {
case AwaitClusterSingletonMigration -> AwaitNodeTerminationSignal =>
self ! TerminateNode
}
53. Wait for Singletons to Migrate
when(AwaitClusterExit, stateTimeout = ...) {
case Event(NodeLeftCluster | StateTimeout, _) =>
// Waiting on cluster singleton migration
goto(AwaitClusterSingletonMigration)
}
when(AwaitClusterSingletonMigration, stateTimeout = ... ) {
case Event(StateTimeout, _) =>
goto(AwaitNodeTerminationSignal)
}
onTransition {
case AwaitClusterSingletonMigration -> AwaitNodeTerminationSignal =>
self ! TerminateNode
}
54. Actor System & JVM Termination
when(AwaitNodeTerminationSignal, stateTimeout = ...) {
case Event(TerminateNode | StateTimeout, _) =>
// This is NOT an Akka thread-pool (since we're shutting those down)
val ec = scala.concurrent.ExecutionContext.global
// Calls context.system.terminate with registered onComplete block
terminateSystem {
case Success(ex) =>
System.exit(...)
case Failure(ex) =>
System.exit(...)
}(ec)
stop(Shutdown)
}
55. Actor System & JVM Termination
when(AwaitNodeTerminationSignal, stateTimeout = ...) {
case Event(TerminateNode | StateTimeout, _) =>
// This is NOT an Akka thread-pool (since we're shutting those down)
val ec = scala.concurrent.ExecutionContext.global
// Calls context.system.terminate with registered onComplete block
terminateSystem {
case Success(ex) =>
System.exit(...)
case Failure(ex) =>
System.exit(...)
}(ec)
stop(Shutdown)
}
56. Actor System & JVM Termination
when(AwaitNodeTerminationSignal, stateTimeout = ...) {
case Event(TerminateNode | StateTimeout, _) =>
// This is NOT an Akka thread-pool (since we're shutting those down)
val ec = scala.concurrent.ExecutionContext.global
// Calls context.system.terminate with registered onComplete block
terminateSystem {
case Success(ex) =>
System.exit(...)
case Failure(ex) =>
System.exit(...)
}(ec)
stop(Shutdown)
}
57. Actor System & JVM Termination
when(AwaitNodeTerminationSignal, stateTimeout = ...) {
case Event(TerminateNode | StateTimeout, _) =>
// This is NOT an Akka thread-pool (since we're shutting those down)
val ec = scala.concurrent.ExecutionContext.global
// Calls context.system.terminate with registered onComplete block
terminateSystem {
case Success(ex) =>
System.exit(...)
case Failure(ex) =>
System.exit(...)
}(ec)
stop(Shutdown)
}
58. Integration with Sharded Actors
● Handling of added messages
○ Passivate() message for graceful stop
○ Context.stop() for immediate stop
● Priority mailbox
○ Priority message handling
○ Message retrying support
59. Summary
● We don’t want to lose data (usually)
● Shutdown coordinator on every node
● Integration with sharded actors
60. Distributed Transactions
Any situation where a single event results in
the mutation of two separate sources of data
which cannot be committed atomically
61. What’s Wrong With Them
● Simple happy paths
● 7 Fallacies of Distributed Programming
○ The network is reliable.
○ Latency is zero.
○ Bandwidth is infinite.
○ The network is secure.
○ Topology doesn't change.
○ There is one administrator.
○ Transport cost is zero.
○ The network is homogeneous.
62. Two-phase commit (2PC)
Stage 1 - Prepare Stage 2 - Commit
Prepare
Prepared
Prepare
Prepared
Com
m
it
Com
m
itted
Commit
Committed
Resource
Manager
Resource
Manager
Transaction
Manager
Resource
Manager
Resource
Manager
Transaction
Manager
64. The Big Trade-Off
● Distributed transactions can be usually avoided
○ Hard, expensive, fragile and do not scale
● Every business event needs to result in a single synchronous commit
● Other data sources should be updated asynchronously
● Introducing eventual consistency
65. Longtail Latencies
Consider a system where each service
typically responds in 10ms but with a 99th
percentile latency of one second
67. Longtails really matter
● Latency accumulation
● Not just noise
● Don’t have to be power users
● Real problem
68. Investigating Longtail Latencies
● Narrow the problem
● Isolate in a test environment
● Measure & monitor everything
● Tackle the problem
● Pretty hard job