Air traffic controller - Streams Processing meetup
Air Traffic Controller
Using Samza to manage communications with members
By: Cameron Lee and Shubhanshu Nagar
How ATC Solves it
What problem are we trying to solve?
In the past, LinkedIn provided a poor communications experience to some of
Too much email, low quality email, fired on multiple channels at once
Our goal was to build a system which could apply some common
functionality across many different communication types and use cases in
order to improve the member experience.
Handle thousands of communications per second
Good understanding of state of members on the site in near-real-time
How does ATC think about
a delightful member experience?
Useful to member
Shouldn’t have seen it before
Kafka: publish-subscribe messaging system
Used to send input to ATC to trigger communications
Many actions and signals in the LinkedIn ecosystem are tracked in kafka events. We can
consume these signals to better understand the state of the ecosystem.
Databus: change capture system for databases
Produces an event whenever an entry in a database changes
By default, whenever a Samza app is deployed, the task instances can be
moved to any host in the cluster, regardless of where the instances were
If there was any state saved (e.g. RocksDB), then the new instances would
have to rebuild that state off of the changelog. This bootstrapping can take
some time depending on the amount of data to reload. Task instances
can’t process new input until bootstrapping is complete.
We have some use cases which can’t be delayed for the amount of time it
Host affinity (continued)
Host affinity is a Samza feature which allows us to deploy task instances
back to the same hosts from the previous deployment, so state does not
need to be reloaded.
In case of failures for individual instances, Samza can fallback to moving the
instance elsewhere and bootstrapping off of the changelog.
Samza does not currently support replicating persistent application state
(e.g. RocksDB) across multiple clusters which are running the same app.
We need ATC to run in multiple datacenters for redundancy.
We need to have state in each datacenter so that if we have to move
processing between datacenters, then we can continue to properly handle
We rely on the input streams to replicate the main input so that we can do
processing and build up state in all datacenters.
The side effects (trigger the actual email send) then will only get emitted by
one of the datacenters. We can dynamically choose where side effects are
When we deploy changes to ATC, we can deploy to a single datacenter at a
time in order to test new versions on only a fraction of traffic.
In some cases, we shift all side effects out of a datacenter to do an upgrade.
Since we still process all input, we can validate almost all of our
functionality and ensure performance doesn’t take an unexpected hit.
In some cases, we need to migrate our system to use a new instance of a
For example, when support was added to use RocksDB TTL, we needed to migrate some of
Since we only needed the last X days of data, we could use the following
strategy for the migration:
Write to both the old and new store for X days, but continue to read from the old store.
After X days, read from the new store, but continue writing both stores so we could fall back
Personalization through relevance
We work closely with a relevance team in order to make better decisions
about the communications we send out.
e.g. channel selection, delivery time, aggregation thresholds
Every day, scores for different decisions are computed offline (Hadoop) by the
relevance team. Those scores are pushed to ATC through Kafka, and then
ATC stores the scores in RocksDB.
Scores are generated for each member, so we can personalize the
Some data is not available on a Kafka stream in a pragmatic way
We make REST requests to fetch that data
Done at the beginning of pipeline
Make remote calls and decorate event
Process decorated event
Remote calls - Efficiently
Framework to write asynchronous code in Java
ParSeq uses a thread pool for making remote calls
Rest of processing happens serially
Checkpointing handled by application
Some messages require real-time latency
Tuned Kafka’s batching configuration to achieve sub-second of pre-ATC
Can be tuned even more aggressively!
ATC/Samza processes most events in 2-3 ms
No remote calls for these messages