Kafka used at scale to deliver real-time notifications

Kafka used at scale
to deliver real-time notifications

Sergio Nunes
Infrastructure Engineer @ Zendesk
@sbrnunes

Agenda
Monitoring
Delivering real-time notifications
Some challenges

Kafka for real-time
notifications

Kafka for real-time notifications
Some initial requirements
● Listen to any database event that could translate into a notification
● Process and aggregate those events into transactions
● Translate transactions into notifications
● Push those notifications to mobile devices
○ In real-time
○ In a sane order (for the User)
○ With an accurate badge count (number of unread notifications)

Architecture of the system
Maxwell
Transactions
Stream
Events
Stream
Notifications
Stream
Notifications Service
MySQL

Maxwell
Transactions
Stream
Events
Stream
Notifications
Stream
Partitioned by database === account
Partitioned by account
MySQL

Maxwell
Transactions
Stream
Events
Stream
Notifications
Stream
MySQL
Partitioned by user

Maxwell
Transactions
Stream
Events
Stream
Notifications
Stream
MySQL
API Server
Mark
as “read”
Badge Update
Notification

Conclusions
● Crazy fast!!!
○ Easily streaming at 7 - 10K database events /s
○ We’ve seen it handle up to 20 - 25K /s ( > 1M /min)
● Ordering guarantees provided really matched our needs
● Highly configurable
● Easily scalable (horizontally)
○ We can “easily” add or remove nodes to the Kafka cluster
○ We can easily add and remove consumer instances

Monitoring
How to monitor a service like this ?
● We need to be able to answer a few questions
○ Are we getting messages from Kafka ?
○ How fast are we reading those messages?
○ How much are we lagging ?
● We need to capture metrics!
● We can use those metrics to create alerts!

Monitoring
Application metrics
● Some metrics can be easily captured in the application and then reported to some monitoring
service
○ Examples: events produced /s, events consumed /s, etc.

Monitoring
Kafka metrics
● Other metrics are hard to get
● Consumer Lag: probably the most important metric
○ Size of partition (last offset) - consumer offset (last committed)
○ Brokers know about partition sizes
○ Consumer owns the consumed offsets
● Consumers acknowledge offsets by “committing” them back into a special Kafka
topic
○ What do we do with this?

Monitoring
Burrow
● Burrow for the rescue!!!
○ Monitoring application open sourced by Linkedin
https://github.com/linkedin/Burrow
○ Can be deployed as a sidecar application for Kafka
○ Keeps track of committed offsets, as well as the last
offsets known by the brokers
○ Exposes an REST API to gather all this information

Monitoring
Datadog
● Open source plugin for Burrow
○ https://github.com/packetloop/datadog-agent-burrow
○ Uses Burrow’s API to fetch some metrics (including the
consumer lag)
○ Publishes the metrics to Datadog

Some challenges
The problem of (not understanding) Kafka horizontal scaling
● Partition is considered the unit of parallelism in Kafka
○ So… we created our topics with 200 partitions!!! The more the merrier!
○ How many consumer instances ?
■ Service was running in 2 hosts, 4 cores each, maybe 4 instances in each machine ?
■ This gives us a total of 8 consumer instances for 200 partitions (?!?)
○ Did we need 200 partitions ? No !
https://www.confluent.io/blog/how-to-choose-the-number-of-topicspartitions-in-a-kafka-cluster/
● The problem: we can add partitions to a topic, but we can’t remove them

Some challenges
The problem of (not understanding) Kafka horizontal scaling

Some challenges
Reducing the number of partitions
Producers
Topic V1 Topic V2
Consumers
Process followed to reduce the partitions:
1. Create topic V2
2. Make the consumer read from V1 and V2
3. Make the producer write only to V2
4. Wait until V1 is drained
5. Re-create V1 with the new number of partitions
6. Make the producer write only to V1
7. When V2 is drained, make the consumer read only from V1

Some challenges
● Advantages of partitioning the data per account:
○ Accounts data gets load balanced across the cluster
○ Guarantees the ordering per account
● Disadvantages:
○ Partitions can become heavily unbalanced
■ Some accounts are bigger than others
■ We can’t guarantee that big accounts won’t end up in the same partition
■ Some accounts may be under load
● The problem: this can slow down the assigned consumer a lot!!!
Unbalanced partitions

Some challenges
● In some rare occasions (load spikes), we had the need to increase the throughput
● Recommended solution: to add more consumer instances
● The problem: we only had two hosts, adding more consumer instances had to be done at the
application level
○ Our implementation with Akka Streams was not helping
○ To add a new consumer we had to replicate the entire stream (significant increase in the
number of threads)
○ Was causing us some issues and actually reducing the throughput
● The solution: keep one or two consumer instances per host, within the app, but consider adding
more more nodes
Increasing the number of consumers

Some challenges
Upgrading Kafka clients
● A few surprises upgrading to 0.9
○ Processing big chunks of data can lead to delays on the heartbeat mechanism which may cause
constant rebalances
○ Possible solutions:
■ reducing the maximum size of data consumed
■ Another possible solution: increase the consumer session timeout (prevent it from expiring
before processing the data)

Kafka used at scale to deliver real-time notifications

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Kafka used at scale to deliver real-time notifications

Similar to Kafka used at scale to deliver real-time notifications (20)

Recently uploaded

Recently uploaded (20)

Kafka used at scale to deliver real-time notifications