Building zero data loss pipelines with apache kafka

www.clairvoyantsoft.com
Building Zero Data loss pipelines
with Apache Kafka
By: Avinash Ramineni

| 3
About
Background Awards & Recognition
Boutique consulting firm centered on building data solutions and
products
All things Web and Data Engineering, Analytics, ML and User
Experience to bring it all together
Support core Hadoop platform, data engineering pipelines and provide
administrative and devops expertise focused on Hadoop

| 4
● Introduction and terminologies
● How can we lose data ?
● Zero Data Loss Pipelines
○ Producer
○ Kafka Cluster
○ Consumer
● Monitoring for Data Loss
● Summary
Agenda

| 5
● An open source distributed stream processing platform
○ High Throughput
○ Scalable
○ Low Latency
○ Real-time
○ Fault Tolerant
● Messages are organized as topics
● Producers push messages
● Consumers pull messages
● Kafka doesn't have message acknowledgments, it assumes the consumer keeps tracks of what's been consumed
so far.
Kafka - Basics

| 6
● Each topic has multiple partitions
● Unit of parallelism
● IDs are unique for a partition for a topic
● Makes sure partitions within the same topic are roughly the same size
● Too many partitions ?.
Kafka - Partitions

| 7
Consuming from a Kafka Topic
Kafka Reads - https://fizalihsan.github.io/technology/kafka-partition-consumer.png

| 8
● Replication factor
○ Unit of replication is partition
○ Producer ACk
● A partition is owned by a single broker in the cluster, and that broker is called the leader for the partition.
● Producers and Consumers are only served by Leaders
● A Zookeeper cluster is called an “ensemble”.
● Spreads leader partitions evenly on brokers throughout the cluster
Kafka - Cluster

| 10
● Any Kafka client (a producer or consumer) communicates only with the leader partition for data
○ All other partitions exist for redundancy and failover.
○ Follower partitions are responsible for copying new records from their leader partitions.
○ Follower partitions have an exact copy of the contents of the leader. Such partitions are called in-sync replicas (ISR).
Kafka - ISR (In Sync Replica)

| 11
● Failures Happen
● Systems need to be designed to tolerate failure
● Expect failures and design systems to handle them
Distributed Systems

| 12
● Producer
○ Acks =0, 1, all
○ Batch size, linger time
○ Broker is down
○ Block.on.buffer.full = false
● Kafka Cluster
○ Disk writes/flush are asynchronous
■ Hardware crashes
○ Metadata corruption
○ Kafka Bugs
● Consumer
○ Offset management
■ Offsets are committed before processing the messages completely
How can we lose data ?

| 13
● At-least once delivery semantics
● Out-of-order delivery semantics
● Kafka Cluster
● Each component in the pipeline makes sure that they are not loosing the messages
○ Producer
■ Takes the responsibility of making sure that messages are successfully delivered to Kafka Brokers
○ Kafka Cluster
■ Reliable and Resilient
○ Consumers
■ Takes the responsibility for consuming all messages from Kafka Cluster with the understanding
● Messages can be delivered more than once and have idempotency logic
● Operates with understanding of retention policy of the brokers, number of partitions for the
topics, etc.
● Manages check pointing and committing offsets
Zero Data loss pipelines

| 14
● Handle cases where broker is unavailable / down / network
issue
○ Local Spooling
○ Low Latency store
○ Backup cluster
● Ack = all ?
○ Consistency Vs Availability
■ Min ISR < = R
● (ex: replication factor = 3 min isr =2)
● Latency Vs Throughput
○ Batch size and linger time
● Standardardized on Message schema
○ Sequence Id, Timestamp, originating service
Producers

| 15
● Configure broker to make sure that no messages are lost at the broker end and can survive hardware failures using below
aspects
○ Partitions
■ Spread leader partitions evenly on brokers throughout the cluster
● Balances the load
■ Partition replication
● Replication Factor (R < = size of the cluster) (network congestion?)
● Sufficient replication sets to preserve data
● Rack awareness
○ Data Retention
■ Too much retention can be a problem
■ Leader election scenarios (unclean leader election)
○ Zookeeper availability
● DR / Backup
○ Mirror Maker to replicate the messages to another cluster
Kafka Cluster

| 16
● Failure Detection
○ Reassign replications on the dead brokers
○ Monitor for under replicated partitions
● Kafka Cluster sizing
○ Disk
○ Network
Kafka Cluster

| 17
Kafka Sizing
Recommended approach: Simulate load using load generation tools that ship with Kafka
Simplest: size based on disk-space (data ingest rate * data retention)
A little better: disk and network throughput requirements
Example: 1 TB/day —> 277MB/s. Replication 3 , Consumers 10
● Disk throughput = ingest * replication *2 (for lagging readers) = 277 * 3 * 2 = 1662 MB/s
● Network read throughput = ingest * (replication -1+ consumers)=277 *(3-1+10) = 3323 MB/s
● Network write throughput = ingest rate * replication = 277 * 3 = 831 MB /s
Network (e.g. 10 GBE): (3323+831)/1250 = 3.3 nodes
Disk (e.g. 6 drives & 70MB/s): 1662 /(6*70) = 3.95 nodes
Plan for double, so 8 kafka nodes

| 18
● Understand the data retention period and design accordingly (need to keep up with the producers)
○ Parallelism = number of partitions
■ Max consumers = partitions
○ Proper checkpointing and offset management
■ Set Autocommit disabled and commit offset after processing the message
■ Commit less often and Asynchronously
■ Checkpoint frequency
○ Expect at least once delivery of a message and have idempotency logic
Consumers

| 19
● QC Check Topic
○ Message sequence Id
○ Message counts per time bucket (10 mins? )
○ Reconcile on the number of messages produced and number of messages consumed numbers
● Kafka Audit System
○ Chaperone
■ https://github.com/uber/chaperone
○ Kafka Monitor
● Continuously monitor for issues
○ Monitor for producer errors -
■ number of retries and counts in retry database
○ Monitor Consumer Lag, fetch rate, fetch latency, records per request
○ Monitor for workload skews
● Capture Metrics
Monitor for Data Loss

| 20
● Guaranteeing zero data loss is not just Kafka’s problem
● Zero data loss pipelines require operations (Kafka cluster) and development teams (Producer /
consumer) working together
● Anticipate failures and design code to handle
● Gracefully Shutdown your application
Summary

Thank You!
| 21
Questions?
avinash@clairvoyantsoft.com

Building zero data loss pipelines with apache kafka

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Building zero data loss pipelines with apache kafka

Similar to Building zero data loss pipelines with apache kafka (20)

More from Avinash Ramineni

More from Avinash Ramineni (10)

Recently uploaded

Recently uploaded (20)

Building zero data loss pipelines with apache kafka