Seek and Destroy Kafka Under Replication

Seek and Destroy:
Kafka Under Replication
Edoardo Comar <ecomar@uk.ibm.com>
Event Streams for IBM Cloud
Apache KaAa commiBer

Contents
• Replication basics
• Why monitoring for URP is important
• How we investigated for replication improvements
• Orchestrating the good old Kafka performance producer
• Monitoring some handy Kafka server metrics
• Our results

Ka)a data replica1on basics
• Producers writes to partition leaders (always)
• Followers fetch from leaders (usually)
• Typical replication factor 3x
• Allows redundancy even when one broker is being updated
• Works well with min.insync.replicas = 2
• We want to prioritize durability
• Producer controls acks mode: all (default), 1, 0
• ack’d responses allow for a backpressure mechanism

Ka)a data replica1on basics
• acks=all isn’t always all replicas!
• It means all followers in an ISR >= min.insync.replicas
• URP (under-replicated partitions) is a risk condition
• not an outage (yet)
• limits operations
• UMI (under-min-ISR) is an outage for clients
• Producers fail with acks=all
• Consumers can’t commit offsets (if __consumer_offsets goes UMI)

A simple game:
• On a fixed reference infrastructure
• 3-brokers cluster
• either in ZK or KRaft mode
• We generated increasing levels of client workload
• tracking URP and UMI metrics
• And measured the effects of changing:
• broker settings
• client (producer) settings
• topic partitioning

Significant kafka.server metrics
• # of under replicated partitions
kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions
• # of under min-Isr partitions
kafka.server:type=ReplicaManager,name=UnderMinIsrPartitionCount
• Max lag in messages between follower and leader replicas
kafka.server:type=ReplicaFetcherManager,name=MaxLag,clientId=cid

Genera1ng client workload
• Executing on a serverless platform
• We ❤ IBM Cloud Code Engine - shameless plug " !
• easier than setting up clusters
• Containerized Java Performance producer
• From the standard Kafka distribution
• Configured with CLI parameters
• Easily reusable

e.g. aggressive producer !
Easily customizable

Running the workloads
On a 3-node Kubernetes cluster
• default.replication.factor=3
• min.insync.replicas=2
• settings affecting replication:
• num.network.threads
• num.io.threads
• num.replica.fetchers
• replica.fetch.min.bytes
• replica.fetch.wait.max.ms
• replica.lag.time.max.ms
• replica.socket.receive.buffer.bytes
• replica.socket.timeout.ms
• replica.fetch.backoff.ms
• replica.fetch.max.bytes
• replica.fetch.response.max.bytes

acks=all vs acks=1
lower replica lag
with acks=all

acks=1 vs acks=0
lower replica lag
with acks=1

Topic Par11oning (imbalanced)
1 partition vs 9 partitions
same set of producers
acks= all
not the
same scale

Topic Partitioning (balanced)
3 partition vs 12 partitions
Both balanced
Replica lag reduced by more that 4x

num.replica.fetchers
multiple fetchers
reduce lag
6 fetchers vs 1(default)
same set of producers
acks= all

num.io.threads, num.network.threads
io=8 & network=3 (defaults)
vs
io=16 & network=32

replica.socket.receive.buﬀer.bytes
64k (default)
vs 512k

Result: tuned broker settings
• num.network.threads=32
• The number of threads that the server uses for receiving requests from the
network and sending responses to the network (default: 3)
• num.io.threads=16
• The number of threads that the server uses for processing requests, which may
include disk I/O (default: 8)
• num.replica.fetchers=6
• Number of fetcher threads (per broker) used to replicate messages from a source
broker (default: 1)
• replica.socket.receive.buffer.bytes=262144
• The socket receive buffer for network requests to the leader for replicating data
(default: 64k)

PuKng it all together
default broker config
vs
cumulative tuning

Summary
• Set up a client workload framework
• Monitor metrics
• Tune broker configs
• Network and I/O Threads
• Replica fetchers
• Replica socket buffers size

Q & A
Replication is complicated J
KIP 966 - Eligible Leader Replicas (Fixing the Last Replica Standing issue)
Hofstadter's Law:
"It always takes longer than you expect, even when you take into
account Hofstadter's Law.”
[Douglas Hofstadter, “Gödel, Escher, Bach: an Eternal Golden Braid”]

Seek and Destroy Kafka Under Replication

Recommended

Recommended

More Related Content

Similar to Seek and Destroy Kafka Under Replication

Similar to Seek and Destroy Kafka Under Replication (20)

More from HostedbyConfluent

More from HostedbyConfluent (20)

Recently uploaded

Recently uploaded (20)

Seek and Destroy Kafka Under Replication