"It's important that even under load, Apache Kafka ensures user topics are fully replicated in synch.
Replication is essential to endure resilience to data loss, so both users and operators care about it.
If a topic partition falls out of the ISR (In-Synch-replicas) set, a user experiences unavailability (when producing with the default acknowledgment setting).
Users may use non-default acks mode to work around it, but the effect on a Kafka cluster is to make the under-replication worse.
Even simple Under replication with no Under Min Isr is to be avoided as a cluster update may cause the dreaded Under Min ISR.
There are a number of settings that can be used, from quotas to number of replication threads to more low-level settings.
This session wants to show how we successfully measured and evolved our Kafkas configuration, with the goal of giving the best possible user experience (and resilience to their data).
Hofstadter's Law applied!
""It always takes longer than you expect, even when you take into account Hofstadter's Law."""
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Seek and Destroy Kafka Under Replication
1. Seek and Destroy:
Kafka Under Replication
Edoardo Comar <ecomar@uk.ibm.com>
Event Streams for IBM Cloud
Apache KaAa commiBer
2. Contents
• Replication basics
• Why monitoring for URP is important
• How we investigated for replication improvements
• Orchestrating the good old Kafka performance producer
• Monitoring some handy Kafka server metrics
• Our results
3. Ka)a data replica1on basics
• Producers writes to partition leaders (always)
• Followers fetch from leaders (usually)
• Typical replication factor 3x
• Allows redundancy even when one broker is being updated
• Works well with min.insync.replicas = 2
• We want to prioritize durability
• Producer controls acks mode: all (default), 1, 0
• ack’d responses allow for a backpressure mechanism
4. Ka)a data replica1on basics
• acks=all isn’t always all replicas!
• It means all followers in an ISR >= min.insync.replicas
• URP (under-replicated partitions) is a risk condition
• not an outage (yet)
• limits operations
• UMI (under-min-ISR) is an outage for clients
• Producers fail with acks=all
• Consumers can’t commit offsets (if __consumer_offsets goes UMI)
5. A simple game:
• On a fixed reference infrastructure
• 3-brokers cluster
• either in ZK or KRaft mode
• We generated increasing levels of client workload
• tracking URP and UMI metrics
• And measured the effects of changing:
• broker settings
• client (producer) settings
• topic partitioning
6. Significant kafka.server metrics
• # of under replicated partitions
kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions
• # of under min-Isr partitions
kafka.server:type=ReplicaManager,name=UnderMinIsrPartitionCount
• Max lag in messages between follower and leader replicas
kafka.server:type=ReplicaFetcherManager,name=MaxLag,clientId=cid
7. Genera1ng client workload
• Executing on a serverless platform
• We ❤ IBM Cloud Code Engine - shameless plug " !
• easier than setting up clusters
• Containerized Java Performance producer
• From the standard Kafka distribution
• Configured with CLI parameters
• Easily reusable
18. Result: tuned broker settings
• num.network.threads=32
• The number of threads that the server uses for receiving requests from the
network and sending responses to the network (default: 3)
• num.io.threads=16
• The number of threads that the server uses for processing requests, which may
include disk I/O (default: 8)
• num.replica.fetchers=6
• Number of fetcher threads (per broker) used to replicate messages from a source
broker (default: 1)
• replica.socket.receive.buffer.bytes=262144
• The socket receive buffer for network requests to the leader for replicating data
(default: 64k)
19. PuKng it all together
default broker config
vs
cumulative tuning
20. Summary
• Set up a client workload framework
• Monitor metrics
• Tune broker configs
• Network and I/O Threads
• Replica fetchers
• Replica socket buffers size
21. Q & A
Replication is complicated J
KIP 966 - Eligible Leader Replicas (Fixing the Last Replica Standing issue)
Hofstadter's Law:
"It always takes longer than you expect, even when you take into
account Hofstadter's Law.”
[Douglas Hofstadter, “Gödel, Escher, Bach: an Eternal Golden Braid”]