- Upgrades should be done often to get bug fixes and improvements, following the upgrade guide carefully. Start with a healthy cluster and upgrade components outward from Zookeeper to Kafka brokers to clients. Don't rush the process or have any unresolved partition reassignments.
- Collect JMX metrics to monitor the cluster as outages can be prolonged without visibility. The Kafka defaults are suitable for single node deployments but replication factor, threads, and broker configuration should be tuned for larger clusters.
- Quotas like replication throttling and bandwidth/request limits per client or topic should be used to protect the cluster and clients. Log files should separate each component and be retained for a few days. Consider multiple clusters by SLA
Similar to Learnings from the Field. Lessons from Working with Dozens of Small & Large Deployments (Mitchell Henderson, Confluent) Kafka Summit 2020 (20)
6. 6
How to upgrade?
● Read the upgrade guide 3 times.
○ Do you understand the
API/Protocol versions? This
is important.
● Start with a healthy cluster!
○ No URP! Seriously, NONE!
● Work outward. Zookeeper ->
Kafka brokers ->
connect/Streams/SR -> clients
● One node(JMV instance) at a
time!
● Upgrade binaries.
● Wait for URP to return to none!
7. 7
What not to do? ● Replace old brokers with new
brokers, unless you have to.
● Upgrade multiple components at
the same time
● Multiple changes at once.
● Start with unhealthy cluster.
● Rush the process
● Do not move on to next step with
any URP!!!!!
12. 12
Common Questions ● What tool to use?
● How often to poll the JMX
interface?
● Will this cause performance issues?
● How long do I need to keep these
metrics?
17. Logging - Can’t know where
you’re going without knowing
where you’ve been
18. 18
Each component should go-to its own log files.
org.apache.log4j.RollingFileAppender is your friend use it!
Without it you will fill up your logging disk and bad things
will happen!
You should plan to keep at least a few days of logs.
Do not be afraid to turn on debug level logging. There is a
JMX bean for this! No more need to restart brokers.
20. 20
Mandatory Quotas!
Replication quota!
This prevents a broker that’s recovering overwhelming the leaders!
This will also prevent a rebalance from stealing all the cluster
resources!
It will save your butt at 3am!
bin/kafka-configs … --alter
--add-config
'leader.replication.throttled.rate=10000'
--entity-type broker
21. 21
Two types of client quotas
Bandwidth
Bytes in/out
Request based
Everything in
Kafka is a request
22. 22
Bandwidth quotas
● Easy to reason about
● Easy to implement.
● Easy to monitor
○ per-client metric to indicate throttle times
● Great way to capacity plan your cluster!
23. 23
Request quotas
● Added in KIP-124
● Motivation was to limit clients from overwhelming the
network threads and request threads
● defined as a percent of utilization of:
((num.io.threads + num.network.threads) * 100%)
● More difficult to reason about but very useful in environments where
clients are concerned about latency.
24. 24
Storage Quotas
also called retention
retention.ms & retention.bytes
If you’re not setting these BOTH on every
single topic you’re asking for trouble.
27. 27
Answer: Many clusters!
Bucket by SLA or Criticality.
Easier maintenance.
Easier tuning.
Better monitoring.
Safer!
Why not? More sprawl
It’s a balance.