"Consumer scaling is a crucial element for many Apache Kafka users. Who doesn’t want to save money by efficiently managing their resources, shutting down unnecessary instances when there is no traffic, quickly scaling up during peak hours and while doing all of that - avoiding annoying and often unnecessary rebalancing.
To achieve all of this you need to understand how consumer assignment works, how nodes are affected by data load and what are common causes of rebalancing. But most importantly - what assignors to choose based on your use case and what metrics to use to measure your data load.
You wonder how we know what are good and bad practices? At Aiven, we've seen firsthand both successful and not-so-great approaches to consumer scaling and rebalancing. The insights we're sharing with you come directly from our experience working on many projects with Apache Kafka.
We’ll discuss metrics that are essential for understanding data load and deciding when to scale. We'll cover a variety of approaches you can take - from commonly used lag exporters, to Knative scalers that are based on concurrent requests and finally insights from our own experience developing a speed lag predictor that goes beyond the basics by calculating the velocity of data load changes. We’ll highlight advantages and disadvantages of each approach and when you should use it.
Next, we'll look at various assignors that are available and guide you on how to choose the most suitable one for your scenario. We'll pay special attention to the challenges faced by stateful applications and the potential pitfalls of frequent scaling, such as overloaded brokers.
Armed with this knowledge, you’ll have what is needed to build scalable systems, minimize downtime and save costs when working with Apache Kafka. Let's make your Kafka experience as smooth and efficient as possible!"
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Mastering Kafka Consumer Distribution: A Guide to Efficient Scaling and Resource Optimization
1. olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
Mastering Kafka consumer distribution
A guide to efficient scaling and resource optimization
Olena Kutsenko
Sr. Developer Advocate
Aiven
Olena Babenko
Staff Software Engineer
Aiven
2. olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
Mastering Kafka consumer distribution
A guide to efficient scaling and resource optimization
➔ why scaling consumers is not always desirable
➔ why consumer lag isn’t the metric you want to rely on
➔ how not to scale stateful consumers
➔ what is the most anticipated change in rebalancing protocol
➔ how to find a right balance between latency, durability and
costs
12. Broker 2
Partition 1
Partition 2
Partition 3
Partition 4
Consumer 1
Consumer 2
Consumer 3
Broker 1
Consumer group
Consumer 4
We need efficient rebalancing for:
● Scalability
● Elasticity
● Fault tolerance
Moving ownership from one consumer to another is called a rebalance
13. Broker 2
Partition 1
Partition 2
Partition 3
Partition 4
Consumer 1
Consumer 2
Consumer 3
Broker 1
Consumer group
Consumer 4
Side effects of rebalancing:
● Increased consumer lag, latency and reduced throughput
● Increased resource utilization
● Potential data duplication or data loss
● Increased complexity
We need efficient rebalancing for:
● Scalability
● Elasticity
● Fault tolerance
Moving ownership from one consumer to another is called a rebalance
15. olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
1. Know when to scale (and when not)
2. Minimize unnecessary data movement
3. Avoid unnecessary rebalancing
We want to scale efficiently and effectively:
34. olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
1. Group coordinator is too slow
2. Group leader is too slow
3. Some of consumers are too slow
Drama points
# of consumers Probability of success per
instance
Overall probability
35. olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
1. Group coordinator is too slow
2. Group leader is too slow
3. Some of consumers are too slow
Drama points
# of consumers Probability of success per
instance
Overall probability
6 99%
36. olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
1. Group coordinator is too slow
2. Group leader is too slow
3. Some of consumers are too slow
Drama points
# of consumers Probability of success per
instance
Overall probability
6 99% =0.99^6 = 0.94 = 94%
37. olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
1. Group coordinator is too slow
2. Group leader is too slow
3. Some of consumers are too slow
4.
Drama points
# of consumers Probability of success per
instance
Overall probability
6 99% =0.99^6 = 0.94 = 94%
100 99%
38. olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
1. Group coordinator is too slow
2. Group leader is too slow
3. Some of consumers are too slow
4.
Drama points
# of consumers Probability of success per
instance
Overall probability
6 99% =0.99^6 = 0.94 = 94%
100 99% =0.99^100 = 0.366 = 37%
39. olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
Drama points
1. Group coordinator is too slow
2. Group leader is too slow
3. Some of consumers are too slow
4. A new node is stuck in rebalancing
40. olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
Drama points
1. Group coordinator is too slow
2. Group leader is too slow
3. Some of consumers are too slow
4. A new node is stuck in rebalancing
5. onPartitionsRevoked dark hole
41. olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
Drama points
1. Group coordinator is too slow
2. Group leader is too slow
3. Some of consumers are too slow
4. A new node is stuck in rebalancing
5. onPartitionsRevoked dark hole
Consumers apply the new assignment plan:
1. What partitions are newly assigned and what are now revoked
2. Start reading from newly assigned partitions
3. If any existing partitions are revoked trigger a new rebalance
42. olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
Drama points
1. Group coordinator is too slow
2. Group leader is too slow
3. Some of consumers are too slow
4. A new node is stuck in rebalancing
5. onPartitionsRevoked dark hole
Consumers apply the new assignment plan:
1. What partitions are newly assigned and what are now revoked
2. Start reading from newly assigned partitions
3. If any existing partitions are revoked trigger a new rebalance
44. olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
Scale horizontally vs vertically
Vertical scaling Pros:
- Less rebalancing, if static members used. (group.instance.id config).
More flexible, when run out of resources(CPU, RAM, disc etc).
Vertical scaling Cons:
- Lots of partitions on one node, not always good as well - one hot
partition could hog all resources.
- Bigger machines not always possible.
- State might be lost
46. olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
Build In consumer metrics
records-lag-max
Lag - very common metric to identify that there are too much going on,
especially, is lag is among ALL or majority of partitions.
records-consumed-rate
The average number of records consumed per second
join-rate
If only one partition is laging, that might be an error, or problems with
Job Groups. Helps to monitor if something is wrong with a rebalancing
47. olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
Build In consumer metrics
Pros:
- It is a simplest option to start from
Cons:
- On a consumer side and depends on a consumer health and state
49. olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
Generic Kafka Cluster metrics
- Also have lag info
- Less biased
- More info about producers and events production
- Additional important info about group coordinator health
51. olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
KEDA (Apache Kafka scaler)
lagThreshold
Could be tuned to scale instanced based on lag.
activationLagThreshold
The activating (or deactivating) phase is the moment when KEDA
(operator) has to decide if the workload should be scaled from/to zero
+ many more
+ A lot more, if chose prometheus trigger (custom metrics)
52. olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
Knative
- Scale up and down faster using an amount of events
- You can Scale Kafka Source using KEDA
- Great in handling spikes
- Reusability of resources
- Keeps same pod identity, while replacing nodes (reduce amount of
rebalancing during failure)
- More complicated
54. olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
Autoscale using lag is not always optimal
- Lots of joins/rebalancing can make event consumption slower. More
nodes will be requested as a result
- Too much pressure on one Leading node
- Lag metric doesn’t answer question WHAT CAUSED A LAG! (it is not
always lack of resources)
As a result:
- Fast autoscaling might be problematic
55. olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
1. Know when to scale (and when not)
2. Minimize unnecessary data movement
3. Avoid unnecessary rebalancing
We want to scale efficiently and effectively:
58. olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
Time is more universal unit
- Lag is depending on a message sizes, on batch.size, on linger.ms
- Time is more universal unit for many businesses - you probably
know how much it cost to delay order for 2 hours, or paying website
downtime for 5 minutes.
- AWS, Confluent, Aiven etc usually on a server side provide
time-related metrics like Estimated Time Lag or Latency
59. olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
Simplest way to calculate time lag
- Took latest offset from a consumer group
- Read committed/consumed message timestamp from a topic
- Compare with current time
Pros:
- Accurate
Cons:
- Need to get a whole message (might be big)
- Need to do this quite often
- Do not scale well for multiple producers, consumer-groups, topics
and partitions
60. olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
Serglo
- Build an interpolation table to eliminate disadvantages of a simple
method
- A latest committed/consumed message get
approximated(predicted) timestamp
- Predicted timestamp compared with current time, to return time lag
65. olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
Aiven Lag predictor
Consumption speed:
100 - 70 = 30 records per 5 seconds
= 30 / 5 = 6 records per second
Left to consume:
180 - 100 = 80 records
= 80 / 6 = 13.333 seconds to catch up
66. olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
Aiven Lag predictor
kafka_lag_predictor_group_lag_predicted_seconds - estimate how much
time you need to catch up, with a current producing and consuming speed.
OR
Estimate WHEN will be consumed event that was published right
NOW.
More data points gives more precise results.
67. olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
Compare
- Serglo more about individual message level(deduce timestamp for
a one last message) vs Aiven lag predictor more about overall
speed. Both could be useful in a right context
- Any options, usually works good
- Can give you slightly different results, and expectations might be
different
70. olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
Aiven Lag predictor
Server-side metrics, defined at the same time:
`kafka_lag_predictor_topic_produced_records_total` Represents the total
count of records produced.(per partition)
`kafka_lag_predictor_group_consumed_records_total` Represents the
total count of records consumed. (per partition)
71. olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
1. Know when to scale (and when not)
2. Minimize unnecessary data movement
3. Avoid unnecessary rebalancing
We want to scale efficiently and effectively:
72. olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
1. Know when to scale (and when not)
2. Minimize unnecessary data movement
3. Avoid unnecessary rebalancing -
Scale multiple instances at once
We want to scale efficiently and effectively:
73. olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
Scaling ratio
Aiven lag predictor (per partition):
- Changes over time of
AVG(kafka_lag_predictor_topic_produced_records_total /
kafka_lag_predictor_group_consumed_records_total )
Client side alternative (per topic):
- Changes over time of AVG(record-send-total / record-consumed-total)
- OR per second record-send-rate / records-consumed-rate
74. olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
1. Know when to scale (and when not)
2. Minimize unnecessary data movement
3. Avoid unnecessary rebalancing
We want to scale efficiently and effectively:
75. olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
1. Know when to scale (and when not)
Identify other issues that caused lag
2. Minimize unnecessary data movement
3. Avoid unnecessary rebalancing
We want to scale efficiently and effectively:
76. olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
Rebalancing issues
Aiven lag predictor (per partition):
- Max over time of for all consumers in topic
kafka_lag_predictor_group_consumed_records_total == 0
New server side metrics (KIP-714):
- consumer.coordinator.assigned.partitions != 0
- consumer.coordinator.rebalance.latency.max
Client side alternative:
- join-rate (Consumer)
77. olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
Consumption issues
Aiven lag predictor (per partition):
- Max over time per partition, per topic
kafka_lag_predictor_group_consumed_records_total == 0
New server side metrics (KIP-714):
- consumer.fetch.manager.fetch.latency.max
- consumer.node.request.latency.max
- consumer.connection.creation.total
78. olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
Production issues (Scale down)
Aiven lag predictor (per partition):
- Max over time of
kafka_lag_predictor_topic_produced_records_total == 0
- Hot partition
MAX(kafka_lag_predictor_topic_produced_records_total)
AVG(kafka_lag_predictor_topic_produced_records_total)
New server side metrics (KIP-714):
- producer.record.queue.time.max
- producer.node.request.latency.max
- producer.record.queue.time.max
80. olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
What is important for your business?
- Define relevant business rules to predict problem
- Rules could be a combination of different metrics:
- Time lag estimation
- Producer health/speed
- Consumer health/speed
- Group Coordinator metrics
Example: Burrow. Took an offsets and other metrics and transform them
into status.
81. olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
Conclusion
- Use static groups when possible
- Include /Try server side metrics + new broker metrics
- Reactive scaling not always good (better to predict lag, then act
when consumer group already lagging)
- Scale based on your business needs
- Scale X instances at once and not overload your partition leader to
reduce rebalancing <- Could we do better?
87. olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
New consumer group protocol
- Consumers are not responsible for keeping state
- Leader is not responsible for calculating assignment
- Simpler
- Not all problems gone
88. olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
New consumer group protocol
- Pay more attention to broker health. This might be another important
dimension to your metrics
- Life of consumers should become easier, and some metrics become
obsolete
- Life of Kafka providers like Consuent, AWS, Aiven became harder, but
it is not your problem ;)
89. olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
Conclusion.
- Use static groups when possible
- Include /Try server side metrics + new broker metrics
- Reactive scaling not always good (better to predict lag, then act
when consumer group already lagging)
- Scale based on your business needs
- Scale X instances at once and not overload your partition leader <-
Not a problem anymore
- You can try a new protocol version soon (3.7 preview)
90. olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
1. Know when to scale (and when not)
2. Minimize unnecessary data movement
3. Avoid unnecessary rebalancing
We want to scale efficiently and effectively:
91. olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
1. Know when to scale (and when not)
2. Minimize unnecessary data movement
3. Avoid unnecessary rebalancing
We want to scale efficiently and effectively:
93. olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
1. Know when to scale (and when not)
2. Minimize unnecessary data movement
3. Avoid unnecessary rebalancing
We want to scale efficiently and effectively:
101. olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
● Scaling infinitely is not possible
● Use static groups and CooperativeStickyAssignor
● Pay attention to broker and consumer health
● Predict lag, not act when consumer group already lagging
● Define business rules and control them with metrics
Remember
103. olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
Olena Kutsenko
twitter.com/OlenaKutsenko
linkedin.com/in/olenakutsenko
Olena Babenko
linkedin.com/in/melhelen/ https://github.com/anelook/mastering-kafka-consumer-distribution
Register for Aiven
for Apache Kafka
and get extra credits:
104. olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
Olena Kutsenko
twitter.com/OlenaKutsenko
linkedin.com/in/olenakutsenko
Olena Babenko
linkedin.com/in/melhelen/
Find us at
#108
https://github.com/anelook/mastering-kafka-consumer-distribution
Register for Aiven
for Apache Kafka
and get extra credits: