SlideShare a Scribd company logo
1 of 104
Download to read offline
olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
Mastering Kafka consumer distribution
A guide to efficient scaling and resource optimization
Olena Kutsenko
Sr. Developer Advocate
Aiven
Olena Babenko
Staff Software Engineer
Aiven
olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
Mastering Kafka consumer distribution
A guide to efficient scaling and resource optimization
➔ why scaling consumers is not always desirable
➔ why consumer lag isn’t the metric you want to rely on
➔ how not to scale stateful consumers
➔ what is the most anticipated change in rebalancing protocol
➔ how to find a right balance between latency, durability and
costs
Definition
1 ● What is rebalancing?
● Why do we need it?
Producers Consumers
Topic
Producers Consumers
Topic
��
Partition 1
Partition 2
Partition 3
Partition 4
Producers Consumers
Topic
��
Partition 1
Partition 2
Partition 3
Partition 4
Producers Consumers
Topic
��
Partition 1
Partition 2
Partition 3
Partition 4
Partition 1
Partition 2
Partition 3
Partition 4
Consumer 1
Consumer 2
Consumer 3
Broker 2
Partition 1
Partition 2
Partition 3
Partition 4
Consumer 1
Consumer 2
Consumer 3
Broker 1
Broker 2
Partition 1
Partition 2
Partition 3
Partition 4
Consumer 1
Consumer 2
Consumer 3
Broker 1
Consumer group
Broker 2
Partition 1
Partition 2
Partition 3
Partition 4
Consumer 1
Consumer 2
Consumer 3
Broker 1
Consumer group
Consumer 4
Broker 2
Partition 1
Partition 2
Partition 3
Partition 4
Consumer 1
Consumer 2
Consumer 3
Broker 1
Consumer group
Consumer 4
We need efficient rebalancing for:
● Scalability
● Elasticity
● Fault tolerance
Moving ownership from one consumer to another is called a rebalance
Broker 2
Partition 1
Partition 2
Partition 3
Partition 4
Consumer 1
Consumer 2
Consumer 3
Broker 1
Consumer group
Consumer 4
Side effects of rebalancing:
● Increased consumer lag, latency and reduced throughput
● Increased resource utilization
● Potential data duplication or data loss
● Increased complexity
We need efficient rebalancing for:
● Scalability
● Elasticity
● Fault tolerance
Moving ownership from one consumer to another is called a rebalance
olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
Rebalancing has a lot in common with
cooking
olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
1. Know when to scale (and when not)
2. Minimize unnecessary data movement
3. Avoid unnecessary rebalancing
We want to scale efficiently and effectively:
olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
Rebalancing is a teamwork
Broker 2
Partition 1
Partition 2
Partition 3
Partition 4
Consumer 1
Consumer 2
Consumer 3
Broker 1
Consumer group
Consumer 4
Broker 2
Partition 1
Partition 2
Partition 3
Partition 4
Consumer 1
Consumer 2
Consumer 3
Broker 1
Consumer group
Consumer 4
Group coordinator - broker
Broker 2
Partition 1
Partition 2
Partition 3
Partition 4
Consumer 1
Consumer 2
Consumer 3
Broker 1
Consumer group
Consumer 4
Group coordinator - broker
Group leader - consumer
Status quo
3 Incremental
cooperative
rebalance
olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
Group
coordinator
Consumer 1 Consumer 2
Consumer 1
member.id=c1
partitions 1 & 2
Consumer 2
member.id=c2
partition 3
olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
Group
coordinator
Consumer 1 Consumer 2
Consumer 1
member.id=c1
partitions 1 & 2
Consumer 2
member.id=c2
partition 3
��
olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
Group
coordinator
Consumer 1 Consumer 2
Consumer 1
member.id=c1
partitions 1 & 2
Consumer 2
member.id=c2
partition 3
Consumer 3
new
olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
Group
coordinator
Consumer 1 Consumer 2
Consumer 1
member.id=c1
partitions 1 & 2
Consumer 2
member.id=c2
partition 3
��
Consumer 3
new
JoinGroupRequest
{member.id=unknown}
olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
Group
coordinator
Consumer 1 Consumer 2
Consumer 1
member.id=c1
partitions 1 & 2
Consumer 2
member.id=c2
partition 3
��
Consumer 3
new
JoinGroupRequest
{member.id=unknown}
HeartbeatResponse
{REBALANCE_IN_PROGRESS}
HeartbeatResponse
{REBALANCE_IN_PROGRESS}
olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
Group
coordinator
Consumer 1 Consumer 2
Consumer 1
member.id=c1
partitions 1 & 2
Consumer 2
member.id=c2
partition 3
��
Consumer 3
new
JoinGroupRequest
{member.id=unknown}
HeartbeatResponse
{REBALANCE_IN_PROGRESS}
HeartbeatResponse
{REBALANCE_IN_PROGRESS}
JoinGroupRequest
{member.id=c1}
JoinGroupRequest
{member.id=c2}
olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
Group
coordinator
Consumer 1 Consumer 2
Consumer 1
member.id=c1
partitions 1 & 2
Consumer 2
member.id=c2
partition 3
��
Consumer 3
new
JoinGroupRequest
{member.id=unknown}
HeartbeatResponse
{REBALANCE_IN_PROGRESS}
HeartbeatResponse
{REBALANCE_IN_PROGRESS}
JoinGroupRequest
{member.id=c1}
JoinGroupRequest
{member.id=c2}
JoinGroupResponse
{memberId, member list & subscriptions}
JoinGroupResponse
JoinGroupResponse
olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
Group
coordinator
Consumer 1 Consumer 2
Consumer 1
member.id=c1
partitions 1 & 2
Consumer 2
member.id=c2
partition 3
��
Consumer 3
new
JoinGroupRequest
{member.id=unknown}
HeartbeatResponse
{REBALANCE_IN_PROGRESS}
HeartbeatResponse
{REBALANCE_IN_PROGRESS}
JoinGroupRequest
{member.id=c1}
JoinGroupRequest
{member.id=c2}
JoinGroupResponse
{memberId, member list & subscriptions}
��
SyncGroupRequest
{assignment plan}
SyncGroupRequest
SyncGroupRequest
JoinGroupResponse
JoinGroupResponse
olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
Group
coordinator
Consumer 1 Consumer 2
Consumer 1
member.id=c1
partitions 1 & 2
Consumer 2
member.id=c2
partition 3
��
Consumer 3
new
JoinGroupRequest
{member.id=unknown}
HeartbeatResponse
{REBALANCE_IN_PROGRESS}
HeartbeatResponse
{REBALANCE_IN_PROGRESS}
JoinGroupRequest
{member.id=c1}
JoinGroupRequest
{member.id=c2}
JoinGroupResponse
{memberId, member list & subscriptions}
JoinGroupResponse
��
SyncGroupRequest
{assignment plan}
SyncGroupResponse
{assignment plan}
SyncGroupRequest
SyncGroupRequest
SyncGroupResponse
{assignment plan}
SyncGroupResponse
{assignment plan}
JoinGroupResponse
olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
You probably already see some
bottlenecks….
olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
Drama points
1. Group coordinator is too slow
olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
Drama points
1. Group coordinator is too slow
2. Group leader is too slow
olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
Drama points
1. Group coordinator is too slow
2. Group leader is too slow
3. Some of consumers are too slow
olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
1. Group coordinator is too slow
2. Group leader is too slow
3. Some of consumers are too slow
Drama points
# of consumers Probability of success per
instance
Overall probability
olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
1. Group coordinator is too slow
2. Group leader is too slow
3. Some of consumers are too slow
Drama points
# of consumers Probability of success per
instance
Overall probability
6 99%
olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
1. Group coordinator is too slow
2. Group leader is too slow
3. Some of consumers are too slow
Drama points
# of consumers Probability of success per
instance
Overall probability
6 99% =0.99^6 = 0.94 = 94%
olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
1. Group coordinator is too slow
2. Group leader is too slow
3. Some of consumers are too slow
4.
Drama points
# of consumers Probability of success per
instance
Overall probability
6 99% =0.99^6 = 0.94 = 94%
100 99%
olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
1. Group coordinator is too slow
2. Group leader is too slow
3. Some of consumers are too slow
4.
Drama points
# of consumers Probability of success per
instance
Overall probability
6 99% =0.99^6 = 0.94 = 94%
100 99% =0.99^100 = 0.366 = 37%
olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
Drama points
1. Group coordinator is too slow
2. Group leader is too slow
3. Some of consumers are too slow
4. A new node is stuck in rebalancing
olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
Drama points
1. Group coordinator is too slow
2. Group leader is too slow
3. Some of consumers are too slow
4. A new node is stuck in rebalancing
5. onPartitionsRevoked dark hole
olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
Drama points
1. Group coordinator is too slow
2. Group leader is too slow
3. Some of consumers are too slow
4. A new node is stuck in rebalancing
5. onPartitionsRevoked dark hole
Consumers apply the new assignment plan:
1. What partitions are newly assigned and what are now revoked
2. Start reading from newly assigned partitions
3. If any existing partitions are revoked trigger a new rebalance
olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
Drama points
1. Group coordinator is too slow
2. Group leader is too slow
3. Some of consumers are too slow
4. A new node is stuck in rebalancing
5. onPartitionsRevoked dark hole
Consumers apply the new assignment plan:
1. What partitions are newly assigned and what are now revoked
2. Start reading from newly assigned partitions
3. If any existing partitions are revoked trigger a new rebalance
Metrics
4 When to
scale
olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
Scale horizontally vs vertically
Vertical scaling Pros:
- Less rebalancing, if static members used. (group.instance.id config).
More flexible, when run out of resources(CPU, RAM, disc etc).
Vertical scaling Cons:
- Lots of partitions on one node, not always good as well - one hot
partition could hog all resources.
- Bigger machines not always possible.
- State might be lost
olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
Horizontal scaling is time
consuming and risky.
How to make it more efficient?
olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
Build In consumer metrics
records-lag-max
Lag - very common metric to identify that there are too much going on,
especially, is lag is among ALL or majority of partitions.
records-consumed-rate
The average number of records consumed per second
join-rate
If only one partition is laging, that might be an error, or problems with
Job Groups. Helps to monitor if something is wrong with a rebalancing
olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
Build In consumer metrics
Pros:
- It is a simplest option to start from
Cons:
- On a consumer side and depends on a consumer health and state
olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
Kafka cluster metrics
olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
Generic Kafka Cluster metrics
- Also have lag info
- Less biased
- More info about producers and events production
- Additional important info about group coordinator health
olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
Autoscale
olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
KEDA (Apache Kafka scaler)
lagThreshold
Could be tuned to scale instanced based on lag.
activationLagThreshold
The activating (or deactivating) phase is the moment when KEDA
(operator) has to decide if the workload should be scaled from/to zero
+ many more
+ A lot more, if chose prometheus trigger (custom metrics)
olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
Knative
- Scale up and down faster using an amount of events
- You can Scale Kafka Source using KEDA
- Great in handling spikes
- Reusability of resources
- Keeps same pod identity, while replacing nodes (reduce amount of
rebalancing during failure)
- More complicated
olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
Lag only grows after autoscale
olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
Autoscale using lag is not always optimal
- Lots of joins/rebalancing can make event consumption slower. More
nodes will be requested as a result
- Too much pressure on one Leading node
- Lag metric doesn’t answer question WHAT CAUSED A LAG! (it is not
always lack of resources)
As a result:
- Fast autoscaling might be problematic
olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
1. Know when to scale (and when not)
2. Minimize unnecessary data movement
3. Avoid unnecessary rebalancing
We want to scale efficiently and effectively:
olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
Lag == Money 💰 ?
olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
Lag == Money 💰 !!
Time
⌛
olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
Time is more universal unit
- Lag is depending on a message sizes, on batch.size, on linger.ms
- Time is more universal unit for many businesses - you probably
know how much it cost to delay order for 2 hours, or paying website
downtime for 5 minutes.
- AWS, Confluent, Aiven etc usually on a server side provide
time-related metrics like Estimated Time Lag or Latency
olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
Simplest way to calculate time lag
- Took latest offset from a consumer group
- Read committed/consumed message timestamp from a topic
- Compare with current time
Pros:
- Accurate
Cons:
- Need to get a whole message (might be big)
- Need to do this quite often
- Do not scale well for multiple producers, consumer-groups, topics
and partitions
olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
Serglo
- Build an interpolation table to eliminate disadvantages of a simple
method
- A latest committed/consumed message get
approximated(predicted) timestamp
- Predicted timestamp compared with current time, to return time lag
olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
Serglo
olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
Aiven time lag predictor
olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
Aiven Lag predictor
Checkpoint 1: 09:00
olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
Aiven Lag predictor
Checkpoint 2: 09:05
olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
Aiven Lag predictor
Consumption speed:
100 - 70 = 30 records per 5 seconds
= 30 / 5 = 6 records per second
Left to consume:
180 - 100 = 80 records
= 80 / 6 = 13.333 seconds to catch up
olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
Aiven Lag predictor
kafka_lag_predictor_group_lag_predicted_seconds - estimate how much
time you need to catch up, with a current producing and consuming speed.
OR
Estimate WHEN will be consumed event that was published right
NOW.
More data points gives more precise results.
olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
Compare
- Serglo more about individual message level(deduce timestamp for
a one last message) vs Aiven lag predictor more about overall
speed. Both could be useful in a right context
- Any options, usually works good
- Can give you slightly different results, and expectations might be
different
olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
More metrics for better conclusion
olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
Aiven Lag predictor
olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
Aiven Lag predictor
Server-side metrics, defined at the same time:
`kafka_lag_predictor_topic_produced_records_total` Represents the total
count of records produced.(per partition)
`kafka_lag_predictor_group_consumed_records_total` Represents the
total count of records consumed. (per partition)
olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
1. Know when to scale (and when not)
2. Minimize unnecessary data movement
3. Avoid unnecessary rebalancing
We want to scale efficiently and effectively:
olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
1. Know when to scale (and when not)
2. Minimize unnecessary data movement
3. Avoid unnecessary rebalancing -
Scale multiple instances at once
We want to scale efficiently and effectively:
olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
Scaling ratio
Aiven lag predictor (per partition):
- Changes over time of
AVG(kafka_lag_predictor_topic_produced_records_total /
kafka_lag_predictor_group_consumed_records_total )
Client side alternative (per topic):
- Changes over time of AVG(record-send-total / record-consumed-total)
- OR per second record-send-rate / records-consumed-rate
olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
1. Know when to scale (and when not)
2. Minimize unnecessary data movement
3. Avoid unnecessary rebalancing
We want to scale efficiently and effectively:
olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
1. Know when to scale (and when not)
Identify other issues that caused lag
2. Minimize unnecessary data movement
3. Avoid unnecessary rebalancing
We want to scale efficiently and effectively:
olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
Rebalancing issues
Aiven lag predictor (per partition):
- Max over time of for all consumers in topic
kafka_lag_predictor_group_consumed_records_total == 0
New server side metrics (KIP-714):
- consumer.coordinator.assigned.partitions != 0
- consumer.coordinator.rebalance.latency.max
Client side alternative:
- join-rate (Consumer)
olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
Consumption issues
Aiven lag predictor (per partition):
- Max over time per partition, per topic
kafka_lag_predictor_group_consumed_records_total == 0
New server side metrics (KIP-714):
- consumer.fetch.manager.fetch.latency.max
- consumer.node.request.latency.max
- consumer.connection.creation.total
olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
Production issues (Scale down)
Aiven lag predictor (per partition):
- Max over time of
kafka_lag_predictor_topic_produced_records_total == 0
- Hot partition
MAX(kafka_lag_predictor_topic_produced_records_total)
AVG(kafka_lag_predictor_topic_produced_records_total)
New server side metrics (KIP-714):
- producer.record.queue.time.max
- producer.node.request.latency.max
- producer.record.queue.time.max
So! When to Scale?
olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
What is important for your business?
- Define relevant business rules to predict problem
- Rules could be a combination of different metrics:
- Time lag estimation
- Producer health/speed
- Consumer health/speed
- Group Coordinator metrics
Example: Burrow. Took an offsets and other metrics and transform them
into status.
olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
Conclusion
- Use static groups when possible
- Include /Try server side metrics + new broker metrics
- Reactive scaling not always good (better to predict lag, then act
when consumer group already lagging)
- Scale based on your business needs
- Scale X instances at once and not overload your partition leader to
reduce rebalancing <- Could we do better?
olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
Yes!
olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
A new consumer protocol! Yey!
The future
5 KIP-848
olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
KIP-848 The Next Generation of the Consumer Rebalance
Protocol
➔
olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
Should we redo everything right
now?
olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
New consumer group protocol
- Consumers are not responsible for keeping state
- Leader is not responsible for calculating assignment
- Simpler
- Not all problems gone
olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
New consumer group protocol
- Pay more attention to broker health. This might be another important
dimension to your metrics
- Life of consumers should become easier, and some metrics become
obsolete
- Life of Kafka providers like Consuent, AWS, Aiven became harder, but
it is not your problem ;)
olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
Conclusion.
- Use static groups when possible
- Include /Try server side metrics + new broker metrics
- Reactive scaling not always good (better to predict lag, then act
when consumer group already lagging)
- Scale based on your business needs
- Scale X instances at once and not overload your partition leader <-
Not a problem anymore
- You can try a new protocol version soon (3.7 preview)
olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
1. Know when to scale (and when not)
2. Minimize unnecessary data movement
3. Avoid unnecessary rebalancing
We want to scale efficiently and effectively:
olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
1. Know when to scale (and when not)
2. Minimize unnecessary data movement
3. Avoid unnecessary rebalancing
We want to scale efficiently and effectively:
Stateful
consumers
6 Assigners
olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
1. Know when to scale (and when not)
2. Minimize unnecessary data movement
3. Avoid unnecessary rebalancing
We want to scale efficiently and effectively:
olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
Assignors
➔ RangeAssignor
➔ RoundRobinAssignor
➔ CooperativeStickyAssignor
olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
RangeAssignor
olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
RangeAssignor
Topic 2
Partition 1
Partition 2
Partition 1
Partition 2
Consumer 1
Consumer 2
Consumer 3
Topic 1
Consumer group
olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
RoundRobinAssignor
olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
RoundRobinAssignor
Topic 2
Partition 1
Partition 2
Partition 1
Partition 2
Consumer 1
Consumer 2
Consumer 3
Topic 1
Consumer group
olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
CooperativeStickyAssignor
olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
CooperativeStickyAssignor
Topic 2
Partition 1
Partition 2
Partition 1
Partition 2
Consumer 1
Consumer 2
Consumer 3
Topic 1
Consumer group
olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
● Scaling infinitely is not possible
● Use static groups and CooperativeStickyAssignor
● Pay attention to broker and consumer health
● Predict lag, not act when consumer group already lagging
● Define business rules and control them with metrics
Remember
olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
Olena Kutsenko
twitter.com/OlenaKutsenko
linkedin.com/in/olenakutsenko
Olena Babenko
linkedin.com/in/melhelen/ https://github.com/anelook/mastering-kafka-consumer-distribution
olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
Olena Kutsenko
twitter.com/OlenaKutsenko
linkedin.com/in/olenakutsenko
Olena Babenko
linkedin.com/in/melhelen/ https://github.com/anelook/mastering-kafka-consumer-distribution
Register for Aiven
for Apache Kafka
and get extra credits:
olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
Olena Kutsenko
twitter.com/OlenaKutsenko
linkedin.com/in/olenakutsenko
Olena Babenko
linkedin.com/in/melhelen/
Find us at
#108
https://github.com/anelook/mastering-kafka-consumer-distribution
Register for Aiven
for Apache Kafka
and get extra credits:

More Related Content

More from HostedbyConfluent

From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...HostedbyConfluent
 
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...HostedbyConfluent
 
Navigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka ClustersNavigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka ClustersHostedbyConfluent
 
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data PlatformApache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data PlatformHostedbyConfluent
 
Explaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy PubExplaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy PubHostedbyConfluent
 
TL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit LondonTL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit LondonHostedbyConfluent
 
A Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSLA Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSLHostedbyConfluent
 
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing PerformanceMastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing PerformanceHostedbyConfluent
 
Data Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and BeyondData Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and BeyondHostedbyConfluent
 
Code-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink AppsCode-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink AppsHostedbyConfluent
 
Debezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC EcosystemDebezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC EcosystemHostedbyConfluent
 
Beyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local DisksBeyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local DisksHostedbyConfluent
 
Automating Speed: A Proven Approach to Preventing Performance Regressions in ...
Automating Speed: A Proven Approach to Preventing Performance Regressions in ...Automating Speed: A Proven Approach to Preventing Performance Regressions in ...
Automating Speed: A Proven Approach to Preventing Performance Regressions in ...HostedbyConfluent
 
How to Build an Event-based Control Center for the Electrical Grid
How to Build an Event-based Control Center for the Electrical GridHow to Build an Event-based Control Center for the Electrical Grid
How to Build an Event-based Control Center for the Electrical GridHostedbyConfluent
 
Keep Your Kafka Cloud Costs in Check with Showbacks
Keep Your Kafka Cloud Costs in Check with ShowbacksKeep Your Kafka Cloud Costs in Check with Showbacks
Keep Your Kafka Cloud Costs in Check with ShowbacksHostedbyConfluent
 
When Securing Access to Data is About Life and Death
When Securing Access to Data is About Life and DeathWhen Securing Access to Data is About Life and Death
When Securing Access to Data is About Life and DeathHostedbyConfluent
 
Aggregating Ad Events with Kafka Streams and Interactive Queries at Invidi
Aggregating Ad Events with Kafka Streams and Interactive Queries at InvidiAggregating Ad Events with Kafka Streams and Interactive Queries at Invidi
Aggregating Ad Events with Kafka Streams and Interactive Queries at InvidiHostedbyConfluent
 
Flink 2.0: Navigating the Future of Unified Stream and Batch Processing
Flink 2.0: Navigating the Future of Unified Stream and Batch ProcessingFlink 2.0: Navigating the Future of Unified Stream and Batch Processing
Flink 2.0: Navigating the Future of Unified Stream and Batch ProcessingHostedbyConfluent
 
Leveraging Tiered Storage in Strimzi-Operated Kafka for Cost-Effective Stream...
Leveraging Tiered Storage in Strimzi-Operated Kafka for Cost-Effective Stream...Leveraging Tiered Storage in Strimzi-Operated Kafka for Cost-Effective Stream...
Leveraging Tiered Storage in Strimzi-Operated Kafka for Cost-Effective Stream...HostedbyConfluent
 
Building Kafka Connectors with Kotlin: A Step-by-Step Guide to Creation and D...
Building Kafka Connectors with Kotlin: A Step-by-Step Guide to Creation and D...Building Kafka Connectors with Kotlin: A Step-by-Step Guide to Creation and D...
Building Kafka Connectors with Kotlin: A Step-by-Step Guide to Creation and D...HostedbyConfluent
 

More from HostedbyConfluent (20)

From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
 
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
 
Navigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka ClustersNavigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka Clusters
 
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data PlatformApache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
 
Explaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy PubExplaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy Pub
 
TL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit LondonTL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit London
 
A Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSLA Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSL
 
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing PerformanceMastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
 
Data Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and BeyondData Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and Beyond
 
Code-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink AppsCode-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink Apps
 
Debezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC EcosystemDebezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC Ecosystem
 
Beyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local DisksBeyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local Disks
 
Automating Speed: A Proven Approach to Preventing Performance Regressions in ...
Automating Speed: A Proven Approach to Preventing Performance Regressions in ...Automating Speed: A Proven Approach to Preventing Performance Regressions in ...
Automating Speed: A Proven Approach to Preventing Performance Regressions in ...
 
How to Build an Event-based Control Center for the Electrical Grid
How to Build an Event-based Control Center for the Electrical GridHow to Build an Event-based Control Center for the Electrical Grid
How to Build an Event-based Control Center for the Electrical Grid
 
Keep Your Kafka Cloud Costs in Check with Showbacks
Keep Your Kafka Cloud Costs in Check with ShowbacksKeep Your Kafka Cloud Costs in Check with Showbacks
Keep Your Kafka Cloud Costs in Check with Showbacks
 
When Securing Access to Data is About Life and Death
When Securing Access to Data is About Life and DeathWhen Securing Access to Data is About Life and Death
When Securing Access to Data is About Life and Death
 
Aggregating Ad Events with Kafka Streams and Interactive Queries at Invidi
Aggregating Ad Events with Kafka Streams and Interactive Queries at InvidiAggregating Ad Events with Kafka Streams and Interactive Queries at Invidi
Aggregating Ad Events with Kafka Streams and Interactive Queries at Invidi
 
Flink 2.0: Navigating the Future of Unified Stream and Batch Processing
Flink 2.0: Navigating the Future of Unified Stream and Batch ProcessingFlink 2.0: Navigating the Future of Unified Stream and Batch Processing
Flink 2.0: Navigating the Future of Unified Stream and Batch Processing
 
Leveraging Tiered Storage in Strimzi-Operated Kafka for Cost-Effective Stream...
Leveraging Tiered Storage in Strimzi-Operated Kafka for Cost-Effective Stream...Leveraging Tiered Storage in Strimzi-Operated Kafka for Cost-Effective Stream...
Leveraging Tiered Storage in Strimzi-Operated Kafka for Cost-Effective Stream...
 
Building Kafka Connectors with Kotlin: A Step-by-Step Guide to Creation and D...
Building Kafka Connectors with Kotlin: A Step-by-Step Guide to Creation and D...Building Kafka Connectors with Kotlin: A Step-by-Step Guide to Creation and D...
Building Kafka Connectors with Kotlin: A Step-by-Step Guide to Creation and D...
 

Recently uploaded

Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsPrecisely
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 

Recently uploaded (20)

Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power Systems
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 

Mastering Kafka Consumer Distribution: A Guide to Efficient Scaling and Resource Optimization

  • 1. olena@aiven.io @OlenaKutsenko aiven.io Olena Babenko: Mastering Kafka consumer distribution A guide to efficient scaling and resource optimization Olena Kutsenko Sr. Developer Advocate Aiven Olena Babenko Staff Software Engineer Aiven
  • 2. olena@aiven.io @OlenaKutsenko aiven.io Olena Babenko: Mastering Kafka consumer distribution A guide to efficient scaling and resource optimization ➔ why scaling consumers is not always desirable ➔ why consumer lag isn’t the metric you want to rely on ➔ how not to scale stateful consumers ➔ what is the most anticipated change in rebalancing protocol ➔ how to find a right balance between latency, durability and costs
  • 3. Definition 1 ● What is rebalancing? ● Why do we need it?
  • 8. Partition 1 Partition 2 Partition 3 Partition 4 Consumer 1 Consumer 2 Consumer 3
  • 9. Broker 2 Partition 1 Partition 2 Partition 3 Partition 4 Consumer 1 Consumer 2 Consumer 3 Broker 1
  • 10. Broker 2 Partition 1 Partition 2 Partition 3 Partition 4 Consumer 1 Consumer 2 Consumer 3 Broker 1 Consumer group
  • 11. Broker 2 Partition 1 Partition 2 Partition 3 Partition 4 Consumer 1 Consumer 2 Consumer 3 Broker 1 Consumer group Consumer 4
  • 12. Broker 2 Partition 1 Partition 2 Partition 3 Partition 4 Consumer 1 Consumer 2 Consumer 3 Broker 1 Consumer group Consumer 4 We need efficient rebalancing for: ● Scalability ● Elasticity ● Fault tolerance Moving ownership from one consumer to another is called a rebalance
  • 13. Broker 2 Partition 1 Partition 2 Partition 3 Partition 4 Consumer 1 Consumer 2 Consumer 3 Broker 1 Consumer group Consumer 4 Side effects of rebalancing: ● Increased consumer lag, latency and reduced throughput ● Increased resource utilization ● Potential data duplication or data loss ● Increased complexity We need efficient rebalancing for: ● Scalability ● Elasticity ● Fault tolerance Moving ownership from one consumer to another is called a rebalance
  • 14. olena@aiven.io @OlenaKutsenko aiven.io Olena Babenko: Rebalancing has a lot in common with cooking
  • 15. olena@aiven.io @OlenaKutsenko aiven.io Olena Babenko: 1. Know when to scale (and when not) 2. Minimize unnecessary data movement 3. Avoid unnecessary rebalancing We want to scale efficiently and effectively:
  • 16. olena@aiven.io @OlenaKutsenko aiven.io Olena Babenko: Rebalancing is a teamwork
  • 17. Broker 2 Partition 1 Partition 2 Partition 3 Partition 4 Consumer 1 Consumer 2 Consumer 3 Broker 1 Consumer group Consumer 4
  • 18. Broker 2 Partition 1 Partition 2 Partition 3 Partition 4 Consumer 1 Consumer 2 Consumer 3 Broker 1 Consumer group Consumer 4 Group coordinator - broker
  • 19. Broker 2 Partition 1 Partition 2 Partition 3 Partition 4 Consumer 1 Consumer 2 Consumer 3 Broker 1 Consumer group Consumer 4 Group coordinator - broker Group leader - consumer
  • 21. olena@aiven.io @OlenaKutsenko aiven.io Olena Babenko: Group coordinator Consumer 1 Consumer 2 Consumer 1 member.id=c1 partitions 1 & 2 Consumer 2 member.id=c2 partition 3
  • 22. olena@aiven.io @OlenaKutsenko aiven.io Olena Babenko: Group coordinator Consumer 1 Consumer 2 Consumer 1 member.id=c1 partitions 1 & 2 Consumer 2 member.id=c2 partition 3 ��
  • 23. olena@aiven.io @OlenaKutsenko aiven.io Olena Babenko: Group coordinator Consumer 1 Consumer 2 Consumer 1 member.id=c1 partitions 1 & 2 Consumer 2 member.id=c2 partition 3 Consumer 3 new
  • 24. olena@aiven.io @OlenaKutsenko aiven.io Olena Babenko: Group coordinator Consumer 1 Consumer 2 Consumer 1 member.id=c1 partitions 1 & 2 Consumer 2 member.id=c2 partition 3 �� Consumer 3 new JoinGroupRequest {member.id=unknown}
  • 25. olena@aiven.io @OlenaKutsenko aiven.io Olena Babenko: Group coordinator Consumer 1 Consumer 2 Consumer 1 member.id=c1 partitions 1 & 2 Consumer 2 member.id=c2 partition 3 �� Consumer 3 new JoinGroupRequest {member.id=unknown} HeartbeatResponse {REBALANCE_IN_PROGRESS} HeartbeatResponse {REBALANCE_IN_PROGRESS}
  • 26. olena@aiven.io @OlenaKutsenko aiven.io Olena Babenko: Group coordinator Consumer 1 Consumer 2 Consumer 1 member.id=c1 partitions 1 & 2 Consumer 2 member.id=c2 partition 3 �� Consumer 3 new JoinGroupRequest {member.id=unknown} HeartbeatResponse {REBALANCE_IN_PROGRESS} HeartbeatResponse {REBALANCE_IN_PROGRESS} JoinGroupRequest {member.id=c1} JoinGroupRequest {member.id=c2}
  • 27. olena@aiven.io @OlenaKutsenko aiven.io Olena Babenko: Group coordinator Consumer 1 Consumer 2 Consumer 1 member.id=c1 partitions 1 & 2 Consumer 2 member.id=c2 partition 3 �� Consumer 3 new JoinGroupRequest {member.id=unknown} HeartbeatResponse {REBALANCE_IN_PROGRESS} HeartbeatResponse {REBALANCE_IN_PROGRESS} JoinGroupRequest {member.id=c1} JoinGroupRequest {member.id=c2} JoinGroupResponse {memberId, member list & subscriptions} JoinGroupResponse JoinGroupResponse
  • 28. olena@aiven.io @OlenaKutsenko aiven.io Olena Babenko: Group coordinator Consumer 1 Consumer 2 Consumer 1 member.id=c1 partitions 1 & 2 Consumer 2 member.id=c2 partition 3 �� Consumer 3 new JoinGroupRequest {member.id=unknown} HeartbeatResponse {REBALANCE_IN_PROGRESS} HeartbeatResponse {REBALANCE_IN_PROGRESS} JoinGroupRequest {member.id=c1} JoinGroupRequest {member.id=c2} JoinGroupResponse {memberId, member list & subscriptions} �� SyncGroupRequest {assignment plan} SyncGroupRequest SyncGroupRequest JoinGroupResponse JoinGroupResponse
  • 29. olena@aiven.io @OlenaKutsenko aiven.io Olena Babenko: Group coordinator Consumer 1 Consumer 2 Consumer 1 member.id=c1 partitions 1 & 2 Consumer 2 member.id=c2 partition 3 �� Consumer 3 new JoinGroupRequest {member.id=unknown} HeartbeatResponse {REBALANCE_IN_PROGRESS} HeartbeatResponse {REBALANCE_IN_PROGRESS} JoinGroupRequest {member.id=c1} JoinGroupRequest {member.id=c2} JoinGroupResponse {memberId, member list & subscriptions} JoinGroupResponse �� SyncGroupRequest {assignment plan} SyncGroupResponse {assignment plan} SyncGroupRequest SyncGroupRequest SyncGroupResponse {assignment plan} SyncGroupResponse {assignment plan} JoinGroupResponse
  • 30. olena@aiven.io @OlenaKutsenko aiven.io Olena Babenko: You probably already see some bottlenecks….
  • 31. olena@aiven.io @OlenaKutsenko aiven.io Olena Babenko: Drama points 1. Group coordinator is too slow
  • 32. olena@aiven.io @OlenaKutsenko aiven.io Olena Babenko: Drama points 1. Group coordinator is too slow 2. Group leader is too slow
  • 33. olena@aiven.io @OlenaKutsenko aiven.io Olena Babenko: Drama points 1. Group coordinator is too slow 2. Group leader is too slow 3. Some of consumers are too slow
  • 34. olena@aiven.io @OlenaKutsenko aiven.io Olena Babenko: 1. Group coordinator is too slow 2. Group leader is too slow 3. Some of consumers are too slow Drama points # of consumers Probability of success per instance Overall probability
  • 35. olena@aiven.io @OlenaKutsenko aiven.io Olena Babenko: 1. Group coordinator is too slow 2. Group leader is too slow 3. Some of consumers are too slow Drama points # of consumers Probability of success per instance Overall probability 6 99%
  • 36. olena@aiven.io @OlenaKutsenko aiven.io Olena Babenko: 1. Group coordinator is too slow 2. Group leader is too slow 3. Some of consumers are too slow Drama points # of consumers Probability of success per instance Overall probability 6 99% =0.99^6 = 0.94 = 94%
  • 37. olena@aiven.io @OlenaKutsenko aiven.io Olena Babenko: 1. Group coordinator is too slow 2. Group leader is too slow 3. Some of consumers are too slow 4. Drama points # of consumers Probability of success per instance Overall probability 6 99% =0.99^6 = 0.94 = 94% 100 99%
  • 38. olena@aiven.io @OlenaKutsenko aiven.io Olena Babenko: 1. Group coordinator is too slow 2. Group leader is too slow 3. Some of consumers are too slow 4. Drama points # of consumers Probability of success per instance Overall probability 6 99% =0.99^6 = 0.94 = 94% 100 99% =0.99^100 = 0.366 = 37%
  • 39. olena@aiven.io @OlenaKutsenko aiven.io Olena Babenko: Drama points 1. Group coordinator is too slow 2. Group leader is too slow 3. Some of consumers are too slow 4. A new node is stuck in rebalancing
  • 40. olena@aiven.io @OlenaKutsenko aiven.io Olena Babenko: Drama points 1. Group coordinator is too slow 2. Group leader is too slow 3. Some of consumers are too slow 4. A new node is stuck in rebalancing 5. onPartitionsRevoked dark hole
  • 41. olena@aiven.io @OlenaKutsenko aiven.io Olena Babenko: Drama points 1. Group coordinator is too slow 2. Group leader is too slow 3. Some of consumers are too slow 4. A new node is stuck in rebalancing 5. onPartitionsRevoked dark hole Consumers apply the new assignment plan: 1. What partitions are newly assigned and what are now revoked 2. Start reading from newly assigned partitions 3. If any existing partitions are revoked trigger a new rebalance
  • 42. olena@aiven.io @OlenaKutsenko aiven.io Olena Babenko: Drama points 1. Group coordinator is too slow 2. Group leader is too slow 3. Some of consumers are too slow 4. A new node is stuck in rebalancing 5. onPartitionsRevoked dark hole Consumers apply the new assignment plan: 1. What partitions are newly assigned and what are now revoked 2. Start reading from newly assigned partitions 3. If any existing partitions are revoked trigger a new rebalance
  • 44. olena@aiven.io @OlenaKutsenko aiven.io Olena Babenko: Scale horizontally vs vertically Vertical scaling Pros: - Less rebalancing, if static members used. (group.instance.id config). More flexible, when run out of resources(CPU, RAM, disc etc). Vertical scaling Cons: - Lots of partitions on one node, not always good as well - one hot partition could hog all resources. - Bigger machines not always possible. - State might be lost
  • 45. olena@aiven.io @OlenaKutsenko aiven.io Olena Babenko: Horizontal scaling is time consuming and risky. How to make it more efficient?
  • 46. olena@aiven.io @OlenaKutsenko aiven.io Olena Babenko: Build In consumer metrics records-lag-max Lag - very common metric to identify that there are too much going on, especially, is lag is among ALL or majority of partitions. records-consumed-rate The average number of records consumed per second join-rate If only one partition is laging, that might be an error, or problems with Job Groups. Helps to monitor if something is wrong with a rebalancing
  • 47. olena@aiven.io @OlenaKutsenko aiven.io Olena Babenko: Build In consumer metrics Pros: - It is a simplest option to start from Cons: - On a consumer side and depends on a consumer health and state
  • 48. olena@aiven.io @OlenaKutsenko aiven.io Olena Babenko: Kafka cluster metrics
  • 49. olena@aiven.io @OlenaKutsenko aiven.io Olena Babenko: Generic Kafka Cluster metrics - Also have lag info - Less biased - More info about producers and events production - Additional important info about group coordinator health
  • 51. olena@aiven.io @OlenaKutsenko aiven.io Olena Babenko: KEDA (Apache Kafka scaler) lagThreshold Could be tuned to scale instanced based on lag. activationLagThreshold The activating (or deactivating) phase is the moment when KEDA (operator) has to decide if the workload should be scaled from/to zero + many more + A lot more, if chose prometheus trigger (custom metrics)
  • 52. olena@aiven.io @OlenaKutsenko aiven.io Olena Babenko: Knative - Scale up and down faster using an amount of events - You can Scale Kafka Source using KEDA - Great in handling spikes - Reusability of resources - Keeps same pod identity, while replacing nodes (reduce amount of rebalancing during failure) - More complicated
  • 53. olena@aiven.io @OlenaKutsenko aiven.io Olena Babenko: Lag only grows after autoscale
  • 54. olena@aiven.io @OlenaKutsenko aiven.io Olena Babenko: Autoscale using lag is not always optimal - Lots of joins/rebalancing can make event consumption slower. More nodes will be requested as a result - Too much pressure on one Leading node - Lag metric doesn’t answer question WHAT CAUSED A LAG! (it is not always lack of resources) As a result: - Fast autoscaling might be problematic
  • 55. olena@aiven.io @OlenaKutsenko aiven.io Olena Babenko: 1. Know when to scale (and when not) 2. Minimize unnecessary data movement 3. Avoid unnecessary rebalancing We want to scale efficiently and effectively:
  • 56. olena@aiven.io @OlenaKutsenko aiven.io Olena Babenko: Lag == Money 💰 ?
  • 57. olena@aiven.io @OlenaKutsenko aiven.io Olena Babenko: Lag == Money 💰 !! Time ⌛
  • 58. olena@aiven.io @OlenaKutsenko aiven.io Olena Babenko: Time is more universal unit - Lag is depending on a message sizes, on batch.size, on linger.ms - Time is more universal unit for many businesses - you probably know how much it cost to delay order for 2 hours, or paying website downtime for 5 minutes. - AWS, Confluent, Aiven etc usually on a server side provide time-related metrics like Estimated Time Lag or Latency
  • 59. olena@aiven.io @OlenaKutsenko aiven.io Olena Babenko: Simplest way to calculate time lag - Took latest offset from a consumer group - Read committed/consumed message timestamp from a topic - Compare with current time Pros: - Accurate Cons: - Need to get a whole message (might be big) - Need to do this quite often - Do not scale well for multiple producers, consumer-groups, topics and partitions
  • 60. olena@aiven.io @OlenaKutsenko aiven.io Olena Babenko: Serglo - Build an interpolation table to eliminate disadvantages of a simple method - A latest committed/consumed message get approximated(predicted) timestamp - Predicted timestamp compared with current time, to return time lag
  • 62. olena@aiven.io @OlenaKutsenko aiven.io Olena Babenko: Aiven time lag predictor
  • 63. olena@aiven.io @OlenaKutsenko aiven.io Olena Babenko: Aiven Lag predictor Checkpoint 1: 09:00
  • 64. olena@aiven.io @OlenaKutsenko aiven.io Olena Babenko: Aiven Lag predictor Checkpoint 2: 09:05
  • 65. olena@aiven.io @OlenaKutsenko aiven.io Olena Babenko: Aiven Lag predictor Consumption speed: 100 - 70 = 30 records per 5 seconds = 30 / 5 = 6 records per second Left to consume: 180 - 100 = 80 records = 80 / 6 = 13.333 seconds to catch up
  • 66. olena@aiven.io @OlenaKutsenko aiven.io Olena Babenko: Aiven Lag predictor kafka_lag_predictor_group_lag_predicted_seconds - estimate how much time you need to catch up, with a current producing and consuming speed. OR Estimate WHEN will be consumed event that was published right NOW. More data points gives more precise results.
  • 67. olena@aiven.io @OlenaKutsenko aiven.io Olena Babenko: Compare - Serglo more about individual message level(deduce timestamp for a one last message) vs Aiven lag predictor more about overall speed. Both could be useful in a right context - Any options, usually works good - Can give you slightly different results, and expectations might be different
  • 68. olena@aiven.io @OlenaKutsenko aiven.io Olena Babenko: More metrics for better conclusion
  • 69. olena@aiven.io @OlenaKutsenko aiven.io Olena Babenko: Aiven Lag predictor
  • 70. olena@aiven.io @OlenaKutsenko aiven.io Olena Babenko: Aiven Lag predictor Server-side metrics, defined at the same time: `kafka_lag_predictor_topic_produced_records_total` Represents the total count of records produced.(per partition) `kafka_lag_predictor_group_consumed_records_total` Represents the total count of records consumed. (per partition)
  • 71. olena@aiven.io @OlenaKutsenko aiven.io Olena Babenko: 1. Know when to scale (and when not) 2. Minimize unnecessary data movement 3. Avoid unnecessary rebalancing We want to scale efficiently and effectively:
  • 72. olena@aiven.io @OlenaKutsenko aiven.io Olena Babenko: 1. Know when to scale (and when not) 2. Minimize unnecessary data movement 3. Avoid unnecessary rebalancing - Scale multiple instances at once We want to scale efficiently and effectively:
  • 73. olena@aiven.io @OlenaKutsenko aiven.io Olena Babenko: Scaling ratio Aiven lag predictor (per partition): - Changes over time of AVG(kafka_lag_predictor_topic_produced_records_total / kafka_lag_predictor_group_consumed_records_total ) Client side alternative (per topic): - Changes over time of AVG(record-send-total / record-consumed-total) - OR per second record-send-rate / records-consumed-rate
  • 74. olena@aiven.io @OlenaKutsenko aiven.io Olena Babenko: 1. Know when to scale (and when not) 2. Minimize unnecessary data movement 3. Avoid unnecessary rebalancing We want to scale efficiently and effectively:
  • 75. olena@aiven.io @OlenaKutsenko aiven.io Olena Babenko: 1. Know when to scale (and when not) Identify other issues that caused lag 2. Minimize unnecessary data movement 3. Avoid unnecessary rebalancing We want to scale efficiently and effectively:
  • 76. olena@aiven.io @OlenaKutsenko aiven.io Olena Babenko: Rebalancing issues Aiven lag predictor (per partition): - Max over time of for all consumers in topic kafka_lag_predictor_group_consumed_records_total == 0 New server side metrics (KIP-714): - consumer.coordinator.assigned.partitions != 0 - consumer.coordinator.rebalance.latency.max Client side alternative: - join-rate (Consumer)
  • 77. olena@aiven.io @OlenaKutsenko aiven.io Olena Babenko: Consumption issues Aiven lag predictor (per partition): - Max over time per partition, per topic kafka_lag_predictor_group_consumed_records_total == 0 New server side metrics (KIP-714): - consumer.fetch.manager.fetch.latency.max - consumer.node.request.latency.max - consumer.connection.creation.total
  • 78. olena@aiven.io @OlenaKutsenko aiven.io Olena Babenko: Production issues (Scale down) Aiven lag predictor (per partition): - Max over time of kafka_lag_predictor_topic_produced_records_total == 0 - Hot partition MAX(kafka_lag_predictor_topic_produced_records_total) AVG(kafka_lag_predictor_topic_produced_records_total) New server side metrics (KIP-714): - producer.record.queue.time.max - producer.node.request.latency.max - producer.record.queue.time.max
  • 79. So! When to Scale?
  • 80. olena@aiven.io @OlenaKutsenko aiven.io Olena Babenko: What is important for your business? - Define relevant business rules to predict problem - Rules could be a combination of different metrics: - Time lag estimation - Producer health/speed - Consumer health/speed - Group Coordinator metrics Example: Burrow. Took an offsets and other metrics and transform them into status.
  • 81. olena@aiven.io @OlenaKutsenko aiven.io Olena Babenko: Conclusion - Use static groups when possible - Include /Try server side metrics + new broker metrics - Reactive scaling not always good (better to predict lag, then act when consumer group already lagging) - Scale based on your business needs - Scale X instances at once and not overload your partition leader to reduce rebalancing <- Could we do better?
  • 83. olena@aiven.io @OlenaKutsenko aiven.io Olena Babenko: A new consumer protocol! Yey!
  • 85. olena@aiven.io @OlenaKutsenko aiven.io Olena Babenko: KIP-848 The Next Generation of the Consumer Rebalance Protocol ➔
  • 86. olena@aiven.io @OlenaKutsenko aiven.io Olena Babenko: Should we redo everything right now?
  • 87. olena@aiven.io @OlenaKutsenko aiven.io Olena Babenko: New consumer group protocol - Consumers are not responsible for keeping state - Leader is not responsible for calculating assignment - Simpler - Not all problems gone
  • 88. olena@aiven.io @OlenaKutsenko aiven.io Olena Babenko: New consumer group protocol - Pay more attention to broker health. This might be another important dimension to your metrics - Life of consumers should become easier, and some metrics become obsolete - Life of Kafka providers like Consuent, AWS, Aiven became harder, but it is not your problem ;)
  • 89. olena@aiven.io @OlenaKutsenko aiven.io Olena Babenko: Conclusion. - Use static groups when possible - Include /Try server side metrics + new broker metrics - Reactive scaling not always good (better to predict lag, then act when consumer group already lagging) - Scale based on your business needs - Scale X instances at once and not overload your partition leader <- Not a problem anymore - You can try a new protocol version soon (3.7 preview)
  • 90. olena@aiven.io @OlenaKutsenko aiven.io Olena Babenko: 1. Know when to scale (and when not) 2. Minimize unnecessary data movement 3. Avoid unnecessary rebalancing We want to scale efficiently and effectively:
  • 91. olena@aiven.io @OlenaKutsenko aiven.io Olena Babenko: 1. Know when to scale (and when not) 2. Minimize unnecessary data movement 3. Avoid unnecessary rebalancing We want to scale efficiently and effectively:
  • 93. olena@aiven.io @OlenaKutsenko aiven.io Olena Babenko: 1. Know when to scale (and when not) 2. Minimize unnecessary data movement 3. Avoid unnecessary rebalancing We want to scale efficiently and effectively:
  • 94. olena@aiven.io @OlenaKutsenko aiven.io Olena Babenko: Assignors ➔ RangeAssignor ➔ RoundRobinAssignor ➔ CooperativeStickyAssignor
  • 96. olena@aiven.io @OlenaKutsenko aiven.io Olena Babenko: RangeAssignor Topic 2 Partition 1 Partition 2 Partition 1 Partition 2 Consumer 1 Consumer 2 Consumer 3 Topic 1 Consumer group
  • 97. olena@aiven.io @OlenaKutsenko aiven.io Olena Babenko: RoundRobinAssignor
  • 98. olena@aiven.io @OlenaKutsenko aiven.io Olena Babenko: RoundRobinAssignor Topic 2 Partition 1 Partition 2 Partition 1 Partition 2 Consumer 1 Consumer 2 Consumer 3 Topic 1 Consumer group
  • 99. olena@aiven.io @OlenaKutsenko aiven.io Olena Babenko: CooperativeStickyAssignor
  • 100. olena@aiven.io @OlenaKutsenko aiven.io Olena Babenko: CooperativeStickyAssignor Topic 2 Partition 1 Partition 2 Partition 1 Partition 2 Consumer 1 Consumer 2 Consumer 3 Topic 1 Consumer group
  • 101. olena@aiven.io @OlenaKutsenko aiven.io Olena Babenko: ● Scaling infinitely is not possible ● Use static groups and CooperativeStickyAssignor ● Pay attention to broker and consumer health ● Predict lag, not act when consumer group already lagging ● Define business rules and control them with metrics Remember
  • 102. olena@aiven.io @OlenaKutsenko aiven.io Olena Babenko: Olena Kutsenko twitter.com/OlenaKutsenko linkedin.com/in/olenakutsenko Olena Babenko linkedin.com/in/melhelen/ https://github.com/anelook/mastering-kafka-consumer-distribution
  • 103. olena@aiven.io @OlenaKutsenko aiven.io Olena Babenko: Olena Kutsenko twitter.com/OlenaKutsenko linkedin.com/in/olenakutsenko Olena Babenko linkedin.com/in/melhelen/ https://github.com/anelook/mastering-kafka-consumer-distribution Register for Aiven for Apache Kafka and get extra credits:
  • 104. olena@aiven.io @OlenaKutsenko aiven.io Olena Babenko: Olena Kutsenko twitter.com/OlenaKutsenko linkedin.com/in/olenakutsenko Olena Babenko linkedin.com/in/melhelen/ Find us at #108 https://github.com/anelook/mastering-kafka-consumer-distribution Register for Aiven for Apache Kafka and get extra credits: