3. Oficinas en Madrid: C/ Francisco Silvela, 54 Duplicado 1ºD 28028
Telf: 91 080 82 44
Oficinas en Barcelona: C/ Madrazo 27-29 4ª 08006
Telf: 933 68 52 46
- Configuration & rollout strategy
- Retention
- Replication
- Consumer lag
- Batching & compression
3
KAFKA IN PRODUCTION: CONSIDERATIONS
4. Oficinas en Madrid: C/ Francisco Silvela, 54 Duplicado 1ºD 28028
Telf: 91 080 82 44
Oficinas en Barcelona: C/ Madrazo 27-29 4ª 08006
Telf: 933 68 52 46
- Kafka configuration is highly reliant on the use case
and application needs.
- Having a set strategy for rolling out the changes to
the cluster without stopping the service is vital.
- There is no perfect configuration from the get-go,
there are many parameters to fine tune.
- Having a clear performance goal and agile ways to
roll out the changes will make your life a lot easier.
4
KAFKA IN PRODUCTION: CONFIGURATION & ROLLOUT
5. Oficinas en Madrid: C/ Francisco Silvela, 54 Duplicado 1ºD 28028
Telf: 91 080 82 44
Oficinas en Barcelona: C/ Madrazo 27-29 4ª 08006
Telf: 933 68 52 46
- It’s important to configure data retention in your
kafka cluster, tailored to the needs of the
application.
- Retention can be configured with a retention time
and a retention volume.
- This space vs time is really important to fine tune
- Have a red alert button! Whenever something goes
wrong invalidate the retention policies in order to fix
the issue without losing data.
- Pro tip: Kafka supports time travel from 0.10.1
onwards!
5
KAFKA IN PRODUCTION: RETENTION
6. Oficinas en Madrid: C/ Francisco Silvela, 54 Duplicado 1ºD 28028
Telf: 91 080 82 44
Oficinas en Barcelona: C/ Madrazo 27-29 4ª 08006
Telf: 933 68 52 46
- Have the right amount of replication depending on
the application needs and data sensitivity.
- Too much replication will lead to unnecessary costs
and complexity.
- Too little replication won’t let your SysAdmins sleep.
- Keep the replicated data in separate failure
domains.
- Be careful with the hardware infrastructure, disk
I/Os get exponentially high with replication.
6
KAFKA IN PRODUCTION: REPLICATION
7. Oficinas en Madrid: C/ Francisco Silvela, 54 Duplicado 1ºD 28028
Telf: 91 080 82 44
Oficinas en Barcelona: C/ Madrazo 27-29 4ª 08006
Telf: 933 68 52 46
7
KAFKA IN PRODUCTION: REPLICATION
8. Oficinas en Madrid: C/ Francisco Silvela, 54 Duplicado 1ºD 28028
Telf: 91 080 82 44
Oficinas en Barcelona: C/ Madrazo 27-29 4ª 08006
Telf: 933 68 52 46
- Consumer lag is one of the scariest problems when
dealing with kafka infrastructure.
- If not detected and the retention policies kick in,
you’ll start losing your data before having it
processed.
- Monitoring append lag vs commit lag is important to
get an accurate diagnostic of the causes of the lag.
- Append lag is the most sensitive to be monitoring.
8
KAFKA IN PRODUCTION: CONSUMER LAG
9. Oficinas en Madrid: C/ Francisco Silvela, 54 Duplicado 1ºD 28028
Telf: 91 080 82 44
Oficinas en Barcelona: C/ Madrazo 27-29 4ª 08006
Telf: 933 68 52 46
9
KAFKA IN PRODUCTION: CONSUMER LAG
10. Oficinas en Madrid: C/ Francisco Silvela, 54 Duplicado 1ºD 28028
Telf: 91 080 82 44
Oficinas en Barcelona: C/ Madrazo 27-29 4ª 08006
Telf: 933 68 52 46
- Some use cases can benefit greatly from batchin
strategies.
- The batching can happen in the producer, the
consumer, or both of them.
- Producer batching stresses the resources of the
kafka machine but lowers the total network and I/O
requirements.
- Consumer batching lowers the resources on the
consumer application at zookeeper peak resource
consumption cost
10
KAFKA IN PRODUCTION: BATCHING
11. Oficinas en Madrid: C/ Francisco Silvela, 54 Duplicado 1ºD 28028
Telf: 91 080 82 44
Oficinas en Barcelona: C/ Madrazo 27-29 4ª 08006
Telf: 933 68 52 46
11
KAFKA IN PRODUCTION: BATCHING
12. Oficinas en Madrid: C/ Francisco Silvela, 54 Duplicado 1ºD 28028
Telf: 91 080 82 44
Oficinas en Barcelona: C/ Madrazo 27-29 4ª 08006
Telf: 933 68 52 46
- Kafka implementation varies greatly depending on
whether one wants to keep the events as an
immutable state or delete them periodically.
- When building an application on event sourcing or
using kafka as a data bus to connect microservices,
event consistency is key.
- If, on the other hand, one is dealing with pipelining
of high throughput of data, scaling and order
management is a greater concern.
12
EVENT SOURCING & EVENT BASING
13. Oficinas en Madrid: C/ Francisco Silvela, 54 Duplicado 1ºD 28028
Telf: 91 080 82 44
Oficinas en Barcelona: C/ Madrazo 27-29 4ª 08006
Telf: 933 68 52 46
- Event versioning comes into play here. Since the events are persisted
permanently in the queue, one needs to be able to read them at all times. Avro’s
schema registry is the best tool available to handle the versioning of events and
entities present in them.
- Don’t trust the queue, rebuild constantly. This not only ensures the consistency of
one’s schemas and events, but also allows for cool DevOps application
deployment strategies, like zero downtime database clusterization or migrations
of application.
- One needs to be really careful when scaling the cluster and relying on event
sourcing, specially when dealing with partitions of topics.
- Thankfully, this implementation generally has the lowest load on the cluster
compared to pipelining high traffic through it, therefore kafka setup and resource
management becomes easier.
13
EVENT SOURCING
14. Oficinas en Madrid: C/ Francisco Silvela, 54 Duplicado 1ºD 28028
Telf: 91 080 82 44
Oficinas en Barcelona: C/ Madrazo 27-29 4ª 08006
Telf: 933 68 52 46
- Infrastructure isn’t the only point of consideration, since event sourcing is an end-
to end effort.
- Clients and APIs need to be tailored for eventual consistency.
- For the client side, local management of the state through a store (specially when
persisting changes) can prevent a large number of fraudulent events.
- On the API side of things, making the required comprobations both before
producing the event and right before persisting the data when consuming it may
prevent undesired exception handling.
- Specially when scaling producers and partitions, it is really important to maintain
bounded context on an entity level on the same pipeline. You don’t want events
being produced on one partition that depend on entities being modified on
another one, leading to data corruption on the consumer level.
- Soft deletes help a lot conserving data integrity when dealing with event-sourced
microservice implementations: The service may appear faulty to the end-user if a
bug is found, but all data is recoverable without needing to completely rebuild a
queue if a flaw is detected.
14
EVENT SOURCING
15. Oficinas en Madrid: C/ Francisco Silvela, 54 Duplicado 1ºD 28028
Telf: 91 080 82 44
Oficinas en Barcelona: C/ Madrazo 27-29 4ª 08006
Telf: 933 68 52 46
- When using kafka as a pipeline of data processing, loads can increase greatly.
Such cases are real-time event tracking, monitoring, process pipelines, etc.
- In this cases, a lot of the times one bases the queue structure in batches of
processing, making it easier to handle.
- Scaling partitions is key here: in most use cases one will find a high load of events
of the same type, and handling the scaling of the pipelining of said events and the
consumption is the focus when dealing with performance and optimization.
- Data sensitivity is generally lower in this implementations, therefore the need for
event versioning and producing assurance isn’t as critical compared to event
sourcing.
15
EVENT BASING
17. Oficinas en Madrid: C/ Francisco Silvela, 54 Duplicado 1ºD 28028
Telf: 91 080 82 44
Oficinas en Barcelona: C/ Madrazo 27-29 4ª 08006
Telf: 933 68 52 46
- One of the least concerning implementations in the
early stages of kafka.
- Only important for really big clusters or for multi-
cluster support.
- Removes a single point of failure for the application.
- Allows for smaller machines to deploy a Zookeeper
instance
17
SCALING ZOOKEEPER
18. Oficinas en Madrid: C/ Francisco Silvela, 54 Duplicado 1ºD 28028
Telf: 91 080 82 44
Oficinas en Barcelona: C/ Madrazo 27-29 4ª 08006
Telf: 933 68 52 46
Pros:
- Increases the highest throughput of the queue.
- Enables data resiliency with multiples copies of the
same partition spread across brokers.
- Permits fault tolerance to the system, depending on
the amount of nodes said tolerance may vary.
Cons:
- Increases complexity of the system
- Increases operational deployment cost
- Increases system monetary cost
18
SCALING KAFKA
19. Oficinas en Madrid: C/ Francisco Silvela, 54 Duplicado 1ºD 28028
Telf: 91 080 82 44
Oficinas en Barcelona: C/ Madrazo 27-29 4ª 08006
Telf: 933 68 52 46
- It’s really important to handle partition assignment
when scaling consumers on your application (more
on this later).
- There are two major strategies when scaling
consumers, competing consumers and
publish/subscribe.
- If handling consumer assignment manually, one
needs to be really careful when mixing both
strategies to avoid data loss.
19
SCALING CONSUMERS
20. Oficinas en Madrid: C/ Francisco Silvela, 54 Duplicado 1ºD 28028
Telf: 91 080 82 44
Oficinas en Barcelona: C/ Madrazo 27-29 4ª 08006
Telf: 933 68 52 46
- Consumers subscribed to a same topic in a
consumer group are competing consumers.
- Each of those receives messages from one or more
partitions of the topic
- This allows to scale the number of consumers of a
topic up to the number of partitions for said topic.
- Extra consumers will remain idle until another one
fails or more partitions of the topic are created.
20
SCALING CONSUMERS: Competing consumers
21. Oficinas en Madrid: C/ Francisco Silvela, 54 Duplicado 1ºD 28028
Telf: 91 080 82 44
Oficinas en Barcelona: C/ Madrazo 27-29 4ª 08006
Telf: 933 68 52 46
- This patterns separates consumers by consumer
groups, and subscribes each consumer group to all
the messages of a single topic.
- In a single consumer group one will find the same
competing pattern explained before, but all the
messages are being sent to all the groups.
- Especially useful for microservice orchestration and
data sharing, since one can assign a single consumer
group per microservice and handle the needed
events for said service there.
21
SCALING CONSUMERS: publish/subscribe
22. Oficinas en Madrid: C/ Francisco Silvela, 54 Duplicado 1ºD 28028
Telf: 91 080 82 44
Oficinas en Barcelona: C/ Madrazo 27-29 4ª 08006
Telf: 933 68 52 46
- There are two ways to connect a consumer to a partition of a
certain topic. The subscribe() and the assign() methods (as
per Kafka API).
- The subscribe() one assigns the consumer group to a topic
and lets handle the consumer assignation and rebalancing by
itself.
- On the other hand, assign()ing a consumer to a specific
partitions makes it manual, therefore increasing the risk of
missing partitions and/or trying to overlap multiple
consumers on the same partition. Be really careful when
going manual on consumer assignation!
22
SCALING CONSUMERS: careful when going manual!
23. Oficinas en Madrid: C/ Francisco Silvela, 54 Duplicado 1ºD 28028
Telf: 91 080 82 44
Oficinas en Barcelona: C/ Madrazo 27-29 4ª 08006
Telf: 933 68 52 46
- By default, a kafka producer will write in any of the partitions of the topic where
it’s producing. Depending on the sending strategy the producer has a buffer of
events to be sent before the last one has been validated.
- Producer acknowledgements (acks) present a strategy for confirmation of an
event being persisted to the queue. This can be set to 0 (NONE), 1 (LEADER) or -1
(ALL)
- If the strategy chosen is too restrictive (-1) or the cluster’s brokers have trouble
keeping up with the producer’s throughput, one may experience higher memory
size of the producer’s buffer, leading to unexpected crashes and/or data loss.
- On the other hand if the strategy is too loose (0 or 1), a broker failure my imply
data loss.
- Batching strategies help with restrictive acks and faster processing of the events,
since the producer can keep building the next batch while the previous one is
being acknowledged by the cluster.
- Batch compression is also an option when the producers are overwhelming the
network of the cluster but the processing of events is still being handled properly.
23
SCALING PRODUCERS
24. Oficinas en Madrid: C/ Francisco Silvela, 54 Duplicado 1ºD 28028
Telf: 91 080 82 44
Oficinas en Barcelona: C/ Madrazo 27-29 4ª 08006
Telf: 933 68 52 46
- If the load of the application is high enough to warrant
multiple instances, you’ll need partitioning of your data.
- Whenever the application allows it, random data distribution
is the most efficient way to scale partitions
- It’s important to consider whether you’ll need to make
aggregates, guarantee order, shard the data or batching
when deciding the partitioning strategy
24
PARTITIONING STRATEGIES
25. Oficinas en Madrid: C/ Francisco Silvela, 54 Duplicado 1ºD 28028
Telf: 91 080 82 44
Oficinas en Barcelona: C/ Madrazo 27-29 4ª 08006
Telf: 933 68 52 46
- Makes no differentiation on which partition handles which
kind of event.
- Makes consumer scaling easier, since any consumer can
consume from any partition.
- Doesn’t ensure any kind of ordering of consumption outside
a single partition, and therefore event type.
25
RANDOM PARTITIONING
26. Oficinas en Madrid: C/ Francisco Silvela, 54 Duplicado 1ºD 28028
Telf: 91 080 82 44
Oficinas en Barcelona: C/ Madrazo 27-29 4ª 08006
Telf: 933 68 52 46
- Each partition handles a certain type of event.
- Consumer scaling gets trickier on a single consumer group.
- Ensures order preservation on a single event type, but adds
design complexity to the events that may be interconnected.
- Also adds complexity to the consumers, but ensuring that
every consumer can process any event makes it easier to
handle.
26
AGGREGATE PARTITIONING
27. Oficinas en Madrid: C/ Francisco Silvela, 54 Duplicado 1ºD 28028
Telf: 91 080 82 44
Oficinas en Barcelona: C/ Madrazo 27-29 4ª 08006
Telf: 933 68 52 46
- If the aggregate partition strategy isn’t homogeneous (some aggregates have
more load than others) the partitions themselves will face different loads,
makings consumer scaling harder.
- You can then separate the highest load partitions with time windows, spreading
the highest load across different partitions
27
TIME WINDOW PARTITIONING
28. Oficinas en Madrid: C/ Francisco Silvela, 54 Duplicado 1ºD 28028
Telf: 91 080 82 44
Oficinas en Barcelona: C/ Madrazo 27-29 4ª 08006
Telf: 933 68 52 46
- Once the chunking of the partitions is in place, one can consume said events on a
time window basis.
- Then, produce in a new sorted topic, partitioning each event in their aggregates.
28
TIME WINDOW PARTITIONING
29. Oficinas en Madrid: C/ Francisco Silvela, 54 Duplicado 1ºD 28028
Telf: 91 080 82 44
Oficinas en Barcelona: C/ Madrazo 27-29 4ª 08006
Telf: 933 68 52 46
- When choosing a partitioning strategy it’s important to take into consideration
the possible resource bottlenecks outside of the kafka cluster.
Example: If a consumer of a topic is dependant on a high load database that has been
sharded, it makes sense to set the topic partitions to match said shards of the db. This
allows to scale the consumers per partition and database shard.
- When dealing with multiple partitions and replication, storage considerations are
really important. If a broker fails and there is replication in place, the partition
leader may change and the replication may move to another broker, creating high
traffic and/or disk I/Os.
29
PARTITIONING: BOTTLENECKS AND EFFICIENCY
30. Oficinas en Madrid: C/ Francisco Silvela, 54 Duplicado 1ºD 28028
Telf: 91 080 82 44
Oficinas en Barcelona: C/ Madrazo 27-29 4ª 08006
Telf: 933 68 52 46
- When a consumer enters or leaves a consumer group, kafka by default rebalances
the partitions for said consumer group.
- When rebalancing happens, all consumers drop their partitions and are
reassigned new ones. If the consumer has state associated to the data being
consumer, you need to be very careful with the rebalancing strategies of the
cluster.
- Another option is use the native Kafka API instead of a consumer group, and
manually assign consumers to partitions (avoiding automatic load balancing).
30
PARTITION REBALANCING
31. Oficinas en Madrid: C/ Francisco Silvela, 54 Duplicado 1ºD 28028
Telf: 91 080 82 44
Oficinas en Barcelona: C/ Madrazo 27-29 4ª 08006
Telf: 933 68 52 46
- As stated before, data resiliency and replication is one of the biggest strengths of
kafka.
- Although it adds load in both disk I/Os and network it ensures that no data loss
will happen.
- A good starting point for data replication is at 3 replicas per partition. This allows
the cluster to lose one broker without critical alert, and two of them without
losing any data!
- This way if a single broker fails at night a single notification would suffice and you
can fix the problem the next morning. Otherwise, if two of them fail you can still
fix the issue without service downtime and/or data loss.
- We’ll talk about data spreading and levels of fault tolerance depending on the
type of infrastructure when we evaluate different production environments.
31
DATA RESILIENCY & FAULT TOLERANCE