SlideShare a Scribd company logo
1 of 31
Oficinas en Madrid: C/ Francisco Silvela, 54 Duplicado 1ºD 28028
Telf: 91 080 82 44
Oficinas en Barcelona: C/ Madrazo 27-29 4ª 08006
Telf: 933 68 52 46
1
KAFKA INFRASTRUCTURE:
PRODUCTION
Oficinas en Madrid: C/ Francisco Silvela, 54 Duplicado 1ºD 28028
Telf: 91 080 82 44
Oficinas en Barcelona: C/ Madrazo 27-29 4ª 08006
Telf: 933 68 52 46
1. Considerations
2. Event sourcing vs. event basing
3. Microservice management
4. Scaling & clustering
5. Partitioning strategies
6. Data resiliency & fault tolerance.
2
$intro --help
Oficinas en Madrid: C/ Francisco Silvela, 54 Duplicado 1ºD 28028
Telf: 91 080 82 44
Oficinas en Barcelona: C/ Madrazo 27-29 4ª 08006
Telf: 933 68 52 46
- Configuration & rollout strategy
- Retention
- Replication
- Consumer lag
- Batching & compression
3
KAFKA IN PRODUCTION: CONSIDERATIONS
Oficinas en Madrid: C/ Francisco Silvela, 54 Duplicado 1ºD 28028
Telf: 91 080 82 44
Oficinas en Barcelona: C/ Madrazo 27-29 4ª 08006
Telf: 933 68 52 46
- Kafka configuration is highly reliant on the use case
and application needs.
- Having a set strategy for rolling out the changes to
the cluster without stopping the service is vital.
- There is no perfect configuration from the get-go,
there are many parameters to fine tune.
- Having a clear performance goal and agile ways to
roll out the changes will make your life a lot easier.
4
KAFKA IN PRODUCTION: CONFIGURATION & ROLLOUT
Oficinas en Madrid: C/ Francisco Silvela, 54 Duplicado 1ºD 28028
Telf: 91 080 82 44
Oficinas en Barcelona: C/ Madrazo 27-29 4ª 08006
Telf: 933 68 52 46
- It’s important to configure data retention in your
kafka cluster, tailored to the needs of the
application.
- Retention can be configured with a retention time
and a retention volume.
- This space vs time is really important to fine tune
- Have a red alert button! Whenever something goes
wrong invalidate the retention policies in order to fix
the issue without losing data.
- Pro tip: Kafka supports time travel from 0.10.1
onwards!
5
KAFKA IN PRODUCTION: RETENTION
Oficinas en Madrid: C/ Francisco Silvela, 54 Duplicado 1ºD 28028
Telf: 91 080 82 44
Oficinas en Barcelona: C/ Madrazo 27-29 4ª 08006
Telf: 933 68 52 46
- Have the right amount of replication depending on
the application needs and data sensitivity.
- Too much replication will lead to unnecessary costs
and complexity.
- Too little replication won’t let your SysAdmins sleep.
- Keep the replicated data in separate failure
domains.
- Be careful with the hardware infrastructure, disk
I/Os get exponentially high with replication.
6
KAFKA IN PRODUCTION: REPLICATION
Oficinas en Madrid: C/ Francisco Silvela, 54 Duplicado 1ºD 28028
Telf: 91 080 82 44
Oficinas en Barcelona: C/ Madrazo 27-29 4ª 08006
Telf: 933 68 52 46
7
KAFKA IN PRODUCTION: REPLICATION
Oficinas en Madrid: C/ Francisco Silvela, 54 Duplicado 1ºD 28028
Telf: 91 080 82 44
Oficinas en Barcelona: C/ Madrazo 27-29 4ª 08006
Telf: 933 68 52 46
- Consumer lag is one of the scariest problems when
dealing with kafka infrastructure.
- If not detected and the retention policies kick in,
you’ll start losing your data before having it
processed.
- Monitoring append lag vs commit lag is important to
get an accurate diagnostic of the causes of the lag.
- Append lag is the most sensitive to be monitoring.
8
KAFKA IN PRODUCTION: CONSUMER LAG
Oficinas en Madrid: C/ Francisco Silvela, 54 Duplicado 1ºD 28028
Telf: 91 080 82 44
Oficinas en Barcelona: C/ Madrazo 27-29 4ª 08006
Telf: 933 68 52 46
9
KAFKA IN PRODUCTION: CONSUMER LAG
Oficinas en Madrid: C/ Francisco Silvela, 54 Duplicado 1ºD 28028
Telf: 91 080 82 44
Oficinas en Barcelona: C/ Madrazo 27-29 4ª 08006
Telf: 933 68 52 46
- Some use cases can benefit greatly from batchin
strategies.
- The batching can happen in the producer, the
consumer, or both of them.
- Producer batching stresses the resources of the
kafka machine but lowers the total network and I/O
requirements.
- Consumer batching lowers the resources on the
consumer application at zookeeper peak resource
consumption cost
10
KAFKA IN PRODUCTION: BATCHING
Oficinas en Madrid: C/ Francisco Silvela, 54 Duplicado 1ºD 28028
Telf: 91 080 82 44
Oficinas en Barcelona: C/ Madrazo 27-29 4ª 08006
Telf: 933 68 52 46
11
KAFKA IN PRODUCTION: BATCHING
Oficinas en Madrid: C/ Francisco Silvela, 54 Duplicado 1ºD 28028
Telf: 91 080 82 44
Oficinas en Barcelona: C/ Madrazo 27-29 4ª 08006
Telf: 933 68 52 46
- Kafka implementation varies greatly depending on
whether one wants to keep the events as an
immutable state or delete them periodically.
- When building an application on event sourcing or
using kafka as a data bus to connect microservices,
event consistency is key.
- If, on the other hand, one is dealing with pipelining
of high throughput of data, scaling and order
management is a greater concern.
12
EVENT SOURCING & EVENT BASING
Oficinas en Madrid: C/ Francisco Silvela, 54 Duplicado 1ºD 28028
Telf: 91 080 82 44
Oficinas en Barcelona: C/ Madrazo 27-29 4ª 08006
Telf: 933 68 52 46
- Event versioning comes into play here. Since the events are persisted
permanently in the queue, one needs to be able to read them at all times. Avro’s
schema registry is the best tool available to handle the versioning of events and
entities present in them.
- Don’t trust the queue, rebuild constantly. This not only ensures the consistency of
one’s schemas and events, but also allows for cool DevOps application
deployment strategies, like zero downtime database clusterization or migrations
of application.
- One needs to be really careful when scaling the cluster and relying on event
sourcing, specially when dealing with partitions of topics.
- Thankfully, this implementation generally has the lowest load on the cluster
compared to pipelining high traffic through it, therefore kafka setup and resource
management becomes easier.
13
EVENT SOURCING
Oficinas en Madrid: C/ Francisco Silvela, 54 Duplicado 1ºD 28028
Telf: 91 080 82 44
Oficinas en Barcelona: C/ Madrazo 27-29 4ª 08006
Telf: 933 68 52 46
- Infrastructure isn’t the only point of consideration, since event sourcing is an end-
to end effort.
- Clients and APIs need to be tailored for eventual consistency.
- For the client side, local management of the state through a store (specially when
persisting changes) can prevent a large number of fraudulent events.
- On the API side of things, making the required comprobations both before
producing the event and right before persisting the data when consuming it may
prevent undesired exception handling.
- Specially when scaling producers and partitions, it is really important to maintain
bounded context on an entity level on the same pipeline. You don’t want events
being produced on one partition that depend on entities being modified on
another one, leading to data corruption on the consumer level.
- Soft deletes help a lot conserving data integrity when dealing with event-sourced
microservice implementations: The service may appear faulty to the end-user if a
bug is found, but all data is recoverable without needing to completely rebuild a
queue if a flaw is detected.
14
EVENT SOURCING
Oficinas en Madrid: C/ Francisco Silvela, 54 Duplicado 1ºD 28028
Telf: 91 080 82 44
Oficinas en Barcelona: C/ Madrazo 27-29 4ª 08006
Telf: 933 68 52 46
- When using kafka as a pipeline of data processing, loads can increase greatly.
Such cases are real-time event tracking, monitoring, process pipelines, etc.
- In this cases, a lot of the times one bases the queue structure in batches of
processing, making it easier to handle.
- Scaling partitions is key here: in most use cases one will find a high load of events
of the same type, and handling the scaling of the pipelining of said events and the
consumption is the focus when dealing with performance and optimization.
- Data sensitivity is generally lower in this implementations, therefore the need for
event versioning and producing assurance isn’t as critical compared to event
sourcing.
15
EVENT BASING
Oficinas en Madrid: C/ Francisco Silvela, 54 Duplicado 1ºD 28028
Telf: 91 080 82 44
Oficinas en Barcelona: C/ Madrazo 27-29 4ª 08006
Telf: 933 68 52 46
16
SCALING & CLUSTERING
Oficinas en Madrid: C/ Francisco Silvela, 54 Duplicado 1ºD 28028
Telf: 91 080 82 44
Oficinas en Barcelona: C/ Madrazo 27-29 4ª 08006
Telf: 933 68 52 46
- One of the least concerning implementations in the
early stages of kafka.
- Only important for really big clusters or for multi-
cluster support.
- Removes a single point of failure for the application.
- Allows for smaller machines to deploy a Zookeeper
instance
17
SCALING ZOOKEEPER
Oficinas en Madrid: C/ Francisco Silvela, 54 Duplicado 1ºD 28028
Telf: 91 080 82 44
Oficinas en Barcelona: C/ Madrazo 27-29 4ª 08006
Telf: 933 68 52 46
Pros:
- Increases the highest throughput of the queue.
- Enables data resiliency with multiples copies of the
same partition spread across brokers.
- Permits fault tolerance to the system, depending on
the amount of nodes said tolerance may vary.
Cons:
- Increases complexity of the system
- Increases operational deployment cost
- Increases system monetary cost
18
SCALING KAFKA
Oficinas en Madrid: C/ Francisco Silvela, 54 Duplicado 1ºD 28028
Telf: 91 080 82 44
Oficinas en Barcelona: C/ Madrazo 27-29 4ª 08006
Telf: 933 68 52 46
- It’s really important to handle partition assignment
when scaling consumers on your application (more
on this later).
- There are two major strategies when scaling
consumers, competing consumers and
publish/subscribe.
- If handling consumer assignment manually, one
needs to be really careful when mixing both
strategies to avoid data loss.
19
SCALING CONSUMERS
Oficinas en Madrid: C/ Francisco Silvela, 54 Duplicado 1ºD 28028
Telf: 91 080 82 44
Oficinas en Barcelona: C/ Madrazo 27-29 4ª 08006
Telf: 933 68 52 46
- Consumers subscribed to a same topic in a
consumer group are competing consumers.
- Each of those receives messages from one or more
partitions of the topic
- This allows to scale the number of consumers of a
topic up to the number of partitions for said topic.
- Extra consumers will remain idle until another one
fails or more partitions of the topic are created.
20
SCALING CONSUMERS: Competing consumers
Oficinas en Madrid: C/ Francisco Silvela, 54 Duplicado 1ºD 28028
Telf: 91 080 82 44
Oficinas en Barcelona: C/ Madrazo 27-29 4ª 08006
Telf: 933 68 52 46
- This patterns separates consumers by consumer
groups, and subscribes each consumer group to all
the messages of a single topic.
- In a single consumer group one will find the same
competing pattern explained before, but all the
messages are being sent to all the groups.
- Especially useful for microservice orchestration and
data sharing, since one can assign a single consumer
group per microservice and handle the needed
events for said service there.
21
SCALING CONSUMERS: publish/subscribe
Oficinas en Madrid: C/ Francisco Silvela, 54 Duplicado 1ºD 28028
Telf: 91 080 82 44
Oficinas en Barcelona: C/ Madrazo 27-29 4ª 08006
Telf: 933 68 52 46
- There are two ways to connect a consumer to a partition of a
certain topic. The subscribe() and the assign() methods (as
per Kafka API).
- The subscribe() one assigns the consumer group to a topic
and lets handle the consumer assignation and rebalancing by
itself.
- On the other hand, assign()ing a consumer to a specific
partitions makes it manual, therefore increasing the risk of
missing partitions and/or trying to overlap multiple
consumers on the same partition. Be really careful when
going manual on consumer assignation!
22
SCALING CONSUMERS: careful when going manual!
Oficinas en Madrid: C/ Francisco Silvela, 54 Duplicado 1ºD 28028
Telf: 91 080 82 44
Oficinas en Barcelona: C/ Madrazo 27-29 4ª 08006
Telf: 933 68 52 46
- By default, a kafka producer will write in any of the partitions of the topic where
it’s producing. Depending on the sending strategy the producer has a buffer of
events to be sent before the last one has been validated.
- Producer acknowledgements (acks) present a strategy for confirmation of an
event being persisted to the queue. This can be set to 0 (NONE), 1 (LEADER) or -1
(ALL)
- If the strategy chosen is too restrictive (-1) or the cluster’s brokers have trouble
keeping up with the producer’s throughput, one may experience higher memory
size of the producer’s buffer, leading to unexpected crashes and/or data loss.
- On the other hand if the strategy is too loose (0 or 1), a broker failure my imply
data loss.
- Batching strategies help with restrictive acks and faster processing of the events,
since the producer can keep building the next batch while the previous one is
being acknowledged by the cluster.
- Batch compression is also an option when the producers are overwhelming the
network of the cluster but the processing of events is still being handled properly.
23
SCALING PRODUCERS
Oficinas en Madrid: C/ Francisco Silvela, 54 Duplicado 1ºD 28028
Telf: 91 080 82 44
Oficinas en Barcelona: C/ Madrazo 27-29 4ª 08006
Telf: 933 68 52 46
- If the load of the application is high enough to warrant
multiple instances, you’ll need partitioning of your data.
- Whenever the application allows it, random data distribution
is the most efficient way to scale partitions
- It’s important to consider whether you’ll need to make
aggregates, guarantee order, shard the data or batching
when deciding the partitioning strategy
24
PARTITIONING STRATEGIES
Oficinas en Madrid: C/ Francisco Silvela, 54 Duplicado 1ºD 28028
Telf: 91 080 82 44
Oficinas en Barcelona: C/ Madrazo 27-29 4ª 08006
Telf: 933 68 52 46
- Makes no differentiation on which partition handles which
kind of event.
- Makes consumer scaling easier, since any consumer can
consume from any partition.
- Doesn’t ensure any kind of ordering of consumption outside
a single partition, and therefore event type.
25
RANDOM PARTITIONING
Oficinas en Madrid: C/ Francisco Silvela, 54 Duplicado 1ºD 28028
Telf: 91 080 82 44
Oficinas en Barcelona: C/ Madrazo 27-29 4ª 08006
Telf: 933 68 52 46
- Each partition handles a certain type of event.
- Consumer scaling gets trickier on a single consumer group.
- Ensures order preservation on a single event type, but adds
design complexity to the events that may be interconnected.
- Also adds complexity to the consumers, but ensuring that
every consumer can process any event makes it easier to
handle.
26
AGGREGATE PARTITIONING
Oficinas en Madrid: C/ Francisco Silvela, 54 Duplicado 1ºD 28028
Telf: 91 080 82 44
Oficinas en Barcelona: C/ Madrazo 27-29 4ª 08006
Telf: 933 68 52 46
- If the aggregate partition strategy isn’t homogeneous (some aggregates have
more load than others) the partitions themselves will face different loads,
makings consumer scaling harder.
- You can then separate the highest load partitions with time windows, spreading
the highest load across different partitions
27
TIME WINDOW PARTITIONING
Oficinas en Madrid: C/ Francisco Silvela, 54 Duplicado 1ºD 28028
Telf: 91 080 82 44
Oficinas en Barcelona: C/ Madrazo 27-29 4ª 08006
Telf: 933 68 52 46
- Once the chunking of the partitions is in place, one can consume said events on a
time window basis.
- Then, produce in a new sorted topic, partitioning each event in their aggregates.
28
TIME WINDOW PARTITIONING
Oficinas en Madrid: C/ Francisco Silvela, 54 Duplicado 1ºD 28028
Telf: 91 080 82 44
Oficinas en Barcelona: C/ Madrazo 27-29 4ª 08006
Telf: 933 68 52 46
- When choosing a partitioning strategy it’s important to take into consideration
the possible resource bottlenecks outside of the kafka cluster.
Example: If a consumer of a topic is dependant on a high load database that has been
sharded, it makes sense to set the topic partitions to match said shards of the db. This
allows to scale the consumers per partition and database shard.
- When dealing with multiple partitions and replication, storage considerations are
really important. If a broker fails and there is replication in place, the partition
leader may change and the replication may move to another broker, creating high
traffic and/or disk I/Os.
29
PARTITIONING: BOTTLENECKS AND EFFICIENCY
Oficinas en Madrid: C/ Francisco Silvela, 54 Duplicado 1ºD 28028
Telf: 91 080 82 44
Oficinas en Barcelona: C/ Madrazo 27-29 4ª 08006
Telf: 933 68 52 46
- When a consumer enters or leaves a consumer group, kafka by default rebalances
the partitions for said consumer group.
- When rebalancing happens, all consumers drop their partitions and are
reassigned new ones. If the consumer has state associated to the data being
consumer, you need to be very careful with the rebalancing strategies of the
cluster.
- Another option is use the native Kafka API instead of a consumer group, and
manually assign consumers to partitions (avoiding automatic load balancing).
30
PARTITION REBALANCING
Oficinas en Madrid: C/ Francisco Silvela, 54 Duplicado 1ºD 28028
Telf: 91 080 82 44
Oficinas en Barcelona: C/ Madrazo 27-29 4ª 08006
Telf: 933 68 52 46
- As stated before, data resiliency and replication is one of the biggest strengths of
kafka.
- Although it adds load in both disk I/Os and network it ensures that no data loss
will happen.
- A good starting point for data replication is at 3 replicas per partition. This allows
the cluster to lose one broker without critical alert, and two of them without
losing any data!
- This way if a single broker fails at night a single notification would suffice and you
can fix the problem the next morning. Otherwise, if two of them fail you can still
fix the issue without service downtime and/or data loss.
- We’ll talk about data spreading and levels of fault tolerance depending on the
type of infrastructure when we evaluate different production environments.
31
DATA RESILIENCY & FAULT TOLERANCE

More Related Content

Similar to Kafka infrastructure production

SCM 2015 International SAP Conference on Supply Chain
SCM 2015 International SAP Conference on Supply ChainSCM 2015 International SAP Conference on Supply Chain
SCM 2015 International SAP Conference on Supply Chain
T.A. Cook
 
VEGA Pressure & Level Measurement - Paper Industry Applications
VEGA Pressure & Level Measurement - Paper Industry ApplicationsVEGA Pressure & Level Measurement - Paper Industry Applications
VEGA Pressure & Level Measurement - Paper Industry Applications
Thorne & Derrick UK
 

Similar to Kafka infrastructure production (20)

Case Study: Increasing Produban's Critical Systems Availability and Performance
Case Study: Increasing Produban's Critical Systems Availability and PerformanceCase Study: Increasing Produban's Critical Systems Availability and Performance
Case Study: Increasing Produban's Critical Systems Availability and Performance
 
Industry 4.0 for beginners
Industry 4.0 for beginnersIndustry 4.0 for beginners
Industry 4.0 for beginners
 
Nogesi whitepaper
Nogesi whitepaperNogesi whitepaper
Nogesi whitepaper
 
VEGA Process Measurement (Level, Limit Level & Pressure) - Oil & Gas Offshore
VEGA Process Measurement (Level, Limit Level & Pressure) - Oil & Gas OffshoreVEGA Process Measurement (Level, Limit Level & Pressure) - Oil & Gas Offshore
VEGA Process Measurement (Level, Limit Level & Pressure) - Oil & Gas Offshore
 
MELAG_proline_118_brochure.pdf
MELAG_proline_118_brochure.pdfMELAG_proline_118_brochure.pdf
MELAG_proline_118_brochure.pdf
 
Datacenter App July09 Bashar
Datacenter App   July09   BasharDatacenter App   July09   Bashar
Datacenter App July09 Bashar
 
SCM 2015 International SAP Conference on Supply Chain
SCM 2015 International SAP Conference on Supply ChainSCM 2015 International SAP Conference on Supply Chain
SCM 2015 International SAP Conference on Supply Chain
 
SMT - Process Control in Thermal Profiling
SMT - Process Control in Thermal ProfilingSMT - Process Control in Thermal Profiling
SMT - Process Control in Thermal Profiling
 
VEGA Pressure & Level Measurement - Paper Industry Applications
VEGA Pressure & Level Measurement - Paper Industry ApplicationsVEGA Pressure & Level Measurement - Paper Industry Applications
VEGA Pressure & Level Measurement - Paper Industry Applications
 
White Paper Mold-ID - Mold Management for Injection Molding with RFID
White Paper Mold-ID - Mold Management for Injection Molding with RFIDWhite Paper Mold-ID - Mold Management for Injection Molding with RFID
White Paper Mold-ID - Mold Management for Injection Molding with RFID
 
The digital lime plant
The digital lime plantThe digital lime plant
The digital lime plant
 
The digital lime plant
The digital lime plantThe digital lime plant
The digital lime plant
 
Oil & Gas Fields Get Smart
Oil & Gas Fields Get SmartOil & Gas Fields Get Smart
Oil & Gas Fields Get Smart
 
Network performance - skilled craft to hard science
Network performance - skilled craft to hard scienceNetwork performance - skilled craft to hard science
Network performance - skilled craft to hard science
 
Industry_4_0_EN-Brochure
Industry_4_0_EN-BrochureIndustry_4_0_EN-Brochure
Industry_4_0_EN-Brochure
 
Confluent Partner Tech Talk with QLIK
Confluent Partner Tech Talk with QLIKConfluent Partner Tech Talk with QLIK
Confluent Partner Tech Talk with QLIK
 
Manufacturing lighthouses
Manufacturing lighthousesManufacturing lighthouses
Manufacturing lighthouses
 
Virtual Human Brain Simulations with Abaqus in the Cloud
Virtual Human Brain Simulations with Abaqus in the CloudVirtual Human Brain Simulations with Abaqus in the Cloud
Virtual Human Brain Simulations with Abaqus in the Cloud
 
201408 digital oilfield (1)
201408 digital oilfield (1)201408 digital oilfield (1)
201408 digital oilfield (1)
 
Gravity White Paper - How to Close the 3rd Party Logistics Technology Gap
Gravity White Paper - How to Close the 3rd Party Logistics Technology GapGravity White Paper - How to Close the 3rd Party Logistics Technology Gap
Gravity White Paper - How to Close the 3rd Party Logistics Technology Gap
 

Recently uploaded

Mastering Windows 7 A Comprehensive Guide for Power Users .pdf
Mastering Windows 7 A Comprehensive Guide for Power Users .pdfMastering Windows 7 A Comprehensive Guide for Power Users .pdf
Mastering Windows 7 A Comprehensive Guide for Power Users .pdf
mbmh111980
 
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
Alluxio, Inc.
 

Recently uploaded (20)

AI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in MichelangeloAI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in Michelangelo
 
iGaming Platform & Lottery Solutions by Skilrock
iGaming Platform & Lottery Solutions by SkilrockiGaming Platform & Lottery Solutions by Skilrock
iGaming Platform & Lottery Solutions by Skilrock
 
A Comprehensive Appium Guide for Hybrid App Automation Testing.pdf
A Comprehensive Appium Guide for Hybrid App Automation Testing.pdfA Comprehensive Appium Guide for Hybrid App Automation Testing.pdf
A Comprehensive Appium Guide for Hybrid App Automation Testing.pdf
 
Designing for Privacy in Amazon Web Services
Designing for Privacy in Amazon Web ServicesDesigning for Privacy in Amazon Web Services
Designing for Privacy in Amazon Web Services
 
A Guideline to Zendesk to Re:amaze Data Migration
A Guideline to Zendesk to Re:amaze Data MigrationA Guideline to Zendesk to Re:amaze Data Migration
A Guideline to Zendesk to Re:amaze Data Migration
 
Workforce Efficiency with Employee Time Tracking Software.pdf
Workforce Efficiency with Employee Time Tracking Software.pdfWorkforce Efficiency with Employee Time Tracking Software.pdf
Workforce Efficiency with Employee Time Tracking Software.pdf
 
The Impact of PLM Software on Fashion Production
The Impact of PLM Software on Fashion ProductionThe Impact of PLM Software on Fashion Production
The Impact of PLM Software on Fashion Production
 
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
 
How to install and activate eGrabber JobGrabber
How to install and activate eGrabber JobGrabberHow to install and activate eGrabber JobGrabber
How to install and activate eGrabber JobGrabber
 
Mastering Windows 7 A Comprehensive Guide for Power Users .pdf
Mastering Windows 7 A Comprehensive Guide for Power Users .pdfMastering Windows 7 A Comprehensive Guide for Power Users .pdf
Mastering Windows 7 A Comprehensive Guide for Power Users .pdf
 
Advanced Flow Concepts Every Developer Should Know
Advanced Flow Concepts Every Developer Should KnowAdvanced Flow Concepts Every Developer Should Know
Advanced Flow Concepts Every Developer Should Know
 
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
 
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
 
5 Reasons Driving Warehouse Management Systems Demand
5 Reasons Driving Warehouse Management Systems Demand5 Reasons Driving Warehouse Management Systems Demand
5 Reasons Driving Warehouse Management Systems Demand
 
Studiovity film pre-production and screenwriting software
Studiovity film pre-production and screenwriting softwareStudiovity film pre-production and screenwriting software
Studiovity film pre-production and screenwriting software
 
Agnieszka Andrzejewska - BIM School Course in Kraków
Agnieszka Andrzejewska - BIM School Course in KrakówAgnieszka Andrzejewska - BIM School Course in Kraków
Agnieszka Andrzejewska - BIM School Course in Kraków
 
StrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi
StrimziCon 2024 - Transition to Apache Kafka on Kubernetes with StrimziStrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi
StrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi
 
Microsoft 365 Copilot; An AI tool changing the world of work _PDF.pdf
Microsoft 365 Copilot; An AI tool changing the world of work _PDF.pdfMicrosoft 365 Copilot; An AI tool changing the world of work _PDF.pdf
Microsoft 365 Copilot; An AI tool changing the world of work _PDF.pdf
 
10 Essential Software Testing Tools You Need to Know About.pdf
10 Essential Software Testing Tools You Need to Know About.pdf10 Essential Software Testing Tools You Need to Know About.pdf
10 Essential Software Testing Tools You Need to Know About.pdf
 
AI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning FrameworkAI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning Framework
 

Kafka infrastructure production

  • 1. Oficinas en Madrid: C/ Francisco Silvela, 54 Duplicado 1ºD 28028 Telf: 91 080 82 44 Oficinas en Barcelona: C/ Madrazo 27-29 4ª 08006 Telf: 933 68 52 46 1 KAFKA INFRASTRUCTURE: PRODUCTION
  • 2. Oficinas en Madrid: C/ Francisco Silvela, 54 Duplicado 1ºD 28028 Telf: 91 080 82 44 Oficinas en Barcelona: C/ Madrazo 27-29 4ª 08006 Telf: 933 68 52 46 1. Considerations 2. Event sourcing vs. event basing 3. Microservice management 4. Scaling & clustering 5. Partitioning strategies 6. Data resiliency & fault tolerance. 2 $intro --help
  • 3. Oficinas en Madrid: C/ Francisco Silvela, 54 Duplicado 1ºD 28028 Telf: 91 080 82 44 Oficinas en Barcelona: C/ Madrazo 27-29 4ª 08006 Telf: 933 68 52 46 - Configuration & rollout strategy - Retention - Replication - Consumer lag - Batching & compression 3 KAFKA IN PRODUCTION: CONSIDERATIONS
  • 4. Oficinas en Madrid: C/ Francisco Silvela, 54 Duplicado 1ºD 28028 Telf: 91 080 82 44 Oficinas en Barcelona: C/ Madrazo 27-29 4ª 08006 Telf: 933 68 52 46 - Kafka configuration is highly reliant on the use case and application needs. - Having a set strategy for rolling out the changes to the cluster without stopping the service is vital. - There is no perfect configuration from the get-go, there are many parameters to fine tune. - Having a clear performance goal and agile ways to roll out the changes will make your life a lot easier. 4 KAFKA IN PRODUCTION: CONFIGURATION & ROLLOUT
  • 5. Oficinas en Madrid: C/ Francisco Silvela, 54 Duplicado 1ºD 28028 Telf: 91 080 82 44 Oficinas en Barcelona: C/ Madrazo 27-29 4ª 08006 Telf: 933 68 52 46 - It’s important to configure data retention in your kafka cluster, tailored to the needs of the application. - Retention can be configured with a retention time and a retention volume. - This space vs time is really important to fine tune - Have a red alert button! Whenever something goes wrong invalidate the retention policies in order to fix the issue without losing data. - Pro tip: Kafka supports time travel from 0.10.1 onwards! 5 KAFKA IN PRODUCTION: RETENTION
  • 6. Oficinas en Madrid: C/ Francisco Silvela, 54 Duplicado 1ºD 28028 Telf: 91 080 82 44 Oficinas en Barcelona: C/ Madrazo 27-29 4ª 08006 Telf: 933 68 52 46 - Have the right amount of replication depending on the application needs and data sensitivity. - Too much replication will lead to unnecessary costs and complexity. - Too little replication won’t let your SysAdmins sleep. - Keep the replicated data in separate failure domains. - Be careful with the hardware infrastructure, disk I/Os get exponentially high with replication. 6 KAFKA IN PRODUCTION: REPLICATION
  • 7. Oficinas en Madrid: C/ Francisco Silvela, 54 Duplicado 1ºD 28028 Telf: 91 080 82 44 Oficinas en Barcelona: C/ Madrazo 27-29 4ª 08006 Telf: 933 68 52 46 7 KAFKA IN PRODUCTION: REPLICATION
  • 8. Oficinas en Madrid: C/ Francisco Silvela, 54 Duplicado 1ºD 28028 Telf: 91 080 82 44 Oficinas en Barcelona: C/ Madrazo 27-29 4ª 08006 Telf: 933 68 52 46 - Consumer lag is one of the scariest problems when dealing with kafka infrastructure. - If not detected and the retention policies kick in, you’ll start losing your data before having it processed. - Monitoring append lag vs commit lag is important to get an accurate diagnostic of the causes of the lag. - Append lag is the most sensitive to be monitoring. 8 KAFKA IN PRODUCTION: CONSUMER LAG
  • 9. Oficinas en Madrid: C/ Francisco Silvela, 54 Duplicado 1ºD 28028 Telf: 91 080 82 44 Oficinas en Barcelona: C/ Madrazo 27-29 4ª 08006 Telf: 933 68 52 46 9 KAFKA IN PRODUCTION: CONSUMER LAG
  • 10. Oficinas en Madrid: C/ Francisco Silvela, 54 Duplicado 1ºD 28028 Telf: 91 080 82 44 Oficinas en Barcelona: C/ Madrazo 27-29 4ª 08006 Telf: 933 68 52 46 - Some use cases can benefit greatly from batchin strategies. - The batching can happen in the producer, the consumer, or both of them. - Producer batching stresses the resources of the kafka machine but lowers the total network and I/O requirements. - Consumer batching lowers the resources on the consumer application at zookeeper peak resource consumption cost 10 KAFKA IN PRODUCTION: BATCHING
  • 11. Oficinas en Madrid: C/ Francisco Silvela, 54 Duplicado 1ºD 28028 Telf: 91 080 82 44 Oficinas en Barcelona: C/ Madrazo 27-29 4ª 08006 Telf: 933 68 52 46 11 KAFKA IN PRODUCTION: BATCHING
  • 12. Oficinas en Madrid: C/ Francisco Silvela, 54 Duplicado 1ºD 28028 Telf: 91 080 82 44 Oficinas en Barcelona: C/ Madrazo 27-29 4ª 08006 Telf: 933 68 52 46 - Kafka implementation varies greatly depending on whether one wants to keep the events as an immutable state or delete them periodically. - When building an application on event sourcing or using kafka as a data bus to connect microservices, event consistency is key. - If, on the other hand, one is dealing with pipelining of high throughput of data, scaling and order management is a greater concern. 12 EVENT SOURCING & EVENT BASING
  • 13. Oficinas en Madrid: C/ Francisco Silvela, 54 Duplicado 1ºD 28028 Telf: 91 080 82 44 Oficinas en Barcelona: C/ Madrazo 27-29 4ª 08006 Telf: 933 68 52 46 - Event versioning comes into play here. Since the events are persisted permanently in the queue, one needs to be able to read them at all times. Avro’s schema registry is the best tool available to handle the versioning of events and entities present in them. - Don’t trust the queue, rebuild constantly. This not only ensures the consistency of one’s schemas and events, but also allows for cool DevOps application deployment strategies, like zero downtime database clusterization or migrations of application. - One needs to be really careful when scaling the cluster and relying on event sourcing, specially when dealing with partitions of topics. - Thankfully, this implementation generally has the lowest load on the cluster compared to pipelining high traffic through it, therefore kafka setup and resource management becomes easier. 13 EVENT SOURCING
  • 14. Oficinas en Madrid: C/ Francisco Silvela, 54 Duplicado 1ºD 28028 Telf: 91 080 82 44 Oficinas en Barcelona: C/ Madrazo 27-29 4ª 08006 Telf: 933 68 52 46 - Infrastructure isn’t the only point of consideration, since event sourcing is an end- to end effort. - Clients and APIs need to be tailored for eventual consistency. - For the client side, local management of the state through a store (specially when persisting changes) can prevent a large number of fraudulent events. - On the API side of things, making the required comprobations both before producing the event and right before persisting the data when consuming it may prevent undesired exception handling. - Specially when scaling producers and partitions, it is really important to maintain bounded context on an entity level on the same pipeline. You don’t want events being produced on one partition that depend on entities being modified on another one, leading to data corruption on the consumer level. - Soft deletes help a lot conserving data integrity when dealing with event-sourced microservice implementations: The service may appear faulty to the end-user if a bug is found, but all data is recoverable without needing to completely rebuild a queue if a flaw is detected. 14 EVENT SOURCING
  • 15. Oficinas en Madrid: C/ Francisco Silvela, 54 Duplicado 1ºD 28028 Telf: 91 080 82 44 Oficinas en Barcelona: C/ Madrazo 27-29 4ª 08006 Telf: 933 68 52 46 - When using kafka as a pipeline of data processing, loads can increase greatly. Such cases are real-time event tracking, monitoring, process pipelines, etc. - In this cases, a lot of the times one bases the queue structure in batches of processing, making it easier to handle. - Scaling partitions is key here: in most use cases one will find a high load of events of the same type, and handling the scaling of the pipelining of said events and the consumption is the focus when dealing with performance and optimization. - Data sensitivity is generally lower in this implementations, therefore the need for event versioning and producing assurance isn’t as critical compared to event sourcing. 15 EVENT BASING
  • 16. Oficinas en Madrid: C/ Francisco Silvela, 54 Duplicado 1ºD 28028 Telf: 91 080 82 44 Oficinas en Barcelona: C/ Madrazo 27-29 4ª 08006 Telf: 933 68 52 46 16 SCALING & CLUSTERING
  • 17. Oficinas en Madrid: C/ Francisco Silvela, 54 Duplicado 1ºD 28028 Telf: 91 080 82 44 Oficinas en Barcelona: C/ Madrazo 27-29 4ª 08006 Telf: 933 68 52 46 - One of the least concerning implementations in the early stages of kafka. - Only important for really big clusters or for multi- cluster support. - Removes a single point of failure for the application. - Allows for smaller machines to deploy a Zookeeper instance 17 SCALING ZOOKEEPER
  • 18. Oficinas en Madrid: C/ Francisco Silvela, 54 Duplicado 1ºD 28028 Telf: 91 080 82 44 Oficinas en Barcelona: C/ Madrazo 27-29 4ª 08006 Telf: 933 68 52 46 Pros: - Increases the highest throughput of the queue. - Enables data resiliency with multiples copies of the same partition spread across brokers. - Permits fault tolerance to the system, depending on the amount of nodes said tolerance may vary. Cons: - Increases complexity of the system - Increases operational deployment cost - Increases system monetary cost 18 SCALING KAFKA
  • 19. Oficinas en Madrid: C/ Francisco Silvela, 54 Duplicado 1ºD 28028 Telf: 91 080 82 44 Oficinas en Barcelona: C/ Madrazo 27-29 4ª 08006 Telf: 933 68 52 46 - It’s really important to handle partition assignment when scaling consumers on your application (more on this later). - There are two major strategies when scaling consumers, competing consumers and publish/subscribe. - If handling consumer assignment manually, one needs to be really careful when mixing both strategies to avoid data loss. 19 SCALING CONSUMERS
  • 20. Oficinas en Madrid: C/ Francisco Silvela, 54 Duplicado 1ºD 28028 Telf: 91 080 82 44 Oficinas en Barcelona: C/ Madrazo 27-29 4ª 08006 Telf: 933 68 52 46 - Consumers subscribed to a same topic in a consumer group are competing consumers. - Each of those receives messages from one or more partitions of the topic - This allows to scale the number of consumers of a topic up to the number of partitions for said topic. - Extra consumers will remain idle until another one fails or more partitions of the topic are created. 20 SCALING CONSUMERS: Competing consumers
  • 21. Oficinas en Madrid: C/ Francisco Silvela, 54 Duplicado 1ºD 28028 Telf: 91 080 82 44 Oficinas en Barcelona: C/ Madrazo 27-29 4ª 08006 Telf: 933 68 52 46 - This patterns separates consumers by consumer groups, and subscribes each consumer group to all the messages of a single topic. - In a single consumer group one will find the same competing pattern explained before, but all the messages are being sent to all the groups. - Especially useful for microservice orchestration and data sharing, since one can assign a single consumer group per microservice and handle the needed events for said service there. 21 SCALING CONSUMERS: publish/subscribe
  • 22. Oficinas en Madrid: C/ Francisco Silvela, 54 Duplicado 1ºD 28028 Telf: 91 080 82 44 Oficinas en Barcelona: C/ Madrazo 27-29 4ª 08006 Telf: 933 68 52 46 - There are two ways to connect a consumer to a partition of a certain topic. The subscribe() and the assign() methods (as per Kafka API). - The subscribe() one assigns the consumer group to a topic and lets handle the consumer assignation and rebalancing by itself. - On the other hand, assign()ing a consumer to a specific partitions makes it manual, therefore increasing the risk of missing partitions and/or trying to overlap multiple consumers on the same partition. Be really careful when going manual on consumer assignation! 22 SCALING CONSUMERS: careful when going manual!
  • 23. Oficinas en Madrid: C/ Francisco Silvela, 54 Duplicado 1ºD 28028 Telf: 91 080 82 44 Oficinas en Barcelona: C/ Madrazo 27-29 4ª 08006 Telf: 933 68 52 46 - By default, a kafka producer will write in any of the partitions of the topic where it’s producing. Depending on the sending strategy the producer has a buffer of events to be sent before the last one has been validated. - Producer acknowledgements (acks) present a strategy for confirmation of an event being persisted to the queue. This can be set to 0 (NONE), 1 (LEADER) or -1 (ALL) - If the strategy chosen is too restrictive (-1) or the cluster’s brokers have trouble keeping up with the producer’s throughput, one may experience higher memory size of the producer’s buffer, leading to unexpected crashes and/or data loss. - On the other hand if the strategy is too loose (0 or 1), a broker failure my imply data loss. - Batching strategies help with restrictive acks and faster processing of the events, since the producer can keep building the next batch while the previous one is being acknowledged by the cluster. - Batch compression is also an option when the producers are overwhelming the network of the cluster but the processing of events is still being handled properly. 23 SCALING PRODUCERS
  • 24. Oficinas en Madrid: C/ Francisco Silvela, 54 Duplicado 1ºD 28028 Telf: 91 080 82 44 Oficinas en Barcelona: C/ Madrazo 27-29 4ª 08006 Telf: 933 68 52 46 - If the load of the application is high enough to warrant multiple instances, you’ll need partitioning of your data. - Whenever the application allows it, random data distribution is the most efficient way to scale partitions - It’s important to consider whether you’ll need to make aggregates, guarantee order, shard the data or batching when deciding the partitioning strategy 24 PARTITIONING STRATEGIES
  • 25. Oficinas en Madrid: C/ Francisco Silvela, 54 Duplicado 1ºD 28028 Telf: 91 080 82 44 Oficinas en Barcelona: C/ Madrazo 27-29 4ª 08006 Telf: 933 68 52 46 - Makes no differentiation on which partition handles which kind of event. - Makes consumer scaling easier, since any consumer can consume from any partition. - Doesn’t ensure any kind of ordering of consumption outside a single partition, and therefore event type. 25 RANDOM PARTITIONING
  • 26. Oficinas en Madrid: C/ Francisco Silvela, 54 Duplicado 1ºD 28028 Telf: 91 080 82 44 Oficinas en Barcelona: C/ Madrazo 27-29 4ª 08006 Telf: 933 68 52 46 - Each partition handles a certain type of event. - Consumer scaling gets trickier on a single consumer group. - Ensures order preservation on a single event type, but adds design complexity to the events that may be interconnected. - Also adds complexity to the consumers, but ensuring that every consumer can process any event makes it easier to handle. 26 AGGREGATE PARTITIONING
  • 27. Oficinas en Madrid: C/ Francisco Silvela, 54 Duplicado 1ºD 28028 Telf: 91 080 82 44 Oficinas en Barcelona: C/ Madrazo 27-29 4ª 08006 Telf: 933 68 52 46 - If the aggregate partition strategy isn’t homogeneous (some aggregates have more load than others) the partitions themselves will face different loads, makings consumer scaling harder. - You can then separate the highest load partitions with time windows, spreading the highest load across different partitions 27 TIME WINDOW PARTITIONING
  • 28. Oficinas en Madrid: C/ Francisco Silvela, 54 Duplicado 1ºD 28028 Telf: 91 080 82 44 Oficinas en Barcelona: C/ Madrazo 27-29 4ª 08006 Telf: 933 68 52 46 - Once the chunking of the partitions is in place, one can consume said events on a time window basis. - Then, produce in a new sorted topic, partitioning each event in their aggregates. 28 TIME WINDOW PARTITIONING
  • 29. Oficinas en Madrid: C/ Francisco Silvela, 54 Duplicado 1ºD 28028 Telf: 91 080 82 44 Oficinas en Barcelona: C/ Madrazo 27-29 4ª 08006 Telf: 933 68 52 46 - When choosing a partitioning strategy it’s important to take into consideration the possible resource bottlenecks outside of the kafka cluster. Example: If a consumer of a topic is dependant on a high load database that has been sharded, it makes sense to set the topic partitions to match said shards of the db. This allows to scale the consumers per partition and database shard. - When dealing with multiple partitions and replication, storage considerations are really important. If a broker fails and there is replication in place, the partition leader may change and the replication may move to another broker, creating high traffic and/or disk I/Os. 29 PARTITIONING: BOTTLENECKS AND EFFICIENCY
  • 30. Oficinas en Madrid: C/ Francisco Silvela, 54 Duplicado 1ºD 28028 Telf: 91 080 82 44 Oficinas en Barcelona: C/ Madrazo 27-29 4ª 08006 Telf: 933 68 52 46 - When a consumer enters or leaves a consumer group, kafka by default rebalances the partitions for said consumer group. - When rebalancing happens, all consumers drop their partitions and are reassigned new ones. If the consumer has state associated to the data being consumer, you need to be very careful with the rebalancing strategies of the cluster. - Another option is use the native Kafka API instead of a consumer group, and manually assign consumers to partitions (avoiding automatic load balancing). 30 PARTITION REBALANCING
  • 31. Oficinas en Madrid: C/ Francisco Silvela, 54 Duplicado 1ºD 28028 Telf: 91 080 82 44 Oficinas en Barcelona: C/ Madrazo 27-29 4ª 08006 Telf: 933 68 52 46 - As stated before, data resiliency and replication is one of the biggest strengths of kafka. - Although it adds load in both disk I/Os and network it ensures that no data loss will happen. - A good starting point for data replication is at 3 replicas per partition. This allows the cluster to lose one broker without critical alert, and two of them without losing any data! - This way if a single broker fails at night a single notification would suffice and you can fix the problem the next morning. Otherwise, if two of them fail you can still fix the issue without service downtime and/or data loss. - We’ll talk about data spreading and levels of fault tolerance depending on the type of infrastructure when we evaluate different production environments. 31 DATA RESILIENCY & FAULT TOLERANCE