Apache Kafka's Common Pitfalls & Intricacies
A Customer Support Perspective
Aurélie MARCUZZO & Christoph SCHUBERT • 19-03-2024
Christoph & Aurélie
Sales Engineer
Conduktor
Previously: Solutions Architect
Customer Success Engineer
Conduktor
Previously: Kafka Developer
Introduction
Agenda
01
Use-case: KafBank’s Kafka journey 02
Conclusion 03
Introduction
01
Introduction
Who is this talk for?
What is this about?
What problems does it address?
Survey
Who has Kafka in production?
Who has it longer than one year? Two years? Five years?
Use-case: KafBank’s Kafka journey
02
Use-case: Gather data from multiple applications
that manage:
● Payments
● Account Balance
● Personal Information
Plan: Build up a Kafka infrastructure as a central
team, with DevOps & Developers
Bank Company
KafBank
Who is Jack?
● Jack is a junior Kafka developer, he belongs to a new project with a few people
● He studied Data and Business Intelligence, so he's comfortable with data
● He has to make a workflow between a few applications that produce
sensitive data, to the Kafka cluster of the company, so other teams can enjoy
these data
Kafka central team
Who is Lea?
● Lea is the DevOps of this new project team
● She has built a Kafka cluster up so that Jack can implement his workflow
● She is responsible of its upgrade, maintenance, and security
Deploy a Kafka cluster: choose a provider
✅Cheap
✅Control
❌Easier to make mistakes
❌Need people to maintain it
On-Premise
✅Managed by a third party
❌$$$
Cloud
Configure a Kafka cluster
Configure a Kafka cluster
Configure a Kafka cluster
Configure a Kafka cluster
❌Timed out waiting for a node assignment ❌
❌Connection to node -1 could not be established. Node may not be available. ❌
Advertised Listeners
Topic Creation - Partition & Naming
Topic Creation - Partition & Naming
📜Naming convention:
[environment].[scope].[application].[details]
dev.priv.payment.details
● From big to small scope
● Names that won’t change (team 👎)
Ideas:
[scope].[environment].[application].[details]
[environment].[application].[details]
bookshelf
topic
book
partition
Topic Creation - Replication Factor
For a topic with 3 partitions, 3 replicas, 2 min insync replicas
Topic Creation - Replication Factor
For a topic with 3 partitions, 3 replicas, 2 min insync replicas
If broker 2 dies:
● Leader election for partition 0
● Other partitions are served by their leader
Topic Creation - Replication Factor
For a topic with 3 partitions, 3 replicas, 2 min insync replicas
If broker 1 and 2 die:
● All the leaders are on broker 3
● Producers are blocked: NotEnoughReplicasException
ISR & Acknowledgment
acks=0: No acknowledgment from the broker
acks=1: Leader acknowledged
acks=all: All in-sync replicas acknowledged
Best practices for topics
Replication factor
3
Min insync replicas
(# replicas) - 1
→ max 1 broker down
Partitions
Small project: 3
Big throughput: Multiple of 6
(consumers balance)
First Producer: produce durable, avoid duplicates
A few weeks after…
👍Good news for Jack: these are now the default for Java clients
Other client libraries might use different defaults, config names
After a couple of ☕
First Producer: produce durable, avoid duplicates
Key learnings
4.
Really important for Dev and Ops to work
together
1.
Client configs are important!!!
2.
Kafka is a distributed system: Lea has to
worry about replication factor, etc
3.
Moreover, as much responsibility as
possible lies with the clients!
A new day, a new challenge…
Create a consumer group
Keys
Keys???
Components of a Kafka message
Determines the partition:
Partition = hash(key) % #partitions
Custom partitioner possible – think
twice whether you need it!
After serialization!
Keys!!!
On a Friday morning …
Poison Pill Scenario
Kafka Upgrade
Upgrading without pain: clients
Client and broker versions can be freely mixed!
=> It is possible to upgrade clients and brokers in any order!
Upgrading without pain: rolling restart of broker
Find the controller!
Brokers can be updated without downtime:
Set inter.broker.protocol.version to the current version
Pick a broker, perform a restart with the new Kafka version
Monitor for under replicated partitions (should reach 0)
Restart the next broker, and continue
Set inter.broker.protocol.version to the current version
Update controller last!
Once all/most clients are updated, perform another rolling
restart to set log.message.format.version to the ‘new’ version.
Practice makes perfect!
We need proper
monitoring in place!
Client behavior while updating
Make sure acks, retries, and delivery timeout are configured correctly!
Load of brokers should be less than 1 - (1 / number brokers), so that we can deal with client workload while
brokers are updated.
When using ZooKeeper, certain operations might temporarily fail during a upgrade (e.g., topic creation)
Retention
Retention
Broker level:
● log.retention.bytes
● log.retention.ms
● log.segment.bytes
● log.segment.ms
Topic level ( 👑):
● retention.bytes
● retention.ms
● segment.bytes
● segment.ms
Retention
Retention - When does a segment turn inactive?
Retention - When is an inactive segment deleted?
Retention - Where’s the trap?
1 segment = 1 local file
Small segment.ms = Disk cleanup more often ✅
Small segment.ms = Many segments = Many local files
“Too many open files” - “Out of memory” ❌
Retention on bytes > risk of losing data during peaks
Security Audit? 🫣
Restricting access using ACLs (Access Control Lists)
ACLs can be specified for
- Topics
- Consumer groups
- Transactional IDs
Use prefixed ACLs whenever possible => naming conventions!
Kafka broker trap
if allow.everyone.if.no.acl.found = true → useless
Recommendations:
● Keep default (false)
● Define super users super.users=User:Bob;User:Alice
Adding partitions
Repartitioning time!
● We can only add partitions, not remove them!
● Plan carefully, better to slightly underprovision
● Law of sixes
Impact of adding partitions
Compaction? Let’s take a coffee …
Compaction to the rescue!
Conclusion
03
Conclusion
External listeners
Replication factor - Partitions - In sync replicas
Naming convention
Acknowledgement
Offset reset
ACLs
What’s next for KafBank?
PII Data, how to deal with it?
We have about 20 projects now, how can Lea’s team give them more autonomy while enforcing
standards?
Give developers awareness of most common issues, share their experience
What’s next for you?
Meet us at our booth #109
Learn using Kafkademy
Thank you for your time!
Any questions?
Apache Kafka's Common Pitfalls & Intricacies
Customer Success Engineer
aurelie@conduktor.io
Aurélie MARCUZZO
Sales Engineer
cschubert@conduktor.io
Christoph SCHUBERT

Apache Kafka's Common Pitfalls & Intricacies: A Customer Support Perspective

  • 1.
    Apache Kafka's CommonPitfalls & Intricacies A Customer Support Perspective Aurélie MARCUZZO & Christoph SCHUBERT • 19-03-2024
  • 2.
    Christoph & Aurélie SalesEngineer Conduktor Previously: Solutions Architect Customer Success Engineer Conduktor Previously: Kafka Developer
  • 3.
  • 4.
  • 5.
    Introduction Who is thistalk for? What is this about? What problems does it address?
  • 6.
    Survey Who has Kafkain production? Who has it longer than one year? Two years? Five years?
  • 7.
  • 8.
    Use-case: Gather datafrom multiple applications that manage: ● Payments ● Account Balance ● Personal Information Plan: Build up a Kafka infrastructure as a central team, with DevOps & Developers Bank Company KafBank
  • 9.
    Who is Jack? ●Jack is a junior Kafka developer, he belongs to a new project with a few people ● He studied Data and Business Intelligence, so he's comfortable with data ● He has to make a workflow between a few applications that produce sensitive data, to the Kafka cluster of the company, so other teams can enjoy these data Kafka central team Who is Lea? ● Lea is the DevOps of this new project team ● She has built a Kafka cluster up so that Jack can implement his workflow ● She is responsible of its upgrade, maintenance, and security
  • 10.
    Deploy a Kafkacluster: choose a provider ✅Cheap ✅Control ❌Easier to make mistakes ❌Need people to maintain it On-Premise ✅Managed by a third party ❌$$$ Cloud
  • 11.
  • 12.
  • 13.
  • 14.
    Configure a Kafkacluster ❌Timed out waiting for a node assignment ❌ ❌Connection to node -1 could not be established. Node may not be available. ❌
  • 15.
  • 16.
    Topic Creation -Partition & Naming
  • 17.
    Topic Creation -Partition & Naming 📜Naming convention: [environment].[scope].[application].[details] dev.priv.payment.details ● From big to small scope ● Names that won’t change (team 👎) Ideas: [scope].[environment].[application].[details] [environment].[application].[details] bookshelf topic book partition
  • 18.
    Topic Creation -Replication Factor For a topic with 3 partitions, 3 replicas, 2 min insync replicas
  • 19.
    Topic Creation -Replication Factor For a topic with 3 partitions, 3 replicas, 2 min insync replicas If broker 2 dies: ● Leader election for partition 0 ● Other partitions are served by their leader
  • 20.
    Topic Creation -Replication Factor For a topic with 3 partitions, 3 replicas, 2 min insync replicas If broker 1 and 2 die: ● All the leaders are on broker 3 ● Producers are blocked: NotEnoughReplicasException
  • 21.
    ISR & Acknowledgment acks=0:No acknowledgment from the broker acks=1: Leader acknowledged acks=all: All in-sync replicas acknowledged
  • 22.
    Best practices fortopics Replication factor 3 Min insync replicas (# replicas) - 1 → max 1 broker down Partitions Small project: 3 Big throughput: Multiple of 6 (consumers balance)
  • 23.
    First Producer: producedurable, avoid duplicates A few weeks after…
  • 24.
    👍Good news forJack: these are now the default for Java clients Other client libraries might use different defaults, config names After a couple of ☕ First Producer: produce durable, avoid duplicates
  • 25.
    Key learnings 4. Really importantfor Dev and Ops to work together 1. Client configs are important!!! 2. Kafka is a distributed system: Lea has to worry about replication factor, etc 3. Moreover, as much responsibility as possible lies with the clients!
  • 26.
    A new day,a new challenge…
  • 27.
  • 28.
  • 29.
    Keys??? Components of aKafka message Determines the partition: Partition = hash(key) % #partitions Custom partitioner possible – think twice whether you need it! After serialization!
  • 30.
  • 31.
    On a Fridaymorning …
  • 32.
  • 33.
  • 34.
    Upgrading without pain:clients Client and broker versions can be freely mixed! => It is possible to upgrade clients and brokers in any order!
  • 35.
    Upgrading without pain:rolling restart of broker Find the controller! Brokers can be updated without downtime: Set inter.broker.protocol.version to the current version Pick a broker, perform a restart with the new Kafka version Monitor for under replicated partitions (should reach 0) Restart the next broker, and continue Set inter.broker.protocol.version to the current version Update controller last! Once all/most clients are updated, perform another rolling restart to set log.message.format.version to the ‘new’ version. Practice makes perfect! We need proper monitoring in place!
  • 36.
    Client behavior whileupdating Make sure acks, retries, and delivery timeout are configured correctly! Load of brokers should be less than 1 - (1 / number brokers), so that we can deal with client workload while brokers are updated. When using ZooKeeper, certain operations might temporarily fail during a upgrade (e.g., topic creation)
  • 37.
  • 38.
    Retention Broker level: ● log.retention.bytes ●log.retention.ms ● log.segment.bytes ● log.segment.ms Topic level ( 👑): ● retention.bytes ● retention.ms ● segment.bytes ● segment.ms
  • 39.
  • 40.
    Retention - Whendoes a segment turn inactive?
  • 41.
    Retention - Whenis an inactive segment deleted?
  • 42.
    Retention - Where’sthe trap? 1 segment = 1 local file Small segment.ms = Disk cleanup more often ✅ Small segment.ms = Many segments = Many local files “Too many open files” - “Out of memory” ❌ Retention on bytes > risk of losing data during peaks
  • 43.
  • 44.
    Restricting access usingACLs (Access Control Lists) ACLs can be specified for - Topics - Consumer groups - Transactional IDs Use prefixed ACLs whenever possible => naming conventions!
  • 45.
    Kafka broker trap ifallow.everyone.if.no.acl.found = true → useless Recommendations: ● Keep default (false) ● Define super users super.users=User:Bob;User:Alice
  • 46.
  • 47.
    Repartitioning time! ● Wecan only add partitions, not remove them! ● Plan carefully, better to slightly underprovision ● Law of sixes
  • 48.
    Impact of addingpartitions
  • 49.
  • 50.
  • 52.
  • 53.
    Conclusion External listeners Replication factor- Partitions - In sync replicas Naming convention Acknowledgement Offset reset ACLs
  • 54.
    What’s next forKafBank? PII Data, how to deal with it? We have about 20 projects now, how can Lea’s team give them more autonomy while enforcing standards? Give developers awareness of most common issues, share their experience
  • 55.
    What’s next foryou? Meet us at our booth #109 Learn using Kafkademy
  • 56.
    Thank you foryour time! Any questions? Apache Kafka's Common Pitfalls & Intricacies Customer Success Engineer aurelie@conduktor.io Aurélie MARCUZZO Sales Engineer cschubert@conduktor.io Christoph SCHUBERT