SlideShare a Scribd company logo
Tips and Tricks for Apache
Kafka™ Ops
June 2020
Jen Snipes, Senior Customer Success Architect
https://www.linkedin.com/in/jennifersnipes/
Kat Grigg, Senior Customer Success Architect @katgrigg
● What is Apache Kafka
● Kafka Operations
○ Configuring
○ Deploying
○ Maintaining
○ Monitoring
○ Debugging
Agenda
What is Apache Kafka?
Open Source
Real time Event Streaming Platform
Scalable, Distributed Log
Fault tolerant storage
Moving Data
Producers
Consumers
Connect
Streams
Log Structured Data Flow
Configuring
Deploying
Maintaining
Monitoring
Debugging
Kafka Operations
6
5 Categories of Ops
Configuring Kafka
7
Data Retention
● Disk space available?
● SLA on consuming?
● Keyed data? Cardinality?
Security
● Start with plaintext
● SSL and SASL? Choose one first
● Console clients are great for testing
Optimization
● Throughput
○ increase partitions, batch size, linger.ms
○ use acks 1
○ compression
○ parallelize consumers
● Latency
○ no compression
○ balance number of partitions
○ increase fetcher threads
○ producer linger.ms=0, consumer
fetch.min.bytes=1
● Durability
○ acks = all, RF = 3, minISR = 2
○ leverage retries
○ unclean leader election set to false
○ disable auto commit
● Availability
○ minISR 1
○ unclean leader election true
○ increase recovery threads
○ low consumer session.timeout
○ optimize number of partitions
Deploying
Kafka
11
Start Easy
● How do you run your database(s)?
● Do you know Docker well?
● Do you know Kubernetes well?
● If no, then don’t use them
● Cloud an option?
● JMX access (metrics reporter)
● Environment Variables
● Logging capacity
● Don’t forget ZooKeeper
● File descriptors (ulimit)
● Disk deployment style
One does not simply start Brokers
Maintaining
Kafka
14
Updates and Upgrades
Rolling upgrade Tips
● Gwen Shapira’s Kafka Summit Talk 2019
● Set protocols and message formats
● Be explicit
● Go slow, too fast and ISR sadness
● Upgrade often, point releases in prod
Rebalancing
● Lose/Add Broker
● Throttle early, adjust later
● Throttles don’t affect in sync
replicas
● Not sure? Batch the reassign
● See Monitoring
Updating Configs
Brokers
● Lots of dynamic config options!
● Sync config across the cluster
Topics
● Overrides are your friend
● Topic level unclean leader election
Monitoring
Kafka
What do you watch?
18
Kafka Thread Model
● Monitor all of these hops with JMX
● Transparent way to see activity
● Especially watch request queue size
● Don’t blindly bump thread count
Log Cleaner / Others
● Watch your log cleaner with JMX
● Bytes In/Out just makes sense
● Don’t need to dashboard everything
● Alert don’t automate (at least at first)
Broker Metrics
Top 5 Broker JMX metrics you should
be watching
● UnderReplicatedPartitions
● RequestHandlerAvgIdlePercent
● NetworkProcessorAvgIdlePercent
● RequestQueueSize
● TotalTimeMs
Clients / Consumers
● batch sizes
● failed-authentication-rate
● io-ratio
● io-wait-ratio
● io-wait-time-ns-avg
Debugging
Kafka
23
When there’s trouble...
Logs about the Log
● Debugging? Head to the logs!
● Defaults separates files
reasonably
● Splunk, ELK, etc.? filter by
appender
● Metadata files are important too
● controller.log -- written to by active controller
● state-change.log -- updates received from and sent to
controller
● server.log -- basic broker operations (segment rolls,
replica fetching, etc.)
● log-cleaner.log -- cleaner thread (related to compaction)
● kafka-authorizer.log -- secure connections only (set to
DEBUG if you want successful connections)
● kafka-request.log -- super verbose, will talk about more
● /var/lib/recovery-point-offset-checkpoint – last offset
flushed to disk
● /var/lib/cleaner-offset-checkpoint – offset up to which
the cleaner has cleaned (compacted topics only)
● /var/lib/replication-offset-checkpoint – last committed
offset
Request logging
● kafka-request.log
● All information about all requests
● Verbose, not great for live debugging
● Use when no other options for root cause
● Rogue clients, single request slowdowns,
etc.
● Kafka Controller Deep Dive
● Kafka Needs No Keeper
● ISR updates aren’t happening
● Replication stopped for no reason
● Broker refusing to become leader for extended
period
● rmr /controller
Controller Issues
The Offsets Topic
● __consumer_offsets → Stores offsets and group metadata
● bin/kafka-run-class kafka.tools.DumpLogSegments --offsets-decoder
● offsets.retention.minutes default 1 day (7 days on 2.0)
● Growing huge? Check the log cleaner health
● High Memory Usage on Brokers? Offsets are cached
● High CPU Usage on Brokers? Make sure this topic isn’t huge
● Please replication factor 3
Restarting? Think about it...
● Restarts in distributed systems are tricky
● Have a reason
● Understand startup time
● Clean shutdowns, limited scope
29
References
• https://www.confluent.io/resources/kafka-the-definitive-guide/
• https://www.confluent.io/kafka-summit-san-francisco-2019/please-upgrade-
apache-kafka-now
• https://conferences.oreilly.com/strata/strata-ny-
2018/public/schedule/detail/68957
• https://www.confluent.io/kafka-summit-san-francisco-2019/kafka-needs-no-
keeper
• https://events.confluent.io/meetups
30
Thank You!

More Related Content

What's hot

Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
SANG WON PARK
 
Admission Control in Impala
Admission Control in ImpalaAdmission Control in Impala
Admission Control in Impala
Cloudera, Inc.
 

What's hot (20)

Getting up to speed with MirrorMaker 2 | Mickael Maison, IBM and Ryanne Dolan...
Getting up to speed with MirrorMaker 2 | Mickael Maison, IBM and Ryanne Dolan...Getting up to speed with MirrorMaker 2 | Mickael Maison, IBM and Ryanne Dolan...
Getting up to speed with MirrorMaker 2 | Mickael Maison, IBM and Ryanne Dolan...
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With Kafka
 
Producer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaProducer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache Kafka
 
Scaling Apache Spark at Facebook
Scaling Apache Spark at FacebookScaling Apache Spark at Facebook
Scaling Apache Spark at Facebook
 
How Apache Kafka® Works
How Apache Kafka® WorksHow Apache Kafka® Works
How Apache Kafka® Works
 
An Introduction to Apache Kafka
An Introduction to Apache KafkaAn Introduction to Apache Kafka
An Introduction to Apache Kafka
 
Exactly-once Semantics in Apache Kafka
Exactly-once Semantics in Apache KafkaExactly-once Semantics in Apache Kafka
Exactly-once Semantics in Apache Kafka
 
kafka
kafkakafka
kafka
 
Apache Kafka Best Practices
Apache Kafka Best PracticesApache Kafka Best Practices
Apache Kafka Best Practices
 
Troubleshooting Kafka's socket server: from incident to resolution
Troubleshooting Kafka's socket server: from incident to resolutionTroubleshooting Kafka's socket server: from incident to resolution
Troubleshooting Kafka's socket server: from incident to resolution
 
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Everything You Always Wanted to Know About Kafka’s Rebalance Protocol but Wer...
Everything You Always Wanted to Know About Kafka’s Rebalance Protocol but Wer...Everything You Always Wanted to Know About Kafka’s Rebalance Protocol but Wer...
Everything You Always Wanted to Know About Kafka’s Rebalance Protocol but Wer...
 
Putting Kafka Into Overdrive
Putting Kafka Into OverdrivePutting Kafka Into Overdrive
Putting Kafka Into Overdrive
 
APACHE KAFKA / Kafka Connect / Kafka Streams
APACHE KAFKA / Kafka Connect / Kafka StreamsAPACHE KAFKA / Kafka Connect / Kafka Streams
APACHE KAFKA / Kafka Connect / Kafka Streams
 
Admission Control in Impala
Admission Control in ImpalaAdmission Control in Impala
Admission Control in Impala
 
Kafka 101
Kafka 101Kafka 101
Kafka 101
 
Envoy and Kafka
Envoy and KafkaEnvoy and Kafka
Envoy and Kafka
 
Everything you ever needed to know about Kafka on Kubernetes but were afraid ...
Everything you ever needed to know about Kafka on Kubernetes but were afraid ...Everything you ever needed to know about Kafka on Kubernetes but were afraid ...
Everything you ever needed to know about Kafka on Kubernetes but were afraid ...
 
Deploying Kafka Streams Applications with Docker and Kubernetes
Deploying Kafka Streams Applications with Docker and KubernetesDeploying Kafka Streams Applications with Docker and Kubernetes
Deploying Kafka Streams Applications with Docker and Kubernetes
 

Similar to Tips & Tricks for Apache Kafka®

Similar to Tips & Tricks for Apache Kafka® (20)

Tips and Tricks for Operating Apache Kafka
Tips and Tricks for Operating Apache KafkaTips and Tricks for Operating Apache Kafka
Tips and Tricks for Operating Apache Kafka
 
kafka
kafkakafka
kafka
 
Introducing KRaft: Kafka Without Zookeeper With Colin McCabe | Current 2022
Introducing KRaft: Kafka Without Zookeeper With Colin McCabe | Current 2022Introducing KRaft: Kafka Without Zookeeper With Colin McCabe | Current 2022
Introducing KRaft: Kafka Without Zookeeper With Colin McCabe | Current 2022
 
High-level architecture of a complete MariaDB deployment
High-level architecture of a complete MariaDB deploymentHigh-level architecture of a complete MariaDB deployment
High-level architecture of a complete MariaDB deployment
 
Introduction to apache kafka
Introduction to apache kafkaIntroduction to apache kafka
Introduction to apache kafka
 
Build real time stream processing applications using Apache Kafka
Build real time stream processing applications using Apache KafkaBuild real time stream processing applications using Apache Kafka
Build real time stream processing applications using Apache Kafka
 
LCE13: Test and Validation Summit: The future of testing at Linaro
LCE13: Test and Validation Summit: The future of testing at LinaroLCE13: Test and Validation Summit: The future of testing at Linaro
LCE13: Test and Validation Summit: The future of testing at Linaro
 
LCE13: Test and Validation Mini-Summit: Review Current Linaro Engineering Pro...
LCE13: Test and Validation Mini-Summit: Review Current Linaro Engineering Pro...LCE13: Test and Validation Mini-Summit: Review Current Linaro Engineering Pro...
LCE13: Test and Validation Mini-Summit: Review Current Linaro Engineering Pro...
 
Building real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark StreamingBuilding real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark Streaming
 
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
 
Twitter’s Apache Kafka Adoption Journey | Ming Liu, Twitter
Twitter’s Apache Kafka Adoption Journey | Ming Liu, TwitterTwitter’s Apache Kafka Adoption Journey | Ming Liu, Twitter
Twitter’s Apache Kafka Adoption Journey | Ming Liu, Twitter
 
The Accidental DBA
The Accidental DBAThe Accidental DBA
The Accidental DBA
 
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With Kafka
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 
Apache Kafka - Martin Podval
Apache Kafka - Martin PodvalApache Kafka - Martin Podval
Apache Kafka - Martin Podval
 
Scaling ELK Stack - DevOpsDays Singapore
Scaling ELK Stack - DevOpsDays SingaporeScaling ELK Stack - DevOpsDays Singapore
Scaling ELK Stack - DevOpsDays Singapore
 
MySQL X protocol - Talking to MySQL Directly over the Wire
MySQL X protocol - Talking to MySQL Directly over the WireMySQL X protocol - Talking to MySQL Directly over the Wire
MySQL X protocol - Talking to MySQL Directly over the Wire
 
A Functional Approach to Architecture - Kafka & Kafka Streams - Kevin Mas Rui...
A Functional Approach to Architecture - Kafka & Kafka Streams - Kevin Mas Rui...A Functional Approach to Architecture - Kafka & Kafka Streams - Kevin Mas Rui...
A Functional Approach to Architecture - Kafka & Kafka Streams - Kevin Mas Rui...
 
Apache Kafka : Monitoring vs Alerting
Apache Kafka : Monitoring vs AlertingApache Kafka : Monitoring vs Alerting
Apache Kafka : Monitoring vs Alerting
 

More from confluent

More from confluent (20)

Speed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in MinutesSpeed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in Minutes
 
Evolving Data Governance for the Real-time Streaming and AI Era
Evolving Data Governance for the Real-time Streaming and AI EraEvolving Data Governance for the Real-time Streaming and AI Era
Evolving Data Governance for the Real-time Streaming and AI Era
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
 
Santander Stream Processing with Apache Flink
Santander Stream Processing with Apache FlinkSantander Stream Processing with Apache Flink
Santander Stream Processing with Apache Flink
 
Unlocking the Power of IoT: A comprehensive approach to real-time insights
Unlocking the Power of IoT: A comprehensive approach to real-time insightsUnlocking the Power of IoT: A comprehensive approach to real-time insights
Unlocking the Power of IoT: A comprehensive approach to real-time insights
 
Workshop híbrido: Stream Processing con Flink
Workshop híbrido: Stream Processing con FlinkWorkshop híbrido: Stream Processing con Flink
Workshop híbrido: Stream Processing con Flink
 
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...
 
AWS Immersion Day Mapfre - Confluent
AWS Immersion Day Mapfre   -   ConfluentAWS Immersion Day Mapfre   -   Confluent
AWS Immersion Day Mapfre - Confluent
 
Eventos y Microservicios - Santander TechTalk
Eventos y Microservicios - Santander TechTalkEventos y Microservicios - Santander TechTalk
Eventos y Microservicios - Santander TechTalk
 
Q&A with Confluent Experts: Navigating Networking in Confluent Cloud
Q&A with Confluent Experts: Navigating Networking in Confluent CloudQ&A with Confluent Experts: Navigating Networking in Confluent Cloud
Q&A with Confluent Experts: Navigating Networking in Confluent Cloud
 
Citi TechTalk Session 2: Kafka Deep Dive
Citi TechTalk Session 2: Kafka Deep DiveCiti TechTalk Session 2: Kafka Deep Dive
Citi TechTalk Session 2: Kafka Deep Dive
 
Build real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with ConfluentBuild real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with Confluent
 
Q&A with Confluent Professional Services: Confluent Service Mesh
Q&A with Confluent Professional Services: Confluent Service MeshQ&A with Confluent Professional Services: Confluent Service Mesh
Q&A with Confluent Professional Services: Confluent Service Mesh
 
Citi Tech Talk: Event Driven Kafka Microservices
Citi Tech Talk: Event Driven Kafka MicroservicesCiti Tech Talk: Event Driven Kafka Microservices
Citi Tech Talk: Event Driven Kafka Microservices
 
Confluent & GSI Webinars series - Session 3
Confluent & GSI Webinars series - Session 3Confluent & GSI Webinars series - Session 3
Confluent & GSI Webinars series - Session 3
 
Citi Tech Talk: Messaging Modernization
Citi Tech Talk: Messaging ModernizationCiti Tech Talk: Messaging Modernization
Citi Tech Talk: Messaging Modernization
 
Citi Tech Talk: Data Governance for streaming and real time data
Citi Tech Talk: Data Governance for streaming and real time dataCiti Tech Talk: Data Governance for streaming and real time data
Citi Tech Talk: Data Governance for streaming and real time data
 
Confluent & GSI Webinars series: Session 2
Confluent & GSI Webinars series: Session 2Confluent & GSI Webinars series: Session 2
Confluent & GSI Webinars series: Session 2
 
Data In Motion Paris 2023
Data In Motion Paris 2023Data In Motion Paris 2023
Data In Motion Paris 2023
 
Confluent Partner Tech Talk with Synthesis
Confluent Partner Tech Talk with SynthesisConfluent Partner Tech Talk with Synthesis
Confluent Partner Tech Talk with Synthesis
 

Recently uploaded

Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
Bhaskar Mitra
 
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlFuture Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Peter Udo Diehl
 

Recently uploaded (20)

Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone KomSalesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
 
IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya HalderCustom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
Powerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaPowerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara Laskowska
 
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
 
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
 
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlFuture Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 

Tips & Tricks for Apache Kafka®

  • 1. Tips and Tricks for Apache Kafka™ Ops June 2020 Jen Snipes, Senior Customer Success Architect https://www.linkedin.com/in/jennifersnipes/ Kat Grigg, Senior Customer Success Architect @katgrigg
  • 2. ● What is Apache Kafka ● Kafka Operations ○ Configuring ○ Deploying ○ Maintaining ○ Monitoring ○ Debugging Agenda
  • 3. What is Apache Kafka? Open Source Real time Event Streaming Platform Scalable, Distributed Log Fault tolerant storage
  • 8. Data Retention ● Disk space available? ● SLA on consuming? ● Keyed data? Cardinality?
  • 9. Security ● Start with plaintext ● SSL and SASL? Choose one first ● Console clients are great for testing
  • 10. Optimization ● Throughput ○ increase partitions, batch size, linger.ms ○ use acks 1 ○ compression ○ parallelize consumers ● Latency ○ no compression ○ balance number of partitions ○ increase fetcher threads ○ producer linger.ms=0, consumer fetch.min.bytes=1 ● Durability ○ acks = all, RF = 3, minISR = 2 ○ leverage retries ○ unclean leader election set to false ○ disable auto commit ● Availability ○ minISR 1 ○ unclean leader election true ○ increase recovery threads ○ low consumer session.timeout ○ optimize number of partitions
  • 12. Start Easy ● How do you run your database(s)? ● Do you know Docker well? ● Do you know Kubernetes well? ● If no, then don’t use them ● Cloud an option?
  • 13. ● JMX access (metrics reporter) ● Environment Variables ● Logging capacity ● Don’t forget ZooKeeper ● File descriptors (ulimit) ● Disk deployment style One does not simply start Brokers
  • 15. Rolling upgrade Tips ● Gwen Shapira’s Kafka Summit Talk 2019 ● Set protocols and message formats ● Be explicit ● Go slow, too fast and ISR sadness ● Upgrade often, point releases in prod
  • 16. Rebalancing ● Lose/Add Broker ● Throttle early, adjust later ● Throttles don’t affect in sync replicas ● Not sure? Batch the reassign ● See Monitoring
  • 17. Updating Configs Brokers ● Lots of dynamic config options! ● Sync config across the cluster Topics ● Overrides are your friend ● Topic level unclean leader election
  • 19. Kafka Thread Model ● Monitor all of these hops with JMX ● Transparent way to see activity ● Especially watch request queue size ● Don’t blindly bump thread count
  • 20. Log Cleaner / Others ● Watch your log cleaner with JMX ● Bytes In/Out just makes sense ● Don’t need to dashboard everything ● Alert don’t automate (at least at first)
  • 21. Broker Metrics Top 5 Broker JMX metrics you should be watching ● UnderReplicatedPartitions ● RequestHandlerAvgIdlePercent ● NetworkProcessorAvgIdlePercent ● RequestQueueSize ● TotalTimeMs
  • 22. Clients / Consumers ● batch sizes ● failed-authentication-rate ● io-ratio ● io-wait-ratio ● io-wait-time-ns-avg
  • 24. Logs about the Log ● Debugging? Head to the logs! ● Defaults separates files reasonably ● Splunk, ELK, etc.? filter by appender ● Metadata files are important too ● controller.log -- written to by active controller ● state-change.log -- updates received from and sent to controller ● server.log -- basic broker operations (segment rolls, replica fetching, etc.) ● log-cleaner.log -- cleaner thread (related to compaction) ● kafka-authorizer.log -- secure connections only (set to DEBUG if you want successful connections) ● kafka-request.log -- super verbose, will talk about more ● /var/lib/recovery-point-offset-checkpoint – last offset flushed to disk ● /var/lib/cleaner-offset-checkpoint – offset up to which the cleaner has cleaned (compacted topics only) ● /var/lib/replication-offset-checkpoint – last committed offset
  • 25. Request logging ● kafka-request.log ● All information about all requests ● Verbose, not great for live debugging ● Use when no other options for root cause ● Rogue clients, single request slowdowns, etc.
  • 26. ● Kafka Controller Deep Dive ● Kafka Needs No Keeper ● ISR updates aren’t happening ● Replication stopped for no reason ● Broker refusing to become leader for extended period ● rmr /controller Controller Issues
  • 27. The Offsets Topic ● __consumer_offsets → Stores offsets and group metadata ● bin/kafka-run-class kafka.tools.DumpLogSegments --offsets-decoder ● offsets.retention.minutes default 1 day (7 days on 2.0) ● Growing huge? Check the log cleaner health ● High Memory Usage on Brokers? Offsets are cached ● High CPU Usage on Brokers? Make sure this topic isn’t huge ● Please replication factor 3
  • 28. Restarting? Think about it... ● Restarts in distributed systems are tricky ● Have a reason ● Understand startup time ● Clean shutdowns, limited scope
  • 29. 29 References • https://www.confluent.io/resources/kafka-the-definitive-guide/ • https://www.confluent.io/kafka-summit-san-francisco-2019/please-upgrade- apache-kafka-now • https://conferences.oreilly.com/strata/strata-ny- 2018/public/schedule/detail/68957 • https://www.confluent.io/kafka-summit-san-francisco-2019/kafka-needs-no- keeper • https://events.confluent.io/meetups