Successfully reported this slideshow.
Your SlideShare is downloading. ×

Understanding Apache Kafka P99 Latency at Scale

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Upcoming SlideShare
Whoops! I Rewrote It in Rust
Whoops! I Rewrote It in Rust
Loading in …3
×

Check these out next

1 of 21 Ad

Understanding Apache Kafka P99 Latency at Scale

Download to read offline

Apache Kafka is a highly popular distributed system used by many organizations to connect systems, build microservices, create data mesh, etc. However, as a distributed system, understanding its performance could be a challenge, so many moving parts exist.

In this talk, we are going to review the key moving parts (producers, consumers, replication, network, etc), a strategy to measure and interpret the performance results for consumers and producers and a general guideline for deciding about performance in Apache Kafka.

An attendee will take home after the talk a proven method to measure, evaluate and optimise the performance of an Apache Kafka based infrastructure. A key skill for low throughput users, but especially for the biggest scale deployments.

Apache Kafka is a highly popular distributed system used by many organizations to connect systems, build microservices, create data mesh, etc. However, as a distributed system, understanding its performance could be a challenge, so many moving parts exist.

In this talk, we are going to review the key moving parts (producers, consumers, replication, network, etc), a strategy to measure and interpret the performance results for consumers and producers and a general guideline for deciding about performance in Apache Kafka.

An attendee will take home after the talk a proven method to measure, evaluate and optimise the performance of an Apache Kafka based infrastructure. A key skill for low throughput users, but especially for the biggest scale deployments.

Advertisement
Advertisement

More Related Content

Slideshows for you (20)

Similar to Understanding Apache Kafka P99 Latency at Scale (20)

Advertisement

More from ScyllaDB (20)

Recently uploaded (20)

Advertisement

Understanding Apache Kafka P99 Latency at Scale

  1. 1. Brought to you by Understanding Apache Kafka Latency at Scale Pere Urbon Bayes Senior Solutions Architect - Professional Services
  2. 2. Pere Urbon Bayes Senior Solutions Architect - Professional Services at Confluent ■ Working “in computers” since the year 2000 ■ Interested in all things Programming, Performance and Security ■ Lego enthusiast and Handball fan ■ Work side by side with customers implementing the most critical data in motion projects
  3. 3. Agenda for today Today we are going to cover: ● How to model latency in Apache Kafka and the existing tradeoff ● How to effectively measure Apache Kafka latency ● What you can do as a user to optimise your deployment effectively
  4. 4. Tales of Apache Kafka Latency
  5. 5. Tales of Apache Kafka latency Measuring performance and latency in distributed systems is certainly not an easy task, way to many moving parts. What are the most important properties to consider in Apache Kafka: ● Durability, Availability, Throughput and for sure Latency NOTE: It is not possible to really achieve great values in all of them!
  6. 6. The different latencies of Apache Kafka Apache Kafka is a distributed system and many “latencies” can be measured
  7. 7. Produce time The time since the application produces a record (KafkaProducer.send()) until a request containing the message is send to an Apache Kafka Broker. Important configuration variables: ● batch.size ● linger.ms ● compression.type ● max.inflight.requests.per.connection
  8. 8. Produce time By default the producer is optimised for latency. Batching can improve throughput, but could introduce an artificial delay. A batch might need longer waiting time if the broker has reached max.inflight.requests.per.connection. The use of compression might help with throughput and latency.
  9. 9. The time between when the producer send a batch of messages to when the corresponding message gets append to the log (leader). Time include: ● network and io processing ● queue time (request and response queue) With low load, usually most of the time is in network and IO. As the brokers become more load, queue time usually dominate. Publish time
  10. 10. Kafka consumers can only read messages from fully replicated messages. This time accounts for all the time necessary for a message to land in all in sync replicas. Important configuration variables ● replica.fetch.min.bytes ● replica.fetch.wait.max.ms Commit time
  11. 11. The time that takes a record to commit is equal to the time it takes the slowest in-sync follower to replicate. The default configuration is optimised for latency. Commit times are usually impacted by replication factor and load. Commit time
  12. 12. The time it takes for a record to be fetched from a partition, in Java a successful call to the KafkaConsumer.poll() method. Important configuration variables ● fetch.min.bytes ● fetch.wait.max.ms The default configuration is optimised for latency. Fetch time
  13. 13. The distributed system fallacy
  14. 14. The impact of the tradeoffs …..
  15. 15. Durability vs Latency
  16. 16. Acknowledgements (acks) If the broker become slower at giving back acknowledgements it usually decrease the producing throughput as it will increase the waiting time (max.in.flight.request.per.connection). Using acks=all usually mean increasing the number of producers. Configuring min.in.sync.replicas is important for availability, however it is not relevant for latency as replication will happen for all in-sync replicas not impacting the commit time.
  17. 17. Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc. Throughput vs Latency, the eternal question
  18. 18. Improving batching without artificial delays When applications produce messages that are not send to the same partitions this will affect batching as they could not be grouped together. So, it is better to make applications aware of this when deciding with key to use. If this is not possible, since AK 2.4 you can take advantage of the sticky partitioner (KIP-480). This partitioner will “stick” to a partition until a full batch if full making a better use of batching.
  19. 19. What about the number of clients? More clients generally mean more load for the Brokers, even if the throughput remains, there are going to be more metadata requests and connections to be handled More clients will have an impact on tail latency, more clients will increase the number of produce and fetch requests send to a Kafka Broker at a time
  20. 20. Moar partitions, please?
  21. 21. Brought to you by Thank you! pere@confluent.io @purbon

×