SignalFx Kafka Consumer Optimization

SignalFx
Feb. 24, 2016
SignalFx Kafka Consumer Optimization
SignalFx Kafka Consumer Optimization
SignalFx Kafka Consumer Optimization
SignalFx Kafka Consumer Optimization
SignalFx Kafka Consumer Optimization
SignalFx Kafka Consumer Optimization
SignalFx Kafka Consumer Optimization
SignalFx Kafka Consumer Optimization
SignalFx Kafka Consumer Optimization
SignalFx Kafka Consumer Optimization
SignalFx Kafka Consumer Optimization
SignalFx Kafka Consumer Optimization
SignalFx Kafka Consumer Optimization
SignalFx Kafka Consumer Optimization
SignalFx Kafka Consumer Optimization
SignalFx Kafka Consumer Optimization
SignalFx Kafka Consumer Optimization
SignalFx Kafka Consumer Optimization
SignalFx Kafka Consumer Optimization
SignalFx Kafka Consumer Optimization
SignalFx Kafka Consumer Optimization
SignalFx Kafka Consumer Optimization
SignalFx Kafka Consumer Optimization
SignalFx Kafka Consumer Optimization
SignalFx Kafka Consumer Optimization
SignalFx Kafka Consumer Optimization
SignalFx Kafka Consumer Optimization
SignalFx Kafka Consumer Optimization
SignalFx Kafka Consumer Optimization
SignalFx Kafka Consumer Optimization
SignalFx Kafka Consumer Optimization
SignalFx Kafka Consumer Optimization
SignalFx Kafka Consumer Optimization
SignalFx Kafka Consumer Optimization
SignalFx Kafka Consumer Optimization
SignalFx Kafka Consumer Optimization
SignalFx Kafka Consumer Optimization
SignalFx Kafka Consumer Optimization
SignalFx Kafka Consumer Optimization
SignalFx Kafka Consumer Optimization
SignalFx Kafka Consumer Optimization
SignalFx Kafka Consumer Optimization
SignalFx Kafka Consumer Optimization
SignalFx Kafka Consumer Optimization
SignalFx Kafka Consumer Optimization
SignalFx Kafka Consumer Optimization
SignalFx Kafka Consumer Optimization
SignalFx Kafka Consumer Optimization
SignalFx Kafka Consumer Optimization
SignalFx Kafka Consumer Optimization
SignalFx Kafka Consumer Optimization
SignalFx Kafka Consumer Optimization
SignalFx Kafka Consumer Optimization
SignalFx Kafka Consumer Optimization
SignalFx Kafka Consumer Optimization
SignalFx Kafka Consumer Optimization
SignalFx Kafka Consumer Optimization
SignalFx Kafka Consumer Optimization
SignalFx Kafka Consumer Optimization
SignalFx Kafka Consumer Optimization
SignalFx Kafka Consumer Optimization
SignalFx Kafka Consumer Optimization
SignalFx Kafka Consumer Optimization
SignalFx Kafka Consumer Optimization
SignalFx Kafka Consumer Optimization
1 of 65

More Related Content

What's hot

From Three Nines to Five Nines - A Kafka JourneyFrom Three Nines to Five Nines - A Kafka Journey
From Three Nines to Five Nines - A Kafka JourneyAllen (Xiaozhong) Wang
HBaseCon2017 Transactions in HBaseHBaseCon2017 Transactions in HBase
HBaseCon2017 Transactions in HBaseHBaseCon
HBaseCon2017 Improving HBase availability in a multi tenant environmentHBaseCon2017 Improving HBase availability in a multi tenant environment
HBaseCon2017 Improving HBase availability in a multi tenant environmentHBaseCon
HBaseCon 2015: OpenTSDB and AsyncHBase UpdateHBaseCon 2015: OpenTSDB and AsyncHBase Update
HBaseCon 2015: OpenTSDB and AsyncHBase UpdateHBaseCon
Flink Forward Berlin 2017: Robert Metzger - Keep it going - How to reliably a...Flink Forward Berlin 2017: Robert Metzger - Keep it going - How to reliably a...
Flink Forward Berlin 2017: Robert Metzger - Keep it going - How to reliably a...Flink Forward
Kafka on ZFS: Better Living Through Filesystems Kafka on ZFS: Better Living Through Filesystems
Kafka on ZFS: Better Living Through Filesystems confluent

What's hot(20)

Viewers also liked

AWS Loft Talk: Behind the Scenes with SignalFxAWS Loft Talk: Behind the Scenes with SignalFx
AWS Loft Talk: Behind the Scenes with SignalFxSignalFx
Making Cassandra Perform as a Time Series Database - Cassandra Summit 15Making Cassandra Perform as a Time Series Database - Cassandra Summit 15
Making Cassandra Perform as a Time Series Database - Cassandra Summit 15SignalFx
SignalFx Elasticsearch Metrics Monitoring and AlertingSignalFx Elasticsearch Metrics Monitoring and Alerting
SignalFx Elasticsearch Metrics Monitoring and AlertingSignalFx
Docker at and with SignalFxDocker at and with SignalFx
Docker at and with SignalFxSignalFx
Microservices and Devs in Charge: Why Monitoring is an Analytics ProblemMicroservices and Devs in Charge: Why Monitoring is an Analytics Problem
Microservices and Devs in Charge: Why Monitoring is an Analytics ProblemSignalFx
Operationalizing Docker at Scale: Lessons from Running Microservices in Produ...Operationalizing Docker at Scale: Lessons from Running Microservices in Produ...
Operationalizing Docker at Scale: Lessons from Running Microservices in Produ...SignalFx

Similar to SignalFx Kafka Consumer Optimization

Apache Kafka from 0.7 to 1.0, History and Lesson LearnedApache Kafka from 0.7 to 1.0, History and Lesson Learned
Apache Kafka from 0.7 to 1.0, History and Lesson LearnedGuozhang Wang
Network.pptxNetwork.pptx
Network.pptxSAMANTHACARDOSO13
Nikita Abdullin - Reverse-engineering of embedded MIPS devices. Case Study - ...Nikita Abdullin - Reverse-engineering of embedded MIPS devices. Case Study - ...
Nikita Abdullin - Reverse-engineering of embedded MIPS devices. Case Study - ...DefconRussia
Fast datastacks - fast and flexible nfv solution stacks leveraging fd.ioFast datastacks - fast and flexible nfv solution stacks leveraging fd.io
Fast datastacks - fast and flexible nfv solution stacks leveraging fd.ioOPNFV
Wireless Troubleshooting Tips using AirPcaps DFS Module DebuggingWireless Troubleshooting Tips using AirPcaps DFS Module Debugging
Wireless Troubleshooting Tips using AirPcaps DFS Module DebuggingMegumi Takeshita
running stable diffusion on androidrunning stable diffusion on android
running stable diffusion on androidKoan-Sin Tan

Similar to SignalFx Kafka Consumer Optimization(20)

Recently uploaded

Unleashing Innovation: IoT Project with MicroPythonUnleashing Innovation: IoT Project with MicroPython
Unleashing Innovation: IoT Project with MicroPythonVubon Roy
Product Research PresentationProduct Research Presentation
Product Research PresentationDeahJadeArellano
How resolve Gem dependencies in your code?How resolve Gem dependencies in your code?
How resolve Gem dependencies in your code?Hiroshi SHIBATA
Accelerating Data Science through Feature Platform, Transformers and GenAIAccelerating Data Science through Feature Platform, Transformers and GenAI
Accelerating Data Science through Feature Platform, Transformers and GenAIFeatureByte
Knowledge Graphs and Generative AI_GraphSummit Minneapolis Sept 20.pptxKnowledge Graphs and Generative AI_GraphSummit Minneapolis Sept 20.pptx
Knowledge Graphs and Generative AI_GraphSummit Minneapolis Sept 20.pptxNeo4j
The Flutter Job Market At The MomentThe Flutter Job Market At The Moment
The Flutter Job Market At The MomentAhmed Abu Eldahab

SignalFx Kafka Consumer Optimization

Editor's Notes

  1. A Kafka cluster has multiple brokers. Each broker is a process of its own with an unique id. The unit of serializability in Kafka is a partition. Each partition has all its messages ordered. I like to think of a topic as a group of partitions. A partition has a statically assigned leader. From the POV of regular clients all read/write operations must go through the leader.
  2. So a client needs to know the mapping of topic-partitions to brokers. This mapping can change dynamically. A client begins by sending a metadata request to know this mapping. A metadata request can be sent to any broker in the cluster.
  3. The broker then replies with a metadata response.
  4. So the client can now form a map of partitions to brokers.
  5. Next the client needs to build a table of partition -> next offset to consume. It can get it from the consumer group functionality or some other external source.
  6. Once this is built it can send fetch requests for actual data.
  7. As long as there is actual data to consume and no errors it gets back a fetch response.
  8. Data is transferred between memory and cache in blocks of fixed size, called cache lines (typically 64 bytes). If you need a single byte, 63 others are coming in for the ride and paying the full tax. So you might as well use these bytes. When the processor needs to read or write a location in main memory, it first checks for a corresponding entry in the cache. In the case of: 1. a cache hit, the processor immediately reads or writes the data in the cache line 2. a cache miss, the cache allocates a new entry and copies in data from main memory, then the request (read or write) is fulfilled from the contents of the cache
  9. Data is transferred between memory and cache in blocks of fixed size, called cache lines (typically 64 bytes). If you need a single byte, 63 others are coming in for the ride and paying the full tax. So you might as well use these bytes.
  10. An application summing numbers in nodes of a linked list might take one cache miss per node.
  11. Spatial locality and prefetching help a lot when summing an array on the other hand. The compiler is also able to write better vectorized code if your layout looks like this.
  12. We really really care about cache usage and cache misses. We don’t care about memory as much. So efficiency for the client means more resources for the application which means a faster application.
  13. Almost all our optimizations are based on constraints that come from our use of the consumer. So, many of them are not directly applicable to generic Kafka clients which need to work well under various scenarios. We need no consumer group functionality. We manage partitions and offsets outside of Kafka. This makes our client super simple. A single topic. Our applications mostly consume a topic. We have a finite small number of partitions. Usually <= 1024. Partition reassignment is rare. I would imagine that this is true for most applications. Control of the entire pipeline means we can make some assumptions that a generic client cannot. End to end principle.
  14. Now the interesting part.
  15. Since we have a single topic, all partitions implicitly belong to that topic. So we don’t need a concept of topic-partition. We only have partitions. Since we don’t need topic-partition objects we can store all per partition data in arrays with the array index = partition number.
  16. It is important to acknowledge that this is a tradeoff. Like we said before we really care about cache space and cache misses. We are ready to trade off using extra memory to reduce our cache usage. Here is an example: Let’s imagine that we have a java util hash map of partition to offset. We’ve already shown that we can have multiple cache misses to do an offset get or put. Now let’s imagine that we have a single partition 0 with an offset 116, store in this map. How much memory does this use? We’ll be generous and assume that headers are only 8 bytes and references are only 4 bytes. So let’s assume that the entry array was preallocated for 2 entries. There is a 8 byte header, a 4 byte length and two 4 byte references. That’s 20 bytes. Similarly the actual entry is itself 16 bytes and the boxed long is 16 bytes and the boxed integer is 12 bytes. So in spite of all the references and indirection it only uses 256 bytes of memory. On the other hand let’s assume that our sparse array has been preallocated for 1024 partitions. So it has a 4 byte length, a 8 byte header and 1024 8 byte entries so a total of 8204 bytes which is around 8 KB. This is a lot more than 64 bytes and kind of wasteful. Now let’s look at how much cache is used by each solution. Each cache line is 64 bytes. So even if you want a single byte 63 unrelated bytes might come along for the ride. Now let’s look at the java hash map again. We first need to fetch the right entry - that’s one cache line. The other entry comes alone for the ride and possibly the length and header. So that’s 64 bytes already. Now the actual entry is on another cache line. That is another cache line used up. Now we need to look at the contents of the boxed partition. That’s another random memory location so a new cache line. Finally we fetch the offset itself and that’s another cache line. So it’s 4 cache lines and hence 256 bytes of cache used up through a simple get request. Now let’s look at the sparse offset array. We know where to fetch it from so with a single cache fetch we get the offset. It comes with potentially 7 other offsets none of which might be useful, but it’s still a single cache line. So we use only 64 bytes! This example is a bit counter-intuitive. It goes to show that a data structure using only 64 bytes of memory can actually use many more times that memory in cache and a data structure using 8 KB of memory might only use a single cache line. This is a bit like virtual memory vs physical memory. You can use a lot of virtual memory but use little physical memory and come out ahead. In our example physical memory is abundant (we have gigabytes of it). Cache memory is very limited. We only have around 32 KB of L1 cache for example so it’s much more precious that physical memory. This also shows how we are ready to make trade offs. Sparse arrays can take more memory but have a pretty guaranteed worse case cache usage and cache miss number.
  17. We talked about the main data structures. Our other data structures especially for our state machine implementation are all designed to be zero allocation in the steady state and very cache friendly. Even our hashed-wheel timer is made of primitive arrays with very few indirections.
  18. Since we service a single topic per client, we can stamp out the client id and topic id bits and never change them.
  19. Since this variable sized portion rarely changes, we can afford to create an index to it. So we have an index of a partition to it’s position within the fetch request ByteBuffer.
  20. So let’s imagine that we sent this particular fetch request with partitions 0, 1 and 2. We have a response and the offsets have been advanced as shown by the offsets table. Now to create the next fetch request, we just read the new offsets and use the index to write them directly on the old buffer.
  21. And we are ready to send this buffer. We avoided all the work required to create a buffer, write out the fixed size fields etc. It’s just writing a few integers to locations in memory.
  22. This is how the code looks. There is a bit of noise in the code because we are iterating a bit set representing the partition assignment. But otherwise the code is simple - fetch the position within our request buffer for this partition. Get the next offset to fetch. Write this offset at the right position.
  23. The metadata request is just frozen after consumer creation.
  24. For the offset request for example we can store a pointer to the num partitions part of the request. So when we need to send a new offset request we can directly seek there and write out the partition bits.
  25. We don’t use JSON, XML, Thrift, ProtocolBuffers etc for our messages. Our messages do not need to be deserialized before consumption. They can be consumed directly just like Kafka’s internal messages can be. There is no POJO created from a serialized message. Instead we can wrap the buffer in a flyweight and consume the fields of our messages by doing reads from the underlying buffer. So we don’t need any copies or any allocation for steady state processing.
  26. The interface however is low level. Any handler of a message is fed with a buffer and a position and length within that buffer that represents the message. We could also alternatively set a position and limit on the buffer and send it to the application. The poll call takes such a handler and feeds it with messages.
  27. So let’s imagine that this is a response from Kafka. There are a bunch of fixed size bits on the top that we can skip. The real payload is a message set per partition.
  28. We begin by going to the first message, ensuring there are no errors and then just passing the pointer and length to the handler. It synchronously consumes it making copies if necessary and then returns back to the parsing code.
  29. We then consume the second message.
  30. And the third and so on.
  31. Benefits are huge. Zero copy and zero allocation in the steady path. Since we are not creating a new ByteBuffer every time - DirectByteBuffers become viable. So we elide the copy involved in reading from the socket into HeapByteBuffers. Sadly the applicability of this optimization is low. We are in control of our buffers and their lifetime so it is easy for us to avoid a copy. It is perhaps possible to create a very low level api that is not the default. I’ve not had much luck pushing this agenda in the past :)
  32. The Kafka client allocates about 423 MB for 5000 * 300 = 1.5 million messages That’s 86.56% of all allocations. A size able portion of that is in fetch response parsing. A lot of that is ByteBuffer slicing which our client does not do at all. We talked about a possible but dangerous way to get rid of this entirely. 1.76% is in the selector. About 9.27% is in cluster init. I am not sure why that’s so much.
  33. We allocate 218 KB overall to process 5000 * 300 = 1.5 million messages. The consumer does no allocations of it’s own. There are allocations done by the java NIO stack but they don’t show up in the profile. Selectors allocate and we plan to use an allocation less Selector like the one the Netty project uses.
  34. CPU used was 6.6%. 91% of that was the 0.9 consumer so about 6% 12.% spent on check sum math. 67% on handling fetch responses - we talked about a way to make this very fast. Some 6% in metadata - not sure why
  35. CPU was 2.63%. The client uses about 50% of that so 1.31%. 16.67% of that is spent in the select call. So the client code accounts for 33.33% of 2.63 which is 0.88% CPU
  36. Similar story for 10000 messages/second 4x odd for CPU and a lot more for allocations.