SignalFx engineer Rajiv Kurian's presentation on why we wrote our own Kafka consumer, the performance goals, and the performance gains achieved.
Download the slides to see animations showing hardware details. These slides were converged from Keynote to Powerpoint, so there may be some oddness with slide transitions!
5. • High resolution:
• Any mix of resolutions up to 1 sec
• Streaming analytics:
• Custom analytics pipelines at any scale that output in seconds
• Streaming dashboards update in seconds
• Multidimensional metrics:
• Dimensions allow arbitrary modeling, pivoting, filtering, and
grouping of both raw and derived (from analytics) metrics
interactively on streaming data
• E.g. 99th-percentile-of-latency-by-service-by-customer
SignalFx is built for monitoring modern infrastructure
6. • Designed to replace SimpleConsumer not the 0.9
consumer
• Needed a non-blocking single threaded consumer
• Wanted it to be low over head
• 100s of thousands of messages/second
• Sensitive to GC
• The Kafka 0.9 consumer wasn’t ready yet
Why write a new Kafka consumer
17. Cache Lines
• Data is transferred between memory and cache in
blocks of fixed size, called cache lines (typically 64
bytes)
• The memory subsystem makes a few bets to help us:
• Temporal locality
• Spatial locality
• Prefetching
25. Optimization aims
• We are NOT aiming for more data/second
• Even a very inefficient implementation
will be bottlenecked by the network
• We are aiming to make the client get out of
the way
• The client is not the only thing running on
the system
• Leave all resources for the actual
application
26. Efficiency VS raw speed
• We value efficiency more than raw speed
for the client
• Fewer cycles
• Less cache usage and fewer cache
misses
• Less memory?
• Efficiency for the client == raw speed for
the application
27. Efficiency from constraints
• No consumer group functionality needed
• A single topic
• Finite number of integer partitions
• Partition reassignment is rare and happens
during startup and shutdown
• We are in control of the code that consumes
the messages
29. Use arrays and open addressing hash maps
• Single topic. Less than 1024 partitions
• Instead of maps we can use arrays
• Or use primitive specialized open
addressing hash maps
35. Low memory and cache friendly data structures
• Queues built from integer arrays. Negative ->
partition lost
• Zero allocation hashed-wheel timer to close
stuck connections
• Open addressing hash maps
• BitSets coded on top of long arrays whenever a
set of partitions is required
• Can be traversed in O(num set bits)
36. Applicability and benefit to Kafka consumer 0.9
• Benefits - medium
• Lots of hash map look ups
• Applicability - low
• Multiple topics - sparse arrays not a great
match
• Open addressing hash maps - preserve
most of the benefits
38. Eliminate redundant work
• A single topic. Finite number of partitions:
• Topic and client string immutable
• The metadata request buffer can be created just once and
kept around forever
• Other requests can have their fixed part written out and only
write the variable part on each request
• Offset request
= fixed_part + per_partition_part
• Fetch request create
= fixed_part + per_partition_part
43. Code
private void setNewOffsetsForFetchRequest() {
final ByteBuffer buffer = this.fetchRequestBuffer;
// Iterate through the partitions assigned to this broker
// and write the offset directly on the buffer.
for (int i = 0; i < partitionAssignment.length; i++) {
// This loop runs in O(partitions assigned).
long bitSet = partitionAssignment[i];
while (bitSet != 0) {
final long t = bitSet & -bitSet;
final int partitionId = i * 64 + Long.bitCount(t - 1);
// The position in the buffer that points to the
// beginning of the offset for this partition.
final int bufferPositionForOffset = fetchRequestIndex[partitionId];
final long offset = partitionToOffset[partitionId];
// Write the offset directly.
buffer.putLong(bufferPositionForOffset, offset);
bitSet ^= t;
}
}
}
46. Applicability and benefit to Kafka consumer 0.9
• Benefits - high
• Reuse instead of allocating - temporal locality
• Steaming through 3 arrays - prefetching
• One fetch request per fetch response - common
• Metadata or offset requests - rare
• Applicability - high
• Internal detail so API doesn’t change
• Even for consumer groups, partition reassignment
and partition migration events are rare
48. Stream responses to application
• Pass each message to the application
when it is ready
• Consume messages synchronously
without a copy or allocation
• No deserialization required
• Benefits add up when processing 100s
of thousands of messages per second
49. Low level interface
public interface KafkaMessageHandler {
void handleMessage(ByteBuffer buffer, int position, int length);
}
public interface KafkaConsumer {
void poll(KafkaMessageHandler handler, long timeoutMs);
. . .
. . .
}
54. Applicability and benefit to Kafka consumer 0.9
• Benefits - very high
• Reuse response buffer, no allocations - temporal locality
• Data is processed right after being read from the socket -
temporal locality
• Streaming through a buffer - spatial locality + prefetching
• Combine with DirectByteBuffers for zero copy
• Applicability - low
• API too low level
• Integrity of internal buffers compromised by bugs in
application
• Maybe a low level “with great power comes great
responsibility” API
56. Caveats
• These are from running a very specific
workload similar to our application
• There are many Pareto-optimal choices
for a client. Our’s is not better in any
way - it’s just tuned for our workload
• It can and will prove bad for other
workloads
57. Benchmark
• Single topic-partition
• Settings of fetch_max_wait, fetch_min_bytes,
max_bytes_per_partition were identical
• Only 5000 messages per second produced by
a single producer
• Each message is 23 bytes
• Warm up -> profile for 5 mins
• 5000/sec * 5 mins = 1.5 million
• Profiler = Java Mission Control
A Kafka cluster has multiple brokers. Each broker is a process of its own with an unique id.
The unit of serializability in Kafka is a partition. Each partition has all its messages ordered.
I like to think of a topic as a group of partitions.
A partition has a statically assigned leader. From the POV of regular clients all read/write operations must go through the leader.
So a client needs to know the mapping of topic-partitions to brokers. This mapping can change dynamically.
A client begins by sending a metadata request to know this mapping. A metadata request can be sent to any broker in the cluster.
The broker then replies with a metadata response.
So the client can now form a map of partitions to brokers.
Next the client needs to build a table of partition -> next offset to consume. It can get it from the consumer group functionality or some other external source.
Once this is built it can send fetch requests for actual data.
As long as there is actual data to consume and no errors it gets back a fetch response.
Data is transferred between memory and cache in blocks of fixed size, called cache lines (typically 64 bytes). If you need a single byte, 63 others are coming in for the ride and paying the full tax. So you might as well use these bytes.
When the processor needs to read or write a location in main memory, it first checks for a corresponding entry in the cache. In the case of:
1. a cache hit, the processor immediately reads or writes the data in the cache line
2. a cache miss, the cache allocates a new entry and copies in data from main memory, then the request (read or write) is fulfilled from the contents of the cache
Data is transferred between memory and cache in blocks of fixed size, called cache lines (typically 64 bytes). If you need a single byte, 63 others are coming in for the ride and paying the full tax. So you might as well use these bytes.
An application summing numbers in nodes of a linked list might take one cache miss per node.
Spatial locality and prefetching help a lot when summing an array on the other hand. The compiler is also able to write better vectorized code if your layout looks like this.
We really really care about cache usage and cache misses.
We don’t care about memory as much.
So efficiency for the client means more resources for the application which means a faster application.
Almost all our optimizations are based on constraints that come from our use of the consumer. So, many of them are not directly applicable to generic Kafka clients which need to work well under various scenarios.
We need no consumer group functionality. We manage partitions and offsets outside of Kafka. This makes our client super simple.
A single topic. Our applications mostly consume a topic.
We have a finite small number of partitions. Usually <= 1024.
Partition reassignment is rare. I would imagine that this is true for most applications.
Control of the entire pipeline means we can make some assumptions that a generic client cannot. End to end principle.
Now the interesting part.
Since we have a single topic, all partitions implicitly belong to that topic. So we don’t need a concept of topic-partition. We only have partitions. Since we don’t need topic-partition objects we can store all per partition data in arrays with the array index = partition number.
It is important to acknowledge that this is a tradeoff. Like we said before we really care about cache space and cache misses. We are ready to trade off using extra memory to reduce our cache usage. Here is an example:
Let’s imagine that we have a java util hash map of partition to offset. We’ve already shown that we can have multiple cache misses to do an offset get or put. Now let’s imagine that we have a single partition 0 with an offset 116, store in this map. How much memory does this use?
We’ll be generous and assume that headers are only 8 bytes and references are only 4 bytes.
So let’s assume that the entry array was preallocated for 2 entries. There is a 8 byte header, a 4 byte length and two 4 byte references. That’s 20 bytes. Similarly the actual entry is itself 16 bytes and the boxed long is 16 bytes and the boxed integer is 12 bytes. So in spite of all the references and indirection it only uses 256 bytes of memory.
On the other hand let’s assume that our sparse array has been preallocated for 1024 partitions. So it has a 4 byte length, a 8 byte header and 1024 8 byte entries so a total of 8204 bytes which is around 8 KB. This is a lot more than 64 bytes and kind of wasteful.
Now let’s look at how much cache is used by each solution. Each cache line is 64 bytes. So even if you want a single byte 63 unrelated bytes might come along for the ride.
Now let’s look at the java hash map again. We first need to fetch the right entry - that’s one cache line. The other entry comes alone for the ride and possibly the length and header. So that’s 64 bytes already.
Now the actual entry is on another cache line. That is another cache line used up.
Now we need to look at the contents of the boxed partition. That’s another random memory location so a new cache line.
Finally we fetch the offset itself and that’s another cache line.
So it’s 4 cache lines and hence 256 bytes of cache used up through a simple get request.
Now let’s look at the sparse offset array. We know where to fetch it from so with a single cache fetch we get the offset. It comes with potentially 7 other offsets none of which might be useful, but it’s still a single cache line. So we use only 64 bytes!
This example is a bit counter-intuitive. It goes to show that a data structure using only 64 bytes of memory can actually use many more times that memory in cache and a data structure using 8 KB of memory might only use a single cache line. This is a bit like virtual memory vs physical memory. You can use a lot of virtual memory but use little physical memory and come out ahead. In our example physical memory is abundant (we have gigabytes of it). Cache memory is very limited. We only have around 32 KB of L1 cache for example so it’s much more precious that physical memory.
This also shows how we are ready to make trade offs. Sparse arrays can take more memory but have a pretty guaranteed worse case cache usage and cache miss number.
We talked about the main data structures. Our other data structures especially for our state machine implementation are all designed to be zero allocation in the steady state and very cache friendly. Even our hashed-wheel timer is made of primitive arrays with very few indirections.
Since we service a single topic per client, we can stamp out the client id and topic id bits and never change them.
Since this variable sized portion rarely changes, we can afford to create an index to it. So we have an index of a partition to it’s position within the fetch request ByteBuffer.
So let’s imagine that we sent this particular fetch request with partitions 0, 1 and 2. We have a response and the offsets have been advanced as shown by the offsets table. Now to create the next fetch request, we just read the new offsets and use the index to write them directly on the old buffer.
And we are ready to send this buffer. We avoided all the work required to create a buffer, write out the fixed size fields etc. It’s just writing a few integers to locations in memory.
This is how the code looks. There is a bit of noise in the code because we are iterating a bit set representing the partition assignment. But otherwise the code is simple - fetch the position within our request buffer for this partition. Get the next offset to fetch. Write this offset at the right position.
The metadata request is just frozen after consumer creation.
For the offset request for example we can store a pointer to the num partitions part of the request. So when we need to send a new offset request we can directly seek there and write out the partition bits.
We don’t use JSON, XML, Thrift, ProtocolBuffers etc for our messages. Our messages do not need to be deserialized before consumption. They can be consumed directly just like Kafka’s internal messages can be. There is no POJO created from a serialized message. Instead we can wrap the buffer in a flyweight and consume the fields of our messages by doing reads from the underlying buffer.
So we don’t need any copies or any allocation for steady state processing.
The interface however is low level. Any handler of a message is fed with a buffer and a position and length within that buffer that represents the message. We could also alternatively set a position and limit on the buffer and send it to the application.
The poll call takes such a handler and feeds it with messages.
So let’s imagine that this is a response from Kafka. There are a bunch of fixed size bits on the top that we can skip. The real payload is a message set per partition.
We begin by going to the first message, ensuring there are no errors and then just passing the pointer and length to the handler. It synchronously consumes it making copies if necessary and then returns back to the parsing code.
We then consume the second message.
And the third and so on.
Benefits are huge. Zero copy and zero allocation in the steady path. Since we are not creating a new ByteBuffer every time - DirectByteBuffers become viable. So we elide the copy involved in reading from the socket into HeapByteBuffers.
Sadly the applicability of this optimization is low. We are in control of our buffers and their lifetime so it is easy for us to avoid a copy. It is perhaps possible to create a very low level api that is not the default. I’ve not had much luck pushing this agenda in the past :)
The Kafka client allocates about 423 MB for 5000 * 300 = 1.5 million messages
That’s 86.56% of all allocations.
A size able portion of that is in fetch response parsing. A lot of that is ByteBuffer slicing which our client does not do at all.
We talked about a possible but dangerous way to get rid of this entirely.
1.76% is in the selector.
About 9.27% is in cluster init. I am not sure why that’s so much.
We allocate 218 KB overall to process 5000 * 300 = 1.5 million messages.
The consumer does no allocations of it’s own. There are allocations done by the java NIO stack but they don’t show up in the profile. Selectors allocate and we plan to use an allocation less Selector like the one the Netty project uses.
CPU used was 6.6%.
91% of that was the 0.9 consumer so about 6%
12.% spent on check sum math.
67% on handling fetch responses - we talked about a way to make this very fast.
Some 6% in metadata - not sure why
CPU was 2.63%.
The client uses about 50% of that so 1.31%.
16.67% of that is spent in the select call. So the client code accounts for 33.33% of 2.63 which is 0.88% CPU
Similar story for 10000 messages/second
4x odd for CPU and a lot more for allocations.