SignalFx Kafka Consumer Optimization

SignalFx
Why and how we wrote a Kafka consumer
Rajiv Kurian, Software Engineer
rajiv@signalfx.com
@rzidane360

Agenda
1. Why we wrote a Kafka consumer
2. Properties and limitations of modern hardware
3. Optimizations
4. Results

SignalFx
Why we wrote a Kafka consumer

• High resolution:
• Any mix of resolutions up to 1 sec
• Streaming analytics:
• Custom analytics pipelines at any scale that output in seconds
• Streaming dashboards update in seconds
• Multidimensional metrics:
• Dimensions allow arbitrary modeling, pivoting, filtering, and
grouping of both raw and derived (from analytics) metrics
interactively on streaming data
• E.g. 99th-percentile-of-latency-by-service-by-customer
SignalFx is built for monitoring modern infrastructure

• Designed to replace SimpleConsumer not the 0.9
consumer
• Needed a non-blocking single threaded consumer
• Wanted it to be low over head
• 100s of thousands of messages/second
• Sensitive to GC
• The Kafka 0.9 consumer wasn’t ready yet
Why write a new Kafka consumer

SignalFx
Kafka consumer - a brief introduction

SignalFx
Topic : 0
Topic : 3
Topic : 6
Topic : 9
Topic : 2
Topic : 5
Topic : 8
Topic : 11
Topic : 1
Topic : 4
Topic : 7
Topic : 10
BROKER 1 BROKER 2 BROKER 3
Brokers, topics and partitions

SignalFx
Topic : 0
Topic : 3
Topic : 6
Topic : 9
Topic : 2
Topic : 5
Topic : 8
Topic : 11
Topic : 1
Topic : 4
Topic : 7
Topic : 10
Metadata request and response
Client
Metadata Request

SignalFx
Topic : 0
Topic : 3
Topic : 6
Topic : 9
Topic : 2
Topic : 5
Topic : 8
Topic : 11
Topic : 1
Topic : 4
Topic : 7
Topic : 10
Client
Metadata Request
Metadata Response

SignalFx
Topic : 0
Topic : 3
Topic : 6
Topic : 9
Topic : 2
Topic : 5
Topic : 8
Topic : 11
Topic : 1
Topic : 4
Topic : 7
Topic : 10
Client
Metadata Request
Metadata Response
Partition Broker ID
0 1
1 2
…. ….
n 3

SignalFx
Topic : 0
Topic : 3
Topic : 6
Topic : 9
Topic : 2
Topic : 5
Topic : 8
Topic : 11
Topic : 1
Topic : 4
Topic : 7
Topic : 10
Offset request and response
Client
Partition offset
0 9024
1 1245
…. ….
n 11645
Partition Broker ID
0 1
1 2
…. ….
n 3
Offsets
(Consumer group/ external source)

SignalFx
Topic : 0
Topic : 3
Topic : 6
Topic : 9
Topic : 2
Topic : 5
Topic : 8
Topic : 11
Topic : 1
Topic : 4
Topic : 7
Topic : 10
Fetch request and response
Client
Fetch request
Partition offset
0 9024
1 1245
…. ….
n 11645
Partition Broker ID
0 1
1 2
…. ….
n 3

SignalFx
Topic : 0
Topic : 3
Topic : 6
Topic : 9
Topic : 2
Topic : 5
Topic : 8
Topic : 11
Topic : 1
Topic : 4
Topic : 7
Topic : 10
Fetch request and response
Client
Fetch response
Partition offset
0 9026
1 1247
…. ….
n 11649
Partition Broker ID
0 1
1 2
…. ….
n 3

SignalFx
Properties and limitations
of modern hardware

SignalFx Main memory
L1 D L1 I
L3
L1 D L1 I
L2L2
Core 1 Core 2
1

Cache Lines
• Data is transferred between memory and cache in
blocks of fixed size, called cache lines (typically 64
bytes)
• The memory subsystem makes a few bets to help us:
• Temporal locality
• Spatial locality
• Prefetching

SignalFx Main memory
L1 D L1 I
L3
L1 D L1 I
L2L2
Core 1 Core 2
1
1
1
2
1
2
2
2
1 2

SignalFx
L1 D
Main memory
L1 D L1 I
L3
L1 I
L2L2
Core 1 Core 2
1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8
2
1 2 3 4 5 6 7 8
1
4
3
6 8
75

Reference latency numbers for comparison
By Jeff Dean: http://research.google.com/people/jeff/
L1 Cache 0.5ns
Branch mispredict 5 ns
L2 Cache 7 ns 14x L1 Cache
Mutex lock/unlock 25 ns
Main memory 100 ns 20x L2 Cache, 200x L1 Cache
Compress 1K bytes (Zippy) 3,000 ns
Send 1K bytes over 1Gbps 10,000 ns 0.01 ms
Read 4K randomly from SSD 150,000 ns 0.15 ms
Read 1MB sequentially from memory 250,000 ns 0.25 ms
Round trip within same DC 500,000 ns 0.5 ms
Read 1MB sequentially from SSD 1,000,000 ns 1 ms 4x memory
Disk seek 10,000,000 ns 10 ms 20x DC roundtrip
Read 1MB sequentially from disk 20,000,000 ns 20 ms 80x memory, 20x SSD
Send packet CA->Netherlands->CA 150,000,000 ns 150 ms

Optimization aims
• We are NOT aiming for more data/second
• Even a very inefficient implementation
will be bottlenecked by the network
• We are aiming to make the client get out of
the way
• The client is not the only thing running on
the system
• Leave all resources for the actual
application

Efficiency VS raw speed
• We value efficiency more than raw speed
for the client
• Fewer cycles
• Less cache usage and fewer cache
misses
• Less memory?
• Efficiency for the client == raw speed for
the application

Efficiency from constraints
• No consumer group functionality needed
• A single topic
• Finite number of integer partitions
• Partition reassignment is rare and happens
during startup and shutdown
• We are in control of the code that consumes
the messages

SignalFx
Use cache conscious data structures

Use arrays and open addressing hash maps
• Single topic. Less than 1024 partitions
• Instead of maps we can use arrays
• Or use primitive specialized open
addressing hash maps

Topic:Partition -> Offset
Topic:Partition offset
Foo:0 9026
Foo:1 1247
…. ….
Foo:n 11649
Partition offset
0 9026
1 1247
…. ….
n 11649
Foo Foo
Offsets
9026
1247
11649

offsetpartition
partition* offset*partition* offset*
Entry*
Entry*
Entry*
Entry*
1
2
3 4
Hash map implemented as an array of lists of key* |
value*

offsetpartition
partition* offset*partition* offset*
Dependable cache miss generator
List
List
List
List

Sparse array
offset 0
offset 1
offset 2
offset 3
offset 4
offset 5
offset 6
offset 7
offset 8
offset 9
offset 10
offset 11
offset 12
offset 13
1

SignalFx
In memory
1160
partition* offset*
Entry*
Entry*
Offsets
116
partition* offset*
0
116
116
Entry* Entry*
In cache
(4 * 2 + 4 + 8) + (4 + 4 + 8) + (4 + 8) + (8 + 8) = 64 bytes
1024 * 8 + 4 + 8 = 8204 bytes
4 * 64 = 256 bytes
1 * 64 = 64 bytes
1
2
3
4
1

Low memory and cache friendly data structures
• Queues built from integer arrays. Negative ->
partition lost
• Zero allocation hashed-wheel timer to close
stuck connections
• Open addressing hash maps
• BitSets coded on top of long arrays whenever a
set of partitions is required
• Can be traversed in O(num set bits)

Applicability and benefit to Kafka consumer 0.9
• Benefits - medium
• Lots of hash map look ups
• Applicability - low
• Multiple topics - sparse arrays not a great
match
• Open addressing hash maps - preserve
most of the benefits

SignalFx
Create buffers once, reuse

Eliminate redundant work
• A single topic. Finite number of partitions:
• Topic and client string immutable
• The metadata request buffer can be created just once and
kept around forever
• Other requests can have their fixed part written out and only
write the variable part on each request
• Offset request
= fixed_part + per_partition_part
• Fetch request create
= fixed_part + per_partition_part

SignalFx
SIZE API_KEY
API_VERSION CORRELATION_ID
CLIEND_ID_STRING REPLICA_ID
MAX_WAIT_TIME MIN_BYTES
NUM_TOPICS TOPIC_STRING
NUM_PARTITIONS
0 1266 1024
1 1164 1024
2 1900 1024
Fixed
Variable
FETCH REQUEST BUFFER

SignalFx
SIZE API_KEY
NUM_PARTITIONS
0 1266 1024
1 1164 1024
2 1900 1024
Index
1200
1216
1232

SignalFx
SIZE API_KEY
NUM_PARTITIONS
Offsets
1289
1172
1990
0 1266 1024
1 1164 1024
2 1900 1024
Index
1200
1216
1232

SignalFx
SIZE API_KEY
NUM_PARTITIONS
Offsets
1289
1172
1990
0 1289 1024
1 1172 1024
2 1990 1024
Index
1200
1216
1232

Code
private void setNewOffsetsForFetchRequest() {
final ByteBuffer buffer = this.fetchRequestBuffer;
// Iterate through the partitions assigned to this broker
// and write the offset directly on the buffer.
for (int i = 0; i < partitionAssignment.length; i++) {
// This loop runs in O(partitions assigned).
long bitSet = partitionAssignment[i];
while (bitSet != 0) {
final long t = bitSet & -bitSet;
final int partitionId = i * 64 + Long.bitCount(t - 1);
// The position in the buffer that points to the
// beginning of the offset for this partition.
final int bufferPositionForOffset = fetchRequestIndex[partitionId];
final long offset = partitionToOffset[partitionId];
// Write the offset directly.
buffer.putLong(bufferPositionForOffset, offset);
bitSet ^= t;
}
}
}

SignalFx
SIZE API_KEY
CLIEND_ID_STRING NUM_TOPICS
TOPIC_STRING
METADATA REQUEST BUFFER
Fixed

SignalFx
SIZE API_KEY
NUM_PARTITIONS
0 1 2 3 4 5
OFFSET REQUEST BUFFER
NUM_PARTITIONS_POSITION
Fixed

• Benefits - high
• Reuse instead of allocating - temporal locality
• Steaming through 3 arrays - prefetching
• One fetch request per fetch response - common
• Metadata or offset requests - rare
• Applicability - high
• Internal detail so API doesn’t change
• Even for consumer groups, partition reassignment
and partition migration events are rare

SignalFx
Zero allocation response processing

Stream responses to application
• Pass each message to the application
when it is ready
• Consume messages synchronously
without a copy or allocation
• No deserialization required
• Benefits add up when processing 100s
of thousands of messages per second

Low level interface
public interface KafkaMessageHandler {
void handleMessage(ByteBuffer buffer, int position, int length);
}
public interface KafkaConsumer {
void poll(KafkaMessageHandler handler, long timeoutMs);
. . .
. . .
}

SignalFx
Partition Message 1 Message 2 Message .. Message n
1 … … … …
2 … … … …
3 … … … …
4 … … … …
Topic string, client string etc
FETCH RESPONSE PARSING

SignalFx
Partition Message 1 Message 2 Message .. Message n
1 … … … …
2 … … … …
3 … … … …
4 … … … …
Topic string, client string etc
FETCH RESPONSE PARSING
public interface KafkaMessageHandler {
void handleMessage(ByteBuffer buffer, int position, int length);
}

• Benefits - very high
• Reuse response buffer, no allocations - temporal locality
• Data is processed right after being read from the socket -
temporal locality
• Streaming through a buffer - spatial locality + prefetching
• Combine with DirectByteBuffers for zero copy
• Applicability - low
• API too low level
• Integrity of internal buffers compromised by bugs in
application
• Maybe a low level “with great power comes great
responsibility” API

Caveats
• These are from running a very specific
workload similar to our application
• There are many Pareto-optimal choices
for a client. Our’s is not better in any
way - it’s just tuned for our workload
• It can and will prove bad for other
workloads

Benchmark
• Single topic-partition
• Settings of fetch_max_wait, fetch_min_bytes,
max_bytes_per_partition were identical
• Only 5000 messages per second produced by
a single producer
• Each message is 23 bytes
• Warm up -> profile for 5 mins
• 5000/sec * 5 mins = 1.5 million
• Profiler = Java Mission Control

SignalFx
0.9 Consumer allocation profile : TLAB

SignalFx
SignalFx Consumer allocation profile : TLAB

SignalFx
0.9 Consumer code profile

SignalFx
SignalFx Consumer code profile

SignalFx
With 5,000 messages/second
Implementation CPU Allocation TLAB
0.9 consumer 6% 422.8 MB
SignalFx consumer 1.3% 217 KB
4.6x 1944 x

SignalFx
With 10,000 messages/second
Implementation CPU Allocation TLAB
0.9 consumer 6.122% 858 MB
SignalFx consumer 1.456% 400 KB
4.2x 2145 x

SignalFx
Thank You!
Rajiv Kurian
rajiv@signalfx.com
@rzidane360
WE’RE HIRING
jobs@signalfx.com
@SignalFx - signalfx.com/careers

SignalFx Kafka Consumer Optimization

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (17)

Similar to SignalFx Kafka Consumer Optimization

Similar to SignalFx Kafka Consumer Optimization (20)

Recently uploaded

Recently uploaded (20)

SignalFx Kafka Consumer Optimization

Editor's Notes