Fan-in Flames: Scaling Kafka to Millions of Producers With Ryanne Dolan | Current 2022

Fan-in Flames
Scaling Kafka to Millions of Producers
Ryanne Dolan
Sr Staff SWE @LinkedIn

What makes streaming
data BIG?
1. Throughput (BPS)
2. Records (RPS)
3. Requests (QPS)
4. ?

What makes streaming
data BIG?
1. Throughput (BPS)
2. Records (RPS)
3. Requests (QPS)
4. Fan-in
5. Fan-out

"Fan-in"
Large number of producers per broker
Broker 1
Broker 2
Broker 3

"Fan-out"
Large number of consumer groups per broker
Broker 1
Broker 2
Broker 3

Fan-in and fan-out can
overwhelm brokers
Even at low throughput!

Fan-in vs Fan-out
Fan-in is more likely to be an issue
Fan-in scales with…
• # application instances
• # containers
• # hosts
• # edge devices
Fan-out scales with…
• # applications

The Consumer Problem
We're all listening!

Fan-out is a
throughput
multiplier
IO ∝ Fin + Fin × Fout
1 MBps per producer
100 producers
10 consumers
à 1.1 GBps

Fan-out can
exhaust broker
resources
Disk IO
Network IO
Request handling

Replication
… is a fan-out divider
Broker 1
Broker 2
Broker 3

Replication in the Kafka ecosystem
Two typical
replication
strategies
Internal Replication
• Broker-Broker
• Built-in
• Synchronous
• Requires fetch-from-follower
(KIP-392)
• Client-side cfg changes
• Custom ReplicaSelector
and/or rack-awareness
• OOTB will not help
Mirroring
• Cluster-Cluster
• External to broker (e.g. MM)
• More infra (e.g. Connect)
• Asynchronous

Use internal replication for
durability.
Use mirroring to reduce fan-out.

Data pipelines
Reduce fan-out by processing
records all in one place

Data pipelines can reduce fan-out
One big topic à high fan-out
One big topic

Data pipelines can reduce fan-out
Pipelines à smaller topics à less fan-out
One big topic
Filtered topic
Projected topic

Data pipelines for aggregation
Small topics à aggregated topics à less fan-out
many small topics aggregated
topic 1
many small topics
many small topics
many small topics
aggregated
topic 2

Use case: Logging infrastructure
Two requirements
Application-specific logs
All logs
Host-specific logs

Use case: Logging infrastructure
Application, container, and host log events
Two
reasonable
approaches
One big topic
• All applications send log
events to one big topic,
which is sent to the cloud
• Data pipeline routes records
based on application ID,
container ID, host ID
• Derived topics for each
application, container, host
• Consumers can process a
single application, container,
host
Many small topics
• Each application sends to its
own topic
• Data pipeline aggregates
across containers and hosts,
and sends to cloud

More topics
More records
More data
Less fan-out

Who needs millions of producers?

The Producer Problem
Applications, Microservices
Hosts, Containers
Logs, Metrics, Tracing

Fan-in counteracts
batching
For a given workload, more producers
à less batching à more requests
à More stress on brokers!

High fan-in is detrimental to performance.
At extreme scale, fan-in can be more
important than BPS, RPS, and QPS!

Contrived Experiments
Simulating massive workloads with constrained brokers
800
MHz
8
cores
40%
quota
practically 1GHz of power! .4 core per broker

The Application
N Producers and M Consumer per Workload
Broker 1
Broker 2

The Application -- Mirrored
N+1 Producers and M+1 Consumer per Workload
Broker 1
Broker 2

The Workload
A modest amount of data
10K
RPS
10K
QPS
300K
BPS
constant constant constant

Latency metrics
As used here.
End-to-End Latency
Delay between when a record is
created (before send) and
when it is ultimately processed
by a consumer (after fetch).
Send Latency
created (before send) and
when it is written to disk on the
last broker (before ACK).
Fetch Latency
written to disk on the last broker
(before ACK) and when it is
ultimately processed by a
consumer (after fetch).

End-to-End Latency
(ms avg)
1000 Producers
1 Consumer
1 Workload
359
555
370
459
419 422
393 396
324
348
309
427
343
317
350
371
343
372
388
351
1 2 3 4 5 6 7 8 9 10
Mirrored Direct

Send Latency
(ms avg)
1000 Producers
1 Consumer
1 Workload
350
552
368
457
417 420
391 394
322
345
145
205
165 166
180 187
175 172
188 185
1 2 3 4 5 6 7 8 9 10
Mirrored Direct

Fetch Latency
(ms avg)
1000 Producers
1 Consumer
1 Workload
5
2
4 4 4
3
6 6
9
10
25
19
27
8
7
15
8
7
26
14
1 2 3 4 5 6 7 8 9 10
Mirrored Direct

13%
Worse End-to-End Latency
Mirroring made latency 1.13x
worse, when we'd expect 2x
worse!

End-to-End Latency
(ms avg)
1000 Producers
10 Consumers
10 Workloads
6899
8393
Mirrored Direct

Send Latency
(ms avg)
1000 Producers
10 Consumers
10 Workloads
6783
7315
Mirrored Direct

Fetch Latency
(ms avg)
1000 Producers
10 Consumers
10 Workloads
Same amount of data!
Same number of consumers!
5
1471
Mirrored Direct

23%
Better End-to-End Latency
Additional hop, but less time.

Producers and consumer compete for
resources. Fan-in can quickly overwhelm
a broker, even at low throughput. Fan-out
magnifies this effect.

LinkedIn
Model
Separate local and aggregation
clusters
Colo-1
Colo-2
Local1
Local2
Agg1
Agg2
Brooklin
Brooklin
Brooklin
Brooklin

Ingestion
layer
Producers sharded across
multiple ingestion clusters
Ingest 1
App 1
Ingest 2
Ingest 3
App 2

Kafka@Edge
Shard workload based on
geographical location
1
2
3
4

Can't we just add
more brokers?
To a point!
Something may be wrong if the amount
of data is small relative to the size of the
cluster.

Read vs write sets
Separate producers from consumers
Broker 1
Broker 2
Broker 3
Write-only
broker
Read-only
brokers

Partitioning
The problem with round-robin sticky partitioning
Broker 1
Broker 2
Broker 3

Smart clients
1. Round-robin among a subset
of partitions à only a fraction
of clients can send to a given
broker
2. Measure latency and avoid
slow partitions

Dealing with Fan-in
Some easy strategies
Add more brokers
Fetch-from-follower can be
used to split brokers into read vs
write sets
Shard workload
Divide workload into non-
overlapping groups
Combine Producers
Avoid having many producer
clients within the same
application

Dealing with Fan-in
Some easy strategies
Mirroring
Separate producers from
consumers
Batching
Slow down to get faster!
Smart partitioning
Avoid having multiple producers
write to the same partition at the
same time

Thank you
Ryanne Dolan
in/RyanneDolan
@DolanRyanne

Fan-in Flames: Scaling Kafka to Millions of Producers With Ryanne Dolan | Current 2022

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Fan-in Flames: Scaling Kafka to Millions of Producers With Ryanne Dolan | Current 2022

Similar to Fan-in Flames: Scaling Kafka to Millions of Producers With Ryanne Dolan | Current 2022 (20)

More from HostedbyConfluent

More from HostedbyConfluent (20)

Recently uploaded

Recently uploaded (20)

Fan-in Flames: Scaling Kafka to Millions of Producers With Ryanne Dolan | Current 2022