Apache Kafka at LinkedIn - How LinkedIn Customizes Kafka to Work at the Trillion Scale
1. Apache Kafka at LinkedIn
How LinkedIn Customizes Kafka to Work at the Trillion Scale
Jon Lee
Staff Software
Engineer LinkedIn
Wesley Wu
Senior Software
Engineer LinkedIn
2. Agenda
1 Apache Kafka @
LinkedIn
2 Development Workflow
3 Patch Examples
4 Release Process
3. Agenda
1 Apache Kafka @
LinkedIn
2 Development Workflow
3 Patch Examples
4 Release Process
4. Apache Kafka
• Distributed stream
processing platform
• Publish and subscribe to
persistent messages
• High throughput and
low latency
• Developed at LinkedIn
• Top-level Apache project
6. Kafka @
LinkedIn
Running at Scale
• 7 trillion messages per day
• 100+ clusters, 4K+
brokers
• 100K+ topics, 7M+
partitions
• Constant scalability and
operability challenges
7. • Source of releases
running in LinkedIn
Production
• Branched from an Apache
Kafka release branch
• Contains hotfix patches
and upstream cherry-
picks
• Tailored to operations
and scale at LinkedIn
LinkedIn Kafka
Release Branch
8. Agenda
1 Apache Kafka @
LinkedIn
2 Development Workflow
3 Patch Examples
4 Release Process
9. Tracking Upstream Closely
“Upstream Everything”
Upstream First
• Commit to upstream first (file a KIP
if necessary)
• Cherry-pick it onto the current
LinkedIn release branch or pick it
up when a new branch containing
the upstream patch is created
• Suitable for patches with low to
medium urgency
LinkedIn First (a.k.a. hotfix
approach)
• Commit to LinkedIn branch first
• Double-commit to upstream (best
effort)
• Suitable for patches with high
urgency
10. Tale of Three Patches
Cherry-pick
• Cherry-picked from
upstream
• Kept until a new
LinkedIn release
branch containing the
original upstream
patch is created
Double-committed
Hotfix
• Hotfix eventually
committed to
upstream
• Kept until a new
LinkedIn branch
containing the
corresponding
upstream patch is
created
LinkedIn-private Hotfix
• Hotfix not of interest
to upstream (e.g.,
temporary debug
patches)
• OR double-commit
attempted but not
accepted by upstream
• Kept in LinkedIn
branches until they are
not needed
11. Close Look at a LinkedIn Release Branch
Apache Kafka
Release branch
LinkedIn
Release branch
Upstream patch (before branching point)
Cherry-pick
Hotfix (double-committed)
Hotfix (LinkedIn-private)
Apache Kafka
trunk
12. Developmen
t
Workflow
New
Issue
New
Feature
Already fixed
in upstream?
Intend to commit to
upstream?
File upstream ticket
Commit to
upstream
First?
Can be
cherrypicked?
KIP required?
File KIP /
upstream ticket
Done Done Done
Patch will be
picked up
at next rebase
Fixed in upstream
and patch exists in
LI Branch
Patch exists only in
LI Branch
Y
N YN
Y
NY
N
Upstream
patching
Hotfix
patching
N
Y
Cherry-
Pick
Rejected
Rejected
Rejected
13. Agenda
1 Apache Kafka @
LinkedIn
2 Development Workflow
3 Patch Examples
4 Release Process
14. Scalability Support
• Challenges
• 140+ brokers and 1M+ replicas on a single cluster
• Controller failure leads to site unavailability
• Slowness in bouncing a broker causes deployment delay
• Solutions
• Reuse UpdateMetadataRequest object to reduce controller memory
footprint
• Improve broker shutdown time by reducing lock contention
• Avoid excessive logging
15. Operability Support
• Challenges
• Broker removal for maintenance requires moving out all replicas.
• New replicas can get assigned to brokers that are going to be removed.
• Solutions
• Add a broker to maintenance broker list
• New replicas do not get assigned to maintenance brokers.
• Integrated with Kafka Cruise Control to automate broker removal process
16. Features
• Observer for billing
• Provide accounting information
• Enforce minimum replication factor
• Minimize data loss risk in case of broker failure
• New offset reset policy
• Help consumer navigate to the closest offset
We are considering (WIP):
• CPU Optimization (e.g., using Open SSL library)
• Separate controller node from data broker nodes
17. Direct Contributions to Upstream
• KIP-219: Improve quota communication
• KIP-291: Separating controller connections and requests from the
data plane
• KIP-354: Add a maximum log compaction lag
• KIP-380: Detect outdated control requests and bounced brokers
using broker generation
18. Agenda
1 Apache Kafka @
LinkedIn
2 Development Workflow
3 Patch Examples
4 Release Process
19. Creating a New LinkedIn Release Branch
Apache Kafka Trunk
Apache
Kafka 2.0.0
Apache
Kafka 2.3.0
Cherrypick Hotfix (LinkedIn-private)
20. Certifying a Release
20
Cert
Cluster
Baseline
Broker 0
Broker 1
Broker …
Broker N
• Identical Setup with 30+
brokers
• Production Traffic
• Automated compare run
• Detailed report
Cert
Cluster
Release
Broker 0
Broker 1
Broker …
Broker N
Produce
Traffic
Consume Traffic
Produce Traffic
Consume Traffic
• Certification covers rebalance, deployment, rolling bounce, stability, and
downgrade
21. • Source code available at GitHub
http://github.com/linkedin/kafka
• NOT a fork
• Branches are named as
<Apache Kafka Release>-li (e.g.,
2.0-li and 2.3-li)
• We are not accepting
external contributions. Please
contribute directly to upstream
Please Check Out