Self-hosting Kafka at Scale: Netflix's Journey & Challenges

Self-hosting Kafka at Scale
Netflix’s Journey & Challenges
Piyush Goyal, Staff Engineer, Data Platform
Nick Mahilani, Staff Engineer, Data Platform
Current 2024

Thank you for being here!
RAISE YOUR HAND
IF YOU USE KAFKA IN YOUR ORGANIZATION

KEEP YOUR HAND UP
IF YOU ARE SELF-HOSTING APACHE KAFKA
(NOT using a Kafka service provider)

WHAT CAN YOU EXPECT FROM THIS SESSION?
● How Netflix leverages Kafka to unlock various use-cases ?
● Our Long Journey with Kafka
● How we operate Kafka today ?
● Challenges and learnings

● Business Context
● Keystone Platform (2015)
● Evolution to Composable Architecture
● Kafka as a Service (2021)
● KaaS Features and Architecture
● KaaS Learnings
Our Journey With Kafka

Netflix Scale
Devices
>1,000,000,000
Countries
>190
* August 2024
Members
>278,000,000

Microservices Ecosystem
● Systems at our scale generate a
lot of data
● This data needs to be
transported to where it can be
processed and analysed

Centralized Event Pipeline (2015)
The System should have the following characteristics:
● Easy to use
● Highly Available
● Scalable
● Near Real-Time

Centralized Event Pipeline (2015)
The System should have the following characteristics:
● Easy to use
● Highly Available
● Scalable
● Near Real-Time
This gave rise to Netflix’s Keystone Platform in 2015

Keystone Platform (2015)
● Highly abstracted product
○ Data Movement to Sinks
○ Simple Real-time processing (Filter, Projection)
● Client Library, UI, Management plane, and Data Plane
● Used Apache Kafka and Apache Flink under the hood

Keystone Platform
Event
Producers
Publish events with
keystone client library
(Kafka-agnostic)

Keystone Platform
Event
Producers
Keystone
Management
Publish events with
(Kafka-agnostic)

Keystone Platform
Fronting
Kafka
Event
Producers
Keystone
Management
Publish events with
(Kafka-agnostic)

Keystone Platform
Fronting
Kafka
Router/
Processor
Event
Producers
Keystone
Management
Publish events with
(Kafka-agnostic)

Keystone Platform
Fronting
Kafka
Router/
Processor
Event
Producers
Keystone
Management
Stream
Consumers
Consumer
Kafka
Publish events with
(Kafka-agnostic)

FRONTING CONSUMER
● Multi-tenant clusters
● Used to publish data
● Abstracted from producers
● Controlled Cluster access
● Critical for High availability
● Larger Fleet
● Multi-tenant clusters
● Used to consume data
● Coupled with consumers
● Smaller Fleet
Two types of Kafka Clusters

Resilience to cluster failure
Keystone Client
Stream Cluster Topic
playback_events Cluster A playback_events
ad_events Cluster B ad_events
Topic lookup
Cluster A
Cluster B
Topic:
playback_events
Fronting
Topic:
ad_events

Resilience to cluster failure
Keystone Client
Stream Cluster Topic
playback_events Cluster A
Cluster B
playback_events
ad_events Cluster B ad_events
Topic lookup
Cluster A
Cluster B
Topic:
playback_events
Fronting
Topic:
ad_events
Topic:
playback_events
⚠

👍 Things worked well..
● Highly abstracted and easy to use product
● Only takes a couple minutes to create simple data pipelines
● Huge adoption - more than 6000 data pipelines
● >100M message per seconds (>150GB/s)
● Quick real-time transformations like filtering and projection

Not everything worked well 😑
● For Streaming-only consumers, It was highly inefficient
○ Unnecessary hops
○ Higher latency
○ Extra Cost
● Noisy neighbors in a multi-tenanted environment
● No direct access to Kafka for producers
● Administration of Kafka was semi-automated

And we needed more..
● Highly abstracted product means limited functionality done well
● Solved 80% use-cases, what about the rest?
● New Business Requirements demanded more functionality
○ Event Driven Architecture
○ Change Data Capture
○ Low latency use-cases
○ Custom Stream Processing
○ Direct Kafka integration for Third party tools

Closed System
Pipeline Abstraction
Kafka
as a
Service
Stream
Processing
Composable System
Architecture Evolution

Whether to build or buy?
● We evaluated the tradeoffs for our situation (Year 2020-21)
○ Customizability
○ Long term costs
○ Available in-house expertise
○ Minimize Risks
After careful consideration, we decided to BUILD our own managed Kafka
Platform. YMMV!

● Keystone Data Pipeline (2015)
● KaaS Architecture
● KaaS Learnings

Kafka as a Service (KaaS)
Alerting & Auto Remediation Security & Access Control
Observability
Client Library Schema Management
Provisioning

SHARED
v/s
DEDICATED
Provisioning Kafka Clusters

● High-availability
○ Replication factor = 2
○ Min insync replicas = 1
○ Unclean leader election enabled
● Strong Consistency
○ Replication factor = 3
○ Min insync replicas = 2
○ Unclean leader election disabled
Kafka Cluster Configuration

KaaS Scale
190 million messages / second
150+ GB ingested / second
8+ PB persisted state
475+ dedicated Kafka Clusters
11,500 Kafka brokers
35,000 Kafka topics

1. Scaling a single Kafka Cluster

Scaling Up a Cluster before KaaS
Topic partition counts were tightly coupled with number of brokers

Using OSS Cruise Control:
Topic partition counts independent of number of brokers
Scaling Up a Cluster in KaaS

2. Making Cluster Upgrades Faster

Unit of Change: AWS EC2 instance

Kafka Fleet Upgrades
Upgrade Time
(old strategy)
Desired
Upgrade Time
Upgrade
Frequency
Hardware Upgrade 3+ months < 1 month annually
Software Upgrade 3+ months < 1 week monthly

Software Upgrade Strategy #1
Leverage Amazon Elastic Block Store (EBS)
Source: https://aws.amazon.com/ebs/

Move Kafka state from local instance storage to EBS

● EBS is expensive at large scale
○ Moved large scale clusters back to AWS instance types
with local disk
● Back to where we started → longer upgrade times ☹
EBS is awesome but ..

How can we upgrade faster without EBS?

How can we upgrade faster without EBS?
AWS
Replace
Root
Volume
Source: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/replace-root.html

AWS Replace Root Volume to upgrade AMI

Kafka Fleet Upgrades with
Replace Root Volume Strategy
Current
Upgrade Time
Desired
Upgrade Time
Upgrade
Frequency
Hardware Upgrade 1+ month < 1 month annually
Software Upgrade 5 days < 1 week monthly

Right Sizing a Kafka Cluster
Num Consumers
Throughput
Replication Factor
Retention
● Which EC2 instance type?
● How many instances?
● How much disk?

Right Sizing a Kafka Cluster
Num Consumers
Throughput
Replication Factor
Retention
Kafka Capacity
Model
Num
Brokers
Instance Type Cost
3 i3en.2xl $
3 i4i.2xl $$
6 r5.4xl + EBS $$$
https://github.com/Netflix-Skunkworks/service-capacity-modeling/blob/main/service_capacity_modeling/models/org/netflix/kafka.py

● Keystone Data Pipeline (2015)
● KaaS Features and Architecture
● KaaS Learnings

Composable architectures are easier to scale and evolve
with the business
Key Takeaway
Closed System
Pipeline
Abstraction
Kafka
as a
Service
Stream
Processing
Composable
System

Q & A
Self-hosting Kafka at Scale
Netflix’s Journey & Challenges
Piyush Goyal Nick Mahilani

● S3 Flash Bootloader (precursor to AWS Replace Root
Volume)
● Joey’s talk on “Capacity Plan optimally in the cloud”
● Kyle and JS talk on “Iterating faster on Stateful Services in
the cloud”
References

Self-hosting Kafka at Scale: Netflix's Journey & Challenges

More Related Content

Similar to Self-hosting Kafka at Scale: Netflix's Journey & Challenges

Recently uploaded

Self-hosting Kafka at Scale: Netflix's Journey & Challenges