Self-hosting Kafka at Scale
Netflix’s Journey & Challenges
Piyush Goyal, Staff Engineer, Data Platform
Nick Mahilani, Staff Engineer, Data Platform
Current 2024
Thank you for being here!
RAISE YOUR HAND
IF YOU USE KAFKA IN YOUR ORGANIZATION
KEEP YOUR HAND UP
IF YOU ARE SELF-HOSTING APACHE KAFKA
(NOT using a Kafka service provider)
WHAT CAN YOU EXPECT FROM THIS SESSION?
● How Netflix leverages Kafka to unlock various use-cases ?
● Our Long Journey with Kafka
● How we operate Kafka today ?
● Challenges and learnings
● Business Context
● Keystone Platform (2015)
● Evolution to Composable Architecture
● Kafka as a Service (2021)
● KaaS Features and Architecture
● KaaS Learnings
Our Journey With Kafka
Netflix Scale
Devices
>1,000,000,000
Countries
>190
* August 2024
Members
>278,000,000
Microservices Ecosystem
● Systems at our scale generate a
lot of data
● This data needs to be
transported to where it can be
processed and analysed
Centralized Event Pipeline (2015)
The System should have the following characteristics:
● Easy to use
● Highly Available
● Scalable
● Near Real-Time
Centralized Event Pipeline (2015)
The System should have the following characteristics:
● Easy to use
● Highly Available
● Scalable
● Near Real-Time
This gave rise to Netflix’s Keystone Platform in 2015
● Business Context
● Keystone Platform (2015)
● Evolution to Composable Architecture
● Kafka as a Service (2021)
● KaaS Features and Architecture
● KaaS Learnings
Our Journey With Kafka
Keystone Platform (2015)
● Highly abstracted product
○ Data Movement to Sinks
○ Simple Real-time processing (Filter, Projection)
● Client Library, UI, Management plane, and Data Plane
● Used Apache Kafka and Apache Flink under the hood
Keystone - User Interface
Keystone Platform
Event
Producers
Publish events with
keystone client library
(Kafka-agnostic)
Keystone Platform
Event
Producers
Keystone
Management
Publish events with
keystone client library
(Kafka-agnostic)
Keystone Platform
Fronting
Kafka
Event
Producers
Keystone
Management
Publish events with
keystone client library
(Kafka-agnostic)
Keystone Platform
Fronting
Kafka
Router/
Processor
Event
Producers
Keystone
Management
Publish events with
keystone client library
(Kafka-agnostic)
Keystone Platform
Fronting
Kafka
Router/
Processor
Event
Producers
Keystone
Management
Publish events with
keystone client library
(Kafka-agnostic)
Keystone Platform
Fronting
Kafka
Router/
Processor
Event
Producers
Keystone
Management
Stream
Consumers
Consumer
Kafka
Publish events with
keystone client library
(Kafka-agnostic)
FRONTING CONSUMER
● Multi-tenant clusters
● Used to publish data
● Abstracted from producers
● Controlled Cluster access
● Critical for High availability
● Larger Fleet
● Multi-tenant clusters
● Used to consume data
● Coupled with consumers
● Smaller Fleet
Two types of Kafka Clusters
Resilience to cluster failure
Keystone Client
Stream Cluster Topic
playback_events Cluster A playback_events
ad_events Cluster B ad_events
Topic lookup
Cluster A
Cluster B
Topic:
playback_events
Fronting
Topic:
ad_events
Resilience to cluster failure
Keystone Client
Stream Cluster Topic
playback_events Cluster A
Cluster B
playback_events
ad_events Cluster B ad_events
Topic lookup
Cluster A
Cluster B
Topic:
playback_events
Fronting
Topic:
ad_events
Topic:
playback_events
⚠
👍 Things worked well..
● Highly abstracted and easy to use product
● Only takes a couple minutes to create simple data pipelines
● Huge adoption - more than 6000 data pipelines
● >100M message per seconds (>150GB/s)
● Quick real-time transformations like filtering and projection
Not everything worked well 😑
● For Streaming-only consumers, It was highly inefficient
○ Unnecessary hops
○ Higher latency
○ Extra Cost
● Noisy neighbors in a multi-tenanted environment
● No direct access to Kafka for producers
● Administration of Kafka was semi-automated
And we needed more..
● Highly abstracted product means limited functionality done well
● Solved 80% use-cases, what about the rest?
● New Business Requirements demanded more functionality
○ Event Driven Architecture
○ Change Data Capture
○ Low latency use-cases
○ Custom Stream Processing
○ Direct Kafka integration for Third party tools
● Business Context
● Keystone Platform (2015)
● Evolution to Composable Architecture
● Kafka as a Service (2021)
● KaaS Features and Architecture
● KaaS Learnings
Our Journey With Kafka
Closed System
Pipeline Abstraction
Pipeline Abstraction
Kafka
as a
Service
Stream
Processing
Composable System
Architecture Evolution
Whether to build or buy?
● We evaluated the tradeoffs for our situation (Year 2020-21)
○ Customizability
○ Long term costs
○ Available in-house expertise
○ Minimize Risks
After careful consideration, we decided to BUILD our own managed Kafka
Platform. YMMV!
● Business Context
● Keystone Data Pipeline (2015)
● Evolution to Composable Architecture
● Kafka as a Service (2021)
● KaaS Architecture
● KaaS Learnings
Our Journey With Kafka
Kafka as a Service (KaaS)
Alerting & Auto Remediation Security & Access Control
Observability
Client Library Schema Management
Provisioning
SHARED
v/s
DEDICATED
Provisioning Kafka Clusters
Provisioning Kafka Clusters
● High-availability
○ Replication factor = 2
○ Min insync replicas = 1
○ Unclean leader election enabled
● Strong Consistency
○ Replication factor = 3
○ Min insync replicas = 2
○ Unclean leader election disabled
Kafka Cluster Configuration
Access Control
Audit Log
Admin Operations
● Business Context
● Keystone Data Pipeline (2015)
● Evolution to Composable Architecture
● Kafka as a Service (2021)
● KaaS Architecture
● KaaS Learnings
Our Journey With Kafka
KaaS Architecture
KaaS Architecture
KaaS Architecture
KaaS Scale
190 million messages / second
150+ GB ingested / second
8+ PB persisted state
475+ dedicated Kafka Clusters
11,500 Kafka brokers
35,000 Kafka topics
● Business Context
● Keystone Data Pipeline (2015)
● Evolution to Composable Architecture
● Kafka as a Service (2021)
● KaaS Architecture
● KaaS Learnings
Our Journey With Kafka
1. Scaling a single Kafka Cluster
Scaling Up a Cluster before KaaS
Topic partition counts were tightly coupled with number of brokers
Using OSS Cruise Control:
Topic partition counts independent of number of brokers
Scaling Up a Cluster in KaaS
2. Making Cluster Upgrades Faster
Upgrade time v/s State Size
Kafka Broker Instance
Unit of Change: AWS EC2 instance
Kafka Fleet Upgrades
Upgrade Time
(old strategy)
Desired
Upgrade Time
Upgrade
Frequency
Hardware Upgrade 3+ months < 1 month annually
Software Upgrade 3+ months < 1 week monthly
Software Upgrade Strategy #1
Leverage Amazon Elastic Block Store (EBS)
Source: https://aws.amazon.com/ebs/
Move Kafka state from local instance storage to EBS
Software Upgrade Strategy #1
● EBS is expensive at large scale
○ Moved large scale clusters back to AWS instance types
with local disk
● Back to where we started → longer upgrade times ☹
EBS is awesome but ..
How can we upgrade faster without EBS?
How can we upgrade faster without EBS?
AWS
Replace
Root
Volume
Source: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/replace-root.html
AWS Replace Root Volume to upgrade AMI
Software Upgrade Strategy #2
Kafka Fleet Upgrades with
Replace Root Volume Strategy
Current
Upgrade Time
Desired
Upgrade Time
Upgrade
Frequency
Hardware Upgrade 1+ month < 1 month annually
Software Upgrade 5 days < 1 week monthly
3. Cost Efficiency
Right Sizing a Kafka Cluster
Num Consumers
Throughput
Replication Factor
Retention
● Which EC2 instance type?
● How many instances?
● How much disk?
Right Sizing a Kafka Cluster
Num Consumers
Throughput
Replication Factor
Retention
Kafka Capacity
Model
Num
Brokers
Instance Type Cost
3 i3en.2xl $
3 i4i.2xl $$
6 r5.4xl + EBS $$$
https://github.com/Netflix-Skunkworks/service-capacity-modeling/blob/main/service_capacity_modeling/models/org/netflix/kafka.py
● Business Context
● Keystone Data Pipeline (2015)
● Evolution to Composable Architecture
● Kafka as a Service (2021)
● KaaS Features and Architecture
● KaaS Learnings
Our Journey With Kafka
Composable architectures are easier to scale and evolve
with the business
Key Takeaway
Closed System
Pipeline Abstraction
Pipeline
Abstraction
Kafka
as a
Service
Stream
Processing
Composable
System
Q & A
Self-hosting Kafka at Scale
Netflix’s Journey & Challenges
Piyush Goyal Nick Mahilani
● S3 Flash Bootloader (precursor to AWS Replace Root
Volume)
● Joey’s talk on “Capacity Plan optimally in the cloud”
● Kyle and JS talk on “Iterating faster on Stateful Services in
the cloud”
References

Self-hosting Kafka at Scale: Netflix's Journey & Challenges

  • 1.
    Self-hosting Kafka atScale Netflix’s Journey & Challenges Piyush Goyal, Staff Engineer, Data Platform Nick Mahilani, Staff Engineer, Data Platform Current 2024
  • 2.
    Thank you forbeing here! RAISE YOUR HAND IF YOU USE KAFKA IN YOUR ORGANIZATION
  • 3.
    KEEP YOUR HANDUP IF YOU ARE SELF-HOSTING APACHE KAFKA (NOT using a Kafka service provider)
  • 4.
    WHAT CAN YOUEXPECT FROM THIS SESSION? ● How Netflix leverages Kafka to unlock various use-cases ? ● Our Long Journey with Kafka ● How we operate Kafka today ? ● Challenges and learnings
  • 5.
    ● Business Context ●Keystone Platform (2015) ● Evolution to Composable Architecture ● Kafka as a Service (2021) ● KaaS Features and Architecture ● KaaS Learnings Our Journey With Kafka
  • 6.
  • 7.
    Microservices Ecosystem ● Systemsat our scale generate a lot of data ● This data needs to be transported to where it can be processed and analysed
  • 8.
    Centralized Event Pipeline(2015) The System should have the following characteristics: ● Easy to use ● Highly Available ● Scalable ● Near Real-Time
  • 9.
    Centralized Event Pipeline(2015) The System should have the following characteristics: ● Easy to use ● Highly Available ● Scalable ● Near Real-Time This gave rise to Netflix’s Keystone Platform in 2015
  • 10.
    ● Business Context ●Keystone Platform (2015) ● Evolution to Composable Architecture ● Kafka as a Service (2021) ● KaaS Features and Architecture ● KaaS Learnings Our Journey With Kafka
  • 11.
    Keystone Platform (2015) ●Highly abstracted product ○ Data Movement to Sinks ○ Simple Real-time processing (Filter, Projection) ● Client Library, UI, Management plane, and Data Plane ● Used Apache Kafka and Apache Flink under the hood
  • 12.
    Keystone - UserInterface
  • 13.
    Keystone Platform Event Producers Publish eventswith keystone client library (Kafka-agnostic)
  • 14.
    Keystone Platform Event Producers Keystone Management Publish eventswith keystone client library (Kafka-agnostic)
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
    FRONTING CONSUMER ● Multi-tenantclusters ● Used to publish data ● Abstracted from producers ● Controlled Cluster access ● Critical for High availability ● Larger Fleet ● Multi-tenant clusters ● Used to consume data ● Coupled with consumers ● Smaller Fleet Two types of Kafka Clusters
  • 20.
    Resilience to clusterfailure Keystone Client Stream Cluster Topic playback_events Cluster A playback_events ad_events Cluster B ad_events Topic lookup Cluster A Cluster B Topic: playback_events Fronting Topic: ad_events
  • 21.
    Resilience to clusterfailure Keystone Client Stream Cluster Topic playback_events Cluster A Cluster B playback_events ad_events Cluster B ad_events Topic lookup Cluster A Cluster B Topic: playback_events Fronting Topic: ad_events Topic: playback_events ⚠
  • 22.
    👍 Things workedwell.. ● Highly abstracted and easy to use product ● Only takes a couple minutes to create simple data pipelines ● Huge adoption - more than 6000 data pipelines ● >100M message per seconds (>150GB/s) ● Quick real-time transformations like filtering and projection
  • 23.
    Not everything workedwell 😑 ● For Streaming-only consumers, It was highly inefficient ○ Unnecessary hops ○ Higher latency ○ Extra Cost ● Noisy neighbors in a multi-tenanted environment ● No direct access to Kafka for producers ● Administration of Kafka was semi-automated
  • 24.
    And we neededmore.. ● Highly abstracted product means limited functionality done well ● Solved 80% use-cases, what about the rest? ● New Business Requirements demanded more functionality ○ Event Driven Architecture ○ Change Data Capture ○ Low latency use-cases ○ Custom Stream Processing ○ Direct Kafka integration for Third party tools
  • 25.
    ● Business Context ●Keystone Platform (2015) ● Evolution to Composable Architecture ● Kafka as a Service (2021) ● KaaS Features and Architecture ● KaaS Learnings Our Journey With Kafka
  • 26.
    Closed System Pipeline Abstraction PipelineAbstraction Kafka as a Service Stream Processing Composable System Architecture Evolution
  • 27.
    Whether to buildor buy? ● We evaluated the tradeoffs for our situation (Year 2020-21) ○ Customizability ○ Long term costs ○ Available in-house expertise ○ Minimize Risks After careful consideration, we decided to BUILD our own managed Kafka Platform. YMMV!
  • 28.
    ● Business Context ●Keystone Data Pipeline (2015) ● Evolution to Composable Architecture ● Kafka as a Service (2021) ● KaaS Architecture ● KaaS Learnings Our Journey With Kafka
  • 29.
    Kafka as aService (KaaS) Alerting & Auto Remediation Security & Access Control Observability Client Library Schema Management Provisioning
  • 30.
  • 31.
  • 32.
    ● High-availability ○ Replicationfactor = 2 ○ Min insync replicas = 1 ○ Unclean leader election enabled ● Strong Consistency ○ Replication factor = 3 ○ Min insync replicas = 2 ○ Unclean leader election disabled Kafka Cluster Configuration
  • 33.
  • 34.
  • 35.
  • 36.
    ● Business Context ●Keystone Data Pipeline (2015) ● Evolution to Composable Architecture ● Kafka as a Service (2021) ● KaaS Architecture ● KaaS Learnings Our Journey With Kafka
  • 37.
  • 38.
  • 39.
  • 40.
    KaaS Scale 190 millionmessages / second 150+ GB ingested / second 8+ PB persisted state 475+ dedicated Kafka Clusters 11,500 Kafka brokers 35,000 Kafka topics
  • 41.
    ● Business Context ●Keystone Data Pipeline (2015) ● Evolution to Composable Architecture ● Kafka as a Service (2021) ● KaaS Architecture ● KaaS Learnings Our Journey With Kafka
  • 42.
    1. Scaling asingle Kafka Cluster
  • 43.
    Scaling Up aCluster before KaaS Topic partition counts were tightly coupled with number of brokers
  • 44.
    Using OSS CruiseControl: Topic partition counts independent of number of brokers Scaling Up a Cluster in KaaS
  • 45.
    2. Making ClusterUpgrades Faster
  • 46.
    Upgrade time v/sState Size
  • 47.
  • 48.
    Unit of Change:AWS EC2 instance
  • 49.
    Kafka Fleet Upgrades UpgradeTime (old strategy) Desired Upgrade Time Upgrade Frequency Hardware Upgrade 3+ months < 1 month annually Software Upgrade 3+ months < 1 week monthly
  • 50.
    Software Upgrade Strategy#1 Leverage Amazon Elastic Block Store (EBS) Source: https://aws.amazon.com/ebs/
  • 51.
    Move Kafka statefrom local instance storage to EBS Software Upgrade Strategy #1
  • 52.
    ● EBS isexpensive at large scale ○ Moved large scale clusters back to AWS instance types with local disk ● Back to where we started → longer upgrade times ☹ EBS is awesome but ..
  • 53.
    How can weupgrade faster without EBS?
  • 54.
    How can weupgrade faster without EBS? AWS Replace Root Volume Source: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/replace-root.html
  • 55.
    AWS Replace RootVolume to upgrade AMI Software Upgrade Strategy #2
  • 56.
    Kafka Fleet Upgradeswith Replace Root Volume Strategy Current Upgrade Time Desired Upgrade Time Upgrade Frequency Hardware Upgrade 1+ month < 1 month annually Software Upgrade 5 days < 1 week monthly
  • 57.
  • 58.
    Right Sizing aKafka Cluster Num Consumers Throughput Replication Factor Retention ● Which EC2 instance type? ● How many instances? ● How much disk?
  • 59.
    Right Sizing aKafka Cluster Num Consumers Throughput Replication Factor Retention Kafka Capacity Model Num Brokers Instance Type Cost 3 i3en.2xl $ 3 i4i.2xl $$ 6 r5.4xl + EBS $$$ https://github.com/Netflix-Skunkworks/service-capacity-modeling/blob/main/service_capacity_modeling/models/org/netflix/kafka.py
  • 60.
    ● Business Context ●Keystone Data Pipeline (2015) ● Evolution to Composable Architecture ● Kafka as a Service (2021) ● KaaS Features and Architecture ● KaaS Learnings Our Journey With Kafka
  • 61.
    Composable architectures areeasier to scale and evolve with the business Key Takeaway Closed System Pipeline Abstraction Pipeline Abstraction Kafka as a Service Stream Processing Composable System
  • 62.
    Q & A Self-hostingKafka at Scale Netflix’s Journey & Challenges Piyush Goyal Nick Mahilani
  • 63.
    ● S3 FlashBootloader (precursor to AWS Replace Root Volume) ● Joey’s talk on “Capacity Plan optimally in the cloud” ● Kyle and JS talk on “Iterating faster on Stateful Services in the cloud” References