Balance Kafka Cluster with Zero Data Movement
Yaodong Yang (Apple), Haochen Li (Apple)
Yaodong Yang, Apple Inc. May, 2023
Haochen Li, Apple Inc. NOT A CONTRIBUTION
Balance Kafka Cluster with Zero
Data Movement
Kafka Cluster Load Balancing
• Bene
fi
ts
• High Performance
• Cost E
ffi
ciency
• Determining Factors
• Kafka Partition Placement
• Kafka Partition Access Pattern
• Challenges
• Kafka Partitions are Heterogenous
• Storage Retention Requirement
• Produce & Consume Tra
ffi
c Pattern
Current Solution
• Continuously rebalance Kafka cluster based on Load Metrics
• collect the load metrics from Kafka
• generate the cluster load model
• compute the optimization proposal
• execute the proposal
• Overhead
• data movement between di
ff
erent brokers
• negative impact for producers and consumers
• long time to
fi
nish (hours or even days)
• infra cost
Data Ingestion Use Case
• Workload Pattern
• Data events are randomly assigned
to partitions from the kafka topic
• All partitions from one topic are
consumed evenly
• Kafka producers and consumers
don’t have strict requirement for
Kafka Partition Count
• Kafka Partitions from the same topic
• Same data volumes produced,
consumed and retained
Kafka Partition Replica Placement
• Partition Replica Placement Strategy
• Partition Count
• scale_number: Number of Leader Replica per broker
for a topic
• partition_count = scale_number * broker_count
• Partition Replica Placement
• For every Kafka Topic, the number of Replicas in each
broker should be the same.
• For every Kafka Topic, the number of Leader Replicas in
each broker should be the same.
• Same load on individual Kafka Brokers.
• Same hardware utilization on individual Kafka brokers
• CPU
• Storage Volume
• Network
Scenarios
• New Topic Creation
• Generate the Replica Assignment for the new topic
• Create the topic in the Kafka cluster with the above Replica Assignment
Scenarios
• Increase Partition Count: scale_number increase
• Generate the Replica Assignment for the new partitions
• Create partitions in the Kafka cluster with the above Replica Assignment
Scenarios
• Add more brokers
• Generate the Replica Assignment for partitions in new brokers
• Create partitions in the Kafka cluster with the above Replica Assignment
Scenarios
• Ingestion tra
ffi
c volume and retention changes
• no impact on the load balance of Kafka Cluster
• Remove some brokers
• data movement is unavoidable
• avoid it with cluster migration if possible
• Cluster Migration & Merge
• rebalance the cluster:
• partition reassignment
• scale_number increase
Implementation
• Current
• Implemented as a Topic Operator
• Deployed in production
• Plan
• Open a KIP in Apache Kafka Project
• Contribute back to upstream
Take Away
• Partition Placement Strategy can greatly improve the Load Balance of Kafka
Clusters
Thank you!

Balance Kafka Cluster with Zero Data Movement with Haochen Li & Yaodong Yang

  • 1.
    Balance Kafka Clusterwith Zero Data Movement Yaodong Yang (Apple), Haochen Li (Apple)
  • 2.
    Yaodong Yang, AppleInc. May, 2023 Haochen Li, Apple Inc. NOT A CONTRIBUTION Balance Kafka Cluster with Zero Data Movement
  • 3.
    Kafka Cluster LoadBalancing • Bene fi ts • High Performance • Cost E ffi ciency • Determining Factors • Kafka Partition Placement • Kafka Partition Access Pattern • Challenges • Kafka Partitions are Heterogenous • Storage Retention Requirement • Produce & Consume Tra ffi c Pattern
  • 4.
    Current Solution • Continuouslyrebalance Kafka cluster based on Load Metrics • collect the load metrics from Kafka • generate the cluster load model • compute the optimization proposal • execute the proposal • Overhead • data movement between di ff erent brokers • negative impact for producers and consumers • long time to fi nish (hours or even days) • infra cost
  • 5.
    Data Ingestion UseCase • Workload Pattern • Data events are randomly assigned to partitions from the kafka topic • All partitions from one topic are consumed evenly • Kafka producers and consumers don’t have strict requirement for Kafka Partition Count • Kafka Partitions from the same topic • Same data volumes produced, consumed and retained
  • 6.
    Kafka Partition ReplicaPlacement • Partition Replica Placement Strategy • Partition Count • scale_number: Number of Leader Replica per broker for a topic • partition_count = scale_number * broker_count • Partition Replica Placement • For every Kafka Topic, the number of Replicas in each broker should be the same. • For every Kafka Topic, the number of Leader Replicas in each broker should be the same. • Same load on individual Kafka Brokers. • Same hardware utilization on individual Kafka brokers • CPU • Storage Volume • Network
  • 7.
    Scenarios • New TopicCreation • Generate the Replica Assignment for the new topic • Create the topic in the Kafka cluster with the above Replica Assignment
  • 8.
    Scenarios • Increase PartitionCount: scale_number increase • Generate the Replica Assignment for the new partitions • Create partitions in the Kafka cluster with the above Replica Assignment
  • 9.
    Scenarios • Add morebrokers • Generate the Replica Assignment for partitions in new brokers • Create partitions in the Kafka cluster with the above Replica Assignment
  • 10.
    Scenarios • Ingestion tra ffi cvolume and retention changes • no impact on the load balance of Kafka Cluster • Remove some brokers • data movement is unavoidable • avoid it with cluster migration if possible • Cluster Migration & Merge • rebalance the cluster: • partition reassignment • scale_number increase
  • 11.
    Implementation • Current • Implementedas a Topic Operator • Deployed in production • Plan • Open a KIP in Apache Kafka Project • Contribute back to upstream
  • 12.
    Take Away • PartitionPlacement Strategy can greatly improve the Load Balance of Kafka Clusters
  • 13.