Microsoft kafka load imbalance

Siphon Data Bus
Shared Data
Soumyajit Sahu
Engineer @ Microsoft

Siphon Usage
3.9 million
EVENTS PER SECOND INGRESS (AVG)
800 TB
Ingestion PER DAY
1,700
PRODUCTION KAFKA BROKERS
10 Sec
99th PERCENTILE LATENCY
KEY CUSTOMER SCENARIOS
Ads Monetization (Fast BI)
O365 Customer Fabric NRT – Tenant & User insights
BingNRT Operational Intelligence
Presto (Fast SML) interactive analysis
Delve Analytics
0
10
20
30
40
50
60
Throughput(inGBps)
Siphon Data Volume (Ingress and Egress)
Volume published (GBps) Volume subscribed (GBps) Total Volume (GBps)
0
5
10
15
20
25
Throughput(eventspersec)Millions
Siphon Events per second (Ingress and Egress)
EPS In Eps Out Total EPS

Siphon Architecture
Asia DC
Zookeeper Canary
Kafka
Collector
Agent
Services Data Pull (Agent)
Services Data Push
Device Proxy Services
Consumer
API (Push/
Pull)
Europe DC
Zookeeper Canary
Kafka
US DC
Zookeeper Canary
Kafka
Streaming
Batch
Audit Trail
Open
Source
Microsoft Internal
Siphon

On Autopilot!
Kafka Cluster Storage
Management
A nightmare today!

The Problem - Unbalanced Disk and Machine usage
Machine 1 Machine 3Machine 2
Disk 1 Disk 1Disk 1
Disk 2 Disk 2Disk 2
Partitions with intense IO
New disk added to the machine
New machine added to the cluster

Disk 1
Disk 2
Disk 3
T1-P1
T2-P2 T2-P3
T1-P2 T2-P1
* Disk-3 is a newly added disk and we could move some partitions there.
* If Topic T1 is a higher IO demanding topic, then prefer to put T1-P1 and T1-P2 on separate drives.
Closer look: Imbalance inside a Broker

Assuming:
1) All disks are similar
2) All partitions of a topic are equally IO demanding
Solution
Just before broker start
{
Get alphabetically sorted list of all local kafka directories
Put them on the list of disks in a round robin fashion
}
Ensures:
1) Partitions of the same Topic go to different drives
2) All disks get equal share of directories
* This method doesn’t cover all scenarios, but does a decent job. Two heavy throughput partitions could still
end up on the same disk.
Closer look: Imbalance inside a Broker

Disk 1
Disk 2
Disk 3
T1-P1 T2-P2
T2-P3T1-P2
T2-P1
* Disk-3 is a newly added disk and we could move some partitions there.
* If Topic T1 is a higher IO demanding topic, then prefer to put T1-P1 and T1-P2 on separate drives.
Closer look: Imbalance inside a Broker solved

Broker-1 Broker-3
T1-P1
T2-P1
T3-P1
T3-P2
Broker-2
T1-P2
T2-P2
* Broker-3 is a newly added broker to the cluster.
* T3 was created when Broker-2 was down.
Closer look: Imbalance across machines

• The static approach (usually has the following wrong assumptions):
Assuming heterogeneous topics’ partitions will show
homogenous throughput characteristics.
Number of partitions for a topic is in the control of the Cluster
manager (Operations team).
All machines in the cluster are of the same configuration

Broker-1 Broker-3
T1-P1
Broker-2
T1-P2
T2-P2T2-P1
* Equal partitions across machines isn’t enough to achieve fair load
* Perhaps T3-P1 is very IO intensive
* Perhaps Broker-1 is of low hardware configuration
T3-P1
Broker-2 Broker-1
T1-P1
T1-P2
T2-P2
T2-P1
T3-P1
Broker-2

• The dynamic approach:
Since a statically defined approach didn’t work,
we need a dynamic approach…
Next, we will discuss the dynamic approach (which I call the
“Adoption Marketplace”)

An Adoption Marketplace
Broker-1 Broker-3
Adoption Ads
(First come first serve)
Item: Topic1-
Partition1
Requires: 2
MBPS at peak
Broker-2

• The (POC) logic/tool runs on each
broker independently
• The logic is completely distributed,
but needs co-ordination.
Zookeeper is leveraged here.

• Advertisement format:
{
Version: 1.0
Item: TopicA-PartitionB,
ResourcesRequired: [
{ResourceName:”X”, ResourceQty: X1},
{ResourceName:”Y”, ResourceQty: Y1},
…
]
}
Example (Advertise to give away Topic1’s partition 1):
{
Version: 1.0
Item: “Topic1-1”,
ResourcesRequired: [
{ResourceName:”PeakMBPS”, ResourceQty: 2}
]
}
Versioning allows future enhancements
to take place seamlessly

• The Interface
public interface IAdoptionLogic {
// Individual Developers need to implement/perfect this based on their need
Partition findLocalPartitionToGiveOut() throws Exception;
// Individual Developers need to implement/perfect this based on their need
Partition findRemotePartitionToTakeIn()throws Exception;
// This method will be a zookeeper backed implementation out of the box
void advertisePartitionForAdoption(Partition partitionToGiveOut)throws Exception;
// This method will be a zookeeper backed implementation out of the box
void adoptRemotePartition(Partition partitionToTakeIn)throws Exception;
}

• LOOP START:
Find any local partition to be given away
Post the Ad and wait for its adoption
Find any Ads from others for adoption and check your eligibility
Lock on an Ad and adopt it if you are eligible
GOTO LOOP START

Proof of Concept Experiment – The logic
The logic to put/take partitions for adoption was kept straight forward:
1) findLocalPartitionToGiveOut()
If a broker sees that it has more partitions for a topic than the
Ceil((Partitions * Replication-factor) / total brokers in cluster), then it
puts a partition of that topic for adoption
2) adoptRemotePartition()
If a broker has less partitions for a topic than the Ceil((Partitions
* Replication-factor) / total brokers in cluster), and it sees an
advertisement for such a partition, then it tries to adopt it.

Proof of Concept Experiment – Result
Cluster with 3 brokers
Before case (completely skewed topic with all partitions on one broker):
Topic:topic1 PartitionCount:10 ReplicationFactor:1 Configs:
Topic: topic1 Partition: 0 Leader: 169595708 Replicas: 169595708 Isr: 169595708

Proof of Concept Experiment – Result
After case (running the tool in background yielded an uniform distribution eventually):
Topic:topic1 PartitionCount:10 ReplicationFactor:1 Configs:

Benefits:
1) Kafka Reassignment command runs continuously throughout the
cluster.
2) Older partitions can spread to the new machines without any
manual operations.
3) The advertisements can be monitored. If there are no adopters, then
it is time to add more machines.
4) The logic to determine which partitions to put for/take in for
adoptions could be constantly improved, and will be dependent on
ones environment and use case.

Proof of Concept Experiment – Code
Github:
https://github.com/Microsoft/Cluster-Partition-Rebalancer-For-Kafka

Microsoft kafka load imbalance

Recommended

Recommended

More Related Content

Similar to Microsoft kafka load imbalance

Similar to Microsoft kafka load imbalance (20)

More from Nitin Kumar

More from Nitin Kumar (16)

Recently uploaded

Recently uploaded (20)

Microsoft kafka load imbalance