Cloud providers like AWS allow free data transfers within an Availability Zone (AZ), but bill users when data moves between AZs. When the data volume streamed through Kafka reaches big data scale, (e.g. numeric data points or user activity tracking), the costs incurred by cross-AZ traffic can add significantly to your monthly cloud spend. Since Kafka serves reads and writes only from leader partitions, for a topic with a replication factor of 3, a message sent through Kafka can cross AZs up to 4 times. Once when a producer produces a message onto broker in a different AZ, two times during Kafka replication, and once more during message consumption. With careful design, we can eliminate the first and last part of the cross AZ traffic. We can also use message compression strategies provided by Kafka to reduce costs during replication. In this talk, we will discuss the architectural choices that allow us to ensure a Kafka message is produced and consumed within a single AZ, as well as an algorithm that lets consumers intelligently subscribe to partitions with leaders in the same AZ. We will also cover use cases in which cross-AZ message streaming is unavoidable due to design limitations. Talk outline: 1) A review of Kafka replication, 2) Cross-AZ traffic implications, 3) Architectural choices for AZ-aware message streaming, 4) Algorithms for AZ-aware producers and consumers, 5) Results, 6) Limitations, 7) Takeaways.
Achieving a 50% Reduction in Cross-AZ Network Costs from Kafka (Uday Sagar Siramshetty, SignalFX)
1. Achieving a 50% Reduction in
Cross-AZ Network Costs
from Kafka
Uday Sagar Shiramshetty
SignalFx
2. About Me:
● 4+ years of experience in
Monitoring and Distributed
Systems
● Site Reliability Engineer
● Distributed Systems Engineer
● Currently, building an In-memory
database to be able to search
metadata on billions of time
series.
3. About SignalFx:
● Real-Time Cloud Monitoring
Platform for Infrastructure,
Microservices and Applications
● 20 Kafka clusters in production
● 400 billion messages per day on
the largest cluster
● Officially part of Splunk from
Today
4. 4
SignalFx is the Only Real-Time Observability Platform
Smart Gateway™
Metricization
Metadata Extraction
TraceStore
COLLECTION PRE-PROCESSING STREAMING RETENTION ANALYTICS CAPABILITIES
SMART AGENT
SFx Quantizer™
DYNAMIC
LAG ADJUSTMENT
ROLL-UPS
1sec
1min
5min
1hr
+>
Metadata Router
Time Series Router
MetaStore
MetricsStore
ROOT CAUSE
ANALYSIS
DEBUGGING
SERVICE MAPPING
Distributed Traces
Auto-Instrumentation
Code Instrumentation
Telemetry Adapter
Metrics
Function Wrappers
Cloud API Integrations
Custom Metrics
HIGH-RES
VISUALIZATION
AUTOMATION
ALERTING
SIGNALFX PATENTED STREAMING ARCHITECTURE
5. 5
Goal and Agenda
Goal: Reduce data transfer across availability zones to lower network costs.
Agenda:
• Motivation: AWS network pricing model
• Brief overview of AZ awareness and compression
• Deep dive on AZ awareness
• Benefits of Kafka Message Compression
• Charts showing our benefits
7. 7
• Kafka brokers are grouped in Racks
• Racks provide fault tolerance and each rack may be in a different physical
location or has its own power source
• In AWS, racks are generally mapped to Availability Zones
• This talk uses the terms Rack and Availability Zone interchangeably
17. 17
We need our producers and consumers to talk to leader brokers in the same AZ.
18. 18
• Kafka Producer needs to be aware of leader partitions in an AZ
• Kafka client library can help us find the locations of leader partitions in an AZ
• Then, the producer will have to produce messages onto those
leader partitions
19. 19
On producer initialization:
String myRack = System.getProperty("KAFKA_PRODUCER_RACK", "defaultRack");
TopicDescription topicDescription = getTopicDescription(topicName);
List<Integer> desiredPartitions = topicDescription.partitions().stream()
.filter(p -> myRack.equals(getLeaderRack(p))
.map(TopicPartitionInfo::partition)
.collect(Collectors.toList());
…
To send a message:
int partition = getPartition(message, desiredPartitions);
ProducerRecord record = new ProducerRecord<>(topicName, partition, null, message);
producer.send(record);
20. 20
There are few things to consider before finalizing on desired partitions to produce
messages.
1. What if partitions are not spread across brokers properly?
2. What if we don’t have any leader partitions in the same rack?
3. What happens to an already initialized state of desired partitions if replica
assignment is updated?
For 1 and 2, we default to spraying messages across all partitions. Then, we balance
partitions assignment as an operational task.
For 3, we just need a callback hook to be able to re-create desired partitions state
after an update to assignment.
23. 23
PartitionAssignor
Kafka Java client library has an interface called PartitionAssignor, that allows us to
define our own custom partition assignment strategy.
public interface PartitionAssignor {
/**
* Return a serializable object representing the local
* member’s subscription.
*/
Subscription subscription(Set<String> topics);
/**
* Perform the group assignment given the member subscriptions
* and current cluster metadata.
*/
Map<String, Assignment> assign(Cluster metadata, Map<String, Subscription> subscriptions);
...
}
24. 24
PartitionAssignor
Partitions can be assigned to consumers in the group based on different implementations of
PartitionAssignor. For example:
1. RangeAssignor
2. RoundRobinAssignor
3. StickyAssignor
4. RackAwareAssignor *
RangeAssignor is the default for a Kafka consumer.
RackAwareAssignor is our implementation to get an AZ aware consumer assignment.
25. 25
RangeAssignor
An introductory example:
Two consumers C0 and C1 and two topics t0 and t1 with 3 partitions each.
Topic t0: t0p0, t0p1, t0p2
Topic t1: t1p0, t1p1, t1p2
Assignment will be:
C0: [t0p0, t0p1, t1p0, t1p1]
C1: [t0p2, t1p2]
26. 26
RackAwareAssignor
Two consumers C0 (R1) and C1 (R2), two topics t0 and t1 with 3 partitions each.
Topic t0: t0p0 (R1), t0p1 (R2), t0p2 (R1)
Topic t1: t1p0 (R1), t1p1 (R2), t1p2 (R2)
R1 and R2 are racks where the leader partitions are located.
Assignment will be:
C0: [t0p0, t0p2, t1p0]
C1: [t0p1, t1p1, t1p2]
27. 27
RackAwareAssignor
RackAwareAssignor.class is supplied as partition.assignment.strategy config
value during consumer initialization.
partition.assignment.strategy:[RackAwareAssignor.class,
RangeAssignor.class]
Most preferred PartitionAssignor suggested by all consumer group members is used for
assignment.
To update the strategy, you will have to include old, new strategy in the config value with
new strategy followed by old. Then, after re-initializing majority of consumers in the group,
your new strategy will be used.
28. 28
RackAwareAssignor
Step 1:
maxNumOfPartitions = Math
.ceil(topicPartitionsCount * 1.0 / totalConsumersCount);
Step 2:
Assign partitions to consumers from same rack
while ensuring that any consumer has only up to
maxNumOfPartitions partitions assigned.
Step 3:
Assign the remaining partitions to consumers with
fewer assignments.
32. 32
Fixed Consumer Assignment
Related messages need to be processed on single consumer instance, without group
membership changes (even when a consumer goes down).
Each consumer instance owns a set of statically assigned tokens and those has to be
mapped to their partitions.
33. 33
Tokens - Partitions Assignment
Partition assignment should be based on large ranges of tokens assigned to racks,
not in any random order.
For ex: with 6 partitions, 6 tokens, 3 brokers (broker id: 0, 1, 2 - 1 in each rack), the
assignment could be:
34. 34
Tokens - Partitions Assignment
Problem: When broker 0 goes down, broker 1 gets all the load
1
1
1
1
40. 40
Kafka Message Compression
• Kafka supports end-to-end compression
• Data is compressed by the Kafka producer client
• Data is written in compressed format on Kafka brokers, leading to savings on disk
usage
• Data is decompressed by Kafka consumer client
41. 41
Kafka Message Compression
• Enabling compression is as simple as setting the config compression.type on Kafka
producer client
• Compression uses extra CPU and memory on producer/consumer
• Snappy compression type worked best for us
42. 42
Benefits from AZ Awareness
45% decrease in Data Transfer - Regional Bytes
43. 43
Benefits from AZ Awareness
Decrease in average end-to-end
latency over last day