What's inside the black box? Using ML to tune and manage Kafka. (Matthew Stump, Vorstella) Kafka Summit SF 2019

• Worked with 500+ large scale deployments
• Author of the Cassandra C++ driver
• Contributor to k8s, Cassandra, Lucene, Hadoop
• Designed some of the largest distributed
systems in existence
• Ran strategic pre-sales and product marketing
at DataStax
• Founder and CTO at SourceNinja
Matthew Stump
Co-Founder & CEO

Agenda
• Why another monitoring tool?
• What do we want as operators of systems like Kafka?
• How we used Kafka as the backbone of our product
• Our architecture, highlighting use of Kafka in Kubernetes
• How our ML models work and the types of models we use
• Example wins and use cases
• Tuning Kafka for large message sizes > 10MB
• Identifying replication groups in rebalancing storms

What do we want from our tools? To know…
•What changed?
•What’s caused it?
•How do I make it stop?
•How do I tune the system
for my application?
•What should I know but
don’t?

Why is this hard?
• Things often fail in coordination/cascades
• Disk, garbage collection, network, request latency, CPU
• One bad actor can take down entire distributed systems
• Existing systems look at 1 signal in a vacuum
• Bad/nonexistent documentation or tuning recommendations
• Information overload: too many knobs, too many metrics
• These systems are very complex; nobody knows how they really work

Insert screenshot of resolution example

Dataflow and Conceptual Architecture

Types of ML Models
Unsupervised Supervised Reinforcement Learning
Clustering: has the workload
changed?
Clustering: stable vs. unstable
Binary LSTM: stable vs. unstable
Binary LSTM: are you this bug?
Multi-class LSTM: identify 1 of N
bad behaviors
Optimize a “score”: tune for
latency, cost, throughput
Meta-learn best resolution

Why Kafka?
• All of our workloads are inherently asynchronous
• Central communications bus between all of our workers
• Wildly different hardware requirements: GPU, CPU, memory
• Wildly different latencies for different workers
• Monte Carlo simulation > 60 seconds
• Clustering ≈ 2-3 milliseconds
• Decouple releases, test new ideas in low-risk ways

Where Kafka Broke
• Poor tooling outside of the JVM ecosystem
• Defaults work for many people, no documentation for tuning and
troubleshooting outside of a couple common scenarios
• Cross-network communication is problematic (multiple k8s clusters)
• Single bad actor can take down large clusters

Example One: Kafka & large messages
*Jiangjie Qin, LinkedIn
• Most benchmarks stop at < 1MB
message sizes
• No official guidelines for tuning for
large messages
• No official guidelines for tuning thread
pools or memory
• Most people say don’t use large
messages, and recommend building a
hybrid system

Example One: Kafka & large messages
java.io.IOException: Connection to 1 was disconnected before the response was read
at org.apache.kafka.clients.NetworkClientUtils.sendAndReceive(NetworkClientUtils.java:97)
at kafka.server.ReplicaFetcherBlockingSend.sendRequest(ReplicaFetcherBlockingSend.scala:96)
…
Presentation
Replicas falling behind (ISR), large messages, connection timeouts
Results
Can safely handle messages > 30 Mb at scale with a small cluster
num.io.threads ≈ 2 * CPU
num.network.threads ≈ 2 * CPU
num.replica.fetchers ≈ highest replica count of largest topic + (20, 50)%
replica.fetch.max.bytes ≈ 150% of max message size
Solution
Increase replication, IO, and network thread counts. Increase replica fetch size. Specific values vary

Example Two: Rebalance storms
Periodic leader election timeout,
consumer lag, node timeout
• A single bad consumer can take down a large
Kafka cluster
• Leader election timeouts, unresponsive Kafka
nodes fail health checks
• Default rolling restart behavior in k8s is
capable of triggering a cascading failure.
• Large messages, or slow consumers increase
the likelihood of instability.
Presentation
• Avoid slow, asynchronous libraries (aiokafka)
• Change k8s deployment policy to:
podManagementPolicy: "Parallel"
• Decrease batch sizes
Solution

We’re expanding to many systems, here’s a few:
Come talk to us, we want to help

Thank You!
Matthew Stump, CEO
www.vorstella.com
@mattstump
contact@vorstella.com

What's inside the black box? Using ML to tune and manage Kafka. (Matthew Stump, Vorstella) Kafka Summit SF 2019

More Related Content

What's hot

Similar to What's inside the black box? Using ML to tune and manage Kafka. (Matthew Stump, Vorstella) Kafka Summit SF 2019

More from confluent

Recently uploaded

What's inside the black box? Using ML to tune and manage Kafka. (Matthew Stump, Vorstella) Kafka Summit SF 2019