expertise on tap
• Worked with 500+ large scale deployments
• Author of the Cassandra C++ driver
• Contributor to k8s, Cassandra, Lucene, Hadoop
• Designed some of the largest distributed
systems in existence
• Ran strategic pre-sales and product marketing
at DataStax
• Founder and CTO at SourceNinja
Matthew Stump
Co-Founder & CEO
Agenda
• Why another monitoring tool?
• What do we want as operators of systems like Kafka?
• How we used Kafka as the backbone of our product
• Our architecture, highlighting use of Kafka in Kubernetes
• How our ML models work and the types of models we use
• Example wins and use cases
• Tuning Kafka for large message sizes > 10MB
• Identifying replication groups in rebalancing storms
What do we fear most?
How things work now
+ =
What do we want from our tools? To know…
•What changed?
•What’s caused it?
•How do I make it stop?
•How do I tune the system
for my application?
•What should I know but
don’t?
Why is this hard?
• Things often fail in coordination/cascades
• Disk, garbage collection, network, request latency, CPU
• One bad actor can take down entire distributed systems
• Existing systems look at 1 signal in a vacuum
• Bad/nonexistent documentation or tuning recommendations
• Information overload: too many knobs, too many metrics
• These systems are very complex; nobody knows how they really work
Our Approach
Insert screenshot of resolution example
High Level Architecture
Dataflow and Conceptual Architecture
Types of ML Models
Unsupervised Supervised Reinforcement Learning
Clustering: has the workload
changed?
Clustering: stable vs. unstable
Binary LSTM: stable vs. unstable
Binary LSTM: are you this bug?
Multi-class LSTM: identify 1 of N
bad behaviors
Optimize a “score”: tune for
latency, cost, throughput
Meta-learn best resolution
Why Kafka?
• All of our workloads are inherently asynchronous
• Central communications bus between all of our workers
• Wildly different hardware requirements: GPU, CPU, memory
• Wildly different latencies for different workers
• Monte Carlo simulation > 60 seconds
• Clustering ≈ 2-3 milliseconds
• Decouple releases, test new ideas in low-risk ways
Where Kafka Broke
• Poor tooling outside of the JVM ecosystem
• Defaults work for many people, no documentation for tuning and
troubleshooting outside of a couple common scenarios
• Cross-network communication is problematic (multiple k8s clusters)
• Single bad actor can take down large clusters
Examples
Example One: Kafka & large messages
*Jiangjie Qin, LinkedIn
• Most benchmarks stop at < 1MB
message sizes
• No official guidelines for tuning for
large messages
• No official guidelines for tuning thread
pools or memory
• Most people say don’t use large
messages, and recommend building a
hybrid system
Example One: Kafka & large messages
java.io.IOException: Connection to 1 was disconnected before the response was read
at org.apache.kafka.clients.NetworkClientUtils.sendAndReceive(NetworkClientUtils.java:97)
at kafka.server.ReplicaFetcherBlockingSend.sendRequest(ReplicaFetcherBlockingSend.scala:96)
…
Presentation
Replicas falling behind (ISR), large messages, connection timeouts
Results
Can safely handle messages > 30 Mb at scale with a small cluster
num.io.threads ≈ 2 * CPU
num.network.threads ≈ 2 * CPU
num.replica.fetchers ≈ highest replica count of largest topic + (20, 50)%
replica.fetch.max.bytes ≈ 150% of max message size
Solution
Increase replication, IO, and network thread counts. Increase replica fetch size. Specific values vary
Example Two: Rebalance storms
Periodic leader election timeout,
consumer lag, node timeout
• A single bad consumer can take down a large
Kafka cluster
• Leader election timeouts, unresponsive Kafka
nodes fail health checks
• Default rolling restart behavior in k8s is
capable of triggering a cascading failure.
• Large messages, or slow consumers increase
the likelihood of instability.
Presentation
• Avoid slow, asynchronous libraries (aiokafka)
• Change k8s deployment policy to:
podManagementPolicy: "Parallel"
• Decrease batch sizes
Solution
We’re expanding to many systems, here’s a few:
Come talk to us, we want to help
Thank You!
Matthew Stump, CEO
www.vorstella.com
@mattstump
contact@vorstella.com

What's inside the black box? Using ML to tune and manage Kafka. (Matthew Stump, Vorstella) Kafka Summit SF 2019

  • 1.
  • 2.
    • Worked with500+ large scale deployments • Author of the Cassandra C++ driver • Contributor to k8s, Cassandra, Lucene, Hadoop • Designed some of the largest distributed systems in existence • Ran strategic pre-sales and product marketing at DataStax • Founder and CTO at SourceNinja Matthew Stump Co-Founder & CEO
  • 3.
    Agenda • Why anothermonitoring tool? • What do we want as operators of systems like Kafka? • How we used Kafka as the backbone of our product • Our architecture, highlighting use of Kafka in Kubernetes • How our ML models work and the types of models we use • Example wins and use cases • Tuning Kafka for large message sizes > 10MB • Identifying replication groups in rebalancing storms
  • 4.
    What do wefear most?
  • 6.
  • 7.
    What do wewant from our tools? To know… •What changed? •What’s caused it? •How do I make it stop? •How do I tune the system for my application? •What should I know but don’t?
  • 8.
    Why is thishard? • Things often fail in coordination/cascades • Disk, garbage collection, network, request latency, CPU • One bad actor can take down entire distributed systems • Existing systems look at 1 signal in a vacuum • Bad/nonexistent documentation or tuning recommendations • Information overload: too many knobs, too many metrics • These systems are very complex; nobody knows how they really work
  • 9.
  • 10.
    Insert screenshot ofresolution example
  • 13.
  • 14.
  • 15.
    Types of MLModels Unsupervised Supervised Reinforcement Learning Clustering: has the workload changed? Clustering: stable vs. unstable Binary LSTM: stable vs. unstable Binary LSTM: are you this bug? Multi-class LSTM: identify 1 of N bad behaviors Optimize a “score”: tune for latency, cost, throughput Meta-learn best resolution
  • 16.
    Why Kafka? • Allof our workloads are inherently asynchronous • Central communications bus between all of our workers • Wildly different hardware requirements: GPU, CPU, memory • Wildly different latencies for different workers • Monte Carlo simulation > 60 seconds • Clustering ≈ 2-3 milliseconds • Decouple releases, test new ideas in low-risk ways
  • 17.
    Where Kafka Broke •Poor tooling outside of the JVM ecosystem • Defaults work for many people, no documentation for tuning and troubleshooting outside of a couple common scenarios • Cross-network communication is problematic (multiple k8s clusters) • Single bad actor can take down large clusters
  • 18.
  • 19.
    Example One: Kafka& large messages *Jiangjie Qin, LinkedIn • Most benchmarks stop at < 1MB message sizes • No official guidelines for tuning for large messages • No official guidelines for tuning thread pools or memory • Most people say don’t use large messages, and recommend building a hybrid system
  • 20.
    Example One: Kafka& large messages java.io.IOException: Connection to 1 was disconnected before the response was read at org.apache.kafka.clients.NetworkClientUtils.sendAndReceive(NetworkClientUtils.java:97) at kafka.server.ReplicaFetcherBlockingSend.sendRequest(ReplicaFetcherBlockingSend.scala:96) … Presentation Replicas falling behind (ISR), large messages, connection timeouts Results Can safely handle messages > 30 Mb at scale with a small cluster num.io.threads ≈ 2 * CPU num.network.threads ≈ 2 * CPU num.replica.fetchers ≈ highest replica count of largest topic + (20, 50)% replica.fetch.max.bytes ≈ 150% of max message size Solution Increase replication, IO, and network thread counts. Increase replica fetch size. Specific values vary
  • 21.
    Example Two: Rebalancestorms Periodic leader election timeout, consumer lag, node timeout • A single bad consumer can take down a large Kafka cluster • Leader election timeouts, unresponsive Kafka nodes fail health checks • Default rolling restart behavior in k8s is capable of triggering a cascading failure. • Large messages, or slow consumers increase the likelihood of instability. Presentation • Avoid slow, asynchronous libraries (aiokafka) • Change k8s deployment policy to: podManagementPolicy: "Parallel" • Decrease batch sizes Solution
  • 22.
    We’re expanding tomany systems, here’s a few: Come talk to us, we want to help
  • 23.
    Thank You! Matthew Stump,CEO www.vorstella.com @mattstump contact@vorstella.com