1
Using Machine
Learning to
Understand Kafka
Runtime Behavior
Shivnath Babu
Cofounder/CTO, Unravel Data
Adjunct Professor, Duke University
shivnath@unraveldata.com
Nate Snapp
Big Data Engineering
Adobe, Palo Alto Networks, Omniture
LinkedIn or nate.snapp@gmail.com
2
Meet the speakers
• Cofounder/CTO at Unravel
• Adjunct Professor of Computer
Science at Duke University
• Focusing on ease-of-use and
manageability of data apps & platforms
• Recipient of US National Science
Foundation CAREER Award, three
IBM Faculty Awards, HP Labs
Innovation Research Award
Shivnath BabuNate Snapp
• Senior SRE from Adobe, Palo Alto
Networks, and Omniture
• 12 years experience in streaming
• First 6 years on proprietary
streaming analytics for 9/10 Fortune
500, 20B events daily, 10K+ servers
• Last 2 years have moved to Kafka
• Blogging on SRE, Hadoop, and data
streaming space at natesnapp.com
3
MODERN DATA APPLICATIONS
Machine Learning Predictive Analytics
AI loT
ENVIRONMENTS
On-Premises HybridCloud
PLATFORMS & TECHNOLOGIES
NoSQL SQL MPP API
01
uncover
ADAPTIVE DATA COLLECTION
understand
02
DATA MODEL & CORRELATION
ANALYTICS
ENGINE
AUTOMATION
ENGINE
TUNING
ENGINE
INFERENCE
ENGINE
unravel
03
DASHBOARDS
AUTO-ACTIONS
SMART ALERTS
REPORTING
RECOMMENDATIONS
4
• Clusters with 6-29 brokers
• Confluent Kafka 5.2.1, Apache Kafka 2.2.0-cp2
• 1700 topics across all clusters
• Largest topics top out with over 20K+ messages/sec
• Smaller topics are 300-500 messages/sec
• Large self-service components
• Ingress is a mix of separate Kafka, Java client API as well as load balanced
REST API frontends; some clusters have use of Schema Registry
• Egress is a mix of custom endpoints, and in many cases, HDFS sink
What Kafka setups?
5
Nature of streaming behavior
6
Practical challenge #1
Variance in flow
How do we decide if there is an anomaly?
7
Variance in flow (contd.)
Partition number
Count
8
Practical challenge #2
Negative effects of really slow data
9
Practical challenge #3
Event Sourcing
Time
Purchase
Topic
Cancellation
Topic
10
• Runtime schema changes
• “Flexible-Rigid Schema”
• Timeouts causing rebalance storms
• Leader affinity and poor assignment
• Poor partition assignment
Elements of surprise!
11
Anomaly
Detection
Most enterprises now have mission-critical
streaming apps
Predictive
Maintenance
Threat
Monitoring
Recommendation
Engines
Real-time customer
sentiment analysis
12
Streaming data architecture must be reliable
STREAM STORE
Kafka HBase Spark Flink
REAL-TIME PROCESSORIoT Sensors
Database
Other Data
Dashboard
Result Store
Other Output
13
Many problems can cause unreliable performance
STREAM STORE
Kafka HBase Spark Flink
REAL-TIME PROCESSORIoT Sensors
Database
Other Data
Dashboard
Result Store
Other Output
Untimely results
14
Many problems can cause unreliable performance
STREAM STORE
Kafka HBase Spark Flink
REAL-TIME PROCESSORIoT Sensors
Database
Other Data
Dashboard
Result Store
Other Output
Untimely resultsPoor partitioning Inefficient Configuration Resource contention
+ + =
15
DevOps face many challenges today
STREAM STORE
Kafka HBase Spark Flink
REAL-TIME PROCESSORIoT Sensors
Database
Other Data
Dashboard
Result Store
Other Output
Poor partitioning Inefficient Configuration Resource contention
+ +
• No single tool
• No correlation across the stack
• No application view
• No insights
• No recommendations
• No automated actions
=
Untimely results
16
How we can empower DevOps teams
IoT Sensors
Database
Other Data
Dashboard
Result Store
Other Output
Platform Metrics App Metrics
STREAM STORE
Kafka HBase Spark Flink
REAL-TIME PROCESSOR
App Platform
Interaction Metrics
Bring all performance data into
one complete & correlated view
17
How we can empower DevOps teams
IoT Sensors
Database
Other Data
Dashboard
Result Store
Other Output
Platform Metrics App Metrics
STREAM STORE
Kafka HBase Spark Flink
REAL-TIME PROCESSOR
App Platform
Interaction Metrics
Provide out-of-the-box
intelligence with
Machine Learning (ML)
18
How we can empower DevOps teams
IoT Sensors
Database
Other Data
Dashboard
Result Store
Other Output
Platform Metrics App Metrics
STREAM STORE
Kafka HBase Spark Flink
REAL-TIME PROCESSOR
App Platform
Interaction Metrics
Automate actions
smartly with
Artificial Intelligence (AI)
19
20
ALGORITHM
GOAL
21
Goals of Streaming App and Kafka DevOps
Teams
Throughput SLA
Latency SLA
Data loss tolerance
Stability/resiliency
Resource usage/cost
Planning/growth
App-level Goals
Platform-level Goals
AI/ML Algorithms
Outlier Detection
Forecasting
Anomaly Detection
Correlation Analysis
Model Learning
22
Outlier Detection
23
1. Detecting load imbalance among Kafka brokers
2. Detecting load imbalance among Kafka partitions
Use Cases
24
Detecting load imbalance among Kafka brokers
Brokers kabo2 and
kabo3 have much
higher number of
incoming messages
than broker kabo1
25
Algorithms for Outlier Detection
Picture credit: http://historum.com/asian-history/128081-aryan-migration-theory-update-128.html
• Based on one feature Vs.
multiple features
• Is the distribution of data
assumed?
1. Z-score: How many standard
deviations is a data point from the
mean
26
• Based on one feature Vs.
multiple features
• Is the distribution of data
assumed?
1. Z-score: How many standard
deviations is a data point from the
mean
2. DBScan: Density-based clustering
3. Isolation forests
4. Deep learning (e.g., Autoencoders)
Algorithms for Outlier Detection
Picture credit: http://en.proft.me/2017/02/3/density-based-clustering-r/
27
Forecasting
28
1. Predicting when SLAs are in danger of being missed
2. Predicting when system may run out of headroom or
capacity
Use Cases
29
A Real-life Application: RealTimeSentimentMonitor
TWEETS
Partitions
1-N
TWEETS
TWEETS
30
Predicting when SLAs are in danger of being missed
Latency SLA is
3 minutes
Latency SLA can be
missed by this time
Current time is here
31
• Many standard time-series forecasting
techniques: ARIMA, Holt-Winters
• Deep-learning techniques (e.g., LSTM)
Algorithms for Forecasting
32
• Many standard time-series forecasting
techniques: ARIMA, Holt-Winters
• Deep-learning techniques (e.g., LSTM)
• Facebook’s Prophet Algorithm: Mixes stats
methods & judgment from domain experts
• Uses Generative Additive Model (GAM)
• Decomposed time-series model: trend,
seasonality, holidays, and error term
Algorithms for Forecasting
y(t) = trend(t) + periodic(t) + shock(t) + error
33
• Many standard time-series forecasting
techniques: ARIMA, Holt-Winters
• Deep-learning techniques (e.g., LSTM)
• Facebook’s Prophet Algorithm: Mixes stats
methods & judgment from domain experts
• Uses Generative Additive Model (GAM)
• Decomposed time-series model: trend,
seasonality, holidays, and error term
• Advantages:
• Fits faster than ARIMA
• Models various growth trends
• Can handle unevenly spaced data
• Defaults often produce accurate forecasts
Algorithms for Forecasting
34
Anomaly Detection
35
1. An unexpected change that needs your attention
2. Smart alerts:
• False negatives should be minimal
• False positives should be minimal
Use Cases
36
Detecting anomalies is tricky
Is this an unexpected lag
worth alerting on?
37
Algorithms for Anomaly Detection
Picture credit: https://blog.statsbot.co/time-series-anomaly-detection-algorithms-1cef5519aef2
• Deviation from forecasts
38
Algorithms for Anomaly Detection
Picture credit: https://blog.statsbot.co/time-series-anomaly-detection-algorithms-1cef5519aef2
• Deviation from forecasts
• ARIMA
• Regression trees
• Prophet
• STL: Seasonal and Trend
Decomposition using Loess
• Topic of intensive
research
• Deep learning (e.g., LSTM)
39
Correlation Analysis
40
• Fast root-causing of problems
• What lower-level cause led to the change in the
streaming application’s performance?
Use Cases
41
What caused the unexpected change in performance?
Anomaly
What caused it?
100s of time series from every level of the stack!
LATENCY is 421.07% WORSE THAN THE BASELINE
42
• Be aware of the many pitfalls
• E.g., trends can make arbitrary time
series look correlated!
• Pick robust time-series
similarity metrics
• E.g., Euclidean distance Vs. Dynamic
Time Warping
Algorithms for Correlation Analysis
Picture credit: https://izbicki.me/blog/converting-images-into-time-series-for-data-mining.html
Euclidean
Distance
Dynamic
Time
Warping
43
• Be aware of the many pitfalls
• E.g., trends can make arbitrary time
series look correlated!
• Pick robust time-series
similarity metrics
• E.g., Euclidean distance Vs. Dynamic
Time Warping
• Carefully incorporate domain
knowledge
• E.g., what caused latency SLA miss?
• Application-level problem?
• Resource allocation problem?
• Platform-level problem?
• Data-level problem?
Algorithms for Correlation Analysis
Picture credit: https://izbicki.me/blog/converting-images-into-time-series-for-data-mining.html
Euclidean
Distance
Dynamic
Time
Warping
44
Model Learning
45
1. Helps answer what-if and optimization questions
• What is the best number of partitions?
• What is the best setting of timeouts to avoid rebalance storms?
• What is the best partition rebalancing action to take?
• What will the impact of adding a new broker be?
2. Enables Auto Actions for resource/cost efficiency & SLA management
Use Cases
46
Automated tuning suggestions to meet SLA
Precise recommendation to meet SLA
47
• Performance = Func(Input Features)
• Have to find the best set of input features
• Supervised learning is often possible: Training data is available
or easy to generate
Algorithms for Learning Models
Picture credit: https://myslide.cn/slides/8328#
48
Summary: Meeting Kafka DevOps Goals with AI/ML
Throughput goal
Stability goal
Latency goal
Resource usage/cost goal
Data loss tolerance goal
App-level Goals
Platform-level Goals
Planning/growth goal
AI/ML Algorithms
Outlier Detection
Forecasting
Anomaly Detection
Correlation Analysis
Model Learning
49
AIOps: Rich opportunities to address
distributed application performance
management as AI/ML problems
Start your free trial: unraveldata.com/free-trial
Visit us at the Unravel booth
And yes, we are hiring!
shivnath@unraveldata.com

Using Machine Learning to Understand Kafka Runtime Behavior (Shivanath Babu, Unravel Data and Nate Snapp, Adobe) Kafka Summit London 2019

  • 1.
    1 Using Machine Learning to UnderstandKafka Runtime Behavior Shivnath Babu Cofounder/CTO, Unravel Data Adjunct Professor, Duke University shivnath@unraveldata.com Nate Snapp Big Data Engineering Adobe, Palo Alto Networks, Omniture LinkedIn or nate.snapp@gmail.com
  • 2.
    2 Meet the speakers •Cofounder/CTO at Unravel • Adjunct Professor of Computer Science at Duke University • Focusing on ease-of-use and manageability of data apps & platforms • Recipient of US National Science Foundation CAREER Award, three IBM Faculty Awards, HP Labs Innovation Research Award Shivnath BabuNate Snapp • Senior SRE from Adobe, Palo Alto Networks, and Omniture • 12 years experience in streaming • First 6 years on proprietary streaming analytics for 9/10 Fortune 500, 20B events daily, 10K+ servers • Last 2 years have moved to Kafka • Blogging on SRE, Hadoop, and data streaming space at natesnapp.com
  • 3.
    3 MODERN DATA APPLICATIONS MachineLearning Predictive Analytics AI loT ENVIRONMENTS On-Premises HybridCloud PLATFORMS & TECHNOLOGIES NoSQL SQL MPP API 01 uncover ADAPTIVE DATA COLLECTION understand 02 DATA MODEL & CORRELATION ANALYTICS ENGINE AUTOMATION ENGINE TUNING ENGINE INFERENCE ENGINE unravel 03 DASHBOARDS AUTO-ACTIONS SMART ALERTS REPORTING RECOMMENDATIONS
  • 4.
    4 • Clusters with6-29 brokers • Confluent Kafka 5.2.1, Apache Kafka 2.2.0-cp2 • 1700 topics across all clusters • Largest topics top out with over 20K+ messages/sec • Smaller topics are 300-500 messages/sec • Large self-service components • Ingress is a mix of separate Kafka, Java client API as well as load balanced REST API frontends; some clusters have use of Schema Registry • Egress is a mix of custom endpoints, and in many cases, HDFS sink What Kafka setups?
  • 5.
  • 6.
    6 Practical challenge #1 Variancein flow How do we decide if there is an anomaly?
  • 7.
    7 Variance in flow(contd.) Partition number Count
  • 8.
    8 Practical challenge #2 Negativeeffects of really slow data
  • 9.
    9 Practical challenge #3 EventSourcing Time Purchase Topic Cancellation Topic
  • 10.
    10 • Runtime schemachanges • “Flexible-Rigid Schema” • Timeouts causing rebalance storms • Leader affinity and poor assignment • Poor partition assignment Elements of surprise!
  • 11.
    11 Anomaly Detection Most enterprises nowhave mission-critical streaming apps Predictive Maintenance Threat Monitoring Recommendation Engines Real-time customer sentiment analysis
  • 12.
    12 Streaming data architecturemust be reliable STREAM STORE Kafka HBase Spark Flink REAL-TIME PROCESSORIoT Sensors Database Other Data Dashboard Result Store Other Output
  • 13.
    13 Many problems cancause unreliable performance STREAM STORE Kafka HBase Spark Flink REAL-TIME PROCESSORIoT Sensors Database Other Data Dashboard Result Store Other Output Untimely results
  • 14.
    14 Many problems cancause unreliable performance STREAM STORE Kafka HBase Spark Flink REAL-TIME PROCESSORIoT Sensors Database Other Data Dashboard Result Store Other Output Untimely resultsPoor partitioning Inefficient Configuration Resource contention + + =
  • 15.
    15 DevOps face manychallenges today STREAM STORE Kafka HBase Spark Flink REAL-TIME PROCESSORIoT Sensors Database Other Data Dashboard Result Store Other Output Poor partitioning Inefficient Configuration Resource contention + + • No single tool • No correlation across the stack • No application view • No insights • No recommendations • No automated actions = Untimely results
  • 16.
    16 How we canempower DevOps teams IoT Sensors Database Other Data Dashboard Result Store Other Output Platform Metrics App Metrics STREAM STORE Kafka HBase Spark Flink REAL-TIME PROCESSOR App Platform Interaction Metrics Bring all performance data into one complete & correlated view
  • 17.
    17 How we canempower DevOps teams IoT Sensors Database Other Data Dashboard Result Store Other Output Platform Metrics App Metrics STREAM STORE Kafka HBase Spark Flink REAL-TIME PROCESSOR App Platform Interaction Metrics Provide out-of-the-box intelligence with Machine Learning (ML)
  • 18.
    18 How we canempower DevOps teams IoT Sensors Database Other Data Dashboard Result Store Other Output Platform Metrics App Metrics STREAM STORE Kafka HBase Spark Flink REAL-TIME PROCESSOR App Platform Interaction Metrics Automate actions smartly with Artificial Intelligence (AI)
  • 19.
  • 20.
  • 21.
    21 Goals of StreamingApp and Kafka DevOps Teams Throughput SLA Latency SLA Data loss tolerance Stability/resiliency Resource usage/cost Planning/growth App-level Goals Platform-level Goals AI/ML Algorithms Outlier Detection Forecasting Anomaly Detection Correlation Analysis Model Learning
  • 22.
  • 23.
    23 1. Detecting loadimbalance among Kafka brokers 2. Detecting load imbalance among Kafka partitions Use Cases
  • 24.
    24 Detecting load imbalanceamong Kafka brokers Brokers kabo2 and kabo3 have much higher number of incoming messages than broker kabo1
  • 25.
    25 Algorithms for OutlierDetection Picture credit: http://historum.com/asian-history/128081-aryan-migration-theory-update-128.html • Based on one feature Vs. multiple features • Is the distribution of data assumed? 1. Z-score: How many standard deviations is a data point from the mean
  • 26.
    26 • Based onone feature Vs. multiple features • Is the distribution of data assumed? 1. Z-score: How many standard deviations is a data point from the mean 2. DBScan: Density-based clustering 3. Isolation forests 4. Deep learning (e.g., Autoencoders) Algorithms for Outlier Detection Picture credit: http://en.proft.me/2017/02/3/density-based-clustering-r/
  • 27.
  • 28.
    28 1. Predicting whenSLAs are in danger of being missed 2. Predicting when system may run out of headroom or capacity Use Cases
  • 29.
    29 A Real-life Application:RealTimeSentimentMonitor TWEETS Partitions 1-N TWEETS TWEETS
  • 30.
    30 Predicting when SLAsare in danger of being missed Latency SLA is 3 minutes Latency SLA can be missed by this time Current time is here
  • 31.
    31 • Many standardtime-series forecasting techniques: ARIMA, Holt-Winters • Deep-learning techniques (e.g., LSTM) Algorithms for Forecasting
  • 32.
    32 • Many standardtime-series forecasting techniques: ARIMA, Holt-Winters • Deep-learning techniques (e.g., LSTM) • Facebook’s Prophet Algorithm: Mixes stats methods & judgment from domain experts • Uses Generative Additive Model (GAM) • Decomposed time-series model: trend, seasonality, holidays, and error term Algorithms for Forecasting y(t) = trend(t) + periodic(t) + shock(t) + error
  • 33.
    33 • Many standardtime-series forecasting techniques: ARIMA, Holt-Winters • Deep-learning techniques (e.g., LSTM) • Facebook’s Prophet Algorithm: Mixes stats methods & judgment from domain experts • Uses Generative Additive Model (GAM) • Decomposed time-series model: trend, seasonality, holidays, and error term • Advantages: • Fits faster than ARIMA • Models various growth trends • Can handle unevenly spaced data • Defaults often produce accurate forecasts Algorithms for Forecasting
  • 34.
  • 35.
    35 1. An unexpectedchange that needs your attention 2. Smart alerts: • False negatives should be minimal • False positives should be minimal Use Cases
  • 36.
    36 Detecting anomalies istricky Is this an unexpected lag worth alerting on?
  • 37.
    37 Algorithms for AnomalyDetection Picture credit: https://blog.statsbot.co/time-series-anomaly-detection-algorithms-1cef5519aef2 • Deviation from forecasts
  • 38.
    38 Algorithms for AnomalyDetection Picture credit: https://blog.statsbot.co/time-series-anomaly-detection-algorithms-1cef5519aef2 • Deviation from forecasts • ARIMA • Regression trees • Prophet • STL: Seasonal and Trend Decomposition using Loess • Topic of intensive research • Deep learning (e.g., LSTM)
  • 39.
  • 40.
    40 • Fast root-causingof problems • What lower-level cause led to the change in the streaming application’s performance? Use Cases
  • 41.
    41 What caused theunexpected change in performance? Anomaly What caused it? 100s of time series from every level of the stack! LATENCY is 421.07% WORSE THAN THE BASELINE
  • 42.
    42 • Be awareof the many pitfalls • E.g., trends can make arbitrary time series look correlated! • Pick robust time-series similarity metrics • E.g., Euclidean distance Vs. Dynamic Time Warping Algorithms for Correlation Analysis Picture credit: https://izbicki.me/blog/converting-images-into-time-series-for-data-mining.html Euclidean Distance Dynamic Time Warping
  • 43.
    43 • Be awareof the many pitfalls • E.g., trends can make arbitrary time series look correlated! • Pick robust time-series similarity metrics • E.g., Euclidean distance Vs. Dynamic Time Warping • Carefully incorporate domain knowledge • E.g., what caused latency SLA miss? • Application-level problem? • Resource allocation problem? • Platform-level problem? • Data-level problem? Algorithms for Correlation Analysis Picture credit: https://izbicki.me/blog/converting-images-into-time-series-for-data-mining.html Euclidean Distance Dynamic Time Warping
  • 44.
  • 45.
    45 1. Helps answerwhat-if and optimization questions • What is the best number of partitions? • What is the best setting of timeouts to avoid rebalance storms? • What is the best partition rebalancing action to take? • What will the impact of adding a new broker be? 2. Enables Auto Actions for resource/cost efficiency & SLA management Use Cases
  • 46.
    46 Automated tuning suggestionsto meet SLA Precise recommendation to meet SLA
  • 47.
    47 • Performance =Func(Input Features) • Have to find the best set of input features • Supervised learning is often possible: Training data is available or easy to generate Algorithms for Learning Models Picture credit: https://myslide.cn/slides/8328#
  • 48.
    48 Summary: Meeting KafkaDevOps Goals with AI/ML Throughput goal Stability goal Latency goal Resource usage/cost goal Data loss tolerance goal App-level Goals Platform-level Goals Planning/growth goal AI/ML Algorithms Outlier Detection Forecasting Anomaly Detection Correlation Analysis Model Learning
  • 49.
    49 AIOps: Rich opportunitiesto address distributed application performance management as AI/ML problems Start your free trial: unraveldata.com/free-trial Visit us at the Unravel booth And yes, we are hiring! shivnath@unraveldata.com