Using Machine Learning to Understand Kafka Runtime Behavior (Shivanath Babu, Unravel Data and Nate Snapp, Adobe) Kafka Summit London 2019

1
Using Machine
Learning to
Understand Kafka
Runtime Behavior
Shivnath Babu
Cofounder/CTO, Unravel Data
Adjunct Professor, Duke University
shivnath@unraveldata.com
Nate Snapp
Big Data Engineering
Adobe, Palo Alto Networks, Omniture
LinkedIn or nate.snapp@gmail.com

2
Meet the speakers
• Cofounder/CTO at Unravel
• Adjunct Professor of Computer
Science at Duke University
• Focusing on ease-of-use and
manageability of data apps & platforms
• Recipient of US National Science
Foundation CAREER Award, three
IBM Faculty Awards, HP Labs
Innovation Research Award
Shivnath BabuNate Snapp
• Senior SRE from Adobe, Palo Alto
Networks, and Omniture
• 12 years experience in streaming
• First 6 years on proprietary
streaming analytics for 9/10 Fortune
500, 20B events daily, 10K+ servers
• Last 2 years have moved to Kafka
• Blogging on SRE, Hadoop, and data
streaming space at natesnapp.com

3
MODERN DATA APPLICATIONS
Machine Learning Predictive Analytics
AI loT
ENVIRONMENTS
On-Premises HybridCloud
PLATFORMS & TECHNOLOGIES
NoSQL SQL MPP API
01
uncover
ADAPTIVE DATA COLLECTION
understand
02
DATA MODEL & CORRELATION
ANALYTICS
ENGINE
AUTOMATION
ENGINE
TUNING
ENGINE
INFERENCE
ENGINE
unravel
03
DASHBOARDS
AUTO-ACTIONS
SMART ALERTS
REPORTING
RECOMMENDATIONS

4
• Clusters with 6-29 brokers
• Confluent Kafka 5.2.1, Apache Kafka 2.2.0-cp2
• 1700 topics across all clusters
• Largest topics top out with over 20K+ messages/sec
• Smaller topics are 300-500 messages/sec
• Large self-service components
• Ingress is a mix of separate Kafka, Java client API as well as load balanced
REST API frontends; some clusters have use of Schema Registry
• Egress is a mix of custom endpoints, and in many cases, HDFS sink
What Kafka setups?

5
Nature of streaming behavior

6
Practical challenge #1
Variance in flow
How do we decide if there is an anomaly?

7
Variance in flow (contd.)
Partition number
Count

8
Negative effects of really slow data

9
Event Sourcing
Time
Purchase
Topic
Cancellation
Topic

10
• Runtime schema changes
• “Flexible-Rigid Schema”
• Timeouts causing rebalance storms
• Leader affinity and poor assignment
• Poor partition assignment
Elements of surprise!

11
Anomaly
Detection
Most enterprises now have mission-critical
streaming apps
Predictive
Maintenance
Threat
Monitoring
Recommendation
Engines
Real-time customer
sentiment analysis

12
Streaming data architecture must be reliable
STREAM STORE
Kafka HBase Spark Flink
REAL-TIME PROCESSORIoT Sensors
Database
Other Data
Dashboard
Result Store
Other Output

13
Many problems can cause unreliable performance
STREAM STORE
Database
Other Data
Dashboard
Result Store
Other Output
Untimely results

14
Many problems can cause unreliable performance
STREAM STORE
Database
Other Data
Dashboard
Result Store
Other Output
Untimely resultsPoor partitioning Inefficient Configuration Resource contention
+ + =

15
DevOps face many challenges today
STREAM STORE
Database
Other Data
Dashboard
Result Store
Other Output
Poor partitioning Inefficient Configuration Resource contention
+ +
• No single tool
• No correlation across the stack
• No application view
• No insights
• No recommendations
• No automated actions
=
Untimely results

16
How we can empower DevOps teams
IoT Sensors
Database
Other Data
Dashboard
Result Store
Other Output
Platform Metrics App Metrics
STREAM STORE
REAL-TIME PROCESSOR
App Platform
Interaction Metrics
Bring all performance data into
one complete & correlated view

17
IoT Sensors
Database
Other Data
Dashboard
Result Store
Other Output
STREAM STORE
REAL-TIME PROCESSOR
App Platform
Interaction Metrics
Provide out-of-the-box
intelligence with
Machine Learning (ML)

18
IoT Sensors
Database
Other Data
Dashboard
Result Store
Other Output
STREAM STORE
REAL-TIME PROCESSOR
App Platform
Interaction Metrics
Automate actions
smartly with
Artificial Intelligence (AI)

21
Goals of Streaming App and Kafka DevOps
Teams
Throughput SLA
Latency SLA
Data loss tolerance
Stability/resiliency
Resource usage/cost
Planning/growth
App-level Goals
Platform-level Goals
AI/ML Algorithms
Outlier Detection
Forecasting
Anomaly Detection
Correlation Analysis
Model Learning

23
1. Detecting load imbalance among Kafka brokers
2. Detecting load imbalance among Kafka partitions
Use Cases

24
Detecting load imbalance among Kafka brokers
Brokers kabo2 and
kabo3 have much
higher number of
incoming messages
than broker kabo1

25
Algorithms for Outlier Detection
Picture credit: http://historum.com/asian-history/128081-aryan-migration-theory-update-128.html
• Based on one feature Vs.
multiple features
• Is the distribution of data
assumed?
1. Z-score: How many standard
deviations is a data point from the
mean

26
• Based on one feature Vs.
multiple features
• Is the distribution of data
assumed?
1. Z-score: How many standard
deviations is a data point from the
mean
2. DBScan: Density-based clustering
3. Isolation forests
4. Deep learning (e.g., Autoencoders)
Algorithms for Outlier Detection
Picture credit: http://en.proft.me/2017/02/3/density-based-clustering-r/

28
1. Predicting when SLAs are in danger of being missed
2. Predicting when system may run out of headroom or
capacity
Use Cases

29
A Real-life Application: RealTimeSentimentMonitor
TWEETS
Partitions
1-N
TWEETS
TWEETS

30
Predicting when SLAs are in danger of being missed
Latency SLA is
3 minutes
Latency SLA can be
missed by this time
Current time is here

31
• Many standard time-series forecasting
techniques: ARIMA, Holt-Winters
• Deep-learning techniques (e.g., LSTM)
Algorithms for Forecasting

32
• Facebook’s Prophet Algorithm: Mixes stats
methods & judgment from domain experts
• Uses Generative Additive Model (GAM)
• Decomposed time-series model: trend,
seasonality, holidays, and error term
y(t) = trend(t) + periodic(t) + shock(t) + error

33
• Facebook’s Prophet Algorithm: Mixes stats
methods & judgment from domain experts
• Uses Generative Additive Model (GAM)
• Decomposed time-series model: trend,
seasonality, holidays, and error term
• Advantages:
• Fits faster than ARIMA
• Models various growth trends
• Can handle unevenly spaced data
• Defaults often produce accurate forecasts

35
1. An unexpected change that needs your attention
2. Smart alerts:
• False negatives should be minimal
• False positives should be minimal
Use Cases

36
Detecting anomalies is tricky
Is this an unexpected lag
worth alerting on?

37
Algorithms for Anomaly Detection
Picture credit: https://blog.statsbot.co/time-series-anomaly-detection-algorithms-1cef5519aef2
• Deviation from forecasts

38
Algorithms for Anomaly Detection
Picture credit: https://blog.statsbot.co/time-series-anomaly-detection-algorithms-1cef5519aef2
• Deviation from forecasts
• ARIMA
• Regression trees
• Prophet
• STL: Seasonal and Trend
Decomposition using Loess
• Topic of intensive
research
• Deep learning (e.g., LSTM)

40
• Fast root-causing of problems
• What lower-level cause led to the change in the
streaming application’s performance?
Use Cases

41
What caused the unexpected change in performance?
Anomaly
What caused it?
100s of time series from every level of the stack!
LATENCY is 421.07% WORSE THAN THE BASELINE

42
• Be aware of the many pitfalls
• E.g., trends can make arbitrary time
series look correlated!
• Pick robust time-series
similarity metrics
• E.g., Euclidean distance Vs. Dynamic
Time Warping
Algorithms for Correlation Analysis
Picture credit: https://izbicki.me/blog/converting-images-into-time-series-for-data-mining.html
Euclidean
Distance
Dynamic
Time
Warping

43
• Be aware of the many pitfalls
• E.g., trends can make arbitrary time
series look correlated!
• Pick robust time-series
similarity metrics
• E.g., Euclidean distance Vs. Dynamic
Time Warping
• Carefully incorporate domain
knowledge
• E.g., what caused latency SLA miss?
• Application-level problem?
• Resource allocation problem?
• Platform-level problem?
• Data-level problem?
Algorithms for Correlation Analysis
Picture credit: https://izbicki.me/blog/converting-images-into-time-series-for-data-mining.html
Euclidean
Distance
Dynamic
Time
Warping

45
1. Helps answer what-if and optimization questions
• What is the best number of partitions?
• What is the best setting of timeouts to avoid rebalance storms?
• What is the best partition rebalancing action to take?
• What will the impact of adding a new broker be?
2. Enables Auto Actions for resource/cost efficiency & SLA management
Use Cases

46
Automated tuning suggestions to meet SLA
Precise recommendation to meet SLA

47
• Performance = Func(Input Features)
• Have to find the best set of input features
• Supervised learning is often possible: Training data is available
or easy to generate
Algorithms for Learning Models
Picture credit: https://myslide.cn/slides/8328#

48
Summary: Meeting Kafka DevOps Goals with AI/ML
Throughput goal
Stability goal
Latency goal
Resource usage/cost goal
Data loss tolerance goal
App-level Goals
Platform-level Goals
Planning/growth goal
AI/ML Algorithms
Outlier Detection
Forecasting
Anomaly Detection
Correlation Analysis
Model Learning

49
AIOps: Rich opportunities to address
distributed application performance
management as AI/ML problems
Start your free trial: unraveldata.com/free-trial
Visit us at the Unravel booth
And yes, we are hiring!
shivnath@unraveldata.com

Using Machine Learning to Understand Kafka Runtime Behavior (Shivanath Babu, Unravel Data and Nate Snapp, Adobe) Kafka Summit London 2019

More Related Content

What's hot

Similar to Using Machine Learning to Understand Kafka Runtime Behavior (Shivanath Babu, Unravel Data and Nate Snapp, Adobe) Kafka Summit London 2019

More from confluent

Recently uploaded

Using Machine Learning to Understand Kafka Runtime Behavior (Shivanath Babu, Unravel Data and Nate Snapp, Adobe) Kafka Summit London 2019