Finding bad apples early: Minimizing performance impact

Arun Kejariwal
(@arun_kejariwal)
1
Finding Bad Apples Early:  
Minimizing Performance Impact

2
Real-time information consumption
> 500M Tweets per day
>300 hrs video uploaded to YouTube every minute
69% Y/Y Growth in Data Traﬃc
Mobile E-commerce
Mobile Trading

3
Monitoring
!
Availability, Performance, Reliability
Large number of micro services
Large number of clusters
Multiple data centers
Millions of time series
Data ﬁdelity
Anomaly Detection

4
Anomaly Detection: Why bother?
Performance regression
Potentially user impacting
Sources
Software
Ineﬃciency
Hardware
Slow
Failure
Keynote Template

5
Visual Analysis
Keynote Template

6
Anomaly Detection
Algorithms
> 50 yrs of Research
✒
image
No model ﬁts all
Time series analysis
Univariate
Multivariate
Frequency domain analysis
Fourier transform
Wavelet transform
Clustering
K-Means

📈
7
Flavors of Anomaly Detection
Overview
Univariate Time Series
# Clicks
Multi-variate Time Series
CPU, Memory
Set of Time Series
# Tasks Completed

9
Host level (contd.)
Keynote Template
Contextual Application Topology Map
Hierarchical
Datacenter ! Applications ! Services ! Hosts
•  Automatically discover Developer / Architect’s view of the
application - for the Operations team
-  Framework for system config and context
•  Real-time, streaming architecture
-  Keeps up with today’s elastic infrastructure
•  Scale to 1000s of hosts, 100s of (micro) services
•  Present evolution of system state over time
-  DVR-like replay of health, system changes, failures
Evolving Needs of Modern Operations

10
Large clusters
“Slow” nodes
Failed Nodes
Keynote Template
Visual analysis?

11
➢ Our Approach
Algorithmic: Simple, fast and eﬀective
Automated
Scalable to millions of time series
Keynote Template

12
The Approach
Bird’s Eye View
Input
Set of time series
Nodes of the same cluster
Same metric, e.g., CPU utilization
Output
Anomalous time series
Classiﬁcation

13
📊The Approach (contd.)
Metrics of Interest
Requests
Tasks completed
Latency
p99
p999
CPU
Bytes out (Kafka)
rx_bytes
TCP retransmits
iowait, iowait_max
Examples

14
✓The Approach (contd.)
Robust against anomalies
Average
Median
Examples

15
✓The Approach (contd.)
Robust against irregular patterns
Examples

16
The Approach (contd.)
Parameters
6 hr
1 day
Historical window width

17
Parameters
Large
High precision
Low recall
Small
Low precision
High recall
Classiﬁcation threshold

18
Compute rolling median time series
Compute pairwise distance matrix between time series
Distance measure
Euclidean
Minkowski
Compute vector of sum of pairwise distances
Classify the time series which satisfy the following as anomalous:
Sum(Pairwise Distances)/Median(Pairwise Distances) > threshold
Outline

19
Inconsistent data
Time series of unequal length
Data collection issue
Cold start issue
How to address?
Truncate at right
Omit inconsistent timestamps
Interpolate
Large number of anomalies
Non-uniform underlying trend across the time series
Edge cases

20
Applications
In production
Hadoop clusters
# Tasks completed
Datastores
Slow writes
Micro-services
CPU utilization
'

21
Applications (contd.)
'
Hadoop
Big elephant in the room
Multiple clusters
Thousands of nodes
Slow nodes -> Long job completion times
Potential business impact
Finding laggards in a Hadoop cluster

22
'
Rank nodes in decreasing order of “slowness”
Long history window
Compute heavy ad-hoc jobs may skew analysis
Burst in new jobs
Report, via e-mail report, the slowest nodes
Subject to an input sensitivity parameter
Speciﬁed by the SRE
Finding laggards in a Hadoop cluster

23
'
Similar workload across nodes
Which band should be classiﬁed as anomalous?
Load imbalance
Finding multiple bands

24
'
Disparate hardware
Load imbalance
Finding heterogeneity

25
Limitations
New nodes coming online
Nodes dying
Open problems

26
Next steps
Support for ﬁner granularity data
Minutely
Sub-minutely

27
Next steps (contd.)
Multi-class classiﬁcation
Presence of multiple bands

28
Next steps (contd.)
Cloud and Containers

29
For the curious
Similarity measures
Match count based sequence similarity
Normalized length of Longest Common Subsequence (LCS)
Clustering
k-Means, Phased k-means, k-Medoids
EM
Single-linkage clustering
One-class SVM

30
( Feedback
@arun_kejariwal
Keynote Template

Finding bad apples early: Minimizing performance impact

More Related Content

What's hot

Viewers also liked

Similar to Finding bad apples early: Minimizing performance impact

More from Arun Kejariwal

Recently uploaded

Finding bad apples early: Minimizing performance impact