Arun Kejariwal
(@arun_kejariwal)
1
Finding Bad Apples Early: 

Minimizing Performance Impact
2
Real-time information consumption
> 500M Tweets per day
>300 hrs video uploaded to YouTube every minute
69% Y/Y Growth in Data Traffic
Mobile E-commerce
Mobile Trading
3
Monitoring
!
Availability, Performance, Reliability
Large number of micro services
Large number of clusters
Multiple data centers
Millions of time series
Data fidelity
Anomaly Detection
4
Anomaly Detection: Why bother?
Performance regression
Potentially user impacting
Sources
Software
Inefficiency
Hardware
Slow
Failure
Keynote Template
5
Visual Analysis
Keynote Template
6
Anomaly Detection
Algorithms
> 50 yrs of Research
✒
image
No model fits all
Time series analysis
Univariate
Multivariate
Frequency domain analysis
Fourier transform
Wavelet transform
Clustering
K-Means
📈
7
Flavors of Anomaly Detection
Overview
Univariate Time Series
# Clicks
Multi-variate Time Series
CPU, Memory
Set of Time Series
# Tasks Completed
8
Host level
Keynote Template
9
Host level (contd.)
Keynote Template
Contextual Application Topology Map
Hierarchical
Datacenter ! Applications ! Services ! Hosts
•  Automatically discover Developer / Architect’s view of the
application - for the Operations team
-  Framework for system config and context
•  Real-time, streaming architecture
-  Keeps up with today’s elastic infrastructure
•  Scale to 1000s of hosts, 100s of (micro) services
•  Present evolution of system state over time
-  DVR-like replay of health, system changes, failures
Evolving Needs of Modern Operations
10
Large clusters
“Slow” nodes
Failed Nodes
Keynote Template
Visual analysis?
11
➢ Our Approach
Algorithmic: Simple, fast and effective
Automated
Scalable to millions of time series
Keynote Template
12
The Approach
Bird’s Eye View
Input
Set of time series
Nodes of the same cluster
Same metric, e.g., CPU utilization
Output
Anomalous time series
Classification
13
📊The Approach (contd.)
Metrics of Interest
Requests
Tasks completed
Latency
p99
p999
CPU
Bytes out (Kafka)
rx_bytes
TCP retransmits
iowait, iowait_max
Examples
14
✓The Approach (contd.)
Robust against anomalies
Average
Median
Examples
15
✓The Approach (contd.)
Robust against irregular patterns
Examples
16
The Approach (contd.)
Parameters
6 hr
1 day
Historical window width
17
The Approach (contd.)
Parameters
Large
High precision
Low recall
Small
Low precision
High recall
Classification threshold
18
The Approach (contd.)
Compute rolling median time series
Compute pairwise distance matrix between time series
Distance measure
Euclidean
Minkowski
Compute vector of sum of pairwise distances
Classify the time series which satisfy the following as anomalous:
Sum(Pairwise Distances)/Median(Pairwise Distances) > threshold
Outline
19
The Approach (contd.)
Inconsistent data
Time series of unequal length
Data collection issue
Cold start issue
How to address?
Truncate at right
Omit inconsistent timestamps
Interpolate
Large number of anomalies
Non-uniform underlying trend across the time series
Edge cases
20
Applications
In production
Hadoop clusters
# Tasks completed
Datastores
Slow writes
Micro-services
CPU utilization
'
21
Applications (contd.)
'
Hadoop
Big elephant in the room
Multiple clusters
Thousands of nodes
Slow nodes -> Long job completion times
Potential business impact
Finding laggards in a Hadoop cluster
22
Applications (contd.)
'
Rank nodes in decreasing order of “slowness”
Long history window
Compute heavy ad-hoc jobs may skew analysis
Burst in new jobs
Report, via e-mail report, the slowest nodes
Subject to an input sensitivity parameter
Specified by the SRE
Finding laggards in a Hadoop cluster
23
Applications (contd.)
'
Similar workload across nodes
Which band should be classified as anomalous?
Load imbalance
Finding multiple bands
24
Applications (contd.)
'
Disparate hardware
Load imbalance
Finding heterogeneity
25
Limitations
New nodes coming online
Nodes dying
Open problems
26
Next steps
Support for finer granularity data
Minutely
Sub-minutely
27
Next steps (contd.)
Multi-class classification
Presence of multiple bands
28
Next steps (contd.)
Cloud and Containers
29
For the curious
Similarity measures
Match count based sequence similarity
Normalized length of Longest Common Subsequence (LCS)
Clustering
k-Means, Phased k-means, k-Medoids
EM
Single-linkage clustering
One-class SVM
30
( Feedback
@arun_kejariwal
Keynote Template

Finding bad apples early: Minimizing performance impact