The big data era is characterized by the ever-increasing velocity and volume of data. In order to store and analyze the ever-growing data, the operational footprint of data stores and Hadoop have also grown over time. (As per a recent report from IDC, the spending on big data infrastructure is expected to reach $41.5 billion by 2018.) The clusters comprise several thousands of nodes. The high performance of such clusters is vital for delivering the best user experience and productivity of teams.
The performance of such clusters is often limited by slow/bad nodes. Finding slow nodes in large clusters is akin to finding a needle in a haystack; hence, manual identification of slow/bad nodes is not practical. To this end, we developed a novel statistical technique to automatically detect slow/bad nodes in clusters comprising hundreds to thousands of nodes. We modeled the problem as a classification problem and employed a simple, yet very effective, distance measure to determine slow/bad nodes. The key highlights of the proposed technique are the following:
# Robustness against anomalies (note that anomalies may occur, for example, due to an ad-hoc heavyweight job on a Hadoop cluster)
# Given the varying data characteristics of different services, no one model fits all. Consequently, we parameterized the threshold used for classification
The proposed technique works well with both hourly and daily data, and has been in use in production by multiple services. This has not only eliminated manual investigation efforts, but has also mitigated the impact of slow nodes, which used to get detected after several weeks/months of lag!
We shall walk the audience through how the techniques are being used with REAL data.
2. 2
Real-time information consumption
> 500M Tweets per day
>300 hrs video uploaded to YouTube every minute
69% Y/Y Growth in Data Traffic
Mobile E-commerce
Mobile Trading
6. 6
Anomaly Detection
Algorithms
> 50 yrs of Research
✒
image
No model fits all
Time series analysis
Univariate
Multivariate
Frequency domain analysis
Fourier transform
Wavelet transform
Clustering
K-Means
7. 📈
7
Flavors of Anomaly Detection
Overview
Univariate Time Series
# Clicks
Multi-variate Time Series
CPU, Memory
Set of Time Series
# Tasks Completed
9. 9
Host level (contd.)
Keynote Template
Contextual Application Topology Map
Hierarchical
Datacenter ! Applications ! Services ! Hosts
• Automatically discover Developer / Architect’s view of the
application - for the Operations team
- Framework for system config and context
• Real-time, streaming architecture
- Keeps up with today’s elastic infrastructure
• Scale to 1000s of hosts, 100s of (micro) services
• Present evolution of system state over time
- DVR-like replay of health, system changes, failures
Evolving Needs of Modern Operations
12. 12
The Approach
Bird’s Eye View
Input
Set of time series
Nodes of the same cluster
Same metric, e.g., CPU utilization
Output
Anomalous time series
Classification
13. 13
📊The Approach (contd.)
Metrics of Interest
Requests
Tasks completed
Latency
p99
p999
CPU
Bytes out (Kafka)
rx_bytes
TCP retransmits
iowait, iowait_max
Examples
18. 18
The Approach (contd.)
Compute rolling median time series
Compute pairwise distance matrix between time series
Distance measure
Euclidean
Minkowski
Compute vector of sum of pairwise distances
Classify the time series which satisfy the following as anomalous:
Sum(Pairwise Distances)/Median(Pairwise Distances) > threshold
Outline
19. 19
The Approach (contd.)
Inconsistent data
Time series of unequal length
Data collection issue
Cold start issue
How to address?
Truncate at right
Omit inconsistent timestamps
Interpolate
Large number of anomalies
Non-uniform underlying trend across the time series
Edge cases
21. 21
Applications (contd.)
'
Hadoop
Big elephant in the room
Multiple clusters
Thousands of nodes
Slow nodes -> Long job completion times
Potential business impact
Finding laggards in a Hadoop cluster
22. 22
Applications (contd.)
'
Rank nodes in decreasing order of “slowness”
Long history window
Compute heavy ad-hoc jobs may skew analysis
Burst in new jobs
Report, via e-mail report, the slowest nodes
Subject to an input sensitivity parameter
Specified by the SRE
Finding laggards in a Hadoop cluster
29. 29
For the curious
Similarity measures
Match count based sequence similarity
Normalized length of Longest Common Subsequence (LCS)
Clustering
k-Means, Phased k-means, k-Medoids
EM
Single-linkage clustering
One-class SVM