The big data era is characterized by ever-increasing velocity and volume of data. Over the last two or three years, several talks at Velocity have explored how to analyze operations data at scale, focusing on anomaly detection, performance analysis, and capacity planning, to name a few topics. Knowledge sharing of the techniques for the aforementioned problems helps the community to build highly available, performant, and resilient systems.
A key aspect of operations data is that data may be missing—referred to as “holes”—in the time series. This may happen for a wide variety of reasons, including (but not limited to):
# Packets being dropped due to unresponsive downstream services
# A network hiccup
# Transient hardware or software failure
# An issue with the data collection service
“Holes” in the time series on data analysis can potentially skew the analysis of data. This in turn can materially impact decision making. Arun Kejariwal presents approaches for analyzing operations data in the presence of “holes” in the time series, highlighting how missing data impacts common data analysis such as anomaly detection and forecasting, discussing the implications of missing data on time series of different granularities, such as minutely and hourly, and exploring a gamut of techniques that can be used to address the missing data issue (e.g., approximate the data using interpolation, regression, ensemble methods, etc.). Arun then walks you through how the techniques can be leveraged using real data.
2. WHAT’S UP WITH THE TITLE?
— Metrics Arms Race —
• “… scaling it to about two million distinct time
series …” (Netflix)
• “… highly accurate, real-time alerts on millions of
system and business metrics …” (Uber)
• “As we have hundreds of systems exposing
multiple data items, the write data rate might
easily exceed tens of millions of data points each
second.” (Facebook)
• “ … w e a r e t a k i n g t e n s o f b i l l i o n s o f
measurements…” (Google)
• “The Observability stack collects 170 million
individual data metrics (time series) …” (Twitter)
• “… serving over 50 million distinct time
series.” (Spotify)
3. WHAT’S UP WITH THE TITLE?
— Metrics Arms Race —
• >95% of metrics data is NEVER read!!
• Legacy instrumentation
• Lack of understanding of how to use metrics
Latency
10p, 20p, …, 90p
95p, 99p, 99.9p
Mean
• Retention
“Hard disk is cheap”
4. WHAT’S UP WITH THE TITLE?
“Rime of the Ancient Mariner”
INSPIRATION
By Samuel Taylor Coleridge
Water, water, every where,
Nor a drop to drink.
— Rooted in 1798! —
Image Source: https://ebooks.adelaide.edu.au/c/coleridge/samuel_taylor/rime/!
5. Growing number of data sources
DATA EXPLOSION
• Mobile (Smartphones, Tablets, Smart Watches)
— Relation to Big/Fast Data —
Data collection has become a commodity
• Million of time series
• IoT
• Wearables
DATA, DATA, EVERYWHERE
7. DATA, DATA, EVERYWHERE
— Time to Market, sinking ships – Can get “lonely” —
Image Source: https://ebooks.adelaide.edu.au/c/coleridge/samuel_taylor/rime/!
8. 1896 Olympics, Greece: Thomas Burke
Purpose and ability to act
Applying hard metrics and asking hard questions
“Advanced data analytics is a quintessential business matter.” [1]
[1] http://www.mckinsey.com/business-functions/digital-mckinsey/our-insights/making-data-analytics-work-for-you-instead-of-the-other-way-around, McKinsey, October, 2016.!
DATA, DATA, EVERYWHERE
— Value-driving analytics, NOT pristine data sets, interesting patterns or killer algorithms —
9. BREAK IT DOWN
The impact of “big data” analytics is
often manifested by thousands—or
m o r e — o f i n c r e m e n t a l l y s m a l l
improvements. If an organization can
atomize a single process into its
smallest parts and implement advances
where possible, the payoffs can be
profound. [1]
DATA, DATA, EVERYWHERE
— Purpose and Action —
[1] http://www.mckinsey.com/business-functions/digital-mckinsey/our-insights/making-data-analytics-work-for-you-instead-of-the-other-way-around, McKinsey, October, 2016.!
ITERATE
“Victory often resulted from the way
decisions are made; the side that
reacts to situations more quickly and
processes new information more
accurately should prevail.”
11. FLORENCE ST 54
NEW YORK
ANOMALY DETECTION
— >100 years History —
Bessel and Baeyer 1838, Chauvenet 1863, Stone 1868, Wright 1884, Irwin 1925, Student 1927, Thompson 1935!
“… present and exact rule for the rejection of observations,!
which shall be legitimately derived from the fundamental !
principles of Calculus of Probabilities.”!
18. ANOMALY DETECTION
Increasing focus of monitoring solutions (RUM/Synthetic): DataDog, Catchpoint, Opsclarity, Ruxit
Netflix, Airbnb, Cloudera
— Operations —
Introducing Practical and Robust
Anomaly Detection in Time Series[2]
[1] https://research.yahoo.com/news/announcing-open-source-egads-scalable-configurable-and-novel-anomaly-detection-system 2015 (https://github.com/yahoo/egads)!
[2] https://blog.twitter.com/2015/introducing-practical-and-robust-anomaly-detection-in-a-time-series, 2015 (https://github.com/twitter/AnomalyDetection)!
[3] https://eng.uber.com/argos/, 2015!
[4] http://bit.ly/luminol-velocity, 2016 (https://github.com/linkedin/luminol)!
Identifying Outages with Argos,
Uber Engineering’s Real-Time
Monitoring and Root-Cause
Exploration Tool [3]
Robust Anomaly Detection System
for Real User Monitoring Data [4]
EGADS: A Scalable, Configurable, and
Novel Anomaly Detection System[1]
19. ANOMALY DETECTION
AZURE [1] AWS [2] IBM Cloud [3]
— As A Service On The Cloud —
[1] https://azure.microsoft.com/en-us/documentation/articles/machine-learning-apps-anomaly-detection!
[2] https://aws.amazon.com/blogs/iot/anomaly-detection-using-aws-iot-and-aws-lambda!
[3] https://developer.ibm.com/recipes/tutorials/engage-machine-learning-for-detecting-anomalous-behaviors-of-things!
22. ANOMALY DETECTION
VS
Real-time Implications: Frequency based techniques may incur additional overhead
Time Domain Frequency Domain
— Distinct Approaches —
Time series analysis
Statistical tests, Clustering
Fast Fourier Transform, DWT
Fractals, Filters (Kalman Filter)
23. FLORENCE ST 54
NEW YORK
ANOMALY DETECTION
— Distinct Approaches —
UNSUPERVISED SUPERVISED
Higher accuracy, Bias-Variance Tradeoff
Decision Trees, SVM, Neural NetworksClustering
The common case in Operations
24. ANOMALY DETECTION
Complex Architectures: Hundreds of Microservices
Loss of productivity (TTD, TTR)
Impact on end-user experienceFALSE POSITIVE/NEGATIVE TRUE POSITIVE
— Too Many Alerts —
26. SOURCES
Data Collection Issues (network hiccups, bursty traffic, queue overflow – packet loss)
System Failures (bugs, hardware) – cascading effects
MISSING DATA
— Often Overlooked During Analysis —
Makes
analysis
non-‐trivial
Methods are in general not prepared to handle them
Loss
of
efficiency
Fewer patterns extracted from data
Conclusions statistically less strong
Inference bias
Resulting from differences between missing and complete data
Larger standard error
Reduced sample size
29. MISSING DATA
MISSING COMPLETELY
AT RANDOM
Does not depend on either the
observed or missing data
MISSING AT
RANDOM
Depends on the observed
data, but not on missing
data
MISSING NOT
AT RANDOM
Depends on missing/
unobserved (latent) values
— Characterization of Missing Values —
Cause
Random
Uncorrelated with variables
of interest
Most Common Assumption Factors
Correlation between cause
of missingness and
Variables of interest
Missingness of variable
of interest
Common Case!
30. MISSING DATA
Completely
Recorded Units
Weighting Imputation Model
Procedures Based Based
— TAXONOMY of METHODS —
Discard variables with
missing data
Subset data to ignore
missing data
Differentially weigh the
complete cases to adjust
for bias
Weights are a function of
response probability
Assign a value to missing
one
Leverage existing ML/data
mining methods
Define a model for the
observed data
Inference based on the
likelihood or posterior
distribution under the model
32. MISSING VALUES
Resampling Methods: Jackknifing (Quenouille, 1949) and Bootstrapping (Efron, 1979)
Based on large-sample theory
MI: Based on Bayesian theory and provides useful inferences for small samples
MULTIPLE
IMPUTATION
— Imputation Methods —
Replace each missing value by a vector of D ≥ 2 imputed values
Single imputation cannot reflect sampling variability under one
model for missing data or uncertainty about the correct model
for missing data
D complete-data inferences are combined to form one inference
that reflects uncertainty due missing data under that model
Throws light on sensitivity of inference of models for missing
data
33. MISSING VALUES
MICE: Chained Equations
(Sequential Regression Multiple Regression)
Assumes missing data is MAR
Each variable with missing data is modeled
conditional upon the other variables in the data
Learn a Bayesian Network using complete data and
use it simultaneously impute all missing values via
abductive inference
Incremental imputation
MULTIVARIATE
IMPUTATION
— Imputation Methods —
Multivariate regression
Linear, Logistic, Poisson
Use of auxiliary variables – not used in the analysis (predictive of missingness) but can improve imputations
34. MISSING VALUES
Expectation-Maximization (EM)
Hartley 1958, Orchard and Woodbury 1972
Dempster et al. 1977
Converges reliably to a local maximum or a
saddle point
Slow to converge in presence of large number of
missing values
M step can be difficult (e.g., has no closed form)
MODEL
BASED
— Methods —
Expectation/Conditional Maximization (ECM) – two or more conditional (on parameters) maximization
Alternating Expectation/Conditional Maximization (AECM) – complete-data/actual loglikelihood
Parameter-Expanded EM (PX-EM) – include parameters whose values are known during maximization
Variational Bayes
35. On Achieving Energy Efficiency and Reducing CO2
Footprint in Cloud Computing, 2015.
Fractal based Anomaly Detection Over Data
Streams, 2013.
READINGS
Techniques for Optimizing Cloud Footprint, 2011.
Alternatives to the Median Absolute Deviation,
1993.
“IF I HAVE SEEN FURTHER, IT IS BY STANDING ON THE SHOULDERS OF GIANTS”
— Research papers —
Knee Point Detection on Bayesian Information
Criterion, 2008.
Knee Point Search Using Cascading Top-k Search
with Minimized Time Complexity, 2013.
Finding a ‘Kneedle’ in a Haystack: Detecting Knee
Points in System Behavior, 2011.
NbClust R Package
ISAAC NEWTON
36. Saturated Correlates (“Spider”) Model
Multiple Group SEM (Structural Equation
Modeling)
READINGS
Latent Transition Analysis
General Location Model
— Techniques —
Extra Dependent Variable (DV) Model
Bayesian Principal Component Analysis
Local Least Squares Imputation
Full Information Maximum Likelihood (FIML)