Data Data Everywhere: Not An Insight to Take Action Upon

DATA DATA EVERYWHERE
ARUN KEJARIWAL
— Not An Insight to Take Action Upon —

WHAT’S UP WITH THE TITLE?
— Metrics Arms Race —
•  “… scaling it to about two million distinct time
series …” (Netﬂix)
•  “… highly accurate, real-time alerts on millions of
system and business metrics …” (Uber)
•  “As we have hundreds of systems exposing
multiple data items, the write data rate might
easily exceed tens of millions of data points each
second.” (Facebook)
•  “ … w e a r e t a k i n g t e n s o f b i l l i o n s o f
measurements…” (Google)
•  “The Observability stack collects 170 million
individual data metrics (time series) …” (Twitter)
•  “… serving over 50 million distinct time
series.” (Spotify)

— Metrics Arms Race —
•  >95% of metrics data is NEVER read!!
•  Legacy instrumentation
•  Lack of understanding of how to use metrics
Latency
10p, 20p, …, 90p
95p, 99p, 99.9p
Mean
•  Retention
“Hard disk is cheap”

“Rime of the Ancient Mariner”
INSPIRATION
By Samuel Taylor Coleridge
Water, water, every where,
Nor a drop to drink.
— Rooted in 1798! —
Image Source: https://ebooks.adelaide.edu.au/c/coleridge/samuel_taylor/rime/!

Growing number of data sources
DATA EXPLOSION
•  Mobile (Smartphones, Tablets, Smart Watches)
— Relation to Big/Fast Data —
Data collection has become a commodity
•  Million of time series
•  IoT
•  Wearables
DATA, DATA, EVERYWHERE

— Non-trivial to Mine Actionable Insights —

— Time to Market, sinking ships – Can get “lonely” —
Image Source: https://ebooks.adelaide.edu.au/c/coleridge/samuel_taylor/rime/!

1896 Olympics, Greece: Thomas Burke
Purpose and ability to act
Applying hard metrics and asking hard questions
“Advanced data analytics is a quintessential business matter.” [1]
[1] http://www.mckinsey.com/business-functions/digital-mckinsey/our-insights/making-data-analytics-work-for-you-instead-of-the-other-way-around, McKinsey, October, 2016.!
— Value-driving analytics, NOT pristine data sets, interesting patterns or killer algorithms —

BREAK IT DOWN
The impact of “big data” analytics is
often manifested by thousands—or
m o r e — o f i n c r e m e n t a l l y s m a l l
improvements. If an organization can
atomize a single process into its
smallest parts and implement advances
where possible, the payoﬀs can be
profound. [1]
— Purpose and Action —
[1] http://www.mckinsey.com/business-functions/digital-mckinsey/our-insights/making-data-analytics-work-for-you-instead-of-the-other-way-around, McKinsey, October, 2016.!
ITERATE
“Victory often resulted from the way
decisions are made; the side that
reacts to situations more quickly and
processes new information more
accurately should prevail.”

DATA ANALYTICS
PERFORMANCE
KNEE POINT
DETECTION
AVAILABILITY
ROOT CAUSE
ANALYSIS
AUTOSCALING
EFFICIENCY
A/B
TESTING
INTRUSION
DETECTION
— Not a Hype, But Not Trivial Either —

FLORENCE ST 54
NEW YORK
ANOMALY DETECTION
— >100 years History —
Bessel and Baeyer 1838, Chauvenet 1863, Stone 1868, Wright 1884, Irwin 1925, Student 1927, Thompson 1935!
“… present and exact rule for the rejection of observations,!
which shall be legitimately derived from the fundamental !
principles of Calculus of Probabilities.”!

FLORENCE ST 54
NEW YORK
ANOMALY DETECTION
— Prior to 1950s —

ANOMALY DETECTION
Median - Wright 1599 (“Certaine Errors in Navigation”), Cournot 1843, Fechner 1874, Galton1882, Edgeworth 1887
Median Absolute Deviation (median |Xi – median(X)|)
Estimators Sn (mediani {medianj|Xi - Xj|}) and Qn (median {(Xi + Xj)/2; i < j})
ROBUST STATISTICS
NORMAL
DIST
STATIONARITY CONTEXT
— Check your assumptions—

FLORENCE ST 54
NEW YORK
ANOMALY DETECTION
— Early work post 1950 —

ANOMALY DETECTION
— >100 years History —

ANOMALY DETECTION
Increasing focus of monitoring solutions (RUM/Synthetic): DataDog, Catchpoint, Opsclarity, Ruxit
Netflix, Airbnb, Cloudera
— Operations —
Introducing Practical and Robust
Anomaly Detection in Time Series[2]
[1] https://research.yahoo.com/news/announcing-open-source-egads-scalable-configurable-and-novel-anomaly-detection-system 2015 (https://github.com/yahoo/egads)!
[2] https://blog.twitter.com/2015/introducing-practical-and-robust-anomaly-detection-in-a-time-series, 2015 (https://github.com/twitter/AnomalyDetection)!
[3] https://eng.uber.com/argos/, 2015!
[4] http://bit.ly/luminol-velocity, 2016 (https://github.com/linkedin/luminol)!
Identifying Outages with Argos,
Uber Engineering’s Real-Time
Monitoring and Root-Cause
Exploration Tool [3]
Robust Anomaly Detection System
for Real User Monitoring Data [4]
EGADS: A Scalable, Configurable, and
Novel Anomaly Detection System[1]

ANOMALY DETECTION
AZURE [1] AWS [2] IBM Cloud [3]
— As A Service On The Cloud —
[1] https://azure.microsoft.com/en-us/documentation/articles/machine-learning-apps-anomaly-detection!
[2] https://aws.amazon.com/blogs/iot/anomaly-detection-using-aws-iot-and-aws-lambda!
[3] https://developer.ibm.com/recipes/tutorials/engage-machine-learning-for-detecting-anomalous-behaviors-of-things!

ANOMALY DETECTION
SEASONALITY
Geographic
TREND
Growth
RESIDUAL
— Key Aspects —

ANOMALY DETECTION
HADOOP CLUSTER
LOAD BALANCER
DISTRIBUTED DATABASE
DBSCAN
Finding outlier nodes
— Finding needle in a haystack —
Clustering Techniques
OPTICS
K-MEANS/MEDIOD

ANOMALY DETECTION
VS
Real-time Implications: Frequency based techniques may incur additional overhead
Time Domain Frequency Domain
— Distinct Approaches —
Time series analysis
Statistical tests, Clustering
Fast Fourier Transform, DWT
Fractals, Filters (Kalman Filter)

FLORENCE ST 54
NEW YORK
ANOMALY DETECTION
— Distinct Approaches —
UNSUPERVISED SUPERVISED
Higher accuracy, Bias-Variance Tradeoﬀ
Decision Trees, SVM, Neural NetworksClustering
The common case in Operations

ANOMALY DETECTION
Complex Architectures: Hundreds of Microservices
Loss of productivity (TTD, TTR)
Impact on end-user experienceFALSE POSITIVE/NEGATIVE TRUE POSITIVE
— Too Many Alerts —

ANOMALY DETECTION
ACTIONABLE
PRODUCTIVITY
SEVERITY
PRIORITIZATTION
ROOT CAUSE ANALYSIS
CORRELATION
— Properties —

SOURCES
Data Collection Issues (network hiccups, bursty traffic, queue overflow – packet loss)
System Failures (bugs, hardware) – cascading effects
MISSING DATA
— Often Overlooked During Analysis —
Makes
analysis
non-‐trivial
Methods are in general not prepared to handle them

Loss
of
efficiency

Fewer patterns extracted from data
Conclusions statistically less strong
Inference bias
Resulting from differences between missing and complete data
Larger standard error
Reduced sample size

MISSING DATA
— A Large Amount of Prior Work —

MISSING DATA
MISSING COMPLETELY
AT RANDOM
Does not depend on either the
observed or missing data
MISSING AT
RANDOM
Depends on the observed
data, but not on missing
data
MISSING NOT
AT RANDOM
Depends on missing/
unobserved (latent) values
— Characterization of Missing Values —
Cause
Random
Uncorrelated with variables
of interest
Most Common Assumption Factors
Correlation between cause
of missingness and
Variables of interest
Missingness of variable
of interest
Common Case!

MISSING DATA
Completely
Recorded Units
Weighting Imputation Model
Procedures Based Based
— TAXONOMY of METHODS —
Discard variables with
missing data
Subset data to ignore
missing data
Diﬀerentially weigh the
complete cases to adjust
for bias
Weights are a function of
response probability
Assign a value to missing
one
Leverage existing ML/data
mining methods
Deﬁne a model for the
observed data
Inference based on the
likelihood or posterior
distribution under the model

MISSING VALUES
Transformations
Normalization
Prediction
— Imputation Methods —
Mean Value Last value Mean of xt-1 and xt+1Most Common
Value
Regression
Similarity K-Nearest
Neighbor
K-
Mean/Mediod
Fuzzy K-
Mean/Mediod
SVM, SVD
Event Covering

MISSING VALUES
Resampling Methods: Jackknifing (Quenouille, 1949) and Bootstrapping (Efron, 1979)
Based on large-sample theory
MI: Based on Bayesian theory and provides useful inferences for small samples
MULTIPLE
IMPUTATION
Replace each missing value by a vector of D ≥ 2 imputed values
Single imputation cannot reflect sampling variability under one
model for missing data or uncertainty about the correct model
for missing data
D complete-data inferences are combined to form one inference
that reflects uncertainty due missing data under that model
Throws light on sensitivity of inference of models for missing
data

MISSING VALUES
MICE: Chained Equations
(Sequential Regression Multiple Regression)
Assumes missing data is MAR
Each variable with missing data is modeled
conditional upon the other variables in the data
Learn a Bayesian Network using complete data and
use it simultaneously impute all missing values via
abductive inference
Incremental imputation
MULTIVARIATE
IMPUTATION
Multivariate regression
Linear, Logistic, Poisson
Use of auxiliary variables – not used in the analysis (predictive of missingness) but can improve imputations

MISSING VALUES
Expectation-Maximization (EM)
Hartley 1958, Orchard and Woodbury 1972
Dempster et al. 1977
Converges reliably to a local maximum or a
saddle point
Slow to converge in presence of large number of
missing values
M step can be diﬃcult (e.g., has no closed form)
MODEL
BASED
— Methods —
Expectation/Conditional Maximization (ECM) – two or more conditional (on parameters) maximization
Alternating Expectation/Conditional Maximization (AECM) – complete-data/actual loglikelihood
Parameter-Expanded EM (PX-EM) – include parameters whose values are known during maximization
Variational Bayes

On Achieving Energy Eﬃciency and Reducing CO2
Footprint in Cloud Computing, 2015.
Fractal based Anomaly Detection Over Data
Streams, 2013.
READINGS
Techniques for Optimizing Cloud Footprint, 2011.
Alternatives to the Median Absolute Deviation,
1993.
“IF I HAVE SEEN FURTHER, IT IS BY STANDING ON THE SHOULDERS OF GIANTS”
— Research papers —
Knee Point Detection on Bayesian Information
Criterion, 2008.
Knee Point Search Using Cascading Top-k Search
with Minimized Time Complexity, 2013.
Finding a ‘Kneedle’ in a Haystack: Detecting Knee
Points in System Behavior, 2011.
NbClust R Package
ISAAC NEWTON

Saturated Correlates (“Spider”) Model
Multiple Group SEM (Structural Equation
Modeling)
READINGS
Latent Transition Analysis
General Location Model
— Techniques —
Extra Dependent Variable (DV) Model
Bayesian Principal Component Analysis
Local Least Squares Imputation
Full Information Maximum Likelihood (FIML)

COFFEE BREAK
— 50 minutes —

Data Data Everywhere: Not An Insight to Take Action Upon

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (19)

Similar to Data Data Everywhere: Not An Insight to Take Action Upon

Similar to Data Data Everywhere: Not An Insight to Take Action Upon (20)

More from Arun Kejariwal

More from Arun Kejariwal (11)

Recently uploaded

Recently uploaded (20)

Data Data Everywhere: Not An Insight to Take Action Upon