Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Data Data Everywhere: Not An Insight to Take Action Upon


Published on

The big data era is characterized by ever-increasing velocity and volume of data. Over the last two or three years, several talks at Velocity have explored how to analyze operations data at scale, focusing on anomaly detection, performance analysis, and capacity planning, to name a few topics. Knowledge sharing of the techniques for the aforementioned problems helps the community to build highly available, performant, and resilient systems.

A key aspect of operations data is that data may be missing—referred to as “holes”—in the time series. This may happen for a wide variety of reasons, including (but not limited to):

# Packets being dropped due to unresponsive downstream services
# A network hiccup
# Transient hardware or software failure
# An issue with the data collection service

“Holes” in the time series on data analysis can potentially skew the analysis of data. This in turn can materially impact decision making. Arun Kejariwal presents approaches for analyzing operations data in the presence of “holes” in the time series, highlighting how missing data impacts common data analysis such as anomaly detection and forecasting, discussing the implications of missing data on time series of different granularities, such as minutely and hourly, and exploring a gamut of techniques that can be used to address the missing data issue (e.g., approximate the data using interpolation, regression, ensemble methods, etc.). Arun then walks you through how the techniques can be leveraged using real data.

Published in: Data & Analytics
  • Be the first to comment

Data Data Everywhere: Not An Insight to Take Action Upon

  1. 1. DATA DATA EVERYWHERE ARUN KEJARIWAL — Not An Insight to Take Action Upon ­­­­­—
  2. 2. WHAT’S UP WITH THE TITLE? ­­­­­— Metrics Arms Race ­­­­­— •  “… scaling it to about two million distinct time series …” (Netflix) •  “… highly accurate, real-time alerts on millions of system and business metrics …” (Uber) •  “As we have hundreds of systems exposing multiple data items, the write data rate might easily exceed tens of millions of data points each second.” (Facebook) •  “ … w e a r e t a k i n g t e n s o f b i l l i o n s o f measurements…” (Google) •  “The Observability stack collects 170 million individual data metrics (time series) …” (Twitter) •  “… serving over 50 million distinct time series.” (Spotify)
  3. 3. WHAT’S UP WITH THE TITLE? ­­­­­— Metrics Arms Race ­­­­­— •  >95% of metrics data is NEVER read!! •  Legacy instrumentation •  Lack of understanding of how to use metrics Latency 10p, 20p, …, 90p 95p, 99p, 99.9p Mean •  Retention “Hard disk is cheap”
  4. 4. WHAT’S UP WITH THE TITLE? “Rime of the Ancient Mariner” INSPIRATION By Samuel Taylor Coleridge Water, water, every where, Nor a drop to drink. ­­­­­— Rooted in 1798! ­­­­­— Image Source:!
  5. 5. Growing number of data sources DATA EXPLOSION •  Mobile (Smartphones, Tablets, Smart Watches) ­­­­­— Relation to Big/Fast Data ­­­­­— Data collection has become a commodity •  Million of time series •  IoT •  Wearables DATA, DATA, EVERYWHERE
  6. 6. DATA, DATA, EVERYWHERE ­­­­­— Non-trivial to Mine Actionable Insights ­­­­­—
  7. 7. DATA, DATA, EVERYWHERE ­­­­­— Time to Market, sinking ships – Can get “lonely” ­­­­­— Image Source:!
  8. 8. 1896 Olympics, Greece: Thomas Burke Purpose and ability to act Applying hard metrics and asking hard questions “Advanced data analytics is a quintessential business matter.” [1] [1], McKinsey, October, 2016.! DATA, DATA, EVERYWHERE ­­­­­— Value-driving analytics, NOT pristine data sets, interesting patterns or killer algorithms ­­­­­—
  9. 9. BREAK IT DOWN The impact of “big data” analytics is often manifested by thousands—or m o r e — o f i n c r e m e n t a l l y s m a l l improvements. If an organization can atomize a single process into its smallest parts and implement advances where possible, the payoffs can be profound. [1] DATA, DATA, EVERYWHERE ­­­­­— Purpose and Action ­­­­­— [1], McKinsey, October, 2016.! ITERATE “Victory often resulted from the way decisions are made; the side that reacts to situations more quickly and processes new information more accurately should prevail.”
  11. 11. FLORENCE ST 54 NEW YORK ANOMALY DETECTION ­­­­­— >100 years History ­­­­­— Bessel and Baeyer 1838, Chauvenet 1863, Stone 1868, Wright 1884, Irwin 1925, Student 1927, Thompson 1935! “… present and exact rule for the rejection of observations,! which shall be legitimately derived from the fundamental ! principles of Calculus of Probabilities.”!
  12. 12. FLORENCE ST 54 NEW YORK ANOMALY DETECTION ­­­­­— Prior to 1950s ­­­­­—
  13. 13. FLORENCE ST 54 NEW YORK ANOMALY DETECTION ­­­­­— Prior to 1950s ­­­­­—
  14. 14. ANOMALY DETECTION Median - Wright 1599 (“Certaine Errors in Navigation”), Cournot 1843, Fechner 1874, Galton1882, Edgeworth 1887 Median Absolute Deviation (median |Xi – median(X)|) Estimators Sn (mediani {medianj|Xi - Xj|}) and Qn (median {(Xi + Xj)/2; i < j}) ROBUST STATISTICS NORMAL DIST STATIONARITY CONTEXT ­­­­­— Check your assumptions­­­­­—
  15. 15. FLORENCE ST 54 NEW YORK ANOMALY DETECTION ­­­­­— Early work post 1950 ­­­­­—
  16. 16. ANOMALY DETECTION ­­­­­— >100 years History ­­­­­—
  17. 17. ANOMALY DETECTION ­­­­­— >100 years History ­­­­­—
  18. 18. ANOMALY DETECTION Increasing focus of monitoring solutions (RUM/Synthetic): DataDog, Catchpoint, Opsclarity, Ruxit Netflix, Airbnb, Cloudera ­­­­­— Operations ­­­­­— Introducing Practical and Robust Anomaly Detection in Time Series[2] [1] 2015 (! [2], 2015 (! [3], 2015! [4], 2016 (! Identifying Outages with Argos, Uber Engineering’s Real-Time Monitoring and Root-Cause Exploration Tool [3] Robust Anomaly Detection System for Real User Monitoring Data [4] EGADS: A Scalable, Configurable, and Novel Anomaly Detection System[1]
  19. 19. ANOMALY DETECTION AZURE [1] AWS [2] IBM Cloud [3] ­­­­­— As A Service On The Cloud ­­­­­— [1]! [2]! [3]!
  20. 20. ANOMALY DETECTION SEASONALITY Geographic TREND Growth RESIDUAL ­­­­­— Key Aspects ­­­­­—
  21. 21. ANOMALY DETECTION HADOOP CLUSTER LOAD BALANCER DISTRIBUTED DATABASE DBSCAN Finding outlier nodes ­­­­­— Finding needle in a haystack ­­­­­— Clustering Techniques OPTICS K-MEANS/MEDIOD
  22. 22. ANOMALY DETECTION VS Real-time Implications: Frequency based techniques may incur additional overhead Time Domain Frequency Domain ­­­­­— Distinct Approaches ­­­­­— Time series analysis Statistical tests, Clustering Fast Fourier Transform, DWT Fractals, Filters (Kalman Filter)
  23. 23. FLORENCE ST 54 NEW YORK ANOMALY DETECTION ­­­­­— Distinct Approaches ­­­­­— UNSUPERVISED SUPERVISED Higher accuracy, Bias-Variance Tradeoff Decision Trees, SVM, Neural NetworksClustering The common case in Operations
  24. 24. ANOMALY DETECTION Complex Architectures: Hundreds of Microservices Loss of productivity (TTD, TTR) Impact on end-user experienceFALSE POSITIVE/NEGATIVE TRUE POSITIVE ­­­­­— Too Many Alerts ­­­­­—
  26. 26. SOURCES Data Collection Issues (network hiccups, bursty traffic, queue overflow – packet loss) System Failures (bugs, hardware) – cascading effects MISSING DATA ­­­­­— Often Overlooked During Analysis ­­­­­— Makes  analysis  non-­‐trivial Methods are in general not prepared to handle them   Loss  of  efficiency   Fewer patterns extracted from data Conclusions statistically less strong Inference bias Resulting from differences between missing and complete data Larger standard error Reduced sample size
  27. 27. MISSING DATA ­­­­­— A Large Amount of Prior Work ­­­­­—
  28. 28. MISSING DATA ­­­­­— A Large Amount of Prior Work ­­­­­—
  29. 29. MISSING DATA MISSING COMPLETELY AT RANDOM Does not depend on either the observed or missing data MISSING AT RANDOM Depends on the observed data, but not on missing data MISSING NOT AT RANDOM Depends on missing/ unobserved (latent) values ­­­­­— Characterization of Missing Values ­­­­­— Cause Random Uncorrelated with variables of interest Most Common Assumption Factors Correlation between cause of missingness and Variables of interest Missingness of variable of interest Common Case!
  30. 30. MISSING DATA Completely Recorded Units Weighting Imputation Model Procedures Based Based ­­­­­— TAXONOMY of METHODS ­­­­­— Discard variables with missing data Subset data to ignore missing data Differentially weigh the complete cases to adjust for bias Weights are a function of response probability Assign a value to missing one Leverage existing ML/data mining methods Define a model for the observed data Inference based on the likelihood or posterior distribution under the model
  31. 31. MISSING VALUES Transformations Normalization Prediction ­­­­­— Imputation Methods ­­­­­— Mean Value Last value Mean of xt-1 and xt+1Most Common Value Regression Similarity K-Nearest Neighbor K- Mean/Mediod Fuzzy K- Mean/Mediod SVM, SVD Event Covering
  32. 32. MISSING VALUES Resampling Methods: Jackknifing (Quenouille, 1949) and Bootstrapping (Efron, 1979) Based on large-sample theory MI: Based on Bayesian theory and provides useful inferences for small samples MULTIPLE IMPUTATION ­­­­­— Imputation Methods ­­­­­— Replace each missing value by a vector of D ≥ 2 imputed values Single imputation cannot reflect sampling variability under one model for missing data or uncertainty about the correct model for missing data D complete-data inferences are combined to form one inference that reflects uncertainty due missing data under that model Throws light on sensitivity of inference of models for missing data
  33. 33. MISSING VALUES MICE: Chained Equations (Sequential Regression Multiple Regression) Assumes missing data is MAR Each variable with missing data is modeled conditional upon the other variables in the data Learn a Bayesian Network using complete data and use it simultaneously impute all missing values via abductive inference Incremental imputation MULTIVARIATE IMPUTATION ­­­­­— Imputation Methods ­­­­­— Multivariate regression Linear, Logistic, Poisson Use of auxiliary variables – not used in the analysis (predictive of missingness) but can improve imputations
  34. 34. MISSING VALUES Expectation-Maximization (EM) Hartley 1958, Orchard and Woodbury 1972 Dempster et al. 1977 Converges reliably to a local maximum or a saddle point Slow to converge in presence of large number of missing values M step can be difficult (e.g., has no closed form) MODEL BASED ­­­­­— Methods ­­­­­— Expectation/Conditional Maximization (ECM) – two or more conditional (on parameters) maximization Alternating Expectation/Conditional Maximization (AECM) – complete-data/actual loglikelihood Parameter-Expanded EM (PX-EM) – include parameters whose values are known during maximization Variational Bayes
  35. 35. On Achieving Energy Efficiency and Reducing CO2 Footprint in Cloud Computing, 2015. Fractal based Anomaly Detection Over Data Streams, 2013. READINGS Techniques for Optimizing Cloud Footprint, 2011. Alternatives to the Median Absolute Deviation, 1993. “IF I HAVE SEEN FURTHER, IT IS BY STANDING ON THE SHOULDERS OF GIANTS” ­­­­­— Research papers ­­­­­— Knee Point Detection on Bayesian Information Criterion, 2008. Knee Point Search Using Cascading Top-k Search with Minimized Time Complexity, 2013. Finding a ‘Kneedle’ in a Haystack: Detecting Knee Points in System Behavior, 2011. NbClust R Package ISAAC NEWTON
  36. 36. Saturated Correlates (“Spider”) Model Multiple Group SEM (Structural Equation Modeling) READINGS Latent Transition Analysis General Location Model ­­­­­— Techniques ­­­­­— Extra Dependent Variable (DV) Model Bayesian Principal Component Analysis Local Least Squares Imputation Full Information Maximum Likelihood (FIML)
  37. 37. COFFEE BREAK — 50 minutes —