Data Data Everywhere: Not An Insight to Take Action Upon

Arun Kejariwal
Arun KejariwalStatistical Learning Principal at Machine Zone, Inc.
DATA DATA EVERYWHERE
ARUN KEJARIWAL
— Not An Insight to Take Action Upon ­­­­­—
WHAT’S UP WITH THE TITLE?
­­­­­— Metrics Arms Race ­­­­­—
•  “… scaling it to about two million distinct time
series …” (Netflix)
•  “… highly accurate, real-time alerts on millions of
system and business metrics …” (Uber)
•  “As we have hundreds of systems exposing
multiple data items, the write data rate might
easily exceed tens of millions of data points each
second.” (Facebook)
•  “ … w e a r e t a k i n g t e n s o f b i l l i o n s o f
measurements…” (Google)
•  “The Observability stack collects 170 million
individual data metrics (time series) …” (Twitter)
•  “… serving over 50 million distinct time
series.” (Spotify)
WHAT’S UP WITH THE TITLE?
­­­­­— Metrics Arms Race ­­­­­—
•  >95% of metrics data is NEVER read!!
•  Legacy instrumentation
•  Lack of understanding of how to use metrics
Latency
10p, 20p, …, 90p
95p, 99p, 99.9p
Mean
•  Retention
“Hard disk is cheap”
WHAT’S UP WITH THE TITLE?
“Rime of the Ancient Mariner”
INSPIRATION
By Samuel Taylor Coleridge
Water, water, every where,
Nor a drop to drink.
­­­­­— Rooted in 1798! ­­­­­—
Image Source: https://ebooks.adelaide.edu.au/c/coleridge/samuel_taylor/rime/!
Growing number of data sources
DATA EXPLOSION
•  Mobile (Smartphones, Tablets, Smart Watches)
­­­­­— Relation to Big/Fast Data ­­­­­—
Data collection has become a commodity
•  Million of time series
•  IoT
•  Wearables
DATA, DATA, EVERYWHERE
DATA, DATA, EVERYWHERE
­­­­­— Non-trivial to Mine Actionable Insights ­­­­­—
DATA, DATA, EVERYWHERE
­­­­­— Time to Market, sinking ships – Can get “lonely” ­­­­­—
Image Source: https://ebooks.adelaide.edu.au/c/coleridge/samuel_taylor/rime/!
1896 Olympics, Greece: Thomas Burke
Purpose and ability to act
Applying hard metrics and asking hard questions
“Advanced data analytics is a quintessential business matter.” [1]
[1] http://www.mckinsey.com/business-functions/digital-mckinsey/our-insights/making-data-analytics-work-for-you-instead-of-the-other-way-around, McKinsey, October, 2016.!
DATA, DATA, EVERYWHERE
­­­­­— Value-driving analytics, NOT pristine data sets, interesting patterns or killer algorithms ­­­­­—
BREAK IT DOWN
The impact of “big data” analytics is
often manifested by thousands—or
m o r e — o f i n c r e m e n t a l l y s m a l l
improvements. If an organization can
atomize a single process into its
smallest parts and implement advances
where possible, the payoffs can be
profound. [1]
DATA, DATA, EVERYWHERE
­­­­­— Purpose and Action ­­­­­—
[1] http://www.mckinsey.com/business-functions/digital-mckinsey/our-insights/making-data-analytics-work-for-you-instead-of-the-other-way-around, McKinsey, October, 2016.!
ITERATE
“Victory often resulted from the way
decisions are made; the side that
reacts to situations more quickly and
processes new information more
accurately should prevail.”
DATA ANALYTICS
PERFORMANCE
KNEE POINT
DETECTION
AVAILABILITY
ROOT CAUSE
ANALYSIS
AUTOSCALING
EFFICIENCY
A/B
TESTING
INTRUSION
DETECTION
­­­­­— Not a Hype, But Not Trivial Either ­­­­­—
FLORENCE ST 54
NEW YORK
ANOMALY DETECTION
­­­­­— >100 years History ­­­­­—
Bessel and Baeyer 1838, Chauvenet 1863, Stone 1868, Wright 1884, Irwin 1925, Student 1927, Thompson 1935!
“… present and exact rule for the rejection of observations,!
which shall be legitimately derived from the fundamental !
principles of Calculus of Probabilities.”!
FLORENCE ST 54
NEW YORK
ANOMALY DETECTION
­­­­­— Prior to 1950s ­­­­­—
FLORENCE ST 54
NEW YORK
ANOMALY DETECTION
­­­­­— Prior to 1950s ­­­­­—
ANOMALY DETECTION
Median - Wright 1599 (“Certaine Errors in Navigation”), Cournot 1843, Fechner 1874, Galton1882, Edgeworth 1887
Median Absolute Deviation (median |Xi – median(X)|)
Estimators Sn (mediani {medianj|Xi - Xj|}) and Qn (median {(Xi + Xj)/2; i < j})
ROBUST STATISTICS
NORMAL
DIST
STATIONARITY CONTEXT
­­­­­— Check your assumptions­­­­­—
FLORENCE ST 54
NEW YORK
ANOMALY DETECTION
­­­­­— Early work post 1950 ­­­­­—
ANOMALY DETECTION
­­­­­— >100 years History ­­­­­—
ANOMALY DETECTION
­­­­­— >100 years History ­­­­­—
ANOMALY DETECTION
Increasing focus of monitoring solutions (RUM/Synthetic): DataDog, Catchpoint, Opsclarity, Ruxit
Netflix, Airbnb, Cloudera
­­­­­— Operations ­­­­­—
Introducing Practical and Robust
Anomaly Detection in Time Series[2]
[1] https://research.yahoo.com/news/announcing-open-source-egads-scalable-configurable-and-novel-anomaly-detection-system 2015 (https://github.com/yahoo/egads)!
[2] https://blog.twitter.com/2015/introducing-practical-and-robust-anomaly-detection-in-a-time-series, 2015 (https://github.com/twitter/AnomalyDetection)!
[3] https://eng.uber.com/argos/, 2015!
[4] http://bit.ly/luminol-velocity, 2016 (https://github.com/linkedin/luminol)!
Identifying Outages with Argos,
Uber Engineering’s Real-Time
Monitoring and Root-Cause
Exploration Tool [3]
Robust Anomaly Detection System
for Real User Monitoring Data [4]
EGADS: A Scalable, Configurable, and
Novel Anomaly Detection System[1]
ANOMALY DETECTION
AZURE [1] AWS [2] IBM Cloud [3]
­­­­­— As A Service On The Cloud ­­­­­—
[1] https://azure.microsoft.com/en-us/documentation/articles/machine-learning-apps-anomaly-detection!
[2] https://aws.amazon.com/blogs/iot/anomaly-detection-using-aws-iot-and-aws-lambda!
[3] https://developer.ibm.com/recipes/tutorials/engage-machine-learning-for-detecting-anomalous-behaviors-of-things!
ANOMALY DETECTION
SEASONALITY
Geographic
TREND
Growth
RESIDUAL
­­­­­— Key Aspects ­­­­­—
ANOMALY DETECTION
HADOOP CLUSTER
LOAD BALANCER
DISTRIBUTED DATABASE
DBSCAN
Finding outlier nodes
­­­­­— Finding needle in a haystack ­­­­­—
Clustering Techniques
OPTICS
K-MEANS/MEDIOD
ANOMALY DETECTION
VS
Real-time Implications: Frequency based techniques may incur additional overhead
Time Domain Frequency Domain
­­­­­— Distinct Approaches ­­­­­—
Time series analysis
Statistical tests, Clustering
Fast Fourier Transform, DWT
Fractals, Filters (Kalman Filter)
FLORENCE ST 54
NEW YORK
ANOMALY DETECTION
­­­­­— Distinct Approaches ­­­­­—
UNSUPERVISED SUPERVISED
Higher accuracy, Bias-Variance Tradeoff
Decision Trees, SVM, Neural NetworksClustering
The common case in Operations
ANOMALY DETECTION
Complex Architectures: Hundreds of Microservices
Loss of productivity (TTD, TTR)
Impact on end-user experienceFALSE POSITIVE/NEGATIVE TRUE POSITIVE
­­­­­— Too Many Alerts ­­­­­—
ANOMALY DETECTION
ACTIONABLE
PRODUCTIVITY
SEVERITY
PRIORITIZATTION
ROOT CAUSE ANALYSIS
CORRELATION
­­­­­— Properties ­­­­­—
SOURCES
Data Collection Issues (network hiccups, bursty traffic, queue overflow – packet loss)
System Failures (bugs, hardware) – cascading effects
MISSING DATA
­­­­­— Often Overlooked During Analysis ­­­­­—
Makes	
  analysis	
  non-­‐trivial
Methods are in general not prepared to handle them
	
  
Loss	
  of	
  efficiency	
  
Fewer patterns extracted from data
Conclusions statistically less strong
Inference bias
Resulting from differences between missing and complete data
Larger standard error
Reduced sample size
MISSING DATA
­­­­­— A Large Amount of Prior Work ­­­­­—
MISSING DATA
­­­­­— A Large Amount of Prior Work ­­­­­—
MISSING DATA
MISSING COMPLETELY
AT RANDOM
Does not depend on either the
observed or missing data
MISSING AT
RANDOM
Depends on the observed
data, but not on missing
data
MISSING NOT
AT RANDOM
Depends on missing/
unobserved (latent) values
­­­­­— Characterization of Missing Values ­­­­­—
Cause
Random
Uncorrelated with variables
of interest
Most Common Assumption Factors
Correlation between cause
of missingness and
Variables of interest
Missingness of variable
of interest
Common Case!
MISSING DATA
Completely
Recorded Units
Weighting Imputation Model
Procedures Based Based
­­­­­— TAXONOMY of METHODS ­­­­­—
Discard variables with
missing data
Subset data to ignore
missing data
Differentially weigh the
complete cases to adjust
for bias
Weights are a function of
response probability
Assign a value to missing
one
Leverage existing ML/data
mining methods
Define a model for the
observed data
Inference based on the
likelihood or posterior
distribution under the model
MISSING VALUES
Transformations
Normalization
Prediction
­­­­­— Imputation Methods ­­­­­—
Mean Value Last value Mean of xt-1 and xt+1Most Common
Value
Regression
Similarity K-Nearest
Neighbor
K-
Mean/Mediod
Fuzzy K-
Mean/Mediod
SVM, SVD
Event Covering
MISSING VALUES
Resampling Methods: Jackknifing (Quenouille, 1949) and Bootstrapping (Efron, 1979)
Based on large-sample theory
MI: Based on Bayesian theory and provides useful inferences for small samples
MULTIPLE
IMPUTATION
­­­­­— Imputation Methods ­­­­­—
Replace each missing value by a vector of D ≥ 2 imputed values
Single imputation cannot reflect sampling variability under one
model for missing data or uncertainty about the correct model
for missing data
D complete-data inferences are combined to form one inference
that reflects uncertainty due missing data under that model
Throws light on sensitivity of inference of models for missing
data
MISSING VALUES
MICE: Chained Equations
(Sequential Regression Multiple Regression)
Assumes missing data is MAR
Each variable with missing data is modeled
conditional upon the other variables in the data
Learn a Bayesian Network using complete data and
use it simultaneously impute all missing values via
abductive inference
Incremental imputation
MULTIVARIATE
IMPUTATION
­­­­­— Imputation Methods ­­­­­—
Multivariate regression
Linear, Logistic, Poisson
Use of auxiliary variables – not used in the analysis (predictive of missingness) but can improve imputations
MISSING VALUES
Expectation-Maximization (EM)
Hartley 1958, Orchard and Woodbury 1972
Dempster et al. 1977
Converges reliably to a local maximum or a
saddle point
Slow to converge in presence of large number of
missing values
M step can be difficult (e.g., has no closed form)
MODEL
BASED
­­­­­— Methods ­­­­­—
Expectation/Conditional Maximization (ECM) – two or more conditional (on parameters) maximization
Alternating Expectation/Conditional Maximization (AECM) – complete-data/actual loglikelihood
Parameter-Expanded EM (PX-EM) – include parameters whose values are known during maximization
Variational Bayes
On Achieving Energy Efficiency and Reducing CO2
Footprint in Cloud Computing, 2015.
Fractal based Anomaly Detection Over Data
Streams, 2013.
READINGS
Techniques for Optimizing Cloud Footprint, 2011.
Alternatives to the Median Absolute Deviation,
1993.
“IF I HAVE SEEN FURTHER, IT IS BY STANDING ON THE SHOULDERS OF GIANTS”
­­­­­— Research papers ­­­­­—
Knee Point Detection on Bayesian Information
Criterion, 2008.
Knee Point Search Using Cascading Top-k Search
with Minimized Time Complexity, 2013.
Finding a ‘Kneedle’ in a Haystack: Detecting Knee
Points in System Behavior, 2011.
NbClust R Package
ISAAC NEWTON
Saturated Correlates (“Spider”) Model
Multiple Group SEM (Structural Equation
Modeling)
READINGS
Latent Transition Analysis
General Location Model
­­­­­— Techniques ­­­­­—
Extra Dependent Variable (DV) Model
Bayesian Principal Component Analysis
Local Least Squares Imputation
Full Information Maximum Likelihood (FIML)
COFFEE BREAK
— 50 minutes —
1 of 37

Recommended

Anomaly detection in real-time data streams using Heron by
Anomaly detection in real-time data streams using HeronAnomaly detection in real-time data streams using Heron
Anomaly detection in real-time data streams using HeronArun Kejariwal
4.7K views49 slides
Live Anomaly Detection by
Live Anomaly DetectionLive Anomaly Detection
Live Anomaly DetectionArun Kejariwal
3K views54 slides
Modern real-time streaming architectures by
Modern real-time streaming architecturesModern real-time streaming architectures
Modern real-time streaming architecturesArun Kejariwal
7.2K views175 slides
Use Machine Learning to Get the Most out of Your Big Data Clusters by
Use Machine Learning to Get the Most out of Your Big Data ClustersUse Machine Learning to Get the Most out of Your Big Data Clusters
Use Machine Learning to Get the Most out of Your Big Data ClustersDatabricks
389 views36 slides
Gimme More! Supporting User Growth in a Performant and Efficient Fashion by
Gimme More! Supporting User Growth in a Performant and Efficient FashionGimme More! Supporting User Growth in a Performant and Efficient Fashion
Gimme More! Supporting User Growth in a Performant and Efficient FashionArun Kejariwal
2.3K views29 slides
Isolating Events from the Fail Whale by
Isolating Events from the Fail WhaleIsolating Events from the Fail Whale
Isolating Events from the Fail WhaleArun Kejariwal
2K views33 slides

More Related Content

What's hot

An Analytics Platform for Connected Vehicles by
An Analytics Platform for Connected VehiclesAn Analytics Platform for Connected Vehicles
An Analytics Platform for Connected VehiclesData Engineers Guild Meetup Group
743 views42 slides
I'm being followed by drones by
I'm being followed by dronesI'm being followed by drones
I'm being followed by dronesDataWorks Summit/Hadoop Summit
1.9K views58 slides
Leveraging NLP and Deep Learning for Document Recommendations in the Cloud by
Leveraging NLP and Deep Learning for Document Recommendations in the CloudLeveraging NLP and Deep Learning for Document Recommendations in the Cloud
Leveraging NLP and Deep Learning for Document Recommendations in the CloudDatabricks
444 views38 slides
Finding Changes in Real Data by
Finding Changes in Real DataFinding Changes in Real Data
Finding Changes in Real DataTed Dunning
803 views83 slides
Keynote 1 the rise of stream processing for data management &amp; micro serv... by
Keynote 1  the rise of stream processing for data management &amp; micro serv...Keynote 1  the rise of stream processing for data management &amp; micro serv...
Keynote 1 the rise of stream processing for data management &amp; micro serv...Sabri Skhiri
402 views25 slides
Big image analytics for (Re-) insurer by
 Big image analytics for (Re-) insurer Big image analytics for (Re-) insurer
Big image analytics for (Re-) insurerFlavio Trolese
628 views29 slides

What's hot(20)

Leveraging NLP and Deep Learning for Document Recommendations in the Cloud by Databricks
Leveraging NLP and Deep Learning for Document Recommendations in the CloudLeveraging NLP and Deep Learning for Document Recommendations in the Cloud
Leveraging NLP and Deep Learning for Document Recommendations in the Cloud
Databricks444 views
Finding Changes in Real Data by Ted Dunning
Finding Changes in Real DataFinding Changes in Real Data
Finding Changes in Real Data
Ted Dunning803 views
Keynote 1 the rise of stream processing for data management &amp; micro serv... by Sabri Skhiri
Keynote 1  the rise of stream processing for data management &amp; micro serv...Keynote 1  the rise of stream processing for data management &amp; micro serv...
Keynote 1 the rise of stream processing for data management &amp; micro serv...
Sabri Skhiri402 views
Big image analytics for (Re-) insurer by Flavio Trolese
 Big image analytics for (Re-) insurer Big image analytics for (Re-) insurer
Big image analytics for (Re-) insurer
Flavio Trolese628 views
Streaming Analytics: It's Not the Same Game by Numenta
Streaming Analytics: It's Not the Same GameStreaming Analytics: It's Not the Same Game
Streaming Analytics: It's Not the Same Game
Numenta2.6K views
Using Mahout and a Search Engine for Recommendation by Ted Dunning
Using Mahout and a Search Engine for RecommendationUsing Mahout and a Search Engine for Recommendation
Using Mahout and a Search Engine for Recommendation
Ted Dunning7.4K views
Detecting Anomalies in Streaming Data by Subutai Ahmad
Detecting Anomalies in Streaming DataDetecting Anomalies in Streaming Data
Detecting Anomalies in Streaming Data
Subutai Ahmad665 views
Evaluating Real-Time Anomaly Detection: The Numenta Anomaly Benchmark by Numenta
Evaluating Real-Time Anomaly Detection: The Numenta Anomaly BenchmarkEvaluating Real-Time Anomaly Detection: The Numenta Anomaly Benchmark
Evaluating Real-Time Anomaly Detection: The Numenta Anomaly Benchmark
Numenta4.3K views
Intelligent Production: Deploying IoT and cloud-based machine learning to opt... by Amazon Web Services
Intelligent Production: Deploying IoT and cloud-based machine learning to opt...Intelligent Production: Deploying IoT and cloud-based machine learning to opt...
Intelligent Production: Deploying IoT and cloud-based machine learning to opt...
Amazon Web Services2.4K views
Graph-Powered Machine Learning by GraphAware
Graph-Powered Machine LearningGraph-Powered Machine Learning
Graph-Powered Machine Learning
GraphAware460 views
How Big Data is Reducing Costs and Improving Outcomes in Health Care by Carol McDonald
How Big Data is Reducing Costs and Improving Outcomes in Health CareHow Big Data is Reducing Costs and Improving Outcomes in Health Care
How Big Data is Reducing Costs and Improving Outcomes in Health Care
Carol McDonald1K views
How Spark Enables the Internet of Things: Efficient Integration of Multiple ... by sparktc
How Spark Enables the Internet of Things: Efficient Integration of Multiple ...How Spark Enables the Internet of Things: Efficient Integration of Multiple ...
How Spark Enables the Internet of Things: Efficient Integration of Multiple ...
sparktc1.8K views
How the growth of R helps data-driven organizations succeed by Revolution Analytics
How the growth of R helps data-driven organizations succeedHow the growth of R helps data-driven organizations succeed
How the growth of R helps data-driven organizations succeed
Revolution Analytics32.4K views
Getting Started with Numenta Technology by Numenta
Getting Started with Numenta Technology Getting Started with Numenta Technology
Getting Started with Numenta Technology
Numenta7.8K views
Real time machine learning by Vinoth Kannan
Real time machine learningReal time machine learning
Real time machine learning
Vinoth Kannan11.7K views
Event Driven Architecture: Mistakes, I've made a few... by confluent
Event Driven Architecture: Mistakes, I've made a few...Event Driven Architecture: Mistakes, I've made a few...
Event Driven Architecture: Mistakes, I've made a few...
confluent919 views

Viewers also liked

Statistical Learning Based Anomaly Detection @ Twitter by
Statistical Learning Based Anomaly Detection @ TwitterStatistical Learning Based Anomaly Detection @ Twitter
Statistical Learning Based Anomaly Detection @ TwitterArun Kejariwal
5.1K views30 slides
Real Time Analytics: Algorithms and Systems by
Real Time Analytics: Algorithms and SystemsReal Time Analytics: Algorithms and Systems
Real Time Analytics: Algorithms and SystemsArun Kejariwal
23K views180 slides
Finding bad apples early: Minimizing performance impact by
Finding bad apples early: Minimizing performance impactFinding bad apples early: Minimizing performance impact
Finding bad apples early: Minimizing performance impactArun Kejariwal
1.1K views30 slides
Velocity 2015-final by
Velocity 2015-finalVelocity 2015-final
Velocity 2015-finalArun Kejariwal
2.1K views40 slides
Anomaly Detection and You by
Anomaly Detection and YouAnomaly Detection and You
Anomaly Detection and YouMary Kelly Rich
173 views23 slides
When Data is Everywhere, Where Do You Start?: Using Drupal to Manage, Distrib... by
When Data is Everywhere, Where Do You Start?: Using Drupal to Manage, Distrib...When Data is Everywhere, Where Do You Start?: Using Drupal to Manage, Distrib...
When Data is Everywhere, Where Do You Start?: Using Drupal to Manage, Distrib...Forum One
3.3K views53 slides

Viewers also liked(19)

Statistical Learning Based Anomaly Detection @ Twitter by Arun Kejariwal
Statistical Learning Based Anomaly Detection @ TwitterStatistical Learning Based Anomaly Detection @ Twitter
Statistical Learning Based Anomaly Detection @ Twitter
Arun Kejariwal5.1K views
Real Time Analytics: Algorithms and Systems by Arun Kejariwal
Real Time Analytics: Algorithms and SystemsReal Time Analytics: Algorithms and Systems
Real Time Analytics: Algorithms and Systems
Arun Kejariwal23K views
Finding bad apples early: Minimizing performance impact by Arun Kejariwal
Finding bad apples early: Minimizing performance impactFinding bad apples early: Minimizing performance impact
Finding bad apples early: Minimizing performance impact
Arun Kejariwal1.1K views
When Data is Everywhere, Where Do You Start?: Using Drupal to Manage, Distrib... by Forum One
When Data is Everywhere, Where Do You Start?: Using Drupal to Manage, Distrib...When Data is Everywhere, Where Do You Start?: Using Drupal to Manage, Distrib...
When Data is Everywhere, Where Do You Start?: Using Drupal to Manage, Distrib...
Forum One3.3K views
Everyone Is an Analyst and Data Is Everywhere, But Research Has Never Been Ne... by MRAMidAtlanticChapter
Everyone Is an Analyst and Data Is Everywhere, But Research Has Never Been Ne...Everyone Is an Analyst and Data Is Everywhere, But Research Has Never Been Ne...
Everyone Is an Analyst and Data Is Everywhere, But Research Has Never Been Ne...
Everyone is a Data Analyst Adobe EMEA Summit 2014 by Simon James
Everyone is a Data Analyst Adobe EMEA Summit 2014Everyone is a Data Analyst Adobe EMEA Summit 2014
Everyone is a Data Analyst Adobe EMEA Summit 2014
Simon James1.7K views
Days In Green (DIG): Forecasting the life of a healthy service by Arun Kejariwal
Days In Green (DIG): Forecasting the life of a healthy serviceDays In Green (DIG): Forecasting the life of a healthy service
Days In Green (DIG): Forecasting the life of a healthy service
Arun Kejariwal793 views
Data, data, everywhere… - SEE UK - 2016 by TOPdesk
Data, data, everywhere… - SEE UK - 2016Data, data, everywhere… - SEE UK - 2016
Data, data, everywhere… - SEE UK - 2016
TOPdesk582 views
Digital Marketing Trends Disrupting Consumer Behavior v. 19 by Kyle Lacy
Digital Marketing Trends Disrupting Consumer Behavior v. 19Digital Marketing Trends Disrupting Consumer Behavior v. 19
Digital Marketing Trends Disrupting Consumer Behavior v. 19
Kyle Lacy17.5K views
Plotting your path to success in fundraising by TPP Recruitment
Plotting your path to success in fundraisingPlotting your path to success in fundraising
Plotting your path to success in fundraising
TPP Recruitment930 views
Q214 earnings presentation by TextronCorp
Q214 earnings presentationQ214 earnings presentation
Q214 earnings presentation
TextronCorp2K views
How effective is the combination of your main question 2 evaluation by Grayce
How effective is the combination of your main question 2 evaluationHow effective is the combination of your main question 2 evaluation
How effective is the combination of your main question 2 evaluation
Grayce 743 views
Getting Started Blogging by Tonia.Johnson
Getting Started BloggingGetting Started Blogging
Getting Started Blogging
Tonia.Johnson906 views

Similar to Data Data Everywhere: Not An Insight to Take Action Upon

The math behind big systems analysis. by
The math behind big systems analysis.The math behind big systems analysis.
The math behind big systems analysis.Theo Schlossnagle
2.8K views43 slides
Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc... by
Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...
Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...Sarah Aerni
5.3K views61 slides
Dynamics Day 2015: Systems of Intelligence in Action by
Dynamics Day 2015: Systems of Intelligence in ActionDynamics Day 2015: Systems of Intelligence in Action
Dynamics Day 2015: Systems of Intelligence in ActionIntergen
981 views29 slides
Machine Learning + Analytics in Splunk by
Machine Learning + Analytics in Splunk Machine Learning + Analytics in Splunk
Machine Learning + Analytics in Splunk Splunk
1.3K views30 slides
IRJET- Improved Model for Big Data Analytics using Dynamic Multi-Swarm Op... by
IRJET-  	  Improved Model for Big Data Analytics using Dynamic Multi-Swarm Op...IRJET-  	  Improved Model for Big Data Analytics using Dynamic Multi-Swarm Op...
IRJET- Improved Model for Big Data Analytics using Dynamic Multi-Swarm Op...IRJET Journal
12 views9 slides
Real-time Classification of Malicious URLs on Twitter using Machine Activity ... by
Real-time Classification of Malicious URLs on Twitter using Machine Activity ...Real-time Classification of Malicious URLs on Twitter using Machine Activity ...
Real-time Classification of Malicious URLs on Twitter using Machine Activity ...Pete Burnap
969 views20 slides

Similar to Data Data Everywhere: Not An Insight to Take Action Upon(20)

The math behind big systems analysis. by Theo Schlossnagle
The math behind big systems analysis.The math behind big systems analysis.
The math behind big systems analysis.
Theo Schlossnagle2.8K views
Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc... by Sarah Aerni
Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...
Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...
Sarah Aerni5.3K views
Dynamics Day 2015: Systems of Intelligence in Action by Intergen
Dynamics Day 2015: Systems of Intelligence in ActionDynamics Day 2015: Systems of Intelligence in Action
Dynamics Day 2015: Systems of Intelligence in Action
Intergen981 views
Machine Learning + Analytics in Splunk by Splunk
Machine Learning + Analytics in Splunk Machine Learning + Analytics in Splunk
Machine Learning + Analytics in Splunk
Splunk1.3K views
IRJET- Improved Model for Big Data Analytics using Dynamic Multi-Swarm Op... by IRJET Journal
IRJET-  	  Improved Model for Big Data Analytics using Dynamic Multi-Swarm Op...IRJET-  	  Improved Model for Big Data Analytics using Dynamic Multi-Swarm Op...
IRJET- Improved Model for Big Data Analytics using Dynamic Multi-Swarm Op...
IRJET Journal12 views
Real-time Classification of Malicious URLs on Twitter using Machine Activity ... by Pete Burnap
Real-time Classification of Malicious URLs on Twitter using Machine Activity ...Real-time Classification of Malicious URLs on Twitter using Machine Activity ...
Real-time Classification of Malicious URLs on Twitter using Machine Activity ...
Pete Burnap969 views
Partner webinar presentation aws pebble_treasure_data by Treasure Data, Inc.
Partner webinar presentation aws pebble_treasure_dataPartner webinar presentation aws pebble_treasure_data
Partner webinar presentation aws pebble_treasure_data
Treasure Data, Inc.7.3K views
BsidesLVPresso2016_JZeditsv6 by Rod Soto
BsidesLVPresso2016_JZeditsv6BsidesLVPresso2016_JZeditsv6
BsidesLVPresso2016_JZeditsv6
Rod Soto261 views
Machine Learning and Analytics in Splunk by Splunk
Machine Learning and Analytics in SplunkMachine Learning and Analytics in Splunk
Machine Learning and Analytics in Splunk
Splunk3.5K views
Machine Learning and Analytics Breakout Session by Splunk
Machine Learning and Analytics Breakout SessionMachine Learning and Analytics Breakout Session
Machine Learning and Analytics Breakout Session
Splunk513 views
Data Science in the Real World: Making a Difference by Srinath Perera
Data Science in the Real World: Making a Difference Data Science in the Real World: Making a Difference
Data Science in the Real World: Making a Difference
Srinath Perera3.4K views
Navy security contest-bigdataforsecurity by stelligence
Navy security contest-bigdataforsecurityNavy security contest-bigdataforsecurity
Navy security contest-bigdataforsecurity
stelligence349 views
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data by Cloudera, Inc.
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your DataCloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera, Inc.1.4K views
Big Data and Machine Learning on AWS by CloudHesive
Big Data and Machine Learning on AWSBig Data and Machine Learning on AWS
Big Data and Machine Learning on AWS
CloudHesive233 views
Delivering Security Insights with Data Analytics and Visualization by Raffael Marty
Delivering Security Insights with Data Analytics and VisualizationDelivering Security Insights with Data Analytics and Visualization
Delivering Security Insights with Data Analytics and Visualization
Raffael Marty3.7K views
Big Data Analytics by Osman Ali
Big Data AnalyticsBig Data Analytics
Big Data Analytics
Osman Ali276 views
Data Science - An emerging Stream of Science with its Spreading Reach & Impact by Dr. Sunil Kr. Pandey
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactData Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & Impact

More from Arun Kejariwal

Anomaly Detection At The Edge by
Anomaly Detection At The EdgeAnomaly Detection At The Edge
Anomaly Detection At The EdgeArun Kejariwal
581 views54 slides
Serverless Streaming Architectures and Algorithms for the Enterprise by
Serverless Streaming Architectures and Algorithms for the EnterpriseServerless Streaming Architectures and Algorithms for the Enterprise
Serverless Streaming Architectures and Algorithms for the EnterpriseArun Kejariwal
2.8K views227 slides
Sequence-to-Sequence Modeling for Time Series by
Sequence-to-Sequence Modeling for Time SeriesSequence-to-Sequence Modeling for Time Series
Sequence-to-Sequence Modeling for Time SeriesArun Kejariwal
3.2K views64 slides
Sequence-to-Sequence Modeling for Time Series by
Sequence-to-Sequence Modeling for Time SeriesSequence-to-Sequence Modeling for Time Series
Sequence-to-Sequence Modeling for Time SeriesArun Kejariwal
1.9K views45 slides
Model Serving via Pulsar Functions by
Model Serving via Pulsar FunctionsModel Serving via Pulsar Functions
Model Serving via Pulsar FunctionsArun Kejariwal
1.7K views44 slides
Designing Modern Streaming Data Applications by
Designing Modern Streaming Data ApplicationsDesigning Modern Streaming Data Applications
Designing Modern Streaming Data ApplicationsArun Kejariwal
2.6K views227 slides

More from Arun Kejariwal(11)

Anomaly Detection At The Edge by Arun Kejariwal
Anomaly Detection At The EdgeAnomaly Detection At The Edge
Anomaly Detection At The Edge
Arun Kejariwal581 views
Serverless Streaming Architectures and Algorithms for the Enterprise by Arun Kejariwal
Serverless Streaming Architectures and Algorithms for the EnterpriseServerless Streaming Architectures and Algorithms for the Enterprise
Serverless Streaming Architectures and Algorithms for the Enterprise
Arun Kejariwal2.8K views
Sequence-to-Sequence Modeling for Time Series by Arun Kejariwal
Sequence-to-Sequence Modeling for Time SeriesSequence-to-Sequence Modeling for Time Series
Sequence-to-Sequence Modeling for Time Series
Arun Kejariwal3.2K views
Sequence-to-Sequence Modeling for Time Series by Arun Kejariwal
Sequence-to-Sequence Modeling for Time SeriesSequence-to-Sequence Modeling for Time Series
Sequence-to-Sequence Modeling for Time Series
Arun Kejariwal1.9K views
Model Serving via Pulsar Functions by Arun Kejariwal
Model Serving via Pulsar FunctionsModel Serving via Pulsar Functions
Model Serving via Pulsar Functions
Arun Kejariwal1.7K views
Designing Modern Streaming Data Applications by Arun Kejariwal
Designing Modern Streaming Data ApplicationsDesigning Modern Streaming Data Applications
Designing Modern Streaming Data Applications
Arun Kejariwal2.6K views
Correlation Analysis on Live Data Streams by Arun Kejariwal
Correlation Analysis on Live Data StreamsCorrelation Analysis on Live Data Streams
Correlation Analysis on Live Data Streams
Arun Kejariwal321 views
Deep Learning for Time Series Data by Arun Kejariwal
Deep Learning for Time Series DataDeep Learning for Time Series Data
Deep Learning for Time Series Data
Arun Kejariwal1.7K views
Correlation Analysis on Live Data Streams by Arun Kejariwal
Correlation Analysis on Live Data StreamsCorrelation Analysis on Live Data Streams
Correlation Analysis on Live Data Streams
Arun Kejariwal2.1K views
Techniques for Minimizing Cloud Footprint by Arun Kejariwal
Techniques for Minimizing Cloud FootprintTechniques for Minimizing Cloud Footprint
Techniques for Minimizing Cloud Footprint
Arun Kejariwal1.4K views
A Tool for Practical Garbage Collection Analysis In the Cloud by Arun Kejariwal
A Tool for Practical Garbage Collection Analysis In the CloudA Tool for Practical Garbage Collection Analysis In the Cloud
A Tool for Practical Garbage Collection Analysis In the Cloud
Arun Kejariwal3.4K views

Recently uploaded

Employees attrition by
Employees attritionEmployees attrition
Employees attritionMaryAlejandraDiaz
7 views5 slides
META.pptx by
META.pptxMETA.pptx
META.pptxvasanthan19012003
7 views10 slides
Report on OSINT by
Report on OSINTReport on OSINT
Report on OSINTAyonDebnathCertified
6 views15 slides
Underfunded.pptx by
Underfunded.pptxUnderfunded.pptx
Underfunded.pptxvgarcia19
15 views7 slides
Applied physics letters journal.pdf by
Applied physics letters journal.pdfApplied physics letters journal.pdf
Applied physics letters journal.pdfaqsamukhtiyar88
5 views8 slides
DGIQ East 2023 AI Ethics SIG by
DGIQ East 2023 AI Ethics SIGDGIQ East 2023 AI Ethics SIG
DGIQ East 2023 AI Ethics SIGKaren Lopez
5 views7 slides

Recently uploaded(20)

Underfunded.pptx by vgarcia19
Underfunded.pptxUnderfunded.pptx
Underfunded.pptx
vgarcia1915 views
DGIQ East 2023 AI Ethics SIG by Karen Lopez
DGIQ East 2023 AI Ethics SIGDGIQ East 2023 AI Ethics SIG
DGIQ East 2023 AI Ethics SIG
Karen Lopez5 views
Lack of communication among family.pptx by ahmed164023
Lack of communication among family.pptxLack of communication among family.pptx
Lack of communication among family.pptx
ahmed16402316 views
Customer Data Cleansing Project.pptx by Nat O
Customer Data Cleansing Project.pptxCustomer Data Cleansing Project.pptx
Customer Data Cleansing Project.pptx
Nat O6 views
Pydata Global 2023 - How can a learnt model unlearn something by SARADINDU SENGUPTA
Pydata Global 2023 - How can a learnt model unlearn somethingPydata Global 2023 - How can a learnt model unlearn something
Pydata Global 2023 - How can a learnt model unlearn something
AZConf 2023 - Considerations for LLMOps: Running LLMs in production by SARADINDU SENGUPTA
AZConf 2023 - Considerations for LLMOps: Running LLMs in productionAZConf 2023 - Considerations for LLMOps: Running LLMs in production
AZConf 2023 - Considerations for LLMOps: Running LLMs in production
6498-Butun_Beyinli_Cocuq-Daniel_J.Siegel-Tina_Payne_Bryson-2011-259s.pdf by 10urkyr34
6498-Butun_Beyinli_Cocuq-Daniel_J.Siegel-Tina_Payne_Bryson-2011-259s.pdf6498-Butun_Beyinli_Cocuq-Daniel_J.Siegel-Tina_Payne_Bryson-2011-259s.pdf
6498-Butun_Beyinli_Cocuq-Daniel_J.Siegel-Tina_Payne_Bryson-2011-259s.pdf
10urkyr347 views
OPPOTUS - Malaysians on Malaysia 3Q2023.pdf by Oppotus
OPPOTUS - Malaysians on Malaysia 3Q2023.pdfOPPOTUS - Malaysians on Malaysia 3Q2023.pdf
OPPOTUS - Malaysians on Malaysia 3Q2023.pdf
Oppotus34 views
Data about the sector workshop by info828217
Data about the sector workshopData about the sector workshop
Data about the sector workshop
info82821729 views
Product Research sample.pdf by AllenSingson
Product Research sample.pdfProduct Research sample.pdf
Product Research sample.pdf
AllenSingson35 views
CRM stick or twist.pptx by info828217
CRM stick or twist.pptxCRM stick or twist.pptx
CRM stick or twist.pptx
info82821711 views
GDG Cloud Community Day 2022 - Managing data quality in Machine Learning by SARADINDU SENGUPTA
GDG Cloud Community Day 2022 -  Managing data quality in Machine LearningGDG Cloud Community Day 2022 -  Managing data quality in Machine Learning
GDG Cloud Community Day 2022 - Managing data quality in Machine Learning
Dr. Ousmane Badiane-2023 ReSAKSS Conference by AKADEMIYA2063
Dr. Ousmane Badiane-2023 ReSAKSS ConferenceDr. Ousmane Badiane-2023 ReSAKSS Conference
Dr. Ousmane Badiane-2023 ReSAKSS Conference
AKADEMIYA20635 views

Data Data Everywhere: Not An Insight to Take Action Upon

  • 1. DATA DATA EVERYWHERE ARUN KEJARIWAL — Not An Insight to Take Action Upon ­­­­­—
  • 2. WHAT’S UP WITH THE TITLE? ­­­­­— Metrics Arms Race ­­­­­— •  “… scaling it to about two million distinct time series …” (Netflix) •  “… highly accurate, real-time alerts on millions of system and business metrics …” (Uber) •  “As we have hundreds of systems exposing multiple data items, the write data rate might easily exceed tens of millions of data points each second.” (Facebook) •  “ … w e a r e t a k i n g t e n s o f b i l l i o n s o f measurements…” (Google) •  “The Observability stack collects 170 million individual data metrics (time series) …” (Twitter) •  “… serving over 50 million distinct time series.” (Spotify)
  • 3. WHAT’S UP WITH THE TITLE? ­­­­­— Metrics Arms Race ­­­­­— •  >95% of metrics data is NEVER read!! •  Legacy instrumentation •  Lack of understanding of how to use metrics Latency 10p, 20p, …, 90p 95p, 99p, 99.9p Mean •  Retention “Hard disk is cheap”
  • 4. WHAT’S UP WITH THE TITLE? “Rime of the Ancient Mariner” INSPIRATION By Samuel Taylor Coleridge Water, water, every where, Nor a drop to drink. ­­­­­— Rooted in 1798! ­­­­­— Image Source: https://ebooks.adelaide.edu.au/c/coleridge/samuel_taylor/rime/!
  • 5. Growing number of data sources DATA EXPLOSION •  Mobile (Smartphones, Tablets, Smart Watches) ­­­­­— Relation to Big/Fast Data ­­­­­— Data collection has become a commodity •  Million of time series •  IoT •  Wearables DATA, DATA, EVERYWHERE
  • 6. DATA, DATA, EVERYWHERE ­­­­­— Non-trivial to Mine Actionable Insights ­­­­­—
  • 7. DATA, DATA, EVERYWHERE ­­­­­— Time to Market, sinking ships – Can get “lonely” ­­­­­— Image Source: https://ebooks.adelaide.edu.au/c/coleridge/samuel_taylor/rime/!
  • 8. 1896 Olympics, Greece: Thomas Burke Purpose and ability to act Applying hard metrics and asking hard questions “Advanced data analytics is a quintessential business matter.” [1] [1] http://www.mckinsey.com/business-functions/digital-mckinsey/our-insights/making-data-analytics-work-for-you-instead-of-the-other-way-around, McKinsey, October, 2016.! DATA, DATA, EVERYWHERE ­­­­­— Value-driving analytics, NOT pristine data sets, interesting patterns or killer algorithms ­­­­­—
  • 9. BREAK IT DOWN The impact of “big data” analytics is often manifested by thousands—or m o r e — o f i n c r e m e n t a l l y s m a l l improvements. If an organization can atomize a single process into its smallest parts and implement advances where possible, the payoffs can be profound. [1] DATA, DATA, EVERYWHERE ­­­­­— Purpose and Action ­­­­­— [1] http://www.mckinsey.com/business-functions/digital-mckinsey/our-insights/making-data-analytics-work-for-you-instead-of-the-other-way-around, McKinsey, October, 2016.! ITERATE “Victory often resulted from the way decisions are made; the side that reacts to situations more quickly and processes new information more accurately should prevail.”
  • 10. DATA ANALYTICS PERFORMANCE KNEE POINT DETECTION AVAILABILITY ROOT CAUSE ANALYSIS AUTOSCALING EFFICIENCY A/B TESTING INTRUSION DETECTION ­­­­­— Not a Hype, But Not Trivial Either ­­­­­—
  • 11. FLORENCE ST 54 NEW YORK ANOMALY DETECTION ­­­­­— >100 years History ­­­­­— Bessel and Baeyer 1838, Chauvenet 1863, Stone 1868, Wright 1884, Irwin 1925, Student 1927, Thompson 1935! “… present and exact rule for the rejection of observations,! which shall be legitimately derived from the fundamental ! principles of Calculus of Probabilities.”!
  • 12. FLORENCE ST 54 NEW YORK ANOMALY DETECTION ­­­­­— Prior to 1950s ­­­­­—
  • 13. FLORENCE ST 54 NEW YORK ANOMALY DETECTION ­­­­­— Prior to 1950s ­­­­­—
  • 14. ANOMALY DETECTION Median - Wright 1599 (“Certaine Errors in Navigation”), Cournot 1843, Fechner 1874, Galton1882, Edgeworth 1887 Median Absolute Deviation (median |Xi – median(X)|) Estimators Sn (mediani {medianj|Xi - Xj|}) and Qn (median {(Xi + Xj)/2; i < j}) ROBUST STATISTICS NORMAL DIST STATIONARITY CONTEXT ­­­­­— Check your assumptions­­­­­—
  • 15. FLORENCE ST 54 NEW YORK ANOMALY DETECTION ­­­­­— Early work post 1950 ­­­­­—
  • 16. ANOMALY DETECTION ­­­­­— >100 years History ­­­­­—
  • 17. ANOMALY DETECTION ­­­­­— >100 years History ­­­­­—
  • 18. ANOMALY DETECTION Increasing focus of monitoring solutions (RUM/Synthetic): DataDog, Catchpoint, Opsclarity, Ruxit Netflix, Airbnb, Cloudera ­­­­­— Operations ­­­­­— Introducing Practical and Robust Anomaly Detection in Time Series[2] [1] https://research.yahoo.com/news/announcing-open-source-egads-scalable-configurable-and-novel-anomaly-detection-system 2015 (https://github.com/yahoo/egads)! [2] https://blog.twitter.com/2015/introducing-practical-and-robust-anomaly-detection-in-a-time-series, 2015 (https://github.com/twitter/AnomalyDetection)! [3] https://eng.uber.com/argos/, 2015! [4] http://bit.ly/luminol-velocity, 2016 (https://github.com/linkedin/luminol)! Identifying Outages with Argos, Uber Engineering’s Real-Time Monitoring and Root-Cause Exploration Tool [3] Robust Anomaly Detection System for Real User Monitoring Data [4] EGADS: A Scalable, Configurable, and Novel Anomaly Detection System[1]
  • 19. ANOMALY DETECTION AZURE [1] AWS [2] IBM Cloud [3] ­­­­­— As A Service On The Cloud ­­­­­— [1] https://azure.microsoft.com/en-us/documentation/articles/machine-learning-apps-anomaly-detection! [2] https://aws.amazon.com/blogs/iot/anomaly-detection-using-aws-iot-and-aws-lambda! [3] https://developer.ibm.com/recipes/tutorials/engage-machine-learning-for-detecting-anomalous-behaviors-of-things!
  • 21. ANOMALY DETECTION HADOOP CLUSTER LOAD BALANCER DISTRIBUTED DATABASE DBSCAN Finding outlier nodes ­­­­­— Finding needle in a haystack ­­­­­— Clustering Techniques OPTICS K-MEANS/MEDIOD
  • 22. ANOMALY DETECTION VS Real-time Implications: Frequency based techniques may incur additional overhead Time Domain Frequency Domain ­­­­­— Distinct Approaches ­­­­­— Time series analysis Statistical tests, Clustering Fast Fourier Transform, DWT Fractals, Filters (Kalman Filter)
  • 23. FLORENCE ST 54 NEW YORK ANOMALY DETECTION ­­­­­— Distinct Approaches ­­­­­— UNSUPERVISED SUPERVISED Higher accuracy, Bias-Variance Tradeoff Decision Trees, SVM, Neural NetworksClustering The common case in Operations
  • 24. ANOMALY DETECTION Complex Architectures: Hundreds of Microservices Loss of productivity (TTD, TTR) Impact on end-user experienceFALSE POSITIVE/NEGATIVE TRUE POSITIVE ­­­­­— Too Many Alerts ­­­­­—
  • 25. ANOMALY DETECTION ACTIONABLE PRODUCTIVITY SEVERITY PRIORITIZATTION ROOT CAUSE ANALYSIS CORRELATION ­­­­­— Properties ­­­­­—
  • 26. SOURCES Data Collection Issues (network hiccups, bursty traffic, queue overflow – packet loss) System Failures (bugs, hardware) – cascading effects MISSING DATA ­­­­­— Often Overlooked During Analysis ­­­­­— Makes  analysis  non-­‐trivial Methods are in general not prepared to handle them   Loss  of  efficiency   Fewer patterns extracted from data Conclusions statistically less strong Inference bias Resulting from differences between missing and complete data Larger standard error Reduced sample size
  • 27. MISSING DATA ­­­­­— A Large Amount of Prior Work ­­­­­—
  • 28. MISSING DATA ­­­­­— A Large Amount of Prior Work ­­­­­—
  • 29. MISSING DATA MISSING COMPLETELY AT RANDOM Does not depend on either the observed or missing data MISSING AT RANDOM Depends on the observed data, but not on missing data MISSING NOT AT RANDOM Depends on missing/ unobserved (latent) values ­­­­­— Characterization of Missing Values ­­­­­— Cause Random Uncorrelated with variables of interest Most Common Assumption Factors Correlation between cause of missingness and Variables of interest Missingness of variable of interest Common Case!
  • 30. MISSING DATA Completely Recorded Units Weighting Imputation Model Procedures Based Based ­­­­­— TAXONOMY of METHODS ­­­­­— Discard variables with missing data Subset data to ignore missing data Differentially weigh the complete cases to adjust for bias Weights are a function of response probability Assign a value to missing one Leverage existing ML/data mining methods Define a model for the observed data Inference based on the likelihood or posterior distribution under the model
  • 31. MISSING VALUES Transformations Normalization Prediction ­­­­­— Imputation Methods ­­­­­— Mean Value Last value Mean of xt-1 and xt+1Most Common Value Regression Similarity K-Nearest Neighbor K- Mean/Mediod Fuzzy K- Mean/Mediod SVM, SVD Event Covering
  • 32. MISSING VALUES Resampling Methods: Jackknifing (Quenouille, 1949) and Bootstrapping (Efron, 1979) Based on large-sample theory MI: Based on Bayesian theory and provides useful inferences for small samples MULTIPLE IMPUTATION ­­­­­— Imputation Methods ­­­­­— Replace each missing value by a vector of D ≥ 2 imputed values Single imputation cannot reflect sampling variability under one model for missing data or uncertainty about the correct model for missing data D complete-data inferences are combined to form one inference that reflects uncertainty due missing data under that model Throws light on sensitivity of inference of models for missing data
  • 33. MISSING VALUES MICE: Chained Equations (Sequential Regression Multiple Regression) Assumes missing data is MAR Each variable with missing data is modeled conditional upon the other variables in the data Learn a Bayesian Network using complete data and use it simultaneously impute all missing values via abductive inference Incremental imputation MULTIVARIATE IMPUTATION ­­­­­— Imputation Methods ­­­­­— Multivariate regression Linear, Logistic, Poisson Use of auxiliary variables – not used in the analysis (predictive of missingness) but can improve imputations
  • 34. MISSING VALUES Expectation-Maximization (EM) Hartley 1958, Orchard and Woodbury 1972 Dempster et al. 1977 Converges reliably to a local maximum or a saddle point Slow to converge in presence of large number of missing values M step can be difficult (e.g., has no closed form) MODEL BASED ­­­­­— Methods ­­­­­— Expectation/Conditional Maximization (ECM) – two or more conditional (on parameters) maximization Alternating Expectation/Conditional Maximization (AECM) – complete-data/actual loglikelihood Parameter-Expanded EM (PX-EM) – include parameters whose values are known during maximization Variational Bayes
  • 35. On Achieving Energy Efficiency and Reducing CO2 Footprint in Cloud Computing, 2015. Fractal based Anomaly Detection Over Data Streams, 2013. READINGS Techniques for Optimizing Cloud Footprint, 2011. Alternatives to the Median Absolute Deviation, 1993. “IF I HAVE SEEN FURTHER, IT IS BY STANDING ON THE SHOULDERS OF GIANTS” ­­­­­— Research papers ­­­­­— Knee Point Detection on Bayesian Information Criterion, 2008. Knee Point Search Using Cascading Top-k Search with Minimized Time Complexity, 2013. Finding a ‘Kneedle’ in a Haystack: Detecting Knee Points in System Behavior, 2011. NbClust R Package ISAAC NEWTON
  • 36. Saturated Correlates (“Spider”) Model Multiple Group SEM (Structural Equation Modeling) READINGS Latent Transition Analysis General Location Model ­­­­­— Techniques ­­­­­— Extra Dependent Variable (DV) Model Bayesian Principal Component Analysis Local Least Squares Imputation Full Information Maximum Likelihood (FIML)
  • 37. COFFEE BREAK — 50 minutes —