Finding the right Machine Learning method for predictive modeling

Finding the
right ML technique
for
Predictive Modeling
SK Reddy
skreddy99
skreddy99

Basics of anomaly detection
https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0152173
Novelty vs Anomaly

Credit Card transactions
Contextual anomaly
Fraud detection
across various
application domains

http://tylervigen.com/spurious-correlations
Strange correlations

Strange correlations
http://tylervigen.com/spurious-correlations

8
SVM
KNN
Classical ML Classifiers
Naïve Bayes

Predictive analyses algos
https://d.dam.sap.com/a/xOXXb/50764_GB_46588_en.pdf

Predicting the real-time availability of 200 million grocery items -
Instacart
https://tech.instacart.com/predicting-real-time-availability-of-200-million-grocery-items-in-us-canada-stores-61f43a16eafe
The problem:
understanding “not founds”
Routes followed by shoppers in SF, Austin, Boston and Miami
https://tech.instacart.com/space-time-and-groceries-a315925acf3a

Instacart

Feature engineering: item level features, time-based features, and categorical features
https://tech.instacart.com/3-million-instacart-orders-open-sourced-d40d29ead6f2
Instacart

Leveraging Elastic Demand for Forecasting
https://tech.instacart.com/leveraging-elastic-demand-for-forecasting-6278b45f805f; https://arxiv.org/pdf/1809.03018.pdf

Goal: shifting the right amount of elastic demand to minimize the difference between the new
demand series
Input: historical demand series, and the amount of elastic demand
Output: shifted demand series

Demand Variance

A larger amount of elastic demand leads to
a smaller variance. However, the first 10%
elastic demand produces the largest
variance reduction.

Predicting creditworthiness in retail banking with limited scoring data
https://www.sciencedirect.com/science/article/pii/S0950705116300156?via%3Dihub 19

Credit Card Fraud Detection
http://isyou.info/inpra/papers/inpra-v5n4-02.pdf
Characteristic of the mobile payment dataset
F-measure after classification
20

“Nowcasting” Recession
https://arxiv.org/pdf/1903.03202.pdf
SVM:
• Linear (classes separable with a linear
hyperplane)
• Non-linear
Features
1. Monthly log difference in nonfarm payrolls
2. Log difference in average monthly price of
the S&P 500
3. Production index from Manufacturing ISM
Report (info about the goods market)
4. 10-year Treasury yield minus the federal
funds rate
SVM Dual Parameter and NBER Recessions
21

22https://arxiv.org/pdf/1804.10796.pdf
Handling Uncertainty in Social Lending
Credit Risk Prediction
The results of combining the three classifiers through a Choquet fuzzy integral
approach compared to the performance of each base classifiers alone

Deep Autoencoders
Variational Autoencoder (VAE)
A standard Autoencoder
23
A variational Autoencoder

Latent representation of bank customers
Learning Latent Representations of Bank Customers
24

25
Predicting bankruptcy –
evaluating the performance of various methods
https://towardsdatascience.com/predicting-bankruptcy-f4611afe8d2c
Classifiers tried
1. Logistic Regression
2. Perceptron as a classifier
3. Deep Neural Network Classifiers (with different size and
depth)
4. Fischer Linear Discriminant Analysis
5. K Nearest Neighbor Classifier (with different values of k)
6. Naive Bayes Classifier
7. Decision Tree (with different bucket size thresholds)
8. Bagged Decision Trees
9. Random Forest (with different tree sizes)
10. Gradient Boosting
11. Support Vector Machines (with different kernels)
Random Forest
kNN

26https://towardsdatascience.com/predicting-bankruptcy-f4611afe8d2c
Model comparison
Predicting bankruptcy –
evaluating the performance of various methods

Production forecasting
Schematic diagram for multi-step-ahead predictions

Production forecasting

https://arxiv.org/pdf/1809.00542v2.pdf
Predicting thermal power
consumption of the Mars Express Spacecraft
Distribution of different feature groups in the feature rankings produced by (a) G-RF, (b) L-RF and (c) XGB
ensembles, for the 33 power lines

Deep Learning based Anomaly detection

Anomaly detection approach within the context of an X-ray security screening problem
GANomaly: Semi-Supervised Anomaly Detection via
Adversarial Training; 2018

AAD: Adaptive Anomaly Detection through traffic surveillance videos
Pixel movement across the frame after ∆t
PASCAL Visual Object Classes

Video Anomaly Detection
Variational autoencoder

RADS: Real-time Anomaly Detection System for Cloud Data Centers

15 Million Battery Voltages and Current
Example: Anomaly Detection of Sensor Data Using
Distance-Based Failure Analysis
Sensor data from the same 15m Batteries
Can you find the anomaly?

Time series
http://www.unofficialgoogledatascience.com/2017/04/our-quest-for-robust-time-series.html
Time series ensemble
• Bass Diffusion Model
• Theta Model,
• Logistic models,
• Bayesian Structural Time Series
• STL (Seasonal-Trend Decomposition Procedure Based on Loess)
• Holt-Winters and other Exponential Smoothing models,
• Seasonal and other other ARIMA-based models,
• Year-over-Year growth models,
• custom models, and
• more

The demand data over the 2010-2015 timeframe
Combining Multiple Methods To Improve Time Series Prediction
Step 1Step 2
Step 3
The estimated trend (Hodrick-Prescott Filter)
• Trend (the increase or decrease in the
series over a period of time),
• Seasonality (the fluctuation that occurs
within the series over each week, each
month, etc.)
• Residuals (the data point that falls outside
of the expected data range)
Multi-seasonality (Loess method)
Step 4
Step 5
(after Elastic Net Regression and Fourier
transformation)
https://labs.eleks.com/2016/10/combined-different-methods-create-advanced-time-series-prediction.html

Turkish Electricity data
http://www.unofficialgoogledatascience.com/2017/04/our-quest-for-robust-time-series.html
Ensemble

Modeling and Forecasting
Vehicle Fleet Maintenance
Vehicles
Maintenance
(a) 3-mode data tensor; (b) the same tensor as a stacked series of frontal slices, or
arrays; (c) an example single frontal slice of a vehicle data tensor used in this analysis
(each entry corresponds to the count of a specific job type for a vehicle at a fixed time)

PARAFAC 3-way plot of absolute-time analysis. High factor weights in
the top panel are for 2014 Terrastar Horton vehicles, an ambulance.
The bottom two panels show systems (Body, Cab/Sheet Metal,
Engine and Motor, and Preventive Maintenance Service) and time
frames where this maintenance most often occurs.
PARAFAC 3-way plot of vehicle lifetime analysis revealing a simple
pattern common to almost all vehicles, as demonstrated by the
consistent loading across the vehicle factor (top panel):
tires/tubes/valves/liners replacement during the second year of
lifetime, with few repairs to this system either before or after.

PARAFAC 3-way plot of vehicle lifetime analysis showing the
2012 Freightliner M2112V, a Department of Solid Waste
garbage truck. This plot reveals a strong pattern of increased
maintenance in years 2-4 after purchase, focusing on a variety
of technical systems: hydraulics, lighting, gauges and warning
devices, and cooling systems.
PARAFAC 3-way plot of absolute-time analysis. This plot demonstrates
strong and specific maintenance patterns for the 2015 Smeal SST
Pumper fire truck. It shows extensive and specific repair to the engine
systems with little other maintenance, from late 2015 through 2016.

• Local Outlier Factor (LOF)
• Connectivity-Based Outlier Factor
(COF)
• Influenced Outlierness (INFLO)
• Local Outlier Probability (LoOP)
• Local Correlation Integral (LOCI)
• Approximate Local Correlation
Integral (aLOCI)
• Cluster-Based Local Outlier Factor
(CBLOF/ uCBLOF)
• Local Density Cluster-based
Outlier Factor (LDCOF)
• Clustering-based Multivariate
Gaussian Outlier Score (CMGOS)
• Histogram-based Outlier Score
(HBOS)
• One-Class Support Vector
Machine
• Robust Principal Component
Analysis (rPCA)

Nearest-Neighbor based algorithms on the breast-cancer dataset Clustering-based anomaly detection algorithms

Recommendations for technique selection

Deep Anomaly Detection techniques
*DAD = Deep Anomaly Detection
*

Autoencoder architectures for anomaly detection
Other DAD Anomaly detection models
• Transfer Learning based anomaly
detection
• Zero Shot learning based anomaly
detection
• Ensemble based anomaly detection
• Clustering based anomaly detection
• Deep Reinforcement Learning (DRL)
based anomaly detection
SDAE: Stacked Denoising Autoencoder, DAE : Denoising Autoencoders
GRU: Gated Recurrent Unit, CNN: Convolutional Neural Networks
LSTM: Long Short Term Memory, AE: Autoencoders
CAE: Convolutional Autoencoders

• Classification - predicts a failure in next n-steps.
• Logistic Regression
• Perceptron as a classifier
• Deep Neural Network Classifiers (with different size
and depth)
• Fischer Linear Discriminant Analysis
• K Nearest Neighbor Classifier (with different values
of k)
• Naive Bayes Classifier
• Decision Tree (with different bucket size
thresholds)
• Bagged Decision Trees
• Random Forest (with different tree sizes)
• Gradient Boosting
• Support Vector Machines (with different kernels)
• Regression - predicts how much time is left before the
next failure called Remaining Useful Life .
ML Techniques for Predictive Maintenance
• Supervised Anomaly Detection: fully labeled
training and test data sets.
• Comments: Decision trees like C4.5
cannot deal well with unbalanced data,
whereas SVM or ANNs should perform
better.
• Semi-supervised Anomaly Detection: training
data only consists of normal data without any
anomalies.
• Comments: One-class SVMs and
autoencoders. Density modeling models
like Gaussian Mixture Models (many
variants exist), Kernel Density Estimation
• Unsupervised Anomaly Detection: no labels.
• Comments: distances or densities are
used to give an estimation what is normal
and what is an outlier.
https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0152173#pone.0152173.ref033
https://towardsdatascience.com/predicting-bankruptcy-f4611afe8d2c

Deep learning methods for Intrusion Detection

Linear regression:
• Relies on the normal, heteroscedasticity and other assumptions,
• does not capture highly non-linear, chaotic patterns.
• Prone to over-fitting.
• Parameters difficult to interpret.
• Very unstable when independent variables are highly correlated.
• Fixes: variable reduction, apply a transformation to your variables,
use constrained regression (e.g. ridge or Lasso regression)
Decision trees:
• Very large decision trees are very unstable and impossible to
interpret, and
• prone to over-fitting.
• Fix: combine multiple small decision trees together instead of using
a large decision tree.
Naive Bayes:
• Used e.g. in fraud and spam detection, and for scoring. Assumes
that variables are independent, if not it will fail miserably. In the
context of fraud or spam detection, variables (sometimes called
rules) are highly correlated.
• Fix: group variables into independent clusters of variables (in each
cluster, variables are highly correlated).
• Apply naive Bayes to the clusters. Or use data reduction techniques.
K-means clustering:
• Used for clustering, tends to produce circular clusters.
• Does not work well with data points that are not a mixture of
Gaussian distributions.
Neural networks:
• Difficult to interpret, unstable, subject to over-fitting.
Maximum Likelihood estimation:
• Requires your data to fit with a prespecified probabilistic
distribution. Not data-driven. In many cases the pre-specified
Gaussian distribution is a terrible fit for your data.
Density estimation in high dimensions:
• Subject to what is referred to as the curse of dimensionality.
Fix: use (non parametric) kernel density estimators with
adaptive bandwidths.
Linear discriminant analysis (LDA):
• Used for supervised clustering. Bad technique because it
assumes that clusters do not overlap, and are well
separated by hyper-planes. In practice, they never do. Use
density estimation techniques instead.
Critique of predictive techniques

Critique of predictive techniques
https://www.analyticbridge.datasciencecentral.com/profiles/blogs/the-8-worst-predictive-modeling-techniques
Random Forests
Pros
• One of the most accurate learning algorithms available. For
many data sets, it produces a highly accurate classifier.
• Runs efficiently on large databases.
• Handles thousands of input variables without variable deletion.
• Gives estimates of what variables are important in the
classification.
• Generates an internal unbiased estimate of the generalization
error as the forest building progresses.
• Has an effective method for estimating missing data and
maintains accuracy when a large proportion of the data are
missing.
• Has methods for balancing error in class population unbalanced
data sets.
• Prototypes are computed that give information about the relation
between the variables and the classification.
• The capabilities of the above can be extended to unlabeled data,
leading to unsupervised clustering, data views and outlier
detection.
Random Forests
Cons
• Random forests have been observed to overfit for
some datasets with noisy classification/regression
tasks.
• Unlike decision trees, the classifications made by
random forests are difficult for humans to interpret.
• For data including categorical variables with different
number of levels, random forests are biased in favor
of those attributes with more levels. Therefore, the
variable importance scores from random forest are
not reliable for this type of data. Methods such as
partial permutations were used to solve the problem.
• If the data contain groups of correlated features of
similar relevance for the output, then smaller groups
are favored over larger groups.

• The right technique is dependent on the use case and data
• K-NN is the best in many global use cases (especially if high-dimensionality problems of > 400
dimensions, the best k value is <5)
• LoF is preferred for many local use cases
• Don’t use local anomaly detection algorithms, such as LOF, COF, INFLO and LoOP on datasets
containing global anomalies (Note: Global anomaly detection algos perform OK for local
anomalies)
• Nearest-neighbor based algorithms perform better in most cases when compared to clustering
algorithms (but pick the right k value) (pick these if computing time is not an issue)
• clustering-based algorithms have a lower computation time (hence use for real-time or near real-
time use cases)
• uCBLOF algorithm is better among clustering-based algorithms
• ARIMA perfomed better for a long time, till ANNs came
• When not sure, use ANN
Recommendations

Thank you
SK Reddy
skreddy99
skreddy99

Finding the right Machine Learning method for predictive modeling

Recommended

Recommended

More Related Content

Similar to Finding the right Machine Learning method for predictive modeling

Similar to Finding the right Machine Learning method for predictive modeling (20)

More from SK Reddy

More from SK Reddy (20)

Recently uploaded

Recently uploaded (20)

Finding the right Machine Learning method for predictive modeling