DEEP VS. DIVERSE
By Colleen M. Farrelly
SCOPE OF PROBLEM
•The No Free Lunch Theorem suggests that no individual machine learning model
will perform best across all types of data and datasets.
• Social science/behavioral datasets present a particular challenge, as data often contains main
effects and interaction effects, which can be linear or nonlinear with respect to an outcome of
• In addition, social science datasets often contain outliers and group overlap among
classification outcomes, where someone may have all the risk factors for dropping out or drug
use but does not exhibit the predicted behavior.
•Several machine learning frameworks have nice theoretical properties, including
convergence theorems and universal approximation guarantees, that may be
particularly adept at modeling social science outcomes.
• Superlearners and subsembles have been proven to improve ensemble performance to a level
at least as good as the best model in the ensemble.
• Neural networks with one hidden layer have universal approximation properties, which
guarantee that random mappings to a wide enough layer will come arbitrarily close to a desired
error level for any given function.
• One caveat to this universal approximation is the size needed to obtain these guarantees may be larger than is practical or
possible in a model.
• Deep learning attempts to rectify this limitation by adding additional layers to the neural network, where each layer
reduces model error beyond the previous layers’ capabilities.
NEURAL NETWORK GENERAL
•A neural network is a model
based on processing
information the way the
human brain does via a series
of feature mappings.
Arrows denote mapping
functions, which take
one topological space
•These are a type of shallow, wide neural network.
•This formulation of neural networks reduces framework to a penalized linear
algebra problem, rather than iterative training (much faster to solve).
•It is based on random mappings, it is shown to converge to correct
classification/regression via the Universal Approximation Theorem (likely a
result of adequate coverage of the underlying data manifold).
•However, this the width of the network required may be computational
infeasible at the point of convergence with an arbitrary error level.
EXTREME LEARNING MACHINES
AND UNIVERSAL APPROXIMATION
•Deep learning attempts to solve the wide
layer problem by adding depth layers in
neural networks, which can be more
effective and computationally feasible than
extreme learning machines for some
• This framework is like sifting data with multiple
sifters to distill finer and finer pieces of the data.
•These are computationally intensive and
require architecture design and tuning for
• Feed-forward networks are particularly popular, as
they can be easily built, tuned, and trained.
• Feed-forward networks also have relations to the
Universal Approximation Theorem, providing a
means to exploit these results without requiring
•This model is a weighted aggregation
of multiple types of models.
• This is analogous to a small town election.
• Different people have different views of the
politics and care about different issues.
•Different modeling methods capture
different pieces of the data variance
and vote accordingly.
• This leverages algorithm strengths while
minimizing weaknesses for each model (kind
of like an extension of bagging to multi-
• Diversity allows the full ensemble to better
explore the geometry underlying the data.
•This combines multiple models while
avoiding multiple testing issues.
THEORY AND PRACTICE
•Superlearners are a type of ensemble of machine learning models,
typically using a set of classifiers or regression models, including linear
models, tree models, and ensemble models like boosting or bagging.
• Superlearners also have some theoretical guarantees about convergence and least
upper bounds on model error relative to algorithms within superlearner framework.
• They also have the ability to rank variables by importance and provide model fits for
•Deep architectures can be designed as feed-forward data processing
networks, in which functional nodes through which data passes add
information to the dataset regarding optimal partitioning and variable
• Recent attempts to create feed-forward deep networks employing random forest or
SVM functions at each mapping show promise as an alternative to the typical neural
network formulation of deep learning.
• It stands to reason that feed-forward deep networks based on other machine learning
algorithms or combinations of algorithms may enjoy some of these benefits of deep
•Algorithm frameworks tested:
1. Superlearner with random forest, random
ferns, KNN regression, MARS regression,
conditional inference trees, and boosted
2. Deep feed-forward machine learning model
(mixed deep model) with first hidden layer
of 2 random forest models, a conditional
inference tree model, and a random ferns
model; with second hidden layer of MARS
regression and conditional inference trees;
and a third hidden layer of boosted
3. Optimally tuned deep feed-forward neural
network model (13-5-3-1 configuration).
4. Deep feed-forward neural network model
with the same hidden layer structure as the
mixed deep model (Model 2).
5. KNN models, including k=5 regression
model, a deep k=5 model with 10-10-5
hidden layer configuration, and a
1. Outcome as yes/no for simplicity of
design (logistic regression problem)
2. 4 true predictors, 9 noise predictors
3. Predictor relationships
1. Purely linear terms (ideal neural network set-
2. Purely nonlinear terms (ideal machine
3. Mix of linear and nonlinear terms (more likely
in real-world data)
4. Gaussian noise level
2. High (more likely in real-world data)
5. Addition of outliers (fraction ~5-10%)
to high noise conditions (mimic
6. Sample sizes of 500, 1000, 2500,
5000, 10000 to test convergence
properties for each condition and
•Deep neural networks show strong performance
(linear relationship models show universal
approximation convergence at low sample sizes with
•Superlearners seem to perform better than deep
models for machine learning ensembles.
•Deep architectures enhance the performance of KNN
models, particularly at low sample sizes, but
superlearners win out.
•Superlearners dominate performance accuracy at
smaller sample sizes, and machine learning deep
models are competitive at these sample sizes.
•Tuned deep neural networks catch up to this
performance at large sample sizes, particularly with
noise and no outliers.
•Superlearner architectures show performance gains in
KNN regression models across all conditions.
•Superlearners retain their competitive advantage up until
very large sample sizes, suggesting that deep neural
networks struggle with a mix of linear and nonlinear terms
in a classification/regression model.
•Machine-learning-based deep architectures are
competitive at small sample sizes compared to deep
neural networks when no outliers are present.
•KNN superlearners retain a large advantage, particularly at
low noise with few outliers.
PREDICTING BAR PASSAGE
•Data includes 188 Concord Law
students for whom BAR data exists.
•22 predictors, including admissions
factors and law school grades,
•Mixed deep model, superlearner
model, and tuned deep neural
network model were compared to
assess performance on real-world
data exhibiting linear and nonlinear
relationships with noise and group
•70% of data was used to train, with
30% held out as a test set to assess
Deep Machine Learning
Superlearner Model 100.0%
Tuned Deep Neural
•Deep neural networks struggle with
the small sample size; using
machine learning map functions
dramatically improves accuracy.
• Sample size requirements for
convergence are a noted limitation of
neural networks in general.
• Previous results suggest performance
depends on choice of hidden layer
activation functions (maps).
•Superlearner yields perfect
prediction, with individual
PREDICTING RETENTION BY
•Data includes 27666 students in 2016
and retention/graduation status at the
end of each term.
demographic, and advising factors—
•Mixed deep model, superlearner
model, and tuned deep neural network
model were compared to assess
performance on real-world data
exhibiting linear and nonlinear
relationships with noise and group
•70% of data was used to train, with 30%
held out as a test set to assess
Deep Machine Learning
Superlearner Model 74.1%
Tuned Deep Neural
•Deep neural networks and deep
machine learning models seem to
provide a good processing sequence
to improve model fits iteratively.
• Examining the deep machine learning
model, we see that later layers do weight
prior models as fairly important
predictors, and we see evidence that
these previous layer predictions combine
with other factors in the dataset in these
• This suggests that a deep approach can
•Data involved 905,612 leads from
2016 and various admission
• Because of low enrollment counts
(~24000), stratified sampling was
used to enrich the training set for all
• Training set contained ~20% of
observations, with ~10% of those
being enrolled students.
•Superlearner/deep models give
very similar model fit specs
(accuracy, AUC, FNR, FPR), and
some individual models (MARS,
random forest, boosted
regression, conditional trees) gave
very good model fit, as well.
•This suggests convergence, of
most models tested, including
•Runtime analysis shows the advantage of
some models over others, with conditional
trees/MARS models showing low runtimes.
•Deep NN have an advantage over deep ML
models and superlearners, mostly as a result
of the random forest runtimes.
•A tree/MARS superlearner gave similar
performance in a shorter amount of time than
the deep NN (~2 minutes).
AUC FNR FPR Time
•Deep architectures can provide gain above individual models, particularly at
lower sample sizes, suggesting deep feed-forward approaches are
efficacious at improving predictive capabilities.
• This suggests that deep architectures can improve individual models that work well on a
• However, there is evidence that the topology of mappings between layers using these more
complex machine learning functions detracts from the predictive capabilities and universal
•Deep architectures with a variety of algorithms in each layer provide gains
above individual models and achieve good performance at low sample sizes
under real-world conditions.
•However, superlearners provide more robust models with no architecture
design or tuning needed; with group overlap and/or a combination of linear
and nonlinear relationships, they are the best models to use, even at sample
sizes where deep architecture begins to converge.
• Superlearners yield interpretable models and, hence, insight into important relationships
between predictors and an outcome.
SELECTED REFERENCES Theory and practice
• Aliper, A., Plis, S., Artemov, A., Ulloa, A., Mamoshina, P., & Zhavoronkov, A. (2016). Deep learning applications for predicting
pharmacological properties of drugs and drug repurposing using transcriptomic data. Molecular pharmaceutics, 13(7), 2524-2530.
• Altman, N. S. (1992). An introduction to kernel and nearest-neighbor nonparametric regression. The American Statistician, 46(3),
• Breiman, L. (2001). Random forests. Machine learning, 45(1), 5-32.
• Dekker, G., Pechenizkiy, M., & Vleeshouwers, J. (2009, July). Predicting students drop out: A case study. In Educational Data Mining
• Devroye, L. (1978). The uniform convergence of nearest neighbor regression function estimators and their application in
optimization. IEEE Transactions on Information Theory, 24(2), 142-151.
• Friedman, J. H. (1991). Multivariate adaptive regression splines. The annals of statistics, 1-67.
• Friedman, J. H., & Meulman, J. J. (2003). Multiple additive regression trees with application in epidemiology. Statistics in medicine,
22(9), 1365-1381. outliers
• Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer feedforward networks are universal approximators. Neural networks,
• Hothorn, T., Hornik, K., & Zeileis, A. (2006). Unbiased recursive partitioning: A conditional inference framework. Journal of
Computational and Graphical statistics, 15(3), 651-674.
• Huang, G. B., Chen, L., & Siew, C. K. (2006). Universal approximation using incremental constructive feedforward networks with
random hidden nodes. IEEE Trans. Neural Networks, 17(4), 879-892.
• Huang, G. B., Wang, D. H., & Lan, Y. (2011). Extreme learning machines: a survey. International Journal of Machine Learning and
Cybernetics, 2(2), 107-122.
• Huberty, C. J., & Lowman, L. L. (2000). Group overlap as a basis for effect size. Educational and Psychological Measurement, 60(4),
• Kang, B., & Choo, H. (2016). A deep-learning-based emergency alert system. ICT Express, 2(2), 67-70.
• Lian, H. (2011). Convergence of functional k-nearest neighbor regression estimate with functional responses. Electronic Journal of
Statistics, 5, 31-40.
• Osborne, J. W., & Overbay, A. (2004). The power of outliers (and why researchers should always check for them). Practical
assessment, research & evaluation, 9(6), 1-12.
• Ozuysal, M., Calonder, M., Lepetit, V., & Fua, P. (2010). Fast keypoint recognition using random ferns. IEEE transactions on pattern
analysis and machine intelligence, 32(3), 448-461.
• Pirracchio, R., Petersen, M. L., Carone, M., Rigon, M. R., Chevret, S., & van der Laan, M. J. (2015). Mortality prediction in intensive
care units with the Super ICU Learner Algorithm (SICULA): a population-based study. The Lancet Respiratory Medicine, 3(1), 42-52.
• Schmidhuber, J. (2015). Deep learning in neural networks: An overview. Neural networks, 61, 85-117. –industry
Computationally expensive in traditional algorithms and rooted in topological maps.
Cannot handle lots of variables compared to number of observations.
Cannot handle non-independent data.
Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer feedforward networks are universal approximators. Neural networks, 2(5), 359-366.
Random mappings to reduce MLP to linear system of equations.
Huang, G. B., Wang, D. H., & Lan, Y. (2011). Extreme learning machines: a survey. International Journal of Machine Learning and Cybernetics, 2(2), 107-122.
Computationally expensive neural network extension.
Still suffers from singularities which hinder performance.
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (pp. 1097-1105).
Bagging of different base models (same bootstrap or different bootstrap).
van der Laan, M. J., Polley, E. C., & Hubbard, A. E. (2007). Super learner. Statistical applications in genetics and molecular biology, 6(1).