- 1. DEEP VS. DIVERSE ARCHITECTURES By Colleen M. Farrelly
- 2. SCOPE OF PROBLEM •The No Free Lunch Theorem suggests that no individual machine learning model will perform best across all types of data and datasets. • Social science/behavioral datasets present a particular challenge, as data often contains main effects and interaction effects, which can be linear or nonlinear with respect to an outcome of interest. • In addition, social science datasets often contain outliers and group overlap among classification outcomes, where someone may have all the risk factors for dropping out or drug use but does not exhibit the predicted behavior. •Several machine learning frameworks have nice theoretical properties, including convergence theorems and universal approximation guarantees, that may be particularly adept at modeling social science outcomes. • Superlearners and subsembles have been proven to improve ensemble performance to a level at least as good as the best model in the ensemble. • Neural networks with one hidden layer have universal approximation properties, which guarantee that random mappings to a wide enough layer will come arbitrarily close to a desired error level for any given function. • One caveat to this universal approximation is the size needed to obtain these guarantees may be larger than is practical or possible in a model. • Deep learning attempts to rectify this limitation by adding additional layers to the neural network, where each layer reduces model error beyond the previous layers’ capabilities.
- 3. NEURAL NETWORK GENERAL OVERVIEW colah.github.iowww.alz.org •A neural network is a model based on processing complex, nonlinear information the way the human brain does via a series of feature mappings. Arrows denote mapping functions, which take one topological space to another
- 4. •These are a type of shallow, wide neural network. •This formulation of neural networks reduces framework to a penalized linear algebra problem, rather than iterative training (much faster to solve). •It is based on random mappings, it is shown to converge to correct classification/regression via the Universal Approximation Theorem (likely a result of adequate coverage of the underlying data manifold). •However, this the width of the network required may be computational infeasible at the point of convergence with an arbitrary error level. EXTREME LEARNING MACHINES AND UNIVERSAL APPROXIMATION
- 5. DEEP LEARNING •Deep learning attempts to solve the wide layer problem by adding depth layers in neural networks, which can be more effective and computationally feasible than extreme learning machines for some problems. • This framework is like sifting data with multiple sifters to distill finer and finer pieces of the data. •These are computationally intensive and require architecture design and tuning for each problem. • Feed-forward networks are particularly popular, as they can be easily built, tuned, and trained. • Feed-forward networks also have relations to the Universal Approximation Theorem, providing a means to exploit these results without requiring
- 6. •This model is a weighted aggregation of multiple types of models. • This is analogous to a small town election. • Different people have different views of the politics and care about different issues. •Different modeling methods capture different pieces of the data variance and vote accordingly. • This leverages algorithm strengths while minimizing weaknesses for each model (kind of like an extension of bagging to multi- algorithm ensembles). • Diversity allows the full ensemble to better explore the geometry underlying the data. •This combines multiple models while avoiding multiple testing issues. SUPERLEARNERS
- 7. THEORY AND PRACTICE •Superlearners are a type of ensemble of machine learning models, typically using a set of classifiers or regression models, including linear models, tree models, and ensemble models like boosting or bagging. • Superlearners also have some theoretical guarantees about convergence and least upper bounds on model error relative to algorithms within superlearner framework. • They also have the ability to rank variables by importance and provide model fits for each component. •Deep architectures can be designed as feed-forward data processing networks, in which functional nodes through which data passes add information to the dataset regarding optimal partitioning and variable pairing. • Recent attempts to create feed-forward deep networks employing random forest or SVM functions at each mapping show promise as an alternative to the typical neural network formulation of deep learning. • It stands to reason that feed-forward deep networks based on other machine learning algorithms or combinations of algorithms may enjoy some of these benefits of deep
- 8. EXPERIMENTAL SET-UP •Algorithm frameworks tested: 1. Superlearner with random forest, random ferns, KNN regression, MARS regression, conditional inference trees, and boosted regression. 2. Deep feed-forward machine learning model (mixed deep model) with first hidden layer of 2 random forest models, a conditional inference tree model, and a random ferns model; with second hidden layer of MARS regression and conditional inference trees; and a third hidden layer of boosted regression. 3. Optimally tuned deep feed-forward neural network model (13-5-3-1 configuration). 4. Deep feed-forward neural network model with the same hidden layer structure as the mixed deep model (Model 2). 5. KNN models, including k=5 regression model, a deep k=5 model with 10-10-5 hidden layer configuration, and a •Simulation design: 1. Outcome as yes/no for simplicity of design (logistic regression problem) 2. 4 true predictors, 9 noise predictors 3. Predictor relationships 1. Purely linear terms (ideal neural network set- up) 2. Purely nonlinear terms (ideal machine learning set-up) 3. Mix of linear and nonlinear terms (more likely in real-world data) 4. Gaussian noise level 1. Low 2. High (more likely in real-world data) 5. Addition of outliers (fraction ~5-10%) to high noise conditions (mimic group overlap) 6. Sample sizes of 500, 1000, 2500, 5000, 10000 to test convergence properties for each condition and algorithm
- 9. LINEAR RESULTS •Deep neural networks show strong performance (linear relationship models show universal approximation convergence at low sample sizes with low noise). •Superlearners seem to perform better than deep models for machine learning ensembles. •Deep architectures enhance the performance of KNN models, particularly at low sample sizes, but superlearners win out.
- 10. NONLINEAR RESULTS •Superlearners dominate performance accuracy at smaller sample sizes, and machine learning deep models are competitive at these sample sizes. •Tuned deep neural networks catch up to this performance at large sample sizes, particularly with noise and no outliers. •Superlearner architectures show performance gains in KNN regression models across all conditions.
- 11. MIXED RESULTS •Superlearners retain their competitive advantage up until very large sample sizes, suggesting that deep neural networks struggle with a mix of linear and nonlinear terms in a classification/regression model. •Machine-learning-based deep architectures are competitive at small sample sizes compared to deep neural networks when no outliers are present. •KNN superlearners retain a large advantage, particularly at low noise with few outliers.
- 12. PREDICTING BAR PASSAGE •Data includes 188 Concord Law students for whom BAR data exists. •22 predictors, including admissions factors and law school grades, used. •Mixed deep model, superlearner model, and tuned deep neural network model were compared to assess performance on real-world data exhibiting linear and nonlinear relationships with noise and group overlap. •70% of data was used to train, with 30% held out as a test set to assess Algorithm Accuracy Deep Machine Learning Network 84.2% Superlearner Model 100.0% Tuned Deep Neural Network 68.4% •Deep neural networks struggle with the small sample size; using machine learning map functions dramatically improves accuracy. • Sample size requirements for convergence are a noted limitation of neural networks in general. • Previous results suggest performance depends on choice of hidden layer activation functions (maps). •Superlearner yields perfect prediction, with individual
- 13. PREDICTING RETENTION BY ADVISING •Data includes 27666 students in 2016 and retention/graduation status at the end of each term. •10 predictors—academic, demographic, and advising factors— were used. •Mixed deep model, superlearner model, and tuned deep neural network model were compared to assess performance on real-world data exhibiting linear and nonlinear relationships with noise and group overlap. •70% of data was used to train, with 30% held out as a test set to assess accuracy. Algorithm Accuracy Deep Machine Learning Network 73.2% Superlearner Model 74.1% Tuned Deep Neural Network 74.4% •Deep neural networks and deep machine learning models seem to provide a good processing sequence to improve model fits iteratively. • Examining the deep machine learning model, we see that later layers do weight prior models as fairly important predictors, and we see evidence that these previous layer predictions combine with other factors in the dataset in these later layers. • This suggests that a deep approach can
- 14. PREDICTING ADMISSIONS •Data involved 905,612 leads from 2016 and various admission factors. • Because of low enrollment counts (~24000), stratified sampling was used to enrich the training set for all models. • Training set contained ~20% of observations, with ~10% of those being enrolled students. •Superlearner/deep models give very similar model fit specs (accuracy, AUC, FNR, FPR), and some individual models (MARS, random forest, boosted regression, conditional trees) gave very good model fit, as well. •This suggests convergence, of most models tested, including •Runtime analysis shows the advantage of some models over others, with conditional trees/MARS models showing low runtimes. •Deep NN have an advantage over deep ML models and superlearners, mostly as a result of the random forest runtimes. •A tree/MARS superlearner gave similar performance in a shorter amount of time than the deep NN (~2 minutes). Algorithm Accurac y AUC FNR FPR Time (Minutes ) Deep Machine Learning Network 98.0% 0.9 5 0.08 0.0 2 22 Superlearner Model 98.2% 0.9 6 0.08 0.0 1 15 Fast Superlearner Model 98.0% 0.9 5 0.08 0.0 2 2 Tuned Deep Neural Network 98.0% 0.9 5 0.08 0.0 2 8
- 15. CONCLUSIONS •Deep architectures can provide gain above individual models, particularly at lower sample sizes, suggesting deep feed-forward approaches are efficacious at improving predictive capabilities. • This suggests that deep architectures can improve individual models that work well on a particular problem. • However, there is evidence that the topology of mappings between layers using these more complex machine learning functions detracts from the predictive capabilities and universal approximation property. •Deep architectures with a variety of algorithms in each layer provide gains above individual models and achieve good performance at low sample sizes under real-world conditions. •However, superlearners provide more robust models with no architecture design or tuning needed; with group overlap and/or a combination of linear and nonlinear relationships, they are the best models to use, even at sample sizes where deep architecture begins to converge. • Superlearners yield interpretable models and, hence, insight into important relationships between predictors and an outcome.
- 16. SELECTED REFERENCES Theory and practice
- 17. • Aliper, A., Plis, S., Artemov, A., Ulloa, A., Mamoshina, P., & Zhavoronkov, A. (2016). Deep learning applications for predicting pharmacological properties of drugs and drug repurposing using transcriptomic data. Molecular pharmaceutics, 13(7), 2524-2530. • Altman, N. S. (1992). An introduction to kernel and nearest-neighbor nonparametric regression. The American Statistician, 46(3), 175-185. • Breiman, L. (2001). Random forests. Machine learning, 45(1), 5-32. • Dekker, G., Pechenizkiy, M., & Vleeshouwers, J. (2009, July). Predicting students drop out: A case study. In Educational Data Mining 2009. • Devroye, L. (1978). The uniform convergence of nearest neighbor regression function estimators and their application in optimization. IEEE Transactions on Information Theory, 24(2), 142-151. • Friedman, J. H. (1991). Multivariate adaptive regression splines. The annals of statistics, 1-67. • Friedman, J. H., & Meulman, J. J. (2003). Multiple additive regression trees with application in epidemiology. Statistics in medicine, 22(9), 1365-1381. outliers • Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer feedforward networks are universal approximators. Neural networks, 2(5), 359-366. • Hothorn, T., Hornik, K., & Zeileis, A. (2006). Unbiased recursive partitioning: A conditional inference framework. Journal of Computational and Graphical statistics, 15(3), 651-674. • Huang, G. B., Chen, L., & Siew, C. K. (2006). Universal approximation using incremental constructive feedforward networks with random hidden nodes. IEEE Trans. Neural Networks, 17(4), 879-892. • Huang, G. B., Wang, D. H., & Lan, Y. (2011). Extreme learning machines: a survey. International Journal of Machine Learning and Cybernetics, 2(2), 107-122. • Huberty, C. J., & Lowman, L. L. (2000). Group overlap as a basis for effect size. Educational and Psychological Measurement, 60(4), 543-563. • Kang, B., & Choo, H. (2016). A deep-learning-based emergency alert system. ICT Express, 2(2), 67-70. • Lian, H. (2011). Convergence of functional k-nearest neighbor regression estimate with functional responses. Electronic Journal of Statistics, 5, 31-40. • Osborne, J. W., & Overbay, A. (2004). The power of outliers (and why researchers should always check for them). Practical assessment, research & evaluation, 9(6), 1-12. • Ozuysal, M., Calonder, M., Lepetit, V., & Fua, P. (2010). Fast keypoint recognition using random ferns. IEEE transactions on pattern analysis and machine intelligence, 32(3), 448-461. • Pirracchio, R., Petersen, M. L., Carone, M., Rigon, M. R., Chevret, S., & van der Laan, M. J. (2015). Mortality prediction in intensive care units with the Super ICU Learner Algorithm (SICULA): a population-based study. The Lancet Respiratory Medicine, 3(1), 42-52. • Schmidhuber, J. (2015). Deep learning in neural networks: An overview. Neural networks, 61, 85-117. –industry and competition/robots

- Computationally expensive in traditional algorithms and rooted in topological maps. Cannot handle lots of variables compared to number of observations. Cannot handle non-independent data. Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer feedforward networks are universal approximators. Neural networks, 2(5), 359-366.
- Random mappings to reduce MLP to linear system of equations. Huang, G. B., Wang, D. H., & Lan, Y. (2011). Extreme learning machines: a survey. International Journal of Machine Learning and Cybernetics, 2(2), 107-122.
- Computationally expensive neural network extension. Still suffers from singularities which hinder performance. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (pp. 1097-1105).
- Bagging of different base models (same bootstrap or different bootstrap). van der Laan, M. J., Polley, E. C., & Hubbard, A. E. (2007). Super learner. Statistical applications in genetics and molecular biology, 6(1).