Nonlinear methods are widely used to produce higher performance compared with linear methods; however, nonlinear methods are generally more expensive in model size, training time, and scoring phase. With proper feature engineering techniques like polynomial expansion, the linear methods can be as competitive as those nonlinear methods. In the process of mapping the data to higher dimensional space, the linear methods will be subject to overfitting and instability of coefficients which can be addressed by penalization methods including Lasso and Elastic-Net. Finally, we'll show how to train linear models with Elastic-Net regularization using MLlib.
Several learning algorithms such as kernel methods, decision tress, and random forests are nonlinear approaches which are widely used to have better performance compared with linear methods. However, with feature engineering techniques like polynomial expansion by mapping the data into a higher dimensional space, the performance of linear methods can be as competitive as those nonlinear methods. As a result, linear methods remain to be very useful given that the training time of linear methods is significantly faster than the nonlinear ones, and the model is just a simple small vector which makes the prediction step very efficient and easy. However, by mapping the data into higher dimensional space, those linear methods are subject to overfitting and instability of coefficients, and those issues can be successfully addressed by penalization methods including Lasso and Elastic-Net. Lasso method with L1 penalty tends to result in many coefficients shrunk exactly to zero and a few other coefficients with comparatively little shrinkage. L2 penalty trends to result in all small but non-zero coefficients. Combining L1 and L2 penalties are called Elastic-Net method which tends to give a result in between. In the first part of the talk, we'll give an overview of linear methods including commonly used formulations and optimization techniques such as L-BFGS and OWLQN. In the second part of talk, we will talk about how to train linear models with Elastic-Net using our recent contribution to Spark MLlib. We'll also talk about how linear models are practically applied with big dataset, and how polynomial expansion can be used to dramatically increase the performance.
DB Tsai is an Apache Spark committer and a Senior Research Engineer at Netflix. He is recently working with Apache Spark community to add several new algorithms including Linear Regression and Binary Logistic Regression with ElasticNet (L1/L2) regularization, Multinomial Logistic Regression, and LBFGS optimizer. Prior to joining Netflix, DB was a Lead Machine Learning Engineer at Alpine Data Labs, where he developed innovative large-scale distributed linear algorithms, and then contributed back to open source Apache Spark project.