Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Linear regression

847 views

Published on

Talk at ACM SIGKDD meetup on advanced Machine Learning

Published in: Data & Analytics
  • Be the first to comment

Linear regression

  1. 1. Linear Regression ACM SIGKDD ADVANCED ML SERIES ASHISH SRIVASTAVA (ANSRIVAS@GMAIL.COM)
  2. 2. Outline Linear Regression โ—ฆ Different perspectives โ—ฆ Issues with linear regression Addressing the issues through regularization Adding sparsity to the model/Feature selection Scikit options
  3. 3. Regression Modeling a quantity as a simple function of features โ—ฆ The predicted quantity should be well approximated as continuous โ—ฆ Prices, lifespan, physical measurements โ—ฆ As opposed to classification where we seek to predict discrete classes Python example for today: Boston house prices โ—ฆ The model is a linear function of the features โ—ฆ House_price = a*age + b*House_size + โ€ฆ. โ—ฆ Create nonlinear features to capture non-linearities โ—ฆ House_size2 = house_size*house_size โ—ฆ House_price = a*age + b*House_size + c*House_size2 + โ€ฆ..
  4. 4. Case of two features Image from http://www.pieceofshijiabian.com/dataandstats/stats-216-lecture-notes/week3/ ๐‘ฆ1 ๐‘ฆ2 ๐‘ฆ3 โ‹ฎ ๐‘ฆ ๐‘› โ‰ˆ 1 ๐‘ฅ11 ๐‘ฅ21 1 ๐‘ฅ12 ๐‘ฅ22 1 ๐‘ฅ13 ๐‘ฅ23 โ‹ฎ โ‹ฎ โ‹ฎ 1 ๐‘ฅ1๐‘› ๐‘ฅ2๐‘› ๐›ฝ0 ๐›ฝ1 ๐›ฝ2 ๐›ฝ0 Residuals
  5. 5. Linear Regression ๏ต Model a quantity as a linear function of some known features ๏ต ๐‘ฆ is the quantity to be modeled ๏ต ๐‘‹ are the sample points with each row being one data point ๏ตColumns are feature vectors ๏ต Goal: Estimate the model coefficients or ๐›ฝ ๐‘ฆ โ‰ˆ ๐‘‹๐›ฝ
  6. 6. Least squares: Optimization perspective Define objective function using the 2-norm of the residuals โ—ฆ ๐‘Ÿ๐‘’๐‘ ๐‘–๐‘‘๐‘ข๐‘Ž๐‘™๐‘  = ๐‘ฆ โˆ’ ๐‘‹๐›ฝ โ—ฆ Minimize: ๐‘“ ๐‘œ๐‘๐‘— = ๐‘ฆ โˆ’ ๐‘‹๐›ฝ 2 2 = ๐‘ฆ โˆ’ ๐‘‹๐›ฝ ๐‘‡ ๐‘ฆ โˆ’ ๐‘‹๐›ฝ = ๐›ฝ ๐‘‡ ๐‘‹ ๐‘‡ ๐‘‹๐›ฝ โˆ’ 2๐‘ฆ ๐‘‡ ๐‘‹๐›ฝ + ๐‘ฆ ๐‘‡ ๐‘ฆ โ—ฆ ๐œ•๐‘“ ๐‘œ๐‘๐‘— ๐œ•๐›ฝ = 2๐‘‹ ๐‘‡ ๐‘‹๐›ฝ โˆ’ 2๐‘‹ ๐‘‡ ๐‘ฆ = 0 โ—ฆ Normal equation โ—ฆ X is assumed to be thin and full rank so that ๐‘‹ ๐‘‡ ๐‘‹ is invertible ๐›ฝ = ๐‘‹ ๐‘‡ ๐‘‹ โˆ’1 ๐‘‹ ๐‘‡ ๐‘ฆ
  7. 7. Geometrical perspective We are trying to approximate y as linear combinations of the column vectors of X Lets make the residual orthogonal to the column space of X We get the same normal equation ๏Š A Defines a left inverse of a rectangular matrix X ๐‘‹ ๐‘‡ ๐‘ฆ โˆ’ ๐‘‹๐›ฝ = 0 ๐›ฝ = ๐‘‹ ๐‘‡ ๐‘‹ โˆ’1 ๐‘‹ ๐‘‡ ๐‘ฆ = ๐ด๐‘ฆ Image from http://www.wikiwand.com/en/Ordinary_least_squares
  8. 8. Python Example
  9. 9. Python example
  10. 10. What is Scikit doing? http://www.mathworks.com/company/newsletters/articles/professor-svd.html Singular Value Decomposition (SVD) โ—ฆ ๐‘‹ = ๐‘ˆฮฃ๐‘‰ ๐‘‡ Defines a general pseudo-inverse Vฮฃโ€  ๐‘ˆ ๐‘‡ โ—ฆ Known as Moore-Penrose inverse โ—ฆ For a thin matrix it is the left inverse โ—ฆ For a fat matrix it is the right inverse โ—ฆ Provides a minimum norm solution of an underdetermined set of equations In general we can have XTX not being full rank We get the minimum norm solution among the set of least squares solution Set of all solutions having the smallest residual norm Least norm
  11. 11. Stats perspective Maximum Likelihood Estimator (MLE) โ—ฆ Normally distributed error โ—ฆ ๐‘ฆ โˆ’ ๐‘‹๐›ฝ = ๐œ€~๐‘ 0, ๐œŽ2 ๐ผ โ—ฆ Consider the exponent in the Gaussian pdf โ—ฆ L2 norm minimization ๐‘ฆ โ‰ˆ ๐‘‹๐›ฝ ๐‘ฆ = ๐‘‹๐›ฝ๐‘ก๐‘Ÿ๐‘ข๐‘’ + ๐œ€ 2๐œ‹ โˆ’๐‘˜/2 ฮฃ ๐œ€ โˆ’1/2 ๐‘Ž0 ๐‘’โˆ’1/2 ๐œ€โˆ’๐œ‡ ๐œ€ ๐‘‡ฮฃ ๐œ€ โˆ’1 ๐œ€โˆ’๐œ‡ ๐œ€ 2๐œ‹ โˆ’๐‘˜/2 ฮฃ ๐œ€ โˆ’1/2 ๐‘Ž0 ๐‘’โˆ’1/2๐œŽ2 ๐‘ฆโˆ’๐‘‹๐›ฝ ๐‘‡ ๐‘ฆโˆ’๐‘‹๐›ฝ
  12. 12. Letโ€™s look at the distribution of our estimated model coefficients ๐›ฝ = ๐‘‹ ๐‘‡ ๐‘‹ โˆ’1 ๐‘‹ ๐‘‡ ๐‘ฆ = ๐‘‹ ๐‘‡ ๐‘‹ โˆ’1 ๐‘‹ ๐‘‡ ๐‘‹๐›ฝ๐‘ก๐‘Ÿ๐‘ข๐‘’ + ๐œ€ = ๐›ฝ๐‘ก๐‘Ÿ๐‘ข๐‘’+ ๐‘‹ ๐‘‡ ๐‘‹ โˆ’1 ๐‘‹ ๐‘‡ ๐œ€ ๐ธ ๐›ฝ = ๐›ฝ๐‘ก๐‘Ÿ๐‘ข๐‘’ Yay!!!!! Unbiased estimator โ—ฆ We can show it is the best linear unbiased estimator (BLUE) ๐ถ๐‘œ๐‘ฃ ๐›ฝ = ๐ธ ๐›ฝ โˆ’ ๐›ฝ๐‘ก๐‘Ÿ๐‘ข๐‘’ ๐›ฝ โˆ’ ๐›ฝ๐‘ก๐‘Ÿ๐‘ข๐‘’ ๐‘‡ = ๐‘‹ ๐‘‡ ๐‘‹ โˆ’1 ๐‘‹ ๐‘‡ ๐ธ ๐œ€๐œ€ ๐‘‡ ๐‘‹ ๐‘‹ ๐‘‡ ๐‘‹ โˆ’1 = ๐œŽ2 ๐‘‹ ๐‘‡ ๐‘‹ โˆ’1 Even if (XTX) is close to being non-invertible we are in trouble Problem I: Unstable results
  13. 13. Estimate parameter variance Bootstrapping
  14. 14. Problem II: Over fitting Model describes the training data very well โ—ฆ Actually โ€œtooโ€ well โ—ฆ The model is adapting to any noise in the training data Model is very bad predicting at other points Defeats the purpose of predictive modeling How do we know that we have overfit? What can we do to avoid overfitting? Image from http://blog.rocapal.org/?p=423
  15. 15. Outline Linear Regression โ—ฆ Different perspectives โ—ฆ Issues with linear regression Addressing the issues through regularization โ—ฆ Ridge regression โ—ฆ Python example: Bootstrapping to demonstrate reduction in variance โ—ฆ Optimizing the predictive capacity of the model through cross validation Adding sparsity to the model/Feature selection Scikit options
  16. 16. Minimize: ||๐‘ฆ โˆ’ ๐‘‹๐›ฝ||2 2 Ridge Regression / Tikhonov regularization A biased linear estimator to get better variance โ—ฆ Least squares was BLUE so we cant hope to get better variance while staying unbiased Gaussian MLE with a Gaussian prior on the model coefficients ๐‘‹ ๐‘‡ ๐‘‹ + ๐œ†๐ผ ๐›ฝ= ๐‘‹ ๐‘‡y Minimize: ||๐‘ฆ โˆ’ ๐‘‹๐›ฝ||2 2 + ฮป||๐›ฝ||2 2
  17. 17. Python example: Creating testcases make_regression in scikit.datasets โ—ฆ Several parameters to control the โ€œtypeโ€ of dataset we want โ—ฆ Parameters: โ—ฆ Size: n_samples and n_features โ—ฆ Type: n_informative, effective_rank, tail_strength, noise We want to test ridge regression with datasets with a low effective rank โ—ฆ Highly correlated (or linearly dependent) features
  18. 18. Python: Comparing ridge with basic regression
  19. 19. Comparison of variances Linear regression Ridge regression
  20. 20. Scikit: Ridge solvers The problem is inherently much better than the LinearRegression() case Several choices for the solver provided by Scikit โ—ฆ SVD โ—ฆ Used by the unregularized linear regression โ—ฆ Cholesky factorization โ—ฆ Conjugate gradients (CGLS) โ—ฆ Iterative method and we can target quality of fit โ—ฆ Lsqr โ—ฆ Similar to CG but is more stable and may need fewer iterations to converge โ—ฆ Stochastic Average Gradient โ€“ Fairly new โ—ฆ Use for big data sets โ—ฆ Improvement over standard stochastic gradient โ—ฆ Convergence rate linear โ€“ Same as gradient descent
  21. 21. How to choose ๐œ†: Cross validation Choosing a smaller ๐œ† or adding more features will always result in lower error on the training dataset โ—ฆ Over fitting โ—ฆ How to identify a model that will work as a good predictor? Break up the dataset โ—ฆ Training and validation set Train the model over a subset of the data and test its predictive capability โ—ฆ Test predictions on an independent set of data โ—ฆ Compare various models and choose the model with the best prediction error
  22. 22. Cross validation: Training vs Test Error Image from http://i.stack.imgur.com/S0tRm.png
  23. 23. Leave one out cross validation (LOOCV) Leave one out CV โ—ฆ Leave one data point as the validation point and train on the remaining dataset โ—ฆ Evaluate model on the left out data point โ—ฆ Repeat the modeling and validation test for all choices of the left out data point โ—ฆ Generalizes to leave-p-out ๐‘ฆ1 ๐‘ฆ2 ๐‘ฆ3 โ‹ฎ ๐‘ฆ ๐‘› โ‰ˆ 1 ๐‘ฅ11 ๐‘ฅ21 1 ๐‘ฅ12 ๐‘ฅ22 1 ๐‘ฅ13 ๐‘ฅ23 โ‹ฎ โ‹ฎ โ‹ฎ 1 ๐‘ฅ1๐‘› ๐‘ฅ2๐‘› ๐›ฝ0 ๐›ฝ1 ๐›ฝ2
  24. 24. K-Fold cross validation 2-fold CV โ—ฆ Divide data set into two parts โ—ฆ Use each part once as training and once as validation dataset โ—ฆ Generalizes to k-fold CV โ—ฆ May want to shuffle the data before partitioning Generally 3/5/10-fold cross validation is preferred โ—ฆ Leave-p-out requires several fits over similar sets of data โ—ฆ Also, computationally expensive compared to k-fold CV ๐‘ฆ1 ๐‘ฆ2 ๐‘ฆ3 โ‹ฎ ๐‘ฆ ๐‘› โ‰ˆ 1 ๐‘ฅ11 ๐‘ฅ21 1 ๐‘ฅ12 ๐‘ฅ22 1 ๐‘ฅ13 ๐‘ฅ23 โ‹ฎ โ‹ฎ โ‹ฎ 1 ๐‘ฅ1๐‘› ๐‘ฅ2๐‘› ๐›ฝ0 ๐›ฝ1 ๐›ฝ2
  25. 25. RidgeCV: Scikitโ€™s Cross validated Ridge Model
  26. 26. Outline Linear Regression โ—ฆ Different perspectives โ—ฆ Issues with linear regression Addressing the issues through regularization โ—ฆ Ridge regression โ—ฆ Python example: Bootstrapping to demonstrate reduction in variance โ—ฆ Optimizing the predictive capacity of the model through cross validation Adding sparsity to the model/Feature selection โ—ฆ LASSO โ—ฆ Basis Pursuit Methods: Matching Pursuit and Least Angle regression Scikit options
  27. 27. LASSO The penalty term for coefficient sizes is now the l1 norm Gaussian MLE with a laplacian prior distribution on the parameters Can result in many feature coefficients being zero/sparse solution โ—ฆ Can be used to select a subset of features โ€“ Feature selection Minimize: ||๐‘ฆ โˆ’ ๐‘‹๐›ฝ||2 2 Minimize: ||๐‘ฆ โˆ’ ๐‘‹๐›ฝ||2 2 + ฮป||๐›ฝ||1
  28. 28. How does this induce sparsity Penalty function Prior
  29. 29. Scikit LASSO: Coordinate descent Minimize along coordinate axes iteratively โ—ฆ Does not work for non-differentiable functions
  30. 30. LASSO objective Non-differentiable part is separable h(x1, x2, โ€ฆ., xn) f1(x1)+f2(x2)+ โ€ฆ + fn(xn) Separable Option in scikit to choose the direction either cyclically or at random called โ€œselectionโ€
  31. 31. Matching Pursuit (MP) Select feature most correlated to the residual f1 f2
  32. 32. Orthogonal Matching Pursuit (OMP) Keep residual orthogonal to the set of selected features (O)MP methods are greedy โ—ฆ Correlated features are ignored and will not be considered again f1 f2
  33. 33. LARS (Least Angle regression) Move along most correlated feature until another feature becomes equally correlated f1 f2
  34. 34. Outline Linear Regression โ—ฆ Different perspectives โ—ฆ Issues with linear regression Addressing the issues through regularization โ—ฆ Ridge regression โ—ฆ Python example: Bootstrapping to demonstrate reduction in variance โ—ฆ Optimizing the predictive capacity of the model through cross validation Adding sparsity to the model/Feature selection โ—ฆ LASSO โ—ฆ Basis Pursuit Methods: Matching Pursuit and Least Angle regression Scikit options
  35. 35. Options Normalize (default false) โ—ฆ Scale the feature vectors to have unit norm โ—ฆ Your choice Fit intercept (default true) โ—ฆ False: Implies the X and y already centered โ—ฆ Basic linear regression will do this implicitly if X is not sparse and compute the intercept separately โ—ฆ Centering can kill sparsity โ—ฆ Center data matrix in regularized regressions unless you really want a penalty on the bias โ—ฆ Issues with sparsity still being worked out in scikit (Temporary bug fix for ridge in 0.17 using sag solver)
  36. 36. RidgeCV options CV - Control to choose type of cross validation โ—ฆ Default LOOCV โ—ฆ Integer value โ€˜nโ€™ sets n-fold CV โ—ฆ You can provide your own data splits as well ๐‘ฆ1 ๐‘ฆ2 ๐‘ฆ3 โ‹ฎ ๐‘ฆ ๐‘› โ‰ˆ 1 ๐‘ฅ11 ๐‘ฅ21 1 ๐‘ฅ12 ๐‘ฅ22 1 ๐‘ฅ13 ๐‘ฅ23 โ‹ฎ โ‹ฎ โ‹ฎ 1 ๐‘ฅ1๐‘› ๐‘ฅ2๐‘› ๐›ฝ
  37. 37. RidgeCV options CV - Control to choose type of cross validation โ—ฆ Default LOOCV โ—ฆ Integer value โ€˜nโ€™ sets n-fold CV โ—ฆ You can provide your own data splits as well ๐‘ฆ1 ๐‘ฆ2 ๐‘ฆ3 โ‹ฎ ๐‘ฆ ๐‘› โ‰ˆ 1 ๐‘ฅ11 ๐‘ฅ21 1 ๐‘ฅ12 ๐‘ฅ22 1 ๐‘ฅ13 ๐‘ฅ23 โ‹ฎ โ‹ฎ โ‹ฎ 1 ๐‘ฅ1๐‘› ๐‘ฅ2๐‘› ๐›ฝ๐‘›๐‘’๐‘ค
  38. 38. RidgeCV options CV - Control to choose type of cross validation โ—ฆ Default LOOCV โ—ฆ Integer value โ€˜nโ€™ sets n-fold CV โ—ฆ You can provide your own data splits as well ๐‘ฆ1 ๐›ฝ๐‘›๐‘’๐‘ค ๐‘‡ ๐‘ฅ2 ๐‘ฆ3 โ‹ฎ ๐‘ฆ ๐‘› โ‰ˆ 1 ๐‘ฅ11 ๐‘ฅ21 1 ๐‘ฅ12 ๐‘ฅ22 1 ๐‘ฅ13 ๐‘ฅ23 โ‹ฎ โ‹ฎ โ‹ฎ 1 ๐‘ฅ1๐‘› ๐‘ฅ2๐‘› ๐›ฝ๐‘›๐‘’๐‘ค
  39. 39. Lasso(CV)/Lars(CV) options Positive โ—ฆ Force coefficients to be positive Other controls for iterations โ—ฆ Number of iterations (Lasso) / Number of non-zeros (Lars) โ—ฆ Tolerance to stop iterations (Lasso)
  40. 40. Summary Linear Models โ—ฆ Linear regression โ—ฆ Ridge โ€“ L2 penalty โ—ฆ Lasso โ€“ L1 penalty results in sparsity โ—ฆ LARS โ€“ Select a sparse set of features iteratively Use Cross Validation (CV) to choose your models โ€“ Leverage scikit โ—ฆ RidgeCV, LarsCV, LassoCV Not discussed โ€“ Explore scikit โ—ฆ Combing Ridge and Lasso: Elastic Nets โ—ฆ Random Sample Consensus (RANSAC) โ—ฆ Fitting linear models where data has several outliers โ—ฆ lassoLars, lars_path
  41. 41. References All code examples are taken form โ€œScikit-Learn Cookbookโ€ by Trent Hauck with some slight modifications LSQR -> C. C. Paige and M. A. Saunders, LSQR: An algorithm for sparse linear equations and sparse least squares. 1982. Ridge SAG -> Mark Schmidt, Nicolas Le Roux, Francis Bach: Minimizing Finite Sums with the Stochastic Average Gradient. 2013. Ridge CV LOOCV -> Rifkin, Lippert: Notes on Regularized Least Squares, MIT Technical Report. 2007. BP Methods1 -> Bradley Efron, Trevor Hastie, Iain Johnstone, and Robert Tibshirani: Least Angle Regression. 2004. BP Methods2 -> Hameed: Comparative Analysis of Orthogonal Matching Pursuit and Least Angle Regression, MSU MS Thesis. 2012.
  42. 42. BACKUP
  43. 43. Python example
  44. 44. Stochastic Gradient Descent When we have an immense number of samples or features SGD can come in handy Randomly select a sample point and use that to evaluate a gradient direction in which to move the parameters โ—ฆ Repeat the procedure until a โ€œtoleranceโ€ is achieved Normalizing the data is important
  45. 45. Recursive least squares Suppose a scenario in which we sequentially obtain a sample point and measurement and we would like to continually update our least squares estimate โ—ฆ โ€œIncrementalโ€ least squares estimate โ—ฆ Rank one update of the matrix XTX Utilize the matrix inversion lemma Similar idea used in RidgeCV LOOCV

ร—