Regression CS294 Practical Machine Learning Romain Thibaux 09/18/06
Outline Ordinary Least Squares regression Derivation from minimizing the sum of squares Probabilistic interpretation Online version (LMS) Overfitting and Regularization Numerical stability L 1  Regression Kernel Regression, Spline Regression Multiple Adaptive Regression Splines  (MARS)
Classification (reminder) X  !  Y Anything: continuous (  ,   d , …) discrete ({0,1}, {1,…k}, …) structured (tree, string, …) … discrete: {0,1} binary {1,…k} multi-class tree, etc. structured
Classification (reminder) X Anything: continuous (  ,   d , …) discrete ({0,1}, {1,…k}, …) structured (tree, string, …) …
Classification (reminder) X Anything: continuous (  ,   d , …) discrete ({0,1}, {1,…k}, …) structured (tree, string, …) … Perceptron Logistic Regression Support Vector Machine Decision Tree Random Forest Kernel trick
Regression X  !  Y continuous:  ,   d Anything: continuous (  ,   d , …) discrete ({0,1}, {1,…k}, …) structured (tree, string, …) … 1
Examples Voltage  !  Temperature Processes, memory  !  Power consumption Protein structure  !  Energy [next week] Robot arm controls  !  Torque at effector Location, industry, past losses  !  Premium
Linear regression 0 10 20 30 40 0 10 20 30 20 22 24 26 Temperature 0 10 20 0 20 40 [start Matlab demo lecture2.m] Given examples Predict given a new point
Linear regression 0 20 0 20 40 Temperature 0 10 20 30 40 0 10 20 30 20 22 24 26 Prediction Prediction
Ordinary Least Squares (OLS) 0 20 0 Error or “residual” Prediction Observation Sum squared error
Minimize the sum squared error Sum squared error Linear equation Linear system
Alternative derivation n d Solve the system (it’s better not to invert the matrix)
LMS Algorithm (Least Mean Squares) where Online algorithm
Beyond lines and planes everything is the same with still linear in 0 10 20 0 20 40
Geometric interpretation [Matlab demo] 0 10 20 0 100 200 300 400 -10 0 10 20
Ordinary Least Squares  [summary] n d Let For example Let Minimize by solving Given examples Predict
Probabilistic interpretation 0 20 0 Likelihood
Assumptions vs. Reality Voltage Intel sensor network data Temperature 0 1 2 3 4 5 6 7 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08
Overfitting 0 2 4 6 8 10 12 14 16 18 20 -15 -10 -5 0 5 10 15 20 25 30 [Matlab demo] Degree 15 polynomial
Ridge Regression (Regularization) 0 2 4 6 8 10 12 14 16 18 20 -10 -5 0 5 10 15 Effect of regularization (degree 19) with  “small” Minimize by solving
Probabilistic interpretation Likelihood Prior Posterior
Numerical Accuracy Condition number vs We want covariates as perpendicular as possible, and roughly the same scale Regularization Preconditioning
Errors in Variables (Total Least Squares) 0 0
Sensitivity to outliers High weight given to outliers Influence function 0 10 20 30 40 0 10 20 30 5 10 15 20 25 Temperature at noon
L 1  Regression Linear program Influence function
Kernel Regression 0 2 4 6 8 10 12 14 16 18 20 -10 -5 0 5 10 15 Kernel regression (sigma=1)
Spline Regression Regression on each interval 5200 5400 5600 5800 50 60 70
Spline Regression With equality constraints 5200 5400 5600 5800 50 60 70
Spline Regression With L 1  cost 5200 5400 5600 5800 50 60 70
0 1 2 0 #requests per minute Time (days) 5000 Heteroscedasticity
MARS Multivariate Adaptive Regression Splines … on the board…
Further topics Generalized Linear Models Gaussian process regression Local Linear regression Feature Selection  [next class]

Regression

  • 1.
    Regression CS294 PracticalMachine Learning Romain Thibaux 09/18/06
  • 2.
    Outline Ordinary LeastSquares regression Derivation from minimizing the sum of squares Probabilistic interpretation Online version (LMS) Overfitting and Regularization Numerical stability L 1 Regression Kernel Regression, Spline Regression Multiple Adaptive Regression Splines (MARS)
  • 3.
    Classification (reminder) X ! Y Anything: continuous (  ,  d , …) discrete ({0,1}, {1,…k}, …) structured (tree, string, …) … discrete: {0,1} binary {1,…k} multi-class tree, etc. structured
  • 4.
    Classification (reminder) XAnything: continuous (  ,  d , …) discrete ({0,1}, {1,…k}, …) structured (tree, string, …) …
  • 5.
    Classification (reminder) XAnything: continuous (  ,  d , …) discrete ({0,1}, {1,…k}, …) structured (tree, string, …) … Perceptron Logistic Regression Support Vector Machine Decision Tree Random Forest Kernel trick
  • 6.
    Regression X ! Y continuous:  ,  d Anything: continuous (  ,  d , …) discrete ({0,1}, {1,…k}, …) structured (tree, string, …) … 1
  • 7.
    Examples Voltage ! Temperature Processes, memory ! Power consumption Protein structure ! Energy [next week] Robot arm controls ! Torque at effector Location, industry, past losses ! Premium
  • 8.
    Linear regression 010 20 30 40 0 10 20 30 20 22 24 26 Temperature 0 10 20 0 20 40 [start Matlab demo lecture2.m] Given examples Predict given a new point
  • 9.
    Linear regression 020 0 20 40 Temperature 0 10 20 30 40 0 10 20 30 20 22 24 26 Prediction Prediction
  • 10.
    Ordinary Least Squares(OLS) 0 20 0 Error or “residual” Prediction Observation Sum squared error
  • 11.
    Minimize the sumsquared error Sum squared error Linear equation Linear system
  • 12.
    Alternative derivation nd Solve the system (it’s better not to invert the matrix)
  • 13.
    LMS Algorithm (LeastMean Squares) where Online algorithm
  • 14.
    Beyond lines andplanes everything is the same with still linear in 0 10 20 0 20 40
  • 15.
    Geometric interpretation [Matlabdemo] 0 10 20 0 100 200 300 400 -10 0 10 20
  • 16.
    Ordinary Least Squares [summary] n d Let For example Let Minimize by solving Given examples Predict
  • 17.
  • 18.
    Assumptions vs. RealityVoltage Intel sensor network data Temperature 0 1 2 3 4 5 6 7 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08
  • 19.
    Overfitting 0 24 6 8 10 12 14 16 18 20 -15 -10 -5 0 5 10 15 20 25 30 [Matlab demo] Degree 15 polynomial
  • 20.
    Ridge Regression (Regularization)0 2 4 6 8 10 12 14 16 18 20 -10 -5 0 5 10 15 Effect of regularization (degree 19) with “small” Minimize by solving
  • 21.
  • 22.
    Numerical Accuracy Conditionnumber vs We want covariates as perpendicular as possible, and roughly the same scale Regularization Preconditioning
  • 23.
    Errors in Variables(Total Least Squares) 0 0
  • 24.
    Sensitivity to outliersHigh weight given to outliers Influence function 0 10 20 30 40 0 10 20 30 5 10 15 20 25 Temperature at noon
  • 25.
    L 1 Regression Linear program Influence function
  • 26.
    Kernel Regression 02 4 6 8 10 12 14 16 18 20 -10 -5 0 5 10 15 Kernel regression (sigma=1)
  • 27.
    Spline Regression Regressionon each interval 5200 5400 5600 5800 50 60 70
  • 28.
    Spline Regression Withequality constraints 5200 5400 5600 5800 50 60 70
  • 29.
    Spline Regression WithL 1 cost 5200 5400 5600 5800 50 60 70
  • 30.
    0 1 20 #requests per minute Time (days) 5000 Heteroscedasticity
  • 31.
    MARS Multivariate AdaptiveRegression Splines … on the board…
  • 32.
    Further topics GeneralizedLinear Models Gaussian process regression Local Linear regression Feature Selection [next class]

Editor's Notes

  • #9 Figure 1: scatter(1:20,10+(1:20)+2*randn(1,20),'k','filled'); a=axis; a(3)=0; axis(a);
  • #10 Figure 1: scatter(1:20,10+(1:20)+2*randn(1,20),'k','filled'); a=axis; a(3)=0; axis(a);
  • #11 Figure 1: scatter(1:20,10+(1:20)+2*randn(1,20),'k','filled'); a=axis; a(3)=0; axis(a);
  • #14 Link with perceptron
  • #26 A linear program is fast, but much slower than solving a linear system
  • #27 Link with kNN