SlideShare a Scribd company logo
1 of 31
Download to read offline
Linear Regression
Machine Learning Seminar Series’11



       Nikita Zhiltsov

         11 March 2011




                                     1 / 31
Motivating example
Prices of houses in Portland




       Living area (feet2 ) #bedrooms Price (1000$s)
              2104              3           400
              1600              3           330
              2400              3           369
              1416              2           232
              3000              4           540
                .
                .               .
                                .
                .               .



                                                       2 / 31
Motivating example
Plot




       How can we predict the prices of other houses as a
       function of the size of their living areas?
                                                            3 / 31
Terminology and notation
      x ∈ X – input variables (“features”)
      t ∈ T – a target variable
      {xn }, n = 1, . . . , N – given N observations of
      input variables
      (xn , tn ) – a training example
      (x1 , t1 ), . . . , (xN , tN ) – a training set

  Goal
  Find a function y(x) : X → T (“hypothesis“) to
  predict the value of t for a new value of x

                                                          4 / 31
Terminology and notation




     When the target variable t is continuous
     ⇒ a regression problem
     In the case of discrete values
     ⇒ a classification problem
                                                5 / 31
Terminology and notation
Loss function
    L(t, y(x)) – loss function or cost function
    In the case of regression problems expected loss is
    given by:

                E[L] =           L(t, y(x))p(x, t) dx dt
                         R   X


    Example
    Squared loss:
                                1
                    L(t, y(x)) = (y(x) − t)2
                                2
                                                           6 / 31
Linear basis function models
Linear regression




               y(x, w) = w0 + w1 x1 + · · · + wD xD ,
    where x = (x1 , . . . , xD )

    In our example,

                    y(x, w) = w0 + w1 x1 + w2 x2
                                   Living area   # of bedrooms




                                                                 7 / 31
Linear basis function models
Basis functions


    Generally
                              M −1
                  y(x, w) =          wj φj (x) = wT φ(x)
                              j=0

    where φj (x) are known as basis functions.
    Typically, φ0 (x) = 1, so that w0 acts as a bias.
    In the simplest case, we use linear basis functions:
    φd (x) = xd .


                                                           8 / 31
Linear basis function models
Polynomial basis functions



                                    1

   Polynomial basis functions:     0.5

              φj (x) = xj .         0

   These are global; a small      −0.5
   change in x affects all basis
   functions.                      −1
                                    −1   0   1




                                                 9 / 31
Linear basis function models
Gaussian basis functions



   Gaussian basis functions:             1

                           (x − µj )2   0.75
    φj (x) = exp −
                              2s2
                                        0.5
   These are local; a small
   change in x only affects              0.25
   nearby basis functions. µj
   and s control location and            0
                                             −1   0   1
   scale (width).




                                                          10 / 31
Linear basis function models
Sigmoidal basis functions
   Sigmoidal basis functions:

                       x − µj
        φj (x) = σ                   1
                         s
                                   0.75
   where
                       1            0.5
        σ(a) =                .
                 1 + exp (−a)
                                   0.25
   Also these are local; a small
                                     0
   change in x only affects           −1   0   1
   nearby basis functions. µj
   and s control location and
   scale (slope).

                                                  11 / 31
Probabilistic interpretation

  Assume observations from a deterministic function with added
  Gaussian noise:

          t = y(x, w) + , where p( |β) = N ( |0, β −1 )

  which is the same as saying,

                p(t|x, w, β) = N (t|y(x, w), β −1 ).




                                                                 12 / 31
Probabilistic interpretation
Optimal prediction for a squared loss

    Expected loss:

                     E[L] =        (y(x) − t)2 p(x, t)dxdt,

    which is minimized by the conditional mean

                                 y(x) = Et [t|x]

    In our case of a Gaussian conditional distribution, it is

                      E[t|x] =          tp(t|x)dt = y(x, w)



                                                                13 / 31
Probabilistic interpretation
Optimal prediction for a squared loss



                   t

                                                        y(x)


              y(x0 )
                                             p(t|x0 )




                                        x0                     x



                                                                   14 / 31
Maximum likelihood and least squares
  Given observed inputs X = {x1 , . . . , xN }, and targets
  t = [t1 , . . . , tN ]T , we obtain the likelihood function
                                 N
             p(t|X, w, β) =            N (tn |wT φ(xn ), β −1 ).
                               n=1

  Taking the logarithm, we get
                             N
          ln p(t|w, β) =           ln N (tn |wT φ(xn ), β −1 )
                             n=1
                             N       N
                        =      ln β − ln(2π) − βED (w)
                             2       2
  where
                                   N
                           1
                  ED (w) =             (tn − wT φ(xn ))2
                           2     n=1
  is the sum-of-squares error.
                                                                   15 / 31
Maximum likelihood and least squares
  Computing the gradient and setting it to zero yields
                                 N

          w   ln p(t|w, β) = β         (tn − wT φ(xn ))φ(xn )T = 0.
                                 n=1

  Solving for w, we get
                                       Moore-Penrose
                                       pseudo-inverse

                        wM L = (ΦT Φ)−1 ΦT t
  where
              φ0 (x1 )        φ1 (x1 ) . . . φM −1 (x1 )
                                                        
             φ0 (x2 )        φ1 (x2 ) . . . φM −1 (x2 ) 
          Φ=     .
                  .               .
                                  .     ..       .
                                                 .
                                                         .
                 .               .        .     .       
                     φ0 (xN ) φ1 (xN ) . . . φM −1 (xN )
                                                                      16 / 31
Maximum likelihood and least squares
Bias parameter
    Rewritten error function:
                                 N                 M −1
                            1
                 ED (w) =             (tn − w0 −          wj φj (xn ))2
                            2   n=1                j=1

    Setting the derivative w.r.t. w0 equal to zero, we obtain
                                           M −1
                                     ¯
                                w0 = t −             ¯
                                                  wj φj
                                           j=1

    where
                            N
                  ¯ 1
                  t=                 ¯
                                tn , φj =   1       N
                                                    n=1   φj (xn ).
                                            N
                     N   n=1


                                                                          17 / 31
Maximum likelihood and least squares


                       N       N
  ln p(t|wM L , β) =     ln β − ln(2π) − βED (wM L )
                       2       2
  Maximizing the log likelihood function w.r.t. the
  noise precision parameter β, we obtain
                         N
             1      1
                  =            (tn − wM L φ(xn ))2
                                      T
           βM L     N    n=1




                                                       18 / 31
Geometry of least squares
 Consider

 y = ΦwM L = [ϕ1 , . . . , ϕM ]wM L .   S
                                                        t

      y∈S ⊆T, t∈T                                ϕ2
                                        ϕ1   y
 S is spanned by ϕ1 , . . . , ϕM .
 wM L minimizes the distance
 between the distance t between
 and t and its orthogonal
 projection on S, i.e. y.



                                                      19 / 31
Batch learning
Batch gradient descent


    Consider the gradient descent algorithm, which
    starts with some initial w(0) :

      w(τ +1) = w(τ ) − η ED
                                     N
                         (τ )
                 = w            +η         (tn − w(τ )T φ(xn ))φ(xn ).
                                     n=1

    This is known as the least-mean-squares (LMS)
    algorithm.


                                                                         20 / 31
Batch gradient descent
Example calculation

    In the case of ordinary least squares and the only leaving area
                              (0)       (0)
    feature, we start from w0 = 48, w1 = 30 ...




                                                                      21 / 31
Batch gradient descent
Results of the example calculation
    ... and obtain the result w0 = 71.27, w1 = 0.1345.




                                                         22 / 31
Sequential learning

  Data items considered one at a time (a.k.a. online
  learning); use stochastic (sequential) gradient
  descent:

     w(τ +1) = w(τ ) − η En
             = w(τ ) + η(tn − w(τ )T φ(xn ))φ(xn ).

  Issue: how to choose η?



                                                       23 / 31
Underfitting and overfitting




                             24 / 31
Regularization
Outlier




                 25 / 31
Regularized least squares
  Consider the error function:

                          ED (w) + λEW (w)
                     Data term + Regularization term

  λ is called the regularization coefficient
  With the sum-of-squares error function and a quadratic
  regularizer, we get
                     N
                 1                          λ
                         (tn − wT φ(xn ))2 + wT w
                 2   n=1
                                            2

  which is minimized by

                      w = (λI + ΦT Φ)−1 ΦT t.
                                                           26 / 31
Regularized least squares
  With a more general regularizer, we have
                 N                            M
            1                 T   2    λ
                     (tn − w φ(xn )) +            |wj |q
            2    n=1
                                       2    j=1




       q = 0.5          q=1           q=2              q=4


                       Lasso      Quadratic


                                                             27 / 31
Regularized least squares
  Lasso tends to generate sparser solutions than a
  quadratic regularizer
                      w2                w2




                           w                 w


                                   w1                w1




                                                          28 / 31
Multiple outputs
  Analogously to the single output case we have:

              p(t|x, W, β) = N (t|y(W, x), β −1 I))
                           = N (t|WT φ(x), β −1 I)).

  Given observed inputs X = {x, . . . , xN }, and targets
  T = [t1 , . . . , tN ]T , we obtain the log likelihood function
                            N
  ln p(T|X, W, β) =              ln N (tn |WT φ(xn ), β −1 I)
                           n=1
                                                     N
                           NK          β         β
                       =      ln             −             ||tn − WT φ(xn )||2 .
                            2          2π        2   n=1



                                                                             29 / 31
Multiple outputs

  Maximizing w.r.t. W, we obtain

                  WM L = (ΦT Φ)−1 ΦT T

  If we consider a single target variable tk , we see that

                    wk = (ΦT Φ)−1 ΦT tk

  where tk = [t1k , . . . , tN k ]T , which is identical with
  the single output case.


                                                                30 / 31
Resources

   ˆ Stanford Engineering Everywhere CS229 –
     Machine Learning
     http://videolectures.net/
     stanfordcs229f07_machine_learning/
   ˆ Bishop C.M. Pattern Recognition and Machine
     Learning. Springer, 2006.
     http://research.microsoft.com/en-us/
     um/people/cmbishop/prml/



                                                   31 / 31

More Related Content

Similar to 1 - Linear Regression

Problem_Session_Notes
Problem_Session_NotesProblem_Session_Notes
Problem_Session_Notes
Lu Mao
 
A numerical method to solve fractional Fredholm-Volterra integro-differential...
A numerical method to solve fractional Fredholm-Volterra integro-differential...A numerical method to solve fractional Fredholm-Volterra integro-differential...
A numerical method to solve fractional Fredholm-Volterra integro-differential...
OctavianPostavaru
 
Chapter 3 – Random Variables and Probability Distributions
Chapter 3 – Random Variables and Probability DistributionsChapter 3 – Random Variables and Probability Distributions
Chapter 3 – Random Variables and Probability Distributions
JasonTagapanGulla
 

Similar to 1 - Linear Regression (20)

ma112011id535
ma112011id535ma112011id535
ma112011id535
 
Section3 stochastic
Section3 stochasticSection3 stochastic
Section3 stochastic
 
Seismic data processing lecture 3
Seismic data processing lecture 3Seismic data processing lecture 3
Seismic data processing lecture 3
 
Moment Closure Based Parameter Inference of Stochastic Kinetic Models
Moment Closure Based Parameter Inference of Stochastic Kinetic ModelsMoment Closure Based Parameter Inference of Stochastic Kinetic Models
Moment Closure Based Parameter Inference of Stochastic Kinetic Models
 
Matrix calculus
Matrix calculusMatrix calculus
Matrix calculus
 
Funcion gamma
Funcion gammaFuncion gamma
Funcion gamma
 
The Multivariate Gaussian Probability Distribution
The Multivariate Gaussian Probability DistributionThe Multivariate Gaussian Probability Distribution
The Multivariate Gaussian Probability Distribution
 
Metodo gauss_newton.pdf
Metodo gauss_newton.pdfMetodo gauss_newton.pdf
Metodo gauss_newton.pdf
 
Moment closure inference for stochastic kinetic models
Moment closure inference for stochastic kinetic modelsMoment closure inference for stochastic kinetic models
Moment closure inference for stochastic kinetic models
 
Statistics (1): estimation Chapter 3: likelihood function and likelihood esti...
Statistics (1): estimation Chapter 3: likelihood function and likelihood esti...Statistics (1): estimation Chapter 3: likelihood function and likelihood esti...
Statistics (1): estimation Chapter 3: likelihood function and likelihood esti...
 
Section5 stochastic
Section5 stochasticSection5 stochastic
Section5 stochastic
 
Bias-variance decomposition in Random Forests
Bias-variance decomposition in Random ForestsBias-variance decomposition in Random Forests
Bias-variance decomposition in Random Forests
 
Distributions
DistributionsDistributions
Distributions
 
Dsp3
Dsp3Dsp3
Dsp3
 
Problem_Session_Notes
Problem_Session_NotesProblem_Session_Notes
Problem_Session_Notes
 
functions review
functions reviewfunctions review
functions review
 
A numerical method to solve fractional Fredholm-Volterra integro-differential...
A numerical method to solve fractional Fredholm-Volterra integro-differential...A numerical method to solve fractional Fredholm-Volterra integro-differential...
A numerical method to solve fractional Fredholm-Volterra integro-differential...
 
Slides: On the Chi Square and Higher-Order Chi Distances for Approximating f-...
Slides: On the Chi Square and Higher-Order Chi Distances for Approximating f-...Slides: On the Chi Square and Higher-Order Chi Distances for Approximating f-...
Slides: On the Chi Square and Higher-Order Chi Distances for Approximating f-...
 
QMC: Operator Splitting Workshop, Progressive Decoupling of Linkages in Optim...
QMC: Operator Splitting Workshop, Progressive Decoupling of Linkages in Optim...QMC: Operator Splitting Workshop, Progressive Decoupling of Linkages in Optim...
QMC: Operator Splitting Workshop, Progressive Decoupling of Linkages in Optim...
 
Chapter 3 – Random Variables and Probability Distributions
Chapter 3 – Random Variables and Probability DistributionsChapter 3 – Random Variables and Probability Distributions
Chapter 3 – Random Variables and Probability Distributions
 

1 - Linear Regression

  • 1. Linear Regression Machine Learning Seminar Series’11 Nikita Zhiltsov 11 March 2011 1 / 31
  • 2. Motivating example Prices of houses in Portland Living area (feet2 ) #bedrooms Price (1000$s) 2104 3 400 1600 3 330 2400 3 369 1416 2 232 3000 4 540 . . . . . . 2 / 31
  • 3. Motivating example Plot How can we predict the prices of other houses as a function of the size of their living areas? 3 / 31
  • 4. Terminology and notation x ∈ X – input variables (“features”) t ∈ T – a target variable {xn }, n = 1, . . . , N – given N observations of input variables (xn , tn ) – a training example (x1 , t1 ), . . . , (xN , tN ) – a training set Goal Find a function y(x) : X → T (“hypothesis“) to predict the value of t for a new value of x 4 / 31
  • 5. Terminology and notation When the target variable t is continuous ⇒ a regression problem In the case of discrete values ⇒ a classification problem 5 / 31
  • 6. Terminology and notation Loss function L(t, y(x)) – loss function or cost function In the case of regression problems expected loss is given by: E[L] = L(t, y(x))p(x, t) dx dt R X Example Squared loss: 1 L(t, y(x)) = (y(x) − t)2 2 6 / 31
  • 7. Linear basis function models Linear regression y(x, w) = w0 + w1 x1 + · · · + wD xD , where x = (x1 , . . . , xD ) In our example, y(x, w) = w0 + w1 x1 + w2 x2 Living area # of bedrooms 7 / 31
  • 8. Linear basis function models Basis functions Generally M −1 y(x, w) = wj φj (x) = wT φ(x) j=0 where φj (x) are known as basis functions. Typically, φ0 (x) = 1, so that w0 acts as a bias. In the simplest case, we use linear basis functions: φd (x) = xd . 8 / 31
  • 9. Linear basis function models Polynomial basis functions 1 Polynomial basis functions: 0.5 φj (x) = xj . 0 These are global; a small −0.5 change in x affects all basis functions. −1 −1 0 1 9 / 31
  • 10. Linear basis function models Gaussian basis functions Gaussian basis functions: 1 (x − µj )2 0.75 φj (x) = exp − 2s2 0.5 These are local; a small change in x only affects 0.25 nearby basis functions. µj and s control location and 0 −1 0 1 scale (width). 10 / 31
  • 11. Linear basis function models Sigmoidal basis functions Sigmoidal basis functions: x − µj φj (x) = σ 1 s 0.75 where 1 0.5 σ(a) = . 1 + exp (−a) 0.25 Also these are local; a small 0 change in x only affects −1 0 1 nearby basis functions. µj and s control location and scale (slope). 11 / 31
  • 12. Probabilistic interpretation Assume observations from a deterministic function with added Gaussian noise: t = y(x, w) + , where p( |β) = N ( |0, β −1 ) which is the same as saying, p(t|x, w, β) = N (t|y(x, w), β −1 ). 12 / 31
  • 13. Probabilistic interpretation Optimal prediction for a squared loss Expected loss: E[L] = (y(x) − t)2 p(x, t)dxdt, which is minimized by the conditional mean y(x) = Et [t|x] In our case of a Gaussian conditional distribution, it is E[t|x] = tp(t|x)dt = y(x, w) 13 / 31
  • 14. Probabilistic interpretation Optimal prediction for a squared loss t y(x) y(x0 ) p(t|x0 ) x0 x 14 / 31
  • 15. Maximum likelihood and least squares Given observed inputs X = {x1 , . . . , xN }, and targets t = [t1 , . . . , tN ]T , we obtain the likelihood function N p(t|X, w, β) = N (tn |wT φ(xn ), β −1 ). n=1 Taking the logarithm, we get N ln p(t|w, β) = ln N (tn |wT φ(xn ), β −1 ) n=1 N N = ln β − ln(2π) − βED (w) 2 2 where N 1 ED (w) = (tn − wT φ(xn ))2 2 n=1 is the sum-of-squares error. 15 / 31
  • 16. Maximum likelihood and least squares Computing the gradient and setting it to zero yields N w ln p(t|w, β) = β (tn − wT φ(xn ))φ(xn )T = 0. n=1 Solving for w, we get Moore-Penrose pseudo-inverse wM L = (ΦT Φ)−1 ΦT t where φ0 (x1 ) φ1 (x1 ) . . . φM −1 (x1 )    φ0 (x2 ) φ1 (x2 ) . . . φM −1 (x2 )  Φ= . . . . .. . . .  . . . .  φ0 (xN ) φ1 (xN ) . . . φM −1 (xN ) 16 / 31
  • 17. Maximum likelihood and least squares Bias parameter Rewritten error function: N M −1 1 ED (w) = (tn − w0 − wj φj (xn ))2 2 n=1 j=1 Setting the derivative w.r.t. w0 equal to zero, we obtain M −1 ¯ w0 = t − ¯ wj φj j=1 where N ¯ 1 t= ¯ tn , φj = 1 N n=1 φj (xn ). N N n=1 17 / 31
  • 18. Maximum likelihood and least squares N N ln p(t|wM L , β) = ln β − ln(2π) − βED (wM L ) 2 2 Maximizing the log likelihood function w.r.t. the noise precision parameter β, we obtain N 1 1 = (tn − wM L φ(xn ))2 T βM L N n=1 18 / 31
  • 19. Geometry of least squares Consider y = ΦwM L = [ϕ1 , . . . , ϕM ]wM L . S t y∈S ⊆T, t∈T ϕ2 ϕ1 y S is spanned by ϕ1 , . . . , ϕM . wM L minimizes the distance between the distance t between and t and its orthogonal projection on S, i.e. y. 19 / 31
  • 20. Batch learning Batch gradient descent Consider the gradient descent algorithm, which starts with some initial w(0) : w(τ +1) = w(τ ) − η ED N (τ ) = w +η (tn − w(τ )T φ(xn ))φ(xn ). n=1 This is known as the least-mean-squares (LMS) algorithm. 20 / 31
  • 21. Batch gradient descent Example calculation In the case of ordinary least squares and the only leaving area (0) (0) feature, we start from w0 = 48, w1 = 30 ... 21 / 31
  • 22. Batch gradient descent Results of the example calculation ... and obtain the result w0 = 71.27, w1 = 0.1345. 22 / 31
  • 23. Sequential learning Data items considered one at a time (a.k.a. online learning); use stochastic (sequential) gradient descent: w(τ +1) = w(τ ) − η En = w(τ ) + η(tn − w(τ )T φ(xn ))φ(xn ). Issue: how to choose η? 23 / 31
  • 26. Regularized least squares Consider the error function: ED (w) + λEW (w) Data term + Regularization term λ is called the regularization coefficient With the sum-of-squares error function and a quadratic regularizer, we get N 1 λ (tn − wT φ(xn ))2 + wT w 2 n=1 2 which is minimized by w = (λI + ΦT Φ)−1 ΦT t. 26 / 31
  • 27. Regularized least squares With a more general regularizer, we have N M 1 T 2 λ (tn − w φ(xn )) + |wj |q 2 n=1 2 j=1 q = 0.5 q=1 q=2 q=4 Lasso Quadratic 27 / 31
  • 28. Regularized least squares Lasso tends to generate sparser solutions than a quadratic regularizer w2 w2 w w w1 w1 28 / 31
  • 29. Multiple outputs Analogously to the single output case we have: p(t|x, W, β) = N (t|y(W, x), β −1 I)) = N (t|WT φ(x), β −1 I)). Given observed inputs X = {x, . . . , xN }, and targets T = [t1 , . . . , tN ]T , we obtain the log likelihood function N ln p(T|X, W, β) = ln N (tn |WT φ(xn ), β −1 I) n=1 N NK β β = ln − ||tn − WT φ(xn )||2 . 2 2π 2 n=1 29 / 31
  • 30. Multiple outputs Maximizing w.r.t. W, we obtain WM L = (ΦT Φ)−1 ΦT T If we consider a single target variable tk , we see that wk = (ΦT Φ)−1 ΦT tk where tk = [t1k , . . . , tN k ]T , which is identical with the single output case. 30 / 31
  • 31. Resources ˆ Stanford Engineering Everywhere CS229 – Machine Learning http://videolectures.net/ stanfordcs229f07_machine_learning/ ˆ Bishop C.M. Pattern Recognition and Machine Learning. Springer, 2006. http://research.microsoft.com/en-us/ um/people/cmbishop/prml/ 31 / 31