1 - Linear Regression

1,247 views
1,073 views

Published on

The first report of Machine Learning Seminar organized by Computational Linguistics Laboratory at Kazan Federal University. See http://cll.niimm.ksu.ru/cms/lang/en_US/main/seminars/mlseminar

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,247
On SlideShare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
8
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

1 - Linear Regression

  1. 1. Linear RegressionMachine Learning Seminar Series’11 Nikita Zhiltsov 11 March 2011 1 / 31
  2. 2. Motivating examplePrices of houses in Portland Living area (feet2 ) #bedrooms Price (1000$s) 2104 3 400 1600 3 330 2400 3 369 1416 2 232 3000 4 540 . . . . . . 2 / 31
  3. 3. Motivating examplePlot How can we predict the prices of other houses as a function of the size of their living areas? 3 / 31
  4. 4. Terminology and notation x ∈ X – input variables (“features”) t ∈ T – a target variable {xn }, n = 1, . . . , N – given N observations of input variables (xn , tn ) – a training example (x1 , t1 ), . . . , (xN , tN ) – a training set Goal Find a function y(x) : X → T (“hypothesis“) to predict the value of t for a new value of x 4 / 31
  5. 5. Terminology and notation When the target variable t is continuous ⇒ a regression problem In the case of discrete values ⇒ a classification problem 5 / 31
  6. 6. Terminology and notationLoss function L(t, y(x)) – loss function or cost function In the case of regression problems expected loss is given by: E[L] = L(t, y(x))p(x, t) dx dt R X Example Squared loss: 1 L(t, y(x)) = (y(x) − t)2 2 6 / 31
  7. 7. Linear basis function modelsLinear regression y(x, w) = w0 + w1 x1 + · · · + wD xD , where x = (x1 , . . . , xD ) In our example, y(x, w) = w0 + w1 x1 + w2 x2 Living area # of bedrooms 7 / 31
  8. 8. Linear basis function modelsBasis functions Generally M −1 y(x, w) = wj φj (x) = wT φ(x) j=0 where φj (x) are known as basis functions. Typically, φ0 (x) = 1, so that w0 acts as a bias. In the simplest case, we use linear basis functions: φd (x) = xd . 8 / 31
  9. 9. Linear basis function modelsPolynomial basis functions 1 Polynomial basis functions: 0.5 φj (x) = xj . 0 These are global; a small −0.5 change in x affects all basis functions. −1 −1 0 1 9 / 31
  10. 10. Linear basis function modelsGaussian basis functions Gaussian basis functions: 1 (x − µj )2 0.75 φj (x) = exp − 2s2 0.5 These are local; a small change in x only affects 0.25 nearby basis functions. µj and s control location and 0 −1 0 1 scale (width). 10 / 31
  11. 11. Linear basis function modelsSigmoidal basis functions Sigmoidal basis functions: x − µj φj (x) = σ 1 s 0.75 where 1 0.5 σ(a) = . 1 + exp (−a) 0.25 Also these are local; a small 0 change in x only affects −1 0 1 nearby basis functions. µj and s control location and scale (slope). 11 / 31
  12. 12. Probabilistic interpretation Assume observations from a deterministic function with added Gaussian noise: t = y(x, w) + , where p( |β) = N ( |0, β −1 ) which is the same as saying, p(t|x, w, β) = N (t|y(x, w), β −1 ). 12 / 31
  13. 13. Probabilistic interpretationOptimal prediction for a squared loss Expected loss: E[L] = (y(x) − t)2 p(x, t)dxdt, which is minimized by the conditional mean y(x) = Et [t|x] In our case of a Gaussian conditional distribution, it is E[t|x] = tp(t|x)dt = y(x, w) 13 / 31
  14. 14. Probabilistic interpretationOptimal prediction for a squared loss t y(x) y(x0 ) p(t|x0 ) x0 x 14 / 31
  15. 15. Maximum likelihood and least squares Given observed inputs X = {x1 , . . . , xN }, and targets t = [t1 , . . . , tN ]T , we obtain the likelihood function N p(t|X, w, β) = N (tn |wT φ(xn ), β −1 ). n=1 Taking the logarithm, we get N ln p(t|w, β) = ln N (tn |wT φ(xn ), β −1 ) n=1 N N = ln β − ln(2π) − βED (w) 2 2 where N 1 ED (w) = (tn − wT φ(xn ))2 2 n=1 is the sum-of-squares error. 15 / 31
  16. 16. Maximum likelihood and least squares Computing the gradient and setting it to zero yields N w ln p(t|w, β) = β (tn − wT φ(xn ))φ(xn )T = 0. n=1 Solving for w, we get Moore-Penrose pseudo-inverse wM L = (ΦT Φ)−1 ΦT t where φ0 (x1 ) φ1 (x1 ) . . . φM −1 (x1 )    φ0 (x2 ) φ1 (x2 ) . . . φM −1 (x2 )  Φ= . . . . .. . . .  . . . .  φ0 (xN ) φ1 (xN ) . . . φM −1 (xN ) 16 / 31
  17. 17. Maximum likelihood and least squaresBias parameter Rewritten error function: N M −1 1 ED (w) = (tn − w0 − wj φj (xn ))2 2 n=1 j=1 Setting the derivative w.r.t. w0 equal to zero, we obtain M −1 ¯ w0 = t − ¯ wj φj j=1 where N ¯ 1 t= ¯ tn , φj = 1 N n=1 φj (xn ). N N n=1 17 / 31
  18. 18. Maximum likelihood and least squares N N ln p(t|wM L , β) = ln β − ln(2π) − βED (wM L ) 2 2 Maximizing the log likelihood function w.r.t. the noise precision parameter β, we obtain N 1 1 = (tn − wM L φ(xn ))2 T βM L N n=1 18 / 31
  19. 19. Geometry of least squares Consider y = ΦwM L = [ϕ1 , . . . , ϕM ]wM L . S t y∈S ⊆T, t∈T ϕ2 ϕ1 y S is spanned by ϕ1 , . . . , ϕM . wM L minimizes the distance between the distance t between and t and its orthogonal projection on S, i.e. y. 19 / 31
  20. 20. Batch learningBatch gradient descent Consider the gradient descent algorithm, which starts with some initial w(0) : w(τ +1) = w(τ ) − η ED N (τ ) = w +η (tn − w(τ )T φ(xn ))φ(xn ). n=1 This is known as the least-mean-squares (LMS) algorithm. 20 / 31
  21. 21. Batch gradient descentExample calculation In the case of ordinary least squares and the only leaving area (0) (0) feature, we start from w0 = 48, w1 = 30 ... 21 / 31
  22. 22. Batch gradient descentResults of the example calculation ... and obtain the result w0 = 71.27, w1 = 0.1345. 22 / 31
  23. 23. Sequential learning Data items considered one at a time (a.k.a. online learning); use stochastic (sequential) gradient descent: w(τ +1) = w(τ ) − η En = w(τ ) + η(tn − w(τ )T φ(xn ))φ(xn ). Issue: how to choose η? 23 / 31
  24. 24. Underfitting and overfitting 24 / 31
  25. 25. RegularizationOutlier 25 / 31
  26. 26. Regularized least squares Consider the error function: ED (w) + λEW (w) Data term + Regularization term λ is called the regularization coefficient With the sum-of-squares error function and a quadratic regularizer, we get N 1 λ (tn − wT φ(xn ))2 + wT w 2 n=1 2 which is minimized by w = (λI + ΦT Φ)−1 ΦT t. 26 / 31
  27. 27. Regularized least squares With a more general regularizer, we have N M 1 T 2 λ (tn − w φ(xn )) + |wj |q 2 n=1 2 j=1 q = 0.5 q=1 q=2 q=4 Lasso Quadratic 27 / 31
  28. 28. Regularized least squares Lasso tends to generate sparser solutions than a quadratic regularizer w2 w2 w w w1 w1 28 / 31
  29. 29. Multiple outputs Analogously to the single output case we have: p(t|x, W, β) = N (t|y(W, x), β −1 I)) = N (t|WT φ(x), β −1 I)). Given observed inputs X = {x, . . . , xN }, and targets T = [t1 , . . . , tN ]T , we obtain the log likelihood function N ln p(T|X, W, β) = ln N (tn |WT φ(xn ), β −1 I) n=1 N NK β β = ln − ||tn − WT φ(xn )||2 . 2 2π 2 n=1 29 / 31
  30. 30. Multiple outputs Maximizing w.r.t. W, we obtain WM L = (ΦT Φ)−1 ΦT T If we consider a single target variable tk , we see that wk = (ΦT Φ)−1 ΦT tk where tk = [t1k , . . . , tN k ]T , which is identical with the single output case. 30 / 31
  31. 31. Resources ˆ Stanford Engineering Everywhere CS229 – Machine Learning http://videolectures.net/ stanfordcs229f07_machine_learning/ ˆ Bishop C.M. Pattern Recognition and Machine Learning. Springer, 2006. http://research.microsoft.com/en-us/ um/people/cmbishop/prml/ 31 / 31

×