Introduction to Statistical                                                                                   Machine Lear...
Introduction to Statistical                   Machine Learning                       c 2011                 Christfried We...
Introduction to StatisticalPolynomial Curve Fitting                                         Machine Learning              ...
Introduction to StatisticalPolynomial Curve Fitting - Input Specification       Machine Learning                           ...
Introduction to StatisticalPolynomial Curve Fitting - Input Specification       Machine Learning                           ...
Introduction to StatisticalPolynomial Curve Fitting - Model Specification                       Machine Learning           ...
Introduction to StatisticalLearning is Improving Performance                                Machine Learning              ...
Introduction to StatisticalModel Comparison or Model Selection                            Machine Learning                ...
Introduction to StatisticalModel Comparison or Model Selection                            Machine Learning                ...
Introduction to StatisticalModel Comparison or Model Selection                         Machine Learning                   ...
Introduction to StatisticalModel Comparison or Model Selection                               Machine Learning             ...
Introduction to StatisticalTesting the Model                                             Machine Learning                 ...
Introduction to StatisticalTesting the Model                                                  Machine Learning            ...
Introduction to StatisticalMore Data                                Machine Learning                                      ...
Introduction to StatisticalMore Data                                                          Machine Learning            ...
Introduction to StatisticalRegularisation                                                          Machine Learning       ...
Introduction to StatisticalRegularisation                                 Machine Learning                                ...
Introduction to StatisticalRegularisation                               Machine Learning                                  ...
Introduction to StatisticalRegularisation                                                 Machine Learning                ...
Introduction to StatisticalProbability Theory                  Machine Learning                                        c 2...
Introduction to StatisticalProbability Theory                                                  Machine Learning           ...
Introduction to StatisticalSum Rule                                                             Machine Learning          ...
Introduction to StatisticalSum Rule                                                               Machine Learning        ...
Introduction to StatisticalProduct Rule                                                        Machine Learning           ...
Introduction to StatisticalProduct Rule                                                       Machine Learning            ...
Introduction to StatisticalSum Rule and Product Rule                        Machine Learning                              ...
Introduction to StatisticalWhy not using Fractions?                                   Machine Learning                    ...
Introduction to StatisticalBayes Theorem                                                   Machine Learning               ...
Introduction to StatisticalProbability Densities                                                      Machine Learning    ...
Introduction to StatisticalConstraints on p(x)                                            Machine Learning                ...
Introduction to StatisticalCumulative distribution function P(x)                    Machine Learning                      ...
Introduction to StatisticalMultivariate Probability Density                                  Machine Learning             ...
Introduction to StatisticalSum and Product Rule for Probability Densities       Machine Learning                          ...
Introduction to StatisticalExpectations                                                           Machine Learning        ...
Introduction to StatisticalHow to approximate E [f ]                                           Machine Learning           ...
Introduction to StatisticalExpection of a function of several variables                                   Machine Learning...
Introduction to StatisticalConditional Expectation                                                                    Mach...
Introduction to StatisticalVariance                                                              Machine Learning         ...
Introduction to StatisticalCovariance                                                               Machine Learning      ...
Introduction to StatisticalCovariance for Vector Valued Variables                     Machine Learning                    ...
Upcoming SlideShare
Loading in …5
×

Introduction

318 views

Published on

Statistical Machine Learning from ANU University

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
318
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Introduction

  1. 1. Introduction to Statistical Machine Learning c 2011 Christfried Webers NICTA The Australian National UniversityIntroduction to Statistical Machine Learning Christfried Webers Statistical Machine Learning Group NICTA and College of Engineering and Computer Science The Australian National University Canberra February – June 2011 (Many figures from C. M. Bishop, "Pattern Recognition and Machine Learning") 1of 79
  2. 2. Introduction to Statistical Machine Learning c 2011 Christfried Webers NICTA The Australian National University Part II Polynomial Curve Fitting Probability TheoryIntroduction Probability Densities Expectations and Covariances 41of 79
  3. 3. Introduction to StatisticalPolynomial Curve Fitting Machine Learning c 2011 Christfried Webers NICTA The Australian National University some artificial data created from the function sin(2πx) + random noise x = 0, . . . , 1 Polynomial Curve Fitting Probability Theory 1 Probability Densities t Expectations and Covariances 0 −1 0 x 1 42of 79
  4. 4. Introduction to StatisticalPolynomial Curve Fitting - Input Specification Machine Learning c 2011 Christfried Webers NICTA The Australian National University Polynomial Curve Fitting N = 10 Probability Theory Probability Densities x ≡ (x1 , . . . , xN )T Expectations and Covariances t ≡ (t1 , . . . , tN )T 43of 79
  5. 5. Introduction to StatisticalPolynomial Curve Fitting - Input Specification Machine Learning c 2011 Christfried Webers NICTA The Australian National University N = 10 Polynomial Curve Fitting Probability Theory T x ≡ (x1 , . . . , xN ) Probability Densities t ≡ (t1 , . . . , tN )T Expectations and Covariances xi ∈ R i = 1,. . . , N ti ∈ R i = 1,. . . , N 44of 79
  6. 6. Introduction to StatisticalPolynomial Curve Fitting - Model Specification Machine Learning c 2011 Christfried Webers NICTA The Australian National UniversityM : order of polynomial y(x, w) = w0 + w1 x + w2 x2 + · · · + wM xM Polynomial Curve Fitting Probability Theory M Probability Densities = wm xm Expectations and m=0 Covariances nonlinear function of x linear function of the unknown model parameter w How can we find good parameters w = (w1 , . . . , wM )T ? 45of 79
  7. 7. Introduction to StatisticalLearning is Improving Performance Machine Learning c 2011 Christfried Webers NICTA The Australian National University t tn y(xn , w) Polynomial Curve Fitting Probability Theory Probability Densities Expectations and xn x Covariances Performance measure : Error between target and prediction of the model for the training data N 1 2 E(w) = (y(xn , w) − tn ) 2 n=1 unique minimum of E(w) for argument w 46of 79
  8. 8. Introduction to StatisticalModel Comparison or Model Selection Machine Learning c 2011 Christfried Webers NICTA The Australian National University M y(x, w) = wm xm m=0 M=0 = w0 Polynomial Curve Fitting Probability Theory Probability Densities Expectations and 1 M =0 Covariances t 0 −1 0 x 1 47of 79
  9. 9. Introduction to StatisticalModel Comparison or Model Selection Machine Learning c 2011 Christfried Webers NICTA The Australian National University M y(x, w) = wm xm m=0 M=1 = w0 + w1 x Polynomial Curve Fitting Probability Theory Probability Densities Expectations and 1 M =1 Covariances t 0 −1 0 x 1 48of 79
  10. 10. Introduction to StatisticalModel Comparison or Model Selection Machine Learning c 2011 Christfried Webers NICTA The Australian National University M y(x, w) = wm xm m=0 M=3 = w0 + w1 x + w2 x2 + w3 x3 Polynomial Curve Fitting Probability Theory Probability Densities Expectations and 1 M =3 Covariances t 0 −1 0 x 1 49of 79
  11. 11. Introduction to StatisticalModel Comparison or Model Selection Machine Learning c 2011 Christfried Webers NICTA The Australian National University M y(x, w) = wm xm m=0 M=9 = w0 + w1 x + · · · + w8 x8 + w9 x9 Polynomial Curve Fitting Probability Theory overfitting Probability Densities Expectations and Covariances 1 M =9 t 0 −1 0 x 1 50of 79
  12. 12. Introduction to StatisticalTesting the Model Machine Learning c 2011 Christfried Webers NICTA The Australian National University Train the model and get w Get 100 new data points Root-mean-square (RMS) error ERMS = 2E(w )/N Polynomial Curve Fitting Probability Theory Probability Densities 1 Training Expectations and Covariances Test ERMS 0.5 0 0 3 M 6 9 51of 79
  13. 13. Introduction to StatisticalTesting the Model Machine Learning c 2011 Christfried Webers NICTA The Australian National University M=0 M=1 M=3 M=9 w0 0.19 0.82 0.31 0.35 w1 -1.27 7.99 232.37 Polynomial Curve Fitting w2 -25.43 -5321.83 Probability Theory w3 17.37 48568.31 Probability Densities w4 -231639.30 Expectations and w5 640042.26 Covariances w6 -1061800.52 w7 1042400.18 w8 -557682.99 w9 125201.43 Table: Coefficients w for polynomials of various order. 52of 79
  14. 14. Introduction to StatisticalMore Data Machine Learning c 2011 Christfried Webers NICTA The Australian National University N = 15 Polynomial Curve Fitting 1 N = 15 Probability Theory t Probability Densities Expectations and Covariances 0 −1 0 x 1 53of 79
  15. 15. Introduction to StatisticalMore Data Machine Learning c 2011 Christfried Webers NICTA N = 100 The Australian National University heuristics : have no less than 5 to 10 times as many data points than parameters but number of parameters is not necessarily the most appropriate measure of model complexity ! Polynomial Curve Fitting later: Bayesian approach Probability Theory Probability Densities Expectations and Covariances 1 N = 100 t 0 −1 0 x 1 54of 79
  16. 16. Introduction to StatisticalRegularisation Machine Learning c 2011 Christfried Webers NICTA The Australian National University How to constrain the growing of the coefficients w ? Add a regularisation term to the error function Polynomial Curve Fitting Probability Theory N 1 2 λ 2 Probability Densities E(w) = ( y(xn , w) − tn ) + w Expectations and 2 2 Covariances n=1 Squared norm of the parameter vector w 2 w ≡ wT w = w2 + w2 + · · · + w2 0 1 M 55of 79
  17. 17. Introduction to StatisticalRegularisation Machine Learning c 2011 Christfried Webers NICTA The Australian National University M=9 Polynomial Curve Fitting 1 ln λ = −18 Probability Theory t Probability Densities Expectations and Covariances 0 −1 0 x 1 56of 79
  18. 18. Introduction to StatisticalRegularisation Machine Learning c 2011 Christfried Webers NICTA The Australian National University M=9 Polynomial Curve Fitting 1 ln λ = 0 Probability Theory t Probability Densities Expectations and Covariances 0 −1 0 x 1 57of 79
  19. 19. Introduction to StatisticalRegularisation Machine Learning c 2011 Christfried Webers NICTA The Australian National University M=9 1 Training Polynomial Curve Fitting Test Probability Theory Probability Densities Expectations and ERMS 0.5 Covariances 0 −35 −30 ln λ −25 −20 58of 79
  20. 20. Introduction to StatisticalProbability Theory Machine Learning c 2011 Christfried Webers NICTA The Australian National University p(X, Y ) Polynomial Curve Fitting Probability Theory Y =2 Probability Densities Expectations and Covariances Y =1 X 59of 79
  21. 21. Introduction to StatisticalProbability Theory Machine Learning c 2011 Christfried Webers NICTA The Australian National University Y vs. X a b c d e f g h i sum 2 0 0 0 1 4 5 8 6 2 26 1 3 6 8 8 5 3 1 0 0 34 sum 3 6 8 9 9 8 9 6 2 60 Polynomial Curve Fitting Probability Theory Probability Densities p(X, Y ) Expectations and Covariances Y =2 Y =1 X 60of 79
  22. 22. Introduction to StatisticalSum Rule Machine Learning c 2011 Christfried Webers NICTA The Australian National University Y vs. X a b c d e f g h i sum 2 0 0 0 1 4 5 8 6 2 26 1 3 6 8 8 5 3 1 0 0 34 sum 3 6 8 9 9 8 9 6 2 60 Polynomial Curve Fitting Probability Theory Probability Densities p(X = d, Y = 1) = 8/60 Expectations and Covariances p(X = d) = p(X = d, Y = 2) + p(X = d, Y = 1) = 1/60 + 8/60 p(X = d) = p(X = d, Y) Y p(X) = p(X, Y) Y 61of 79
  23. 23. Introduction to StatisticalSum Rule Machine Learning c 2011 Christfried Webers NICTA The Australian National University Y vs. X a b c d e f g h i sum 2 0 0 0 1 4 5 8 6 2 26 1 3 6 8 8 5 3 1 0 0 34 sum 3 6 8 9 9 8 9 6 2 60 Polynomial Curve Fitting Probability Theory Probability Densities p(X) = p(X, Y) p(Y) = p(X, Y) Expectations and Covariances Y X p(X) p(Y ) X 62of 79
  24. 24. Introduction to StatisticalProduct Rule Machine Learning c 2011 Christfried Webers NICTA The Australian National University Y vs. X a b c d e f g h i sum 2 0 0 0 1 4 5 8 6 2 26 1 3 6 8 8 5 3 1 0 0 34 sum 3 6 8 9 9 8 9 6 2 60 Polynomial Curve FittingConditional Probability Probability Theory Probability Densities p(X = d | Y = 1) = 8/34 Expectations and CovariancesCalculate p(Y = 1): p(Y = 1) = p(X, Y = 1) = 34/60 X p(X = d, Y = 1) = p(X = d | Y = 1)p(Y = 1) p(X, Y) = p(X | Y) p(Y) 63of 79
  25. 25. Introduction to StatisticalProduct Rule Machine Learning c 2011 Christfried Webers NICTA The Australian National University Y vs. X a b c d e f g h i sum 2 0 0 0 1 4 5 8 6 2 26 1 3 6 8 8 5 3 1 0 0 34 sum 3 6 8 9 9 8 9 6 2 60 Polynomial Curve Fitting p(X, Y) = p(X | Y) p(Y) Probability Theory Probability Densities Expectations and p(X|Y = 1) Covariances X 64of 79
  26. 26. Introduction to StatisticalSum Rule and Product Rule Machine Learning c 2011 Christfried Webers NICTA The Australian National University Polynomial Curve Fitting Sum Rule Probability Theory p(X) = p(X, Y) Probability Densities Y Expectations and Covariances Product Rule p(X, Y) = p(X | Y) p(Y) 65of 79
  27. 27. Introduction to StatisticalWhy not using Fractions? Machine Learning c 2011 Christfried Webers NICTA The Australian National University Why not using pairs of numbers (s, t) such that p(X, Y) = s/t (e.g. s = 8, t = 60 )? Polynomial Curve Fitting Why not using pairs of numbers (a, c) instead of Probability Theory sin(alpha) = a/c? Probability Densities Expectations and Covariances c a alpha b 66of 79
  28. 28. Introduction to StatisticalBayes Theorem Machine Learning c 2011 Christfried Webers NICTA The Australian National UniversityUse product rule p(X, Y) = p(X | Y) p(Y) = p(Y | X) p(X)Bayes Theorem Polynomial Curve Fitting Probability Theory p(X | Y) p(Y) p(Y | X) = Probability Densities p(X) Expectations and Covariancesand p(X) = p(X, Y) (sum rule) Y = p(X | Y) p(Y) (product rule) Y 67of 79
  29. 29. Introduction to StatisticalProbability Densities Machine Learning c 2011 Christfried Webers NICTA The Australian National University Real valued variable x ∈ R Probability of x to fall in the interval (x, x + δx) is given by p(x)δx for infinitesimal small δx. Polynomial Curve Fitting b Probability Theory p(x ∈ (a, b)) = p(x) dx. Probability Densities a Expectations and Covariances P (x) p(x) δx x 68of 79
  30. 30. Introduction to StatisticalConstraints on p(x) Machine Learning c 2011 Christfried Webers NICTA The Australian National University Nonnegative p(x) ≥ 0 Normalisation ∞ Polynomial Curve Fitting p(x) dx = 1. Probability Theory −∞ Probability Densities Expectations and P (x) Covariances p(x) δx x 69of 79
  31. 31. Introduction to StatisticalCumulative distribution function P(x) Machine Learning c 2011 Christfried Webers NICTA The Australian National University x P(x) = p(z) dz −∞or Polynomial Curve Fitting d P(x) = p(x) Probability Theory dx Probability Densities Expectations and P (x) Covariances p(x) δx x 70of 79
  32. 32. Introduction to StatisticalMultivariate Probability Density Machine Learning c 2011 Christfried Webers NICTA The Australian National University   x1 T . Vector x ≡ (x1 , . . . , xD ) =  .  . xD Polynomial Curve Fitting Nonnegative Probability Theory p(x) ≥ 0 Probability Densities Expectations and Normalisation ∞ Covariances p(x) dx = 1. −∞ This means ∞ ∞ ... p(x) dx1 . . . dxD = 1. −∞ −∞ 71of 79
  33. 33. Introduction to StatisticalSum and Product Rule for Probability Densities Machine Learning c 2011 Christfried Webers NICTA The Australian National University Polynomial Curve Fitting Sum Rule ∞ Probability Theory p(x) = p(x, y) dy Probability Densities −∞ Expectations and Covariances Product Rule p(x, y) = p(y | x) p(x) 72of 79
  34. 34. Introduction to StatisticalExpectations Machine Learning c 2011 Christfried Webers NICTA The Australian National University Weighted average of a function f(x) under the probability Polynomial Curve Fitting distribution p(x) Probability Theory Probability Densities E [f ] = p(x) f (x) discrete distribution p(x) Expectations and Covariances x E [f ] = p(x) f (x) dx probability density p(x) 73of 79
  35. 35. Introduction to StatisticalHow to approximate E [f ] Machine Learning c 2011 Christfried Webers NICTA The Australian National University Given a finite number N of points xn drawn from the probability distribution p(x). Polynomial Curve Fitting Approximate the expectation by a finite sum: Probability Theory Probability Densities N Expectations and 1 Covariances E [f ] f (xn ) N n=1 How to draw points from a probability distribution p(x) ? Lecture coming about “Sampling” 74of 79
  36. 36. Introduction to StatisticalExpection of a function of several variables Machine Learning c 2011 Christfried Webers NICTA The Australian National University arbitrary function f (x, y) Polynomial Curve Fitting Probability Theory Ex [f (x, y)] = p(x) f (x, y) discrete distribution p(x) Probability Densities x Expectations and Covariances Ex [f (x, y)] = p(x) f (x, y) dx probability density p(x) Note that Ex [f (x, y)] is a function of y. 75of 79
  37. 37. Introduction to StatisticalConditional Expectation Machine Learning c 2011 Christfried Webers NICTA arbitrary function f (x) The Australian National University Ex [f | y] = p(x | y) f (x) discrete distribution p(x) x Ex [f | y] = p(x | y) f (x) dx probability density p(x) Polynomial Curve Fitting Probability Theory Note that Ex [f | y] is a function of y. Probability Densities Other notation used in the literature : Ex|y [f ]. Expectations and Covariances What is E [E [f (x) | y]] ? Can we simplify it? This must mean Ey [Ex [f (x) | y]]. (Why?) Ey [Ex [f (x) | y]] = p(y) Ex [f | y] = p(y) p(x|y) f (x) y y x = f (x) p(x, y) = f (x) p(x) x,y x = Ex [f (x)] 76of 79
  38. 38. Introduction to StatisticalVariance Machine Learning c 2011 Christfried Webers NICTA The Australian National University arbitrary function f (x) Polynomial Curve Fitting Probability Theory 2 var[f ] = E (f (x) − E [f (x)])2 = E f (x)2 − E [f (x)] Probability Densities Expectations and Covariances Special case: f (x) = x 2 var[x] = E (x − E [x])2 = E x2 − E [x] 77of 79
  39. 39. Introduction to StatisticalCovariance Machine Learning c 2011 Christfried Webers NICTA The Australian National University Two random variables x ∈ R and y ∈ R cov[x, y] = Ex,y [(x − E [x])(y − E [y])] = Ex,y [x y] − E [x] E [y] Polynomial Curve Fitting With E [x] = a and E [y] = b Probability Theory Probability Densities cov[x, y] = Ex,y [(x − a)(y − b)] Expectations and Covariances = Ex,y [x y] − Ex,y [x b] − Ex,y [a y] + Ex,y [a b] = Ex,y [x y] − b Ex,y [x] −a Ex,y [y] +a b Ex,y [1] =Ex [x] =Ey [y] =1 = Ex,y [x y] − a b − a b + a b = Ex,y [x y] − a b = Ex,y [x y] − E [x] E [y] Expresses how strongly x and y vary together. If x and y are independent, their covariance vanishes. 78of 79
  40. 40. Introduction to StatisticalCovariance for Vector Valued Variables Machine Learning c 2011 Christfried Webers NICTA The Australian National University Polynomial Curve Fitting Two random variables x ∈ RD and y ∈ RD Probability Theory Probability Densities cov[x, y] = Ex,y (x − E [x])(yT − E yT ) Expectations and Covariances T T = Ex,y x y − E [x] E y 79of 79

×