Upcoming SlideShare
×

Introduction

318 views

Published on

Statistical Machine Learning from ANU University

0 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

Views
Total views
318
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
0
0
Likes
0
Embeds 0
No embeds

No notes for slide

Introduction

1. 1. Introduction to Statistical Machine Learning c 2011 Christfried Webers NICTA The Australian National UniversityIntroduction to Statistical Machine Learning Christfried Webers Statistical Machine Learning Group NICTA and College of Engineering and Computer Science The Australian National University Canberra February – June 2011 (Many ﬁgures from C. M. Bishop, "Pattern Recognition and Machine Learning") 1of 79
2. 2. Introduction to Statistical Machine Learning c 2011 Christfried Webers NICTA The Australian National University Part II Polynomial Curve Fitting Probability TheoryIntroduction Probability Densities Expectations and Covariances 41of 79
3. 3. Introduction to StatisticalPolynomial Curve Fitting Machine Learning c 2011 Christfried Webers NICTA The Australian National University some artiﬁcial data created from the function sin(2πx) + random noise x = 0, . . . , 1 Polynomial Curve Fitting Probability Theory 1 Probability Densities t Expectations and Covariances 0 −1 0 x 1 42of 79
4. 4. Introduction to StatisticalPolynomial Curve Fitting - Input Speciﬁcation Machine Learning c 2011 Christfried Webers NICTA The Australian National University Polynomial Curve Fitting N = 10 Probability Theory Probability Densities x ≡ (x1 , . . . , xN )T Expectations and Covariances t ≡ (t1 , . . . , tN )T 43of 79
5. 5. Introduction to StatisticalPolynomial Curve Fitting - Input Speciﬁcation Machine Learning c 2011 Christfried Webers NICTA The Australian National University N = 10 Polynomial Curve Fitting Probability Theory T x ≡ (x1 , . . . , xN ) Probability Densities t ≡ (t1 , . . . , tN )T Expectations and Covariances xi ∈ R i = 1,. . . , N ti ∈ R i = 1,. . . , N 44of 79
6. 6. Introduction to StatisticalPolynomial Curve Fitting - Model Speciﬁcation Machine Learning c 2011 Christfried Webers NICTA The Australian National UniversityM : order of polynomial y(x, w) = w0 + w1 x + w2 x2 + · · · + wM xM Polynomial Curve Fitting Probability Theory M Probability Densities = wm xm Expectations and m=0 Covariances nonlinear function of x linear function of the unknown model parameter w How can we ﬁnd good parameters w = (w1 , . . . , wM )T ? 45of 79
7. 7. Introduction to StatisticalLearning is Improving Performance Machine Learning c 2011 Christfried Webers NICTA The Australian National University t tn y(xn , w) Polynomial Curve Fitting Probability Theory Probability Densities Expectations and xn x Covariances Performance measure : Error between target and prediction of the model for the training data N 1 2 E(w) = (y(xn , w) − tn ) 2 n=1 unique minimum of E(w) for argument w 46of 79
8. 8. Introduction to StatisticalModel Comparison or Model Selection Machine Learning c 2011 Christfried Webers NICTA The Australian National University M y(x, w) = wm xm m=0 M=0 = w0 Polynomial Curve Fitting Probability Theory Probability Densities Expectations and 1 M =0 Covariances t 0 −1 0 x 1 47of 79
9. 9. Introduction to StatisticalModel Comparison or Model Selection Machine Learning c 2011 Christfried Webers NICTA The Australian National University M y(x, w) = wm xm m=0 M=1 = w0 + w1 x Polynomial Curve Fitting Probability Theory Probability Densities Expectations and 1 M =1 Covariances t 0 −1 0 x 1 48of 79
10. 10. Introduction to StatisticalModel Comparison or Model Selection Machine Learning c 2011 Christfried Webers NICTA The Australian National University M y(x, w) = wm xm m=0 M=3 = w0 + w1 x + w2 x2 + w3 x3 Polynomial Curve Fitting Probability Theory Probability Densities Expectations and 1 M =3 Covariances t 0 −1 0 x 1 49of 79
11. 11. Introduction to StatisticalModel Comparison or Model Selection Machine Learning c 2011 Christfried Webers NICTA The Australian National University M y(x, w) = wm xm m=0 M=9 = w0 + w1 x + · · · + w8 x8 + w9 x9 Polynomial Curve Fitting Probability Theory overﬁtting Probability Densities Expectations and Covariances 1 M =9 t 0 −1 0 x 1 50of 79
12. 12. Introduction to StatisticalTesting the Model Machine Learning c 2011 Christfried Webers NICTA The Australian National University Train the model and get w Get 100 new data points Root-mean-square (RMS) error ERMS = 2E(w )/N Polynomial Curve Fitting Probability Theory Probability Densities 1 Training Expectations and Covariances Test ERMS 0.5 0 0 3 M 6 9 51of 79
13. 13. Introduction to StatisticalTesting the Model Machine Learning c 2011 Christfried Webers NICTA The Australian National University M=0 M=1 M=3 M=9 w0 0.19 0.82 0.31 0.35 w1 -1.27 7.99 232.37 Polynomial Curve Fitting w2 -25.43 -5321.83 Probability Theory w3 17.37 48568.31 Probability Densities w4 -231639.30 Expectations and w5 640042.26 Covariances w6 -1061800.52 w7 1042400.18 w8 -557682.99 w9 125201.43 Table: Coefﬁcients w for polynomials of various order. 52of 79
14. 14. Introduction to StatisticalMore Data Machine Learning c 2011 Christfried Webers NICTA The Australian National University N = 15 Polynomial Curve Fitting 1 N = 15 Probability Theory t Probability Densities Expectations and Covariances 0 −1 0 x 1 53of 79
15. 15. Introduction to StatisticalMore Data Machine Learning c 2011 Christfried Webers NICTA N = 100 The Australian National University heuristics : have no less than 5 to 10 times as many data points than parameters but number of parameters is not necessarily the most appropriate measure of model complexity ! Polynomial Curve Fitting later: Bayesian approach Probability Theory Probability Densities Expectations and Covariances 1 N = 100 t 0 −1 0 x 1 54of 79
16. 16. Introduction to StatisticalRegularisation Machine Learning c 2011 Christfried Webers NICTA The Australian National University How to constrain the growing of the coefﬁcients w ? Add a regularisation term to the error function Polynomial Curve Fitting Probability Theory N 1 2 λ 2 Probability Densities E(w) = ( y(xn , w) − tn ) + w Expectations and 2 2 Covariances n=1 Squared norm of the parameter vector w 2 w ≡ wT w = w2 + w2 + · · · + w2 0 1 M 55of 79
17. 17. Introduction to StatisticalRegularisation Machine Learning c 2011 Christfried Webers NICTA The Australian National University M=9 Polynomial Curve Fitting 1 ln λ = −18 Probability Theory t Probability Densities Expectations and Covariances 0 −1 0 x 1 56of 79
18. 18. Introduction to StatisticalRegularisation Machine Learning c 2011 Christfried Webers NICTA The Australian National University M=9 Polynomial Curve Fitting 1 ln λ = 0 Probability Theory t Probability Densities Expectations and Covariances 0 −1 0 x 1 57of 79
19. 19. Introduction to StatisticalRegularisation Machine Learning c 2011 Christfried Webers NICTA The Australian National University M=9 1 Training Polynomial Curve Fitting Test Probability Theory Probability Densities Expectations and ERMS 0.5 Covariances 0 −35 −30 ln λ −25 −20 58of 79
20. 20. Introduction to StatisticalProbability Theory Machine Learning c 2011 Christfried Webers NICTA The Australian National University p(X, Y ) Polynomial Curve Fitting Probability Theory Y =2 Probability Densities Expectations and Covariances Y =1 X 59of 79
21. 21. Introduction to StatisticalProbability Theory Machine Learning c 2011 Christfried Webers NICTA The Australian National University Y vs. X a b c d e f g h i sum 2 0 0 0 1 4 5 8 6 2 26 1 3 6 8 8 5 3 1 0 0 34 sum 3 6 8 9 9 8 9 6 2 60 Polynomial Curve Fitting Probability Theory Probability Densities p(X, Y ) Expectations and Covariances Y =2 Y =1 X 60of 79
22. 22. Introduction to StatisticalSum Rule Machine Learning c 2011 Christfried Webers NICTA The Australian National University Y vs. X a b c d e f g h i sum 2 0 0 0 1 4 5 8 6 2 26 1 3 6 8 8 5 3 1 0 0 34 sum 3 6 8 9 9 8 9 6 2 60 Polynomial Curve Fitting Probability Theory Probability Densities p(X = d, Y = 1) = 8/60 Expectations and Covariances p(X = d) = p(X = d, Y = 2) + p(X = d, Y = 1) = 1/60 + 8/60 p(X = d) = p(X = d, Y) Y p(X) = p(X, Y) Y 61of 79
23. 23. Introduction to StatisticalSum Rule Machine Learning c 2011 Christfried Webers NICTA The Australian National University Y vs. X a b c d e f g h i sum 2 0 0 0 1 4 5 8 6 2 26 1 3 6 8 8 5 3 1 0 0 34 sum 3 6 8 9 9 8 9 6 2 60 Polynomial Curve Fitting Probability Theory Probability Densities p(X) = p(X, Y) p(Y) = p(X, Y) Expectations and Covariances Y X p(X) p(Y ) X 62of 79
24. 24. Introduction to StatisticalProduct Rule Machine Learning c 2011 Christfried Webers NICTA The Australian National University Y vs. X a b c d e f g h i sum 2 0 0 0 1 4 5 8 6 2 26 1 3 6 8 8 5 3 1 0 0 34 sum 3 6 8 9 9 8 9 6 2 60 Polynomial Curve FittingConditional Probability Probability Theory Probability Densities p(X = d | Y = 1) = 8/34 Expectations and CovariancesCalculate p(Y = 1): p(Y = 1) = p(X, Y = 1) = 34/60 X p(X = d, Y = 1) = p(X = d | Y = 1)p(Y = 1) p(X, Y) = p(X | Y) p(Y) 63of 79
25. 25. Introduction to StatisticalProduct Rule Machine Learning c 2011 Christfried Webers NICTA The Australian National University Y vs. X a b c d e f g h i sum 2 0 0 0 1 4 5 8 6 2 26 1 3 6 8 8 5 3 1 0 0 34 sum 3 6 8 9 9 8 9 6 2 60 Polynomial Curve Fitting p(X, Y) = p(X | Y) p(Y) Probability Theory Probability Densities Expectations and p(X|Y = 1) Covariances X 64of 79
26. 26. Introduction to StatisticalSum Rule and Product Rule Machine Learning c 2011 Christfried Webers NICTA The Australian National University Polynomial Curve Fitting Sum Rule Probability Theory p(X) = p(X, Y) Probability Densities Y Expectations and Covariances Product Rule p(X, Y) = p(X | Y) p(Y) 65of 79
27. 27. Introduction to StatisticalWhy not using Fractions? Machine Learning c 2011 Christfried Webers NICTA The Australian National University Why not using pairs of numbers (s, t) such that p(X, Y) = s/t (e.g. s = 8, t = 60 )? Polynomial Curve Fitting Why not using pairs of numbers (a, c) instead of Probability Theory sin(alpha) = a/c? Probability Densities Expectations and Covariances c a alpha b 66of 79
28. 28. Introduction to StatisticalBayes Theorem Machine Learning c 2011 Christfried Webers NICTA The Australian National UniversityUse product rule p(X, Y) = p(X | Y) p(Y) = p(Y | X) p(X)Bayes Theorem Polynomial Curve Fitting Probability Theory p(X | Y) p(Y) p(Y | X) = Probability Densities p(X) Expectations and Covariancesand p(X) = p(X, Y) (sum rule) Y = p(X | Y) p(Y) (product rule) Y 67of 79
29. 29. Introduction to StatisticalProbability Densities Machine Learning c 2011 Christfried Webers NICTA The Australian National University Real valued variable x ∈ R Probability of x to fall in the interval (x, x + δx) is given by p(x)δx for inﬁnitesimal small δx. Polynomial Curve Fitting b Probability Theory p(x ∈ (a, b)) = p(x) dx. Probability Densities a Expectations and Covariances P (x) p(x) δx x 68of 79
30. 30. Introduction to StatisticalConstraints on p(x) Machine Learning c 2011 Christfried Webers NICTA The Australian National University Nonnegative p(x) ≥ 0 Normalisation ∞ Polynomial Curve Fitting p(x) dx = 1. Probability Theory −∞ Probability Densities Expectations and P (x) Covariances p(x) δx x 69of 79
31. 31. Introduction to StatisticalCumulative distribution function P(x) Machine Learning c 2011 Christfried Webers NICTA The Australian National University x P(x) = p(z) dz −∞or Polynomial Curve Fitting d P(x) = p(x) Probability Theory dx Probability Densities Expectations and P (x) Covariances p(x) δx x 70of 79
32. 32. Introduction to StatisticalMultivariate Probability Density Machine Learning c 2011 Christfried Webers NICTA The Australian National University   x1 T . Vector x ≡ (x1 , . . . , xD ) =  .  . xD Polynomial Curve Fitting Nonnegative Probability Theory p(x) ≥ 0 Probability Densities Expectations and Normalisation ∞ Covariances p(x) dx = 1. −∞ This means ∞ ∞ ... p(x) dx1 . . . dxD = 1. −∞ −∞ 71of 79
33. 33. Introduction to StatisticalSum and Product Rule for Probability Densities Machine Learning c 2011 Christfried Webers NICTA The Australian National University Polynomial Curve Fitting Sum Rule ∞ Probability Theory p(x) = p(x, y) dy Probability Densities −∞ Expectations and Covariances Product Rule p(x, y) = p(y | x) p(x) 72of 79
34. 34. Introduction to StatisticalExpectations Machine Learning c 2011 Christfried Webers NICTA The Australian National University Weighted average of a function f(x) under the probability Polynomial Curve Fitting distribution p(x) Probability Theory Probability Densities E [f ] = p(x) f (x) discrete distribution p(x) Expectations and Covariances x E [f ] = p(x) f (x) dx probability density p(x) 73of 79
35. 35. Introduction to StatisticalHow to approximate E [f ] Machine Learning c 2011 Christfried Webers NICTA The Australian National University Given a ﬁnite number N of points xn drawn from the probability distribution p(x). Polynomial Curve Fitting Approximate the expectation by a ﬁnite sum: Probability Theory Probability Densities N Expectations and 1 Covariances E [f ] f (xn ) N n=1 How to draw points from a probability distribution p(x) ? Lecture coming about “Sampling” 74of 79
36. 36. Introduction to StatisticalExpection of a function of several variables Machine Learning c 2011 Christfried Webers NICTA The Australian National University arbitrary function f (x, y) Polynomial Curve Fitting Probability Theory Ex [f (x, y)] = p(x) f (x, y) discrete distribution p(x) Probability Densities x Expectations and Covariances Ex [f (x, y)] = p(x) f (x, y) dx probability density p(x) Note that Ex [f (x, y)] is a function of y. 75of 79
37. 37. Introduction to StatisticalConditional Expectation Machine Learning c 2011 Christfried Webers NICTA arbitrary function f (x) The Australian National University Ex [f | y] = p(x | y) f (x) discrete distribution p(x) x Ex [f | y] = p(x | y) f (x) dx probability density p(x) Polynomial Curve Fitting Probability Theory Note that Ex [f | y] is a function of y. Probability Densities Other notation used in the literature : Ex|y [f ]. Expectations and Covariances What is E [E [f (x) | y]] ? Can we simplify it? This must mean Ey [Ex [f (x) | y]]. (Why?) Ey [Ex [f (x) | y]] = p(y) Ex [f | y] = p(y) p(x|y) f (x) y y x = f (x) p(x, y) = f (x) p(x) x,y x = Ex [f (x)] 76of 79
38. 38. Introduction to StatisticalVariance Machine Learning c 2011 Christfried Webers NICTA The Australian National University arbitrary function f (x) Polynomial Curve Fitting Probability Theory 2 var[f ] = E (f (x) − E [f (x)])2 = E f (x)2 − E [f (x)] Probability Densities Expectations and Covariances Special case: f (x) = x 2 var[x] = E (x − E [x])2 = E x2 − E [x] 77of 79
39. 39. Introduction to StatisticalCovariance Machine Learning c 2011 Christfried Webers NICTA The Australian National University Two random variables x ∈ R and y ∈ R cov[x, y] = Ex,y [(x − E [x])(y − E [y])] = Ex,y [x y] − E [x] E [y] Polynomial Curve Fitting With E [x] = a and E [y] = b Probability Theory Probability Densities cov[x, y] = Ex,y [(x − a)(y − b)] Expectations and Covariances = Ex,y [x y] − Ex,y [x b] − Ex,y [a y] + Ex,y [a b] = Ex,y [x y] − b Ex,y [x] −a Ex,y [y] +a b Ex,y [1] =Ex [x] =Ey [y] =1 = Ex,y [x y] − a b − a b + a b = Ex,y [x y] − a b = Ex,y [x y] − E [x] E [y] Expresses how strongly x and y vary together. If x and y are independent, their covariance vanishes. 78of 79
40. 40. Introduction to StatisticalCovariance for Vector Valued Variables Machine Learning c 2011 Christfried Webers NICTA The Australian National University Polynomial Curve Fitting Two random variables x ∈ RD and y ∈ RD Probability Theory Probability Densities cov[x, y] = Ex,y (x − E [x])(yT − E yT ) Expectations and Covariances T T = Ex,y x y − E [x] E y 79of 79