Upcoming SlideShare
×

# Probability

463 views

Published on

Probability for Machine Learning

0 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
Your message goes here
• Be the first to comment

• Be the first to like this

Views
Total views
463
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
0
0
Likes
0
Embeds 0
No embeds

No notes for slide

### Probability

1. 1. Introduction to Statistical Machine Learning c 2011 Christfried Webers NICTA The Australian National UniversityIntroduction to Statistical Machine Learning Christfried Webers Statistical Machine Learning Group NICTA and College of Engineering and Computer Science The Australian National University Canberra February – June 2011 (Many ﬁgures from C. M. Bishop, "Pattern Recognition and Machine Learning") 1of 146
2. 2. Introduction to Statistical Machine Learning c 2011 Christfried Webers NICTA The Australian National University Part IV Boxes with Apples and Oranges Bayes’ TheoremProbability and Uncertainty Bayes’ Probabilities Probability Distributions Gaussian Distribution over a Vector Decision Theory Model Selection - Key Ideas 113of 146
3. 3. Introduction to StatisticalSimple Experiment Machine Learning c 2011 Christfried Webers NICTA The Australian National University 1 Choose a box. Red box p(B = r) = 4/10 Blue box p(B = b) = 6/10 2 Choose any item of the selected box with equal probability. Boxes with Apples and Oranges Given that we have chosen an orange, what is the Bayes’ Theorem probability that the box we chose was the blue one? Bayes’ Probabilities Probability Distributions Gaussian Distribution over a Vector Decision Theory Model Selection - Key Ideas 114of 146
4. 4. Introduction to StatisticalWhat do we know? Machine Learning c 2011 Christfried Webers NICTA The Australian National University p(F = o | B = b) = 1/4 p(F = a | B = b) = 3/4 p(F = o | B = r) = 3/4 Boxes with Apples and p(F = a | B = r) = 1/4 Oranges Bayes’ Theorem Given that we have chosen an orange, what is the Bayes’ Probabilities probability that the box we chose was the blue one? Probability Distributions Gaussian Distribution p(B = b | F = o)? over a Vector Decision Theory Model Selection - Key Ideas 115of 146
5. 5. Introduction to StatisticalCalculating the Posterior p(B = b | F = o) Machine Learning c 2011 Christfried Webers NICTA The Australian National University Bayes’ Theorem p(F = o|B = b)p(B = b) p(B = b | F = o) = p(F = o) Boxes with Apples and Oranges Sum Rule for the denominator Bayes’ Theorem Bayes’ Probabilities p(F = o) = p(F = o, B = b) + p(F = o, B = r) Probability Distributions = p(F = o|B = b)p(B = b) Gaussian Distribution over a Vector + p(F = o|B = r)p(B = r) Decision Theory Model Selection - Key 1 6 3 4 9 Ideas = × + × = 4 10 4 10 20 1 6 20 1 p(B = b | F = o) = × × = 4 10 9 3 116of 146
6. 6. Introduction to StatisticalBayes’ Theorem - revisited Machine Learning c 2011 Christfried Webers NICTA The Australian National University Before choosing an item from a box: most complete information in p(B) (prior). 6 Note, that in our example p(B = b) = 10 . Therefore Boxes with Apples and Oranges choosing the box from the prior, we would opt for the blue Bayes’ Theorem box. Bayes’ Probabilities Once we observe some data (e.g. choose an orange), we Probability Distributions can calculate p(B = b | F = o) (posterior probability) via Gaussian Distribution over a Vector Bayes’ theorem. Decision Theory After observing an orange, the posterior probability Model Selection - Key 1 2 Ideas p(B = b | F = o) = 3 and therefore p(B = r | F = o) = 3 . Observing an orange it is now more likely that the orange came from the red box. 117of 146
7. 7. Introduction to StatisticalBayes’ Rule Machine Learning c 2011 Christfried Webers NICTA The Australian National University Boxes with Apples and Oranges likelihood prior posterior Bayes’ Theorem p(X | Y) p(Y) p(X | Y) p(Y) Bayes’ Probabilities p(Y | X) = = p(X) Y p(X | Y)p(Y) Probability Distributions Gaussian Distribution normalisation over a Vector Decision Theory Model Selection - Key Ideas 118of 146
8. 8. Introduction to StatisticalBayes’ Probabilities Machine Learning c 2011 Christfried Webers NICTA The Australian National University classical or frequentist interpretation of probabilities Bayesian view: probabilities represent uncertainty Boxes with Apples and Oranges Example: Will the Arctic ice cap have disappeared by the Bayes’ Theorem end of the century? Bayes’ Probabilities Probability Distributions fresh evidence can change the opinion on ice loss Gaussian Distribution over a Vector goal: quantify uncertainty and revise uncertainty in light of Decision Theory new evidence Model Selection - Key Ideas use Bayesian interpretation of probability 119of 146
9. 9. Introduction to StatisticalAndrey Kolmogorov - Axiomatic Probability Machine Learning c 2011Theory (1933) Christfried Webers NICTA The Australian National University Boxes with Apples and Let (Ω, F, P) be a measure space with P(Ω) = 1. Oranges Then (Ω, F, P) is a probability space, with sample space Ω, Bayes’ Theorem Bayes’ Probabilities event space F and probability measure P. Probability Distributions 1. Axiom P(E) ≥ 0 ∀E ∈ F. Gaussian Distribution over a Vector 2. Axiom P(Ω) = 1. Decision Theory 3. Axiom P(E1 ∪ E2 ∪ . . . ) = i P(Ei ) for any countable Model Selection - Key Ideas sequence of pairwise disjoint events E1 , E2 , . . . . 120of 146
10. 10. Introduction to StatisticalRichard Threlkeld Cox - (1946, 1961) Machine Learning c 2011 Christfried Webers NICTA The Australian National University Assume numerical values are used to represent degrees of belief. Deﬁne a set of axioms encoding common sense Boxes with Apples and properties of such beliefs. Oranges Bayes’ Theorem Results in a set of rules for manipulating degrees of belief Bayes’ Probabilities which are equivalent to the sum and product rule of Probability Distributions probability. Gaussian Distribution over a Vector many other authors have proposed different sets of Decision Theory axioms and properties Model Selection - Key Ideas Result: the numerical quantities behave all according to the rules of probability Denote these quantities as Bayesian probabilities. 121of 146
11. 11. Introduction to StatisticalCurve ﬁtting - revisited Machine Learning c 2011 Christfried Webers NICTA The Australian National University uncertainty about the parameter w captured in the prior probability p(w) observed data D = {t1 , . . . , tN } Boxes with Apples and Oranges calculate the uncertainty in w after the data D have been Bayes’ Theorem observed Bayes’ Probabilities p(D | w)p(w) Probability Distributions p(w | D) = p(D) Gaussian Distribution over a Vector p(D | w) as a function of w : likelihood function Decision Theory Model Selection - Key likelihood expresses how probable the data are for Ideas different values of w not a probability function over w 122of 146
12. 12. Introduction to StatisticalLikelihood Function - Frequentist versus Bayesian Machine Learning c 2011 Christfried Webers NICTA The Australian National University Likelihood function p(D | w) Frequentist Approach Bayesian Approach Boxes with Apples and Oranges w considered ﬁxed only one single data Bayes’ Theorem parameter set D Bayes’ Probabilities value deﬁned by uncertainty in the Probability Distributions some ’estimator’ parameters comes Gaussian Distribution over a Vector error bars on the from a probability Decision Theory estimated w distribution over w Model Selection - Key Ideas obtained from the distribution of possible data sets D 123of 146
13. 13. Introduction to StatisticalFrequentist Estimator - Maximum Likelihood Machine Learning c 2011 Christfried Webers NICTA The Australian National University choose w for which the likelihood p(D | w) is maximal choose w for which the probability of the observed data is maximal Boxes with Apples and Oranges Machine Learning: error function is negative log of Bayes’ Theorem likelihood function Bayes’ Probabilities log is a monoton function Probability Distributions Gaussian Distribution maximising the likelihood ⇐⇒ minimising the error over a Vector Decision Theory Example: Fair-looking coin is tossed three times, always Model Selection - Key landing on heads. Ideas Maximum likelihood estimate of the probability of landing heads will give 1. 124of 146
14. 14. Introduction to StatisticalBayesian Approach Machine Learning c 2011 Christfried Webers NICTA The Australian National University including prior knowledge easy (via prior w) BUT: if prior is badly chosen, can lead to bad results Boxes with Apples and subjective choice of prior Oranges Bayes’ Theorem sometimes choice of prior motivated by convinient Bayes’ Probabilities mathematical form Probability Distributions need to sum/integrate over the whole parameter space Gaussian Distribution over a Vector advances in sampling (Markov Chain Monte Carlo Decision Theory methods) Model Selection - Key Ideas advances in approximation schemes (Variational Bayes, Expectation Propagation) 125of 146
15. 15. Introduction to StatisticalThe Gaussian Distribution Machine Learning c 2011 Christfried Webers NICTA The Australian National University x∈R Gaussian Distribution with mean µ and variance σ 2 1 1 Boxes with Apples and N (x | µ, σ 2 ) = 1 exp{− (x − µ)2 } Oranges (2πσ 2 ) 2 2σ 2 Bayes’ Theorem Bayes’ Probabilities N (x|µ, σ 2 ) Probability Distributions Gaussian Distribution over a Vector Decision Theory 2σ Model Selection - Key Ideas µ x 126of 146
16. 16. Introduction to StatisticalThe Gaussian Distribution Machine Learning c 2011 Christfried Webers NICTA The Australian National University N (x | µ, σ 2 ) > 0 ∞ −∞ N (x | µ, σ 2 ) dx = 1 Expectation over x Boxes with Apples and ∞ Oranges 2 E [x] = N (x | µ, σ ) x dx = µ Bayes’ Theorem −∞ Bayes’ Probabilities Probability Distributions 2 Expectation over x Gaussian Distribution over a Vector ∞ Decision Theory E x2 = N (x | µ, σ 2 ) x2 dx = µ2 + σ 2 Model Selection - Key −∞ Ideas Variance of x 2 var[x] = E x2 − E [x] = σ 2 127of 146
17. 17. Introduction to StatisticalThe Mode of a Probability Distribution Machine Learning c 2011 Christfried Webers NICTA The Australian National University Mode of a distribution : the value that occurs the most frequently in a probability distribution. For a probability density function: the value x at which the probability density attains its maximum. Boxes with Apples and Gaussian Distribution has one mode (unimodular); the Oranges mode is µ. Bayes’ Theorem Bayes’ Probabilities If there are multiple local maxima in the probability Probability Distributions distribution, the probability distribution is called multimodal Gaussian Distribution (example: mixture of three Gaussians). over a Vector Decision Theory N (x|µ, σ 2 ) p(x) Model Selection - Key Ideas 2σ µ x x 128of 146
18. 18. Introduction to StatisticalThe Bernoulli Distribution Machine Learning c 2011 Christfried Webers NICTA The Australian National University Two possible outcomes x ∈ {0, 1} (e.g. coin which may be damaged). p(x = 1 | µ) = µ for 0 ≤ µ ≤ 1 Boxes with Apples and Oranges p(x = 0 | µ) = 1 − µ Bayes’ Theorem Bernoulli Distribution Bayes’ Probabilities Probability Distributions Bern(x | µ) = µx (1 − µ)1−x Gaussian Distribution over a Vector Decision Theory Expectation over x Model Selection - Key E [x] = µ Ideas Variance of x var[x] = µ(1 − µ) 129of 146
19. 19. Introduction to StatisticalThe Binomial Distribution Machine Learning c 2011 Christfried Webers NICTA The Australian National University Flip a coin N times. What is the distribution to observe heads exactly m times? This is a distribution over m = {0, . . . , N}. Binomial Distribution Boxes with Apples and Oranges N m Bayes’ Theorem Bin(m | N, µ) = µ (1 − µ)N−m m Bayes’ Probabilities Probability Distributions 0.3 Gaussian Distribution over a Vector Decision Theory 0.2 Model Selection - Key Ideas 0.1 0 0 1 2 3 4 5 6 7 8 9 10 m N = 10, µ = 0.25 130of 146
20. 20. Introduction to StatisticalThe Beta Distribution Machine Learning c 2011 Christfried Webers NICTA The Australian National Beta Distribution University Γ(a + b) a−1 Beta(µ | a, b) = µ (1 − µ)b−1 Γ(a)Γ(b) 3 3 Boxes with Apples and a = 0.1 a=1 Oranges b = 0.1 b=1 2 2 Bayes’ Theorem Bayes’ Probabilities 1 1 Probability Distributions Gaussian Distribution over a Vector 0 0 0 0.5 µ 1 0 0.5 µ 1 Decision Theory 3 3 Model Selection - Key a=2 a=8 Ideas b=3 b=4 2 2 1 1 0 0 0 0.5 µ 1 0 0.5 µ 1 131of 146
21. 21. Introduction to StatisticalThe Gaussian Distribution over a Vector x Machine Learning c 2011 Christfried Webers NICTA The Australian National University x ∈ RD Gaussian Distribution with mean µ ∈ RD and covariance matrix Σ ∈ RD×D 1 1 1 N (x | µ, Σ) = D 1 exp{− (x − µ)T Σ−1 (x − µ)} Boxes with Apples and Oranges (2π) 2 |Σ| 2 2 Bayes’ Theorem Bayes’ Probabilities where |Σ| is the determinant of Σ. Probability Distributions x2 Gaussian Distribution u2 over a Vector u1 Decision Theory Model Selection - Key y2 Ideas y1 µ 1/2 λ2 1/2 λ1 x1 132of 146
22. 22. Introduction to StatisticalThe Gaussian Distribution over a vector x Machine Learning c 2011 Christfried Webers NICTA Can ﬁnd a linear transformation to a new coordinate The Australian National University system y in which the x becomes y = UT (x − µ) , U is the eigenvector matrix for the covariance matrix Σ with eigenvalue matrix E Boxes with Apples and   Oranges λ1 0 . . . 0 Bayes’ Theorem  0 λ2 . . . 0  ΣU = UE = U  . . . . . . . . . . . . Bayes’ Probabilities Probability Distributions 0 0 . . . λD Gaussian Distribution over a Vector U can be made an orthogonal matrix, therefore the Decision Theory columns ui of U are unit vectors which are orthogonal to Model Selection - Key Ideas 1 i=j each other uT uj = i 0 i=j Now we can write Σ and its inverse (prove that ΣΣ−1 = I) n n 1 Σ = U E UT = λi ui uT i Σ−1 = U E−1 UT = ui uT i λi i=1 i=1 133of 146
23. 23. Introduction to StatisticalThe Gaussian Distribution over a vector x Machine Learning c 2011 Christfried Webers NICTA Now use the linear transformation y = UT (x − µ) and The Australian National University Σ−1 = U E−1 UT to transform the exponent (x − µ)T Σ−1 (x − µ) into n y2 j yT E y = λj Boxes with Apples and j=1 Oranges Now exponating the sum (and taking care of the factors) Bayes’ Theorem Bayes’ Probabilities results in a product of scalar valued Gaussian distributions Probability Distributions in orthogonal directions ui Gaussian Distribution over a Vector D 1 y2 j Decision Theory p(y) = exp{− } j=1 (2πλj )1/2 2λj Model Selection - Key Ideas x2 u2 u1 y2 y1 µ 1/2 λ2 1/2 λ1 x1 134of 146
24. 24. Introduction to StatisticalPartitioned Gaussians Machine Learning c 2011 Christfried Webers NICTA Given a joint Gaussian distribution N (x | µ, Σ), where µ is The Australian National University the mean vector, Σ the covariance matrix, and Λ ≡ Σ−1 the precision matrix Assume that the variables can be partitioned into two sets xa µa Σaa Σab Λaa Λab x= ,µ= ,Σ= ,Λ= Boxes with Apples and Oranges xb µb Σba Σbb Λba Λbb Bayes’ Theorem Bayes’ Probabilities Λaa = (Σaa − Σab Σ−1 Σba )−1 bb Probability Distributions Gaussian Distribution Λab = −(Σaa − Σab Σ−1 Σba )−1 Σab Σ−1 bb bb over a Vector Decision Theory Conditional distribution Model Selection - Key Ideas p(xa | xb ) = N (xa | µa|b , Λ−1 ) aa µa|b = µa − Λ−1 Λab (xb − µb ) aa Marginal distribution p(xa ) = N (xa | µa , Σaa ) 135of 146
25. 25. Introduction to StatisticalPartitioned Gaussians Machine Learning c 2011 Christfried Webers NICTA The Australian National University 1 10 xb xb = 0.7 p(xa |xb = 0.7) Boxes with Apples and Oranges 0.5 5 Bayes’ Theorem Bayes’ Probabilities p(xa , xb ) Probability Distributions p(xa ) Gaussian Distribution over a Vector Decision Theory 0 0 0 0.5 xa 1 0 0.5 xa 1 Model Selection - Key Ideas Contours of a Gaussian distribution over two variables xa and xb (left), and marginal distribution p(xa ) and conditional distribution p(xa | xb ) for xb = 0.7 (right). 136of 146
26. 26. Introduction to StatisticalNonlinear Change of Variables in Distributions Machine Learning c 2011 Christfried Webers NICTA The Australian National University Given some px (x). Consider a nonlinear change of variables x = g(y) Boxes with Apples and Oranges What is the new probability distribution py (y) in terms of Bayes’ Theorem the variable y ? Bayes’ Probabilities Probability Distributions dx py (y) = px (x) Gaussian Distribution dy over a Vector Decision Theory = px (g(y)) |g (y)| Model Selection - Key Ideas For vector valued x and y py (y) = px (x) | J | ∂xi where Jij = ∂yj . 137of 146
27. 27. Introduction to StatisticalDecision Theory - Key Ideas Machine Learning c 2011 Christfried Webers NICTA The Australian National University Two classes C1 and C2 joint distribution p(x, Ck ) using Bayes’ theorem Boxes with Apples and Oranges p(x | Ck ) p(Ck ) Bayes’ Theorem p(Ck | x) = p(x) Bayes’ Probabilities Probability Distributions Example: cancer treatment (k = 2) Gaussian Distribution over a Vector data x : an X-ray image Decision Theory C1 : patient has cancer (C1 : patient has no cancer) Model Selection - Key Ideas p(C1 ) is the prior probability of a person having cancer p(C1 | x) is the posterior probability of a person having cancer after having seen the X-ray data 138of 146
28. 28. Introduction to StatisticalDecision Theory - Key Ideas Machine Learning c 2011 Christfried Webers NICTA Need a rule which assigns each value of the input x to one The Australian National University of the available classes. The input space is partitioned into decision regions Rk . Leads to decision boundaries or decision surfaces probability of a mistake p(mistake) = p(x ∈ R1 , C2 ) + p(x ∈ R2 , C1 ) Boxes with Apples and Oranges Bayes’ Theorem = p(x, C2 ) dx + p(x, C1 ) dx Bayes’ Probabilities R1 R2 Probability Distributions Gaussian Distribution R1 over a Vector Decision Theory Model Selection - Key Ideas R2 139of 146
29. 29. Introduction to StatisticalDecision Theory - Key Ideas Machine Learning c 2011 Christfried Webers NICTA The Australian National probability of a mistake University p(mistake) = p(x ∈ R1 , C2 ) + p(x ∈ R2 , C1 ) = p(x, C2 ) dx + p(x, C1 ) dx R1 R2 Boxes with Apples and Oranges goal: minimize p(mistake) Bayes’ Theorem Bayes’ Probabilities x0 Probability Distributions x Gaussian Distribution p(x, C1 ) over a Vector Decision Theory p(x, C2 ) Model Selection - Key Ideas x R1 R2 140of 146
30. 30. Introduction to StatisticalDecision Theory - Key Ideas Machine Learning c 2011 Christfried Webers NICTA The Australian National University multiple classes instead of minimising the probability of mistakes, maximise Boxes with Apples and Oranges the probability of correct classiﬁcation Bayes’ Theorem K Bayes’ Probabilities p(correct) = p(x ∈ Rk , Ck ) Probability Distributions Gaussian Distribution k=1 over a Vector K Decision Theory = p(x, Ck ) dx Model Selection - Key Ideas k=1 Rk 141of 146
31. 31. Introduction to StatisticalMinimising the Expected Loss Machine Learning c 2011 Christfried Webers NICTA The Australian National University Not all mistakes are equally costly. Weight each misclassiﬁcation of x to the wrong class Cj Boxes with Apples and Oranges instead of assigning it to the correct class Ck by a factor Lkj . Bayes’ Theorem The expected loss is now Bayes’ Probabilities Probability Distributions Gaussian Distribution E [L] = Lkj p(x, Ck )dx over a Vector k j Rj Decision Theory Model Selection - Key Ideas Goal: minimize the expected loss E [L] 142of 146
32. 32. Introduction to StatisticalThe Reject Region Machine Learning c 2011 Christfried Webers NICTA The Australian National University Avoid making automated decisions on difﬁcult cases. Difﬁcult cases: posterior probabilities p(Ck | x) are very small Boxes with Apples and joint distributions p(x, Ck ) have comparable values Oranges Bayes’ Theorem p(C1 |x) p(C2 |x) 1.0 Bayes’ Probabilities θ Probability Distributions Gaussian Distribution over a Vector Decision Theory Model Selection - Key Ideas 0.0 x reject region 143of 146
33. 33. Introduction to StatisticalS-fold Cross Validation Machine Learning c 2011 Christfried Webers NICTA The Australian National University Given a set of N data items and targets. Goal: Find the best model (type of model, number of parameters like the order p of the polynomial or the regularisation constant λ). Avoid overﬁtting. Boxes with Apples and Oranges Solution: Train a machine learning algorithm with some of Bayes’ Theorem the data, evaluate it with the rest. Bayes’ Probabilities If we have many data Probability Distributions Gaussian Distribution 1 Train a range of models or a model with a range of over a Vector parameters. Decision Theory 2 Compare the performance on an independent data set Model Selection - Key (validation set) and choose the one with the best predictive Ideas performance. 3 Still, overﬁtting to the validation set can occur. Therefore, use a third test set for ﬁnal evaluation. (Keep the test set in a safe and never give it to the developers ;–) 144of 146
34. 34. Introduction to StatisticalS-fold Cross Validation Machine Learning c 2011 Christfried Webers NICTA The Australian National University For few data, there is a dilemma: Few training data or few Boxes with Apples and test data. Oranges Bayes’ Theorem Solution is cross-validation. Bayes’ Probabilities Use a portion of (S − 1)/S of the available data for training, Probability Distributions but use all the data to asses the performance. Gaussian Distribution over a Vector For very scarce data one may use S = N, which is also Decision Theory called the leave-one-out technique. Model Selection - Key Ideas 145of 146
35. 35. Introduction to StatisticalS-fold Cross Validation Machine Learning c 2011 Christfried Webers NICTA The Australian National University Partition data into S groups. Use S − 1 groups to train a set of models that are then evaluated on the remaining group. Repeat for all S choices of the held-out group, and average Boxes with Apples and Oranges the performance scores from the S runs. Bayes’ Theorem Bayes’ Probabilities run 1 Probability Distributions Gaussian Distribution over a Vector run 2 Decision Theory Model Selection - Key run 3 Ideas run 4 Example for S = 4. 146of 146