1. The document discusses maximum likelihood estimation and Bayesian parameter estimation for machine learning problems involving parametric densities like the Gaussian.
2. Maximum likelihood estimation finds the parameter values that maximize the probability of obtaining the observed training data. For Gaussian distributions with unknown mean and variance, MLE returns the sample mean and variance.
3. Bayesian parameter estimation treats the parameters as random variables and uses prior distributions and observed data to obtain posterior distributions over the parameters. This allows incorporation of prior knowledge with the training data.
At times it is useful to consider a function whose derivative is a given function. We look at the general idea of reversing the differentiation process and its applications to rectilinear motion.
There are various reasons why we would want to find the extreme (maximum and minimum values) of a function. Fermat's Theorem tells us we can find local extreme points by looking at critical points. This process is known as the Closed Interval Method.
Runtime Analysis of Population-based Evolutionary AlgorithmsPK Lehre
Populations are at the heart of evolutionary algorithms (EAs). They provide the genetic variation which selection acts upon. A complete picture of EAs can only be obtained if we understand their population dynamics. A rich theory on runtime analysis (also called time-complexity analysis) of EAs has been developed over the last 20 years. The goal of this theory is to show, via rigorous mathematical means, how the performance of EAs depends on their parameter settings and the characteristics of the underlying fitness landscapes. Initially, runtime analysis of EAs was mostly restricted to simplified EAs that do not employ large populations, such as the (1+1) EA. This tutorial introduces more recent techniques that enable runtime analysis of EAs with realistic population sizes.
The tutorial begins with a brief overview of the population‐based EAs that are covered by the techniques. We recall the common stochastic selection mechanisms and how to measure the selection pressure they induce. The main part of the tutorial covers in detail widely applicable techniques tailored to the analysis of populations. We discuss random family trees and branching processes, drift and concentration of measure in populations, and level‐based analyses.
To illustrate how these techniques can be applied, we consider several fundamental questions: When are populations necessary for efficient optimisation with EAs? What is the appropriate balance between exploration and exploitation and how does this depend on relationships between mutation and selection rates? What determines an EA's tolerance for uncertainty, e.g. in form of noisy or partially available fitness?
This tutorial was presented at the 2015 IEEE Congress on Evolutionary Computation at Sendai, Japan, May 25th 2015.
Rao-Blackwellisation schemes for accelerating Metropolis-Hastings algorithmsChristian Robert
Aggregate of three different papers on Rao-Blackwellisation, from Casella & Robert (1996), to Douc & Robert (2010), to Banterle et al. (2015), presented during an OxWaSP workshop on MCMC methods, Warwick, Nov 20, 2015
Lesson 15: Exponential Growth and Decay (slides)Matthew Leingang
Many problems in nature are expressible in terms of a certain differential equation that has a solution in terms of exponential functions. We look at the equation in general and some fun applications, including radioactivity, cooling, and interest.
A discussion on sampling graphs to approximate network classification functionsLARCA UPC
The problem of network classification consists on assigning a finite set of labels to the nodes of the graphs; the underlying assumption is that nodes with the same label tend to be connected via strong paths in the graph. This is similar to the assumptions made by semi-supervised learning algorithms based on graphs, which build an artificial graph from vectorial data. Such semi-supervised algorithms are based on label propagation principles and their accuracy heavily relies on the structure (presence of edges) in the graph.
In this talk I will discuss ideas of how to perform sampling in the network graph, thus sparsifying the structure in order to apply semi-supervised algorithms and compute efficiently the classification function on the network. I will show very preliminary experiments indicating that the sampling technique has an important effect on the final results and discuss open theoretical and practical questions that are to be solved yet.
AACIMP 2010 Summer School lecture by Leonidas Sakalauskas. "Applied Mathematics" stream. "Stochastic Programming and Applications" course. Part 5.
More info at http://summerschool.ssa.org.ua
At times it is useful to consider a function whose derivative is a given function. We look at the general idea of reversing the differentiation process and its applications to rectilinear motion.
There are various reasons why we would want to find the extreme (maximum and minimum values) of a function. Fermat's Theorem tells us we can find local extreme points by looking at critical points. This process is known as the Closed Interval Method.
Runtime Analysis of Population-based Evolutionary AlgorithmsPK Lehre
Populations are at the heart of evolutionary algorithms (EAs). They provide the genetic variation which selection acts upon. A complete picture of EAs can only be obtained if we understand their population dynamics. A rich theory on runtime analysis (also called time-complexity analysis) of EAs has been developed over the last 20 years. The goal of this theory is to show, via rigorous mathematical means, how the performance of EAs depends on their parameter settings and the characteristics of the underlying fitness landscapes. Initially, runtime analysis of EAs was mostly restricted to simplified EAs that do not employ large populations, such as the (1+1) EA. This tutorial introduces more recent techniques that enable runtime analysis of EAs with realistic population sizes.
The tutorial begins with a brief overview of the population‐based EAs that are covered by the techniques. We recall the common stochastic selection mechanisms and how to measure the selection pressure they induce. The main part of the tutorial covers in detail widely applicable techniques tailored to the analysis of populations. We discuss random family trees and branching processes, drift and concentration of measure in populations, and level‐based analyses.
To illustrate how these techniques can be applied, we consider several fundamental questions: When are populations necessary for efficient optimisation with EAs? What is the appropriate balance between exploration and exploitation and how does this depend on relationships between mutation and selection rates? What determines an EA's tolerance for uncertainty, e.g. in form of noisy or partially available fitness?
This tutorial was presented at the 2015 IEEE Congress on Evolutionary Computation at Sendai, Japan, May 25th 2015.
Rao-Blackwellisation schemes for accelerating Metropolis-Hastings algorithmsChristian Robert
Aggregate of three different papers on Rao-Blackwellisation, from Casella & Robert (1996), to Douc & Robert (2010), to Banterle et al. (2015), presented during an OxWaSP workshop on MCMC methods, Warwick, Nov 20, 2015
Lesson 15: Exponential Growth and Decay (slides)Matthew Leingang
Many problems in nature are expressible in terms of a certain differential equation that has a solution in terms of exponential functions. We look at the equation in general and some fun applications, including radioactivity, cooling, and interest.
A discussion on sampling graphs to approximate network classification functionsLARCA UPC
The problem of network classification consists on assigning a finite set of labels to the nodes of the graphs; the underlying assumption is that nodes with the same label tend to be connected via strong paths in the graph. This is similar to the assumptions made by semi-supervised learning algorithms based on graphs, which build an artificial graph from vectorial data. Such semi-supervised algorithms are based on label propagation principles and their accuracy heavily relies on the structure (presence of edges) in the graph.
In this talk I will discuss ideas of how to perform sampling in the network graph, thus sparsifying the structure in order to apply semi-supervised algorithms and compute efficiently the classification function on the network. I will show very preliminary experiments indicating that the sampling technique has an important effect on the final results and discuss open theoretical and practical questions that are to be solved yet.
AACIMP 2010 Summer School lecture by Leonidas Sakalauskas. "Applied Mathematics" stream. "Stochastic Programming and Applications" course. Part 5.
More info at http://summerschool.ssa.org.ua
Image sciences, image processing, image restoration, photo manipulation. Image and videos representation. Digital versus analog imagery. Quantization and sampling. Sources and models of noises in digital CCD imagery: photon, thermal and readout noises. Sources and models of blurs. Convolutions and point spread functions. Overview of other standard models, problems and tasks: salt-and-pepper and impulse noises, half toning, inpainting, super-resolution, compressed sensing, high dynamic range imagery, demosaicing. Short introduction to other types of imagery: SAR, Sonar, ultrasound, CT and MRI. Linear and ill-posed restoration problems.
We apply tensor train (TT) data format to solve an elliptic PDE with uncertain coefficients. We reduce complexity and storage from exponential to linear. Post-processing in TT format is also provided.
Those are the slides for my Master course on Monte Carlo Statistical Methods given in conjunction with the Monte Carlo Statistical Methods book with George Casella.
1. Machine Learning
Maximum Likelihood Estimation
and Bayesian Parameter Estimation
(Parametric Learning)
Phong VO
vdphong@fit.hcmus.edu.vn
September 11, 2010
– Typeset by FoilTEX –
2. Introduction
• From previous lecture, designing classifier assumes knowledge of p(x|ωi)
and P (ωi) for each class, i.e. For Gaussian densities, we need to know
µi, Σi for i = 1, . . . , c
• Unfortunately, this information is not available directly.
• Given training samples with true class label for each of sample, we have
a learning problem.
• If the form of the densities is known, i.e. the number of parameters and
general knowledge about the problem, a parameter estimation problem
results.
– Typeset by FoilTEX – 1
3. Example 1. Assume that p(x|ωi) is a normal density with mean µi and
covariance matrix Σi, although we do not know exact values of these
quantities. This knowledge simplifies the problem from one of estimating
an unknow function p(x|ωx) to one of estimating the parameters µi and
Σi.
– Typeset by FoilTEX – 2
4. Approaches to Parameter Estimation
• In maximum likelihood estimation, we assume the parameters are fixed,
but unknown. The MLE approach seeks the ”‘best”’ parameter estimate
in the sense that ”‘best”’ means the set of parameters that maximize
the probability of obtaining the training set.
• Bayesian estimation models the parameters to be estimated as random
variables with some (assumed) known priori distribution. The training
set are ”‘observation”’, which allow conversion of the a priori information
into an a posteriori density. The Bayesian approach uses the training set
to update the training set-conditioned density function of the unknown
parameters.
– Typeset by FoilTEX – 3
5. Maximum Likelihood Estimation
• MLE nearly always have good convergence properties as the number of
training samples increases.
• It is often simpler than alternate methods, such as Bayesian techniques.
– Typeset by FoilTEX – 4
6. Formulation
c
• Assume D = j=1 Dj , with the samples in Dj having been drawn
independently according to the probability law p(x|ωj ).
• Assume that p(x|ωj ) has a known parametric form, and is determined
uniquely by the value of a parameter vector θ j , i.e. p(x|ωi) ∼ N (µj , Σj )
where θ j = {µj , Σj }.
• The dependence of p(x|ωj ) on θ j is expressed as p(x|ωj , θ j ).
• Our problem: use the training samples to obtain good estimates for the
unknown parameter vectors θ 1, . . . , θ c.
– Typeset by FoilTEX – 5
7. • Assume more that samples in Di give no information about θ j if i = j.
In other word, parameters are functionally independent.
• The problem of classification is turned into the problems of parameter
estimation: Use a set D of training samples draw independently from the
probability density p(x|θ) to estimate the unknown parameter vector θ.
• Suppose that D = {x1, . . . , xn}. Since the samples were drawn
independently, we have
n
p(D|θ) = p(xk |θ)
k=1
p(D|θ) is called the likelihood of θ with respect to the set of samples.
– Typeset by FoilTEX – 6
8. ˆ
• The maximum likelihood estimate of θ is the value θ that maximizes
p(D|θ)
– Typeset by FoilTEX – 7
9. • Take the logarithm on both sides (just for analytical purpose), we define
l(θ) as the log-likelihood function
l(θ) = ln p(D|θ)
ˆ
• Since the logarithm is monotonically increasing, the θ that maximizes
the log-likelihood also maximizes the likelihood,
n
ˆ
θ = arg max θ l(θ) = arg max θ ln p(xk |θ)
k=1
ˆ
• θ can be found by taking derivatives of log-likelihood function
– Typeset by FoilTEX – 8
10. n
θl = θ ln p(xk |θ)
k=1
where
∂
∂θ1
≡ .
. ,
θ
∂
∂θp
and then solve the equation
θl = 0.
– Typeset by FoilTEX – 9
11. • A solution could be a global maximum, a local maximum or minimum.
We have to check each of them individually.
• NOTE: A related class of estimators - maximum a posteriori or MAP
estimators - find the value of θ that maximizes l(θ)p(θ). Thus a ML
estimator is a MAP estimator for the uniform or ”‘flat”’ prior.
– Typeset by FoilTEX – 10
12. MLE: The Gaussian Case for Unknown µ
• In this case, only the mean is unknown. Under this condition, we consider
a sample point xk and find
1 1
ln p(xk |µ) = − ln (2π)d|Σ| − (xk − µ)tΣ−1(xk − µ)
2 2
and
θ ln p(xk |µ) = Σ−1(xk − µ).
• The maximum likelihood estimate for µ must satisfy
– Typeset by FoilTEX – 11
13. n
Σ−1(xk − µ) = 0,
ˆ
k=1
• Solve above equation, we obtain
n
1
ˆ
µ= xk
n
k=1
• Interpretation: The maximum likelihood estimate for theu unknown
population mean is just the arithmetic average of the training samples
- the sample mean. Think of the n samples as a cloud of points, the
sample mean is the centroid of the cloud.
– Typeset by FoilTEX – 12
14. MLE: The Gaussian Case for Unknown µ and Σ
• Consider the univariate normal case, θ = {θ1, θ2} = {µ, σ 2}. The
log-likelihood of a single point is
1 1
ln p(xk |θ) = − ln 2πθ2 − (xk − θ1)2
2 2θ2
and its derivative is
1
θ2 (xk − θ1 )
θl = θ ln p(xk |θ) = 1 (xk −θ1 )2 .
− 2θ2 + 2θ2
2
– Typeset by FoilTEX – 13
15. • Let θl = 0 and we obtain
n
1 ˆ
(xk − θ1) = 0
ˆ
θ2
k=1
and
n n ˆ
1 (xk − θ1)2
− + =0
ˆ
θ2 ˆ2
θ
k=1 k=1 2
.
ˆ ˆ
• Substitute µ = θ1 and σ = θ2 we obtain
n
1
µ=
ˆ xk
n
k=1
– Typeset by FoilTEX – 14
16. and
n
2 1
σ =
ˆ (xk − µ)2
ˆ
n
k=1
.
Exercise 1. Estimate µ and Σ for the case of multivariate Gaussian.
– Typeset by FoilTEX – 15
17. Bayesian Parameter Estimation
• Bayes’ formula allows us to compute the posterior probabilities P (ωi|x)
from the prior probabilities P (ωi) and the class-conditional densities
p(x|ωi).
• How can we proceed those quantities?
– Prior probabilities: from knowledge of the functional forms for unknown
densities and ranges for the values of unknown parameters
– class-conditional densities: from training samples.
– Typeset by FoilTEX – 16
18. • Given training samples as D, Bayes’s formula then becomes
p(x|ωi, D)P (ωi|D)
P (ωi|x, D) = c
j=1 p(x|ωj , D)P (ωj |D)
• Assume that the a priori probabilities are known, P (ωi) = P (ωi|D) and
the samples in Di have no influence on p(x|ωj , D) if i = j,
p(x|ωi, Di)P (ωi)
P (ωi|x, D) = c
j=1 p(x|ωj , Dj )P (ωj )
• We have c separate problems of the following form: use a set D of samples
drawn independently according to the fixed but unknown probability
distribution p(x) to determine p(x|D). Our supervised learning problem
is turned into an unsupervised density estimation problem.
– Typeset by FoilTEX – 17
19. The Parameter Distribution
• Although the desired probability density p(x) is unknown, we assume
that it has a known parametric form.
• The unknown factor is the value of a parameter vector θ. As long as θ
is known, the function p(x|θ) is known.
• Information that we have about θ prior to observing the samples is
assumed to be contained in a known prior density p(θ).
• Observation of the samples converts this to a posterior density p(θ|D),
which is expected to be sharply peaked about the true value of θ.
– Typeset by FoilTEX – 18
20. • Our basic goal is to compute p(x|D), which is as close as we can come to
obtaining the unknown p(x). By integrating the joint density p(x, θ|D),
we have
p(x|D) = p(x, θ|D)dθ (1)
θ
= p(x|θ, D)p(θ|D)dθ (2)
θ
= p(x|θ)p(θ|D)dθ (3)
θ
– Typeset by FoilTEX – 19
21. BPE: Gaussian Case
• Calculate p(θ|D) and p(x|D) for the case where p(x|µ) ∼ N (µ, Σ)
• Consider the univariate case where µ is unknown
p(x|µ) ∼ N (µ, σ 2)
• We assume that the prior density p(µ) has a known distribution
2
p(µ) ∼ N (µ0, σ0 )
2
Interpretation: µ0 represents our best a priori guess for µ, and σ0
measures our uncertainty about this guess.
– Typeset by FoilTEX – 20
22. • Once µ is ”‘guessed”’, it determines the density for x. Letting D =
{x1, . . . , xn}, Bayes’ formula gives us
n
p(D|µ)p(µ)
p(µ|D) = ∝ p(xk |µ)p(µ)
p(D|µ)p(µ)dµ
k=1
where it is easy to see the affection of training samples to the estimation
of the true µ.
• Since p(xk |µ) ∼ N (µ, σ 2) and p(µ) ∼ N (µ0, σ0 ), we have
2
– Typeset by FoilTEX – 21
23. p(xk |µ) p(µ)
n 2 2
1 1 xk − µ 1 1 x k − µ0
p(µ|D) ∝ √ exp √ exp
σ 2π 2 σ σ0 2π 2 σ0
k=1
(4)
n 2 2
1 µ − xk µ − µ0
∝ exp − + (5)
2 σ σ0
k=1
n
1 n 1 1 µ0
∝ exp − + 2 µ2 − 2 xk + 2 µ (6)
2 σ 2 σ0 σ2 σ0
k=1
• p(µ|D) is again a normal density and is said to be a reproducing density
and p(µ) is conjugate prior.
– Typeset by FoilTEX – 22
24. 2
• If we write p(µ|D) ∼ N (µn, σn), then
2
1 1 µ − µn
p(µ|D) = √ exp −
σ 2π 2 σn
• Equating coefficients show us
2
nσ0 σ2
µn = 2 xn + 2 µ0
nσ0 + σ 2 nσ0 + σ 2
1 n
where xn = n k=1 xk and
2 σ0 σ 2
2
σn = 2
nσ0 + σ 2
– Typeset by FoilTEX – 23
25. Interpretation: these equations show how the prior information is
combined with the empirical information in the samples to obtain the a
posteriori density p(µ|D).
– Typeset by FoilTEX – 24
26. Interpretation
• µn represents our best guess for µ after observing n samples
2
• σn measures our uncertainty about this guess,
2 σ2
limn→∞ σn = ,
n
each additional observation decreases our uncertainty about the true
value of µ.
• As n increases, p(µ|D) approaches a Dirac delta function.
• This behavior is known as Bayesian learning.
– Typeset by FoilTEX – 25
29. • µn is a positive combination of xn and µ0, xn ≤ µn ≤ µ0
lim
n→∞ µn = xn if σ = 0
µn = µ0 if σ0 = 0
xn if σ0 σ
• The dogmatism:
prior knowledge σ 2
∼ 2
empirical data σ0
• If the dogmatism in not infinite, after enough samples are taken the exact
2
values assumed for µ0 and σ0 will be unimportant, µn will converge to
the sample mean.
– Typeset by FoilTEX – 28
30. Compute the class-conditional density
• Having obtained a posteriori density for the mean, p(µ|D), we now
compute the ”‘class-contitional”’ density for p(x|D)
p(x|D) = p(x|µ)p(µ|D)dµ (7)
1 1 x−µ 1 1 x − µn
= √ exp − √ exp −
σ 2π 2 σ σn 2π 2 σn
(8)
1 1 (x − µn)2
= exp − f (σ, σn), (9)
2πσσn 2 σ 2 + σn
2
– Typeset by FoilTEX – 29
31. where
2
1 σ 2 + σn 2
σn x + σ 2 µ n
2
f (σ, σn) = exp − 2σ 2
µ− dµ
2 σ n σ 2 + σn2
• Hence p(x|D) is normally distributed with mean µn and variance σ 2 + σn
2
p(x|D) ∼ N (µn, σ 2 + σn).
2
• The density p(x|D) is the desired class-conditional density p(x|ωj , Dj ).
Exercise 2. Use Bayesian estimation to calculate the a posteriori density
p(θ|D) and the desired probability density p(x|D) for the multivariate case
where p(x|µ) ∼ N (µ, Σ)
– Typeset by FoilTEX – 30
32. BPE: General Theory
The basic assumptions for the applicability of Bayesian estimation are
summarized as follows:
1. The form of the density p(x|θ) is assumed to be known, but the value
of the parameter vector θ is not known exactly.
2. Our initial knowledge about θ is assumed to be contained in a known a
priori density p(θ).
3. The rest of our knowledge about θ is contained in a set D of n samples
x1, . . . , xn drawn independently according to the unknown probability
density p(x).
– Typeset by FoilTEX – 31
33. The basic problem is to compute the posterior density p(θ|D)
p(x|D) = p(x|θ)p(θ|D)dθ.
By Bayes’ formula we have
p(D|θ)p(θ)
p(θ|D) =
p(D|θ)p(θ)dθ
and by the independence assumption
n
p(D|θ) = p(xk |θ).
k=1
– Typeset by FoilTEX – 32
34. Frequentists Perspective
• Probability refers to limiting relative frequencies. Probabilities are
objective properties of the real world.
• Parameters are fixed, unknown constants. Because they are not
fluctuating, no useful probability statements can be made about
parameters.
• Statistical procedures should be designed to have well-defined long run
frequency properties
– Typeset by FoilTEX – 33
35. Bayesian Perspective
• Probability describes degrees of belief, not limiting frequency.
• We can make probability statements about parameters, even though they
are fixed constants.
• We make inference about a parameters θ by producing a probability
distribution for θ.
– Typeset by FoilTEX – 34
36. Frequentists VS. Bayesians
• Bayesian inference is a controversial approach because it inherently
embrace a subjective notion of probability.
•
– Typeset by FoilTEX – 35