Maximum Likelihood Estimation
02-680 Essential Mathematics and Statistics for Scientists
DeGroot Ch 7.5
Wasserman Ch 9
Overview
• Our recipe for learning
Data
Step 1: collect data
Samples
Features
Step 0: form a scientific hypothesis
Step 2:
Pick the
appropriate
probability
distribution,
with parameter 𝜃
Step 3:
Estimate the
parameter 𝜃
Coin Flip Example
• Our recipe for learning
Step 1: collect data
Samples
Feature
Step 0: I got a new coin. What is the probability of head?
Step 2:
Pick the
appropriate
probability
distribution,
with parameter 𝜃
Step 3:
Estimate the
parameter 𝜃
Weight Example
• Our recipe for learning
Step 1: collect data
Samples
Feature
Step 0: What is the distribution of weights of people in Pittsburgh?
Step 2:
Pick the
appropriate
probability
distribution,
with parameter 𝜃
Step 3:
Estimate the
parameter 𝜃
Before estimation in Step 3, we need to set things up
Statistical Inference, Learning: a Formal Set Up
• Given a sample X1,...,Xn ∼ F, how do we learn our probability model?
– Statistical model F: a set of distributions
– Parametric model: F that can be parameterized by a finite number of parameters
F = { p( x| 𝜃 ): 𝜃 ∈Θ },
where 𝜃 = {𝜃1, ..., 𝜃k} is unknown parameters θ ∈ Θ.
Parametric Models: Coin Flip
Ex) Probability of head p of a coin
Bernoulli distribution
Parameter:
Parametric model:
Parametric Models: Gene Expression
Ex) Learn the probability distribution of the expression levels of gene A
Normal distribution
Parameters:
Parametric model:
Which Parametric Model?
• Discrete or continuous?
• Univariate or multivariate features?
• Prior knowledge
ex) Count data in intervals: Poisson distribution
ex) real-value, bell-shape: Normal distribution
• Evidence from exploratory analysis of data
ex) relationship between mean and variance: Poisson vs negative binomial distribution
Overview
• Our recipe for learning
Data
Step 1: collect data
Samples
Features
Step 0: form a scientific hypothesis
Step 2:
Pick the
appropriate
probability
distribution,
with parameter 𝜃
Step 3:
Estimate the
parameter 𝜃
Maximum Likelihood Estimation
• The most commonly used method for parametric estimation
• MLE for probability distributions we have seen so far
• MLE for complex state-of-the-art probability models
Parametric Models Meet Data
• Ingredient 1: Given our choice of parametric model X ~ F, for a random variable X
• Ingredient 2: Collect data D for n samples. Model each sample as a random variable,
X1,...,Xn
• Recipe: Maximum likelihood estimation (MLE)
– Pick a member in the set F that maximizes the “likelihood of data D”
Illustration for MLE
Formally, Parametric Estimation, Point Estimation
• Select f ∈ F that best describes data
• Select parameter 𝜃 that best describes data
• Inference: estimate 𝜃 given data
• Maximum likelihood estimate: the parameter 𝜃 estimated from data using MLE
MLE as an Optimization Problem
• How can we select parameter 𝜃 that best describes data?
• Maximum likelihood estimation as an optimization
– Likelihood: Score function, for scoring how well a candidate parameter 𝜃 describes data
– Maximize the likelihood: find the parameter 𝜃 that maximizes the score function
“Likelihood of Data”
• Assume X1,...,Xn are i.i.d. random variables, representing samples from a distribution
P(X|𝜃) in F. Then, the “likelihood of data” is defined as the probability of data D
P (D| 𝜃) =
“Likelihood of Data”
• Assume X1,...,Xn are i.i.d. random variables, representing samples from a distribution
P(X|𝜃) in F.
• The likelihood function, also called the likelihood of data, is given as
𝐿"(𝜃) = ∏#$%
"
P(Xi |𝜃)
• Function of parameter 𝜃 given data X1,...,Xn
“Log Likelihood of Data”
• The log likelihood function is
𝑙"(𝜃) = log 𝐿"(𝜃)
=
• Why log likelihood instead of likelihood?
MLE as an Optimization Problem
• How can we select parameter 𝜃 that best describes data?
• Maximum likelihood estimation as an optimization
– Llikelihood: Score function, for scoring how well a candidate parameter 𝜃 describes data
– Maximize the likelihood: find the parameter 𝜽 that maximizes the score function
Maximum Likelihood Estimation
• Maximum likelihood estimator
*
𝜃" = argmax
&
𝐿"(𝜃)
= argmax
&
𝑙"(𝜃)
log is a monotonically increasing concave function
MLE as an Optimization Problem
• How can we select parameter 𝜃 that best describes data?
• Maximum likelihood estimation as an optimization
– Llikelihood: Score function, for scoring how well a candidate parameter 𝜃 describes data
– Maximize the likelihood: find the parameter 𝜽 that maximizes the score function
Illustration for MLE
MLE, Examples
• Let’s perform MLE
– Bernoulli distribution
• You have a coin with unknown
probability of head, p
• You flip the coin 10 times and get
H, H, T, T, H, T, T, H, T, H
• What is your estimate of the
probability of head p?
– Normal distribution
• You have a gene, gene A, whose
expression level and variance are
unknown
• You collect the expression
measurements of gene A for 7
individuals
• How would you estimate for mean and
variance?
MLE: Bernoulli
• Let X1,...,Xn ∼ Bernoulli(p). Find a maximum likelihood estimate p.
• Step 1: Write down the log likelihood of data
𝐿"(𝑝) =
𝑙"(𝑝) =
MLE: Bernoulli Distribution
• Step 2: Maximize the log likelihood
argmax
'
𝑙"(𝑝)
MLE: Bernoulli Distribution with Observed Data
• Let X ∼ Bernoulli(p).
• Flipped the coin 10 times and got
H, H, T, T, T, H, H, T, T, H
• Step 1: Write down the log likelihood of data
𝐿(𝑝; 𝐷) =
𝑙(𝑝; 𝐷) =
MLE: Bernoulli Distribution
• Step 2: Maximize the log likelihood
argmax
'
𝑙(𝑝; 𝐷)
MLE: Normal Distribution
• Let X1,...,Xn ∼ N(µ, σ2). Find a maximum likelihood estimate of µ, σ2 .
• Step 1: Write down the log likelihood of data
𝐿"(µ, σ2) =
𝑙"(µ, σ2) =
MLE: Normal Distribution
• Step 2: Maximize the log likelihood
argmax
µ, σ2
𝑙"(µ, σ2)
MLE for µ
MLE: Normal Distribution
• Step 2: Maximize the log likelihood
argmax
µ, σ2
𝑙"(µ, σ2)
MLE for σ2
MLE: Normal Distribution with Observations
• Let X ∼ N(µ, σ2). Find a maximum likelihood estimate of µ, σ2 .
• Data with 10 samples
10, 12, 9, 14, 8, 11, 7, 6, 10.5, 12.5
• Step 1: Write down the log likelihood of data
𝐿"(µ, σ2) =
𝑙"(µ, σ2) =
MLE: Poisson Distribution
• Let X1,...,Xn ∼ Poisson(𝜆). Find a maximum likelihood estimate 𝜆.
• Step 1: Write down the log likelihood of data
𝐿"(𝜆) =
𝑙"(𝜆) =
MLE: Poisson Distribution
• Step 2: Maximize the log likelihood
argmax
(
𝑙"(𝜆)
Markov Model
• Joint distribution of all binary random variables X1, . . . , XT
P(X1, . . . , XT )
• A Markov model is defined by
– P(X1): initial distribution
– P(Xk| Xk-1): transition probabilities identical for k=2,…,T
MLE: Markov Chain
• Let X1,...,XT follows Markov chain. Find a maximum likelihood estimate of the
Markov chain parameters 𝜃.
• Assume three sequences of observations D
1, 0, 0, 1, 1
1, 0, 1, 1, 0
0, 1, 0, 0, 1
MLE: Markov Chain, Intuition
• Assume three sequences of observations D
1, 0, 0, 1, 1
1, 0, 1, 1, 0
0, 1, 0, 0, 1
• What is your estimate of P(X1)?
MLE: Markov Chain, Intuition
• Assume three sequences of observations D
1, 0, 0, 1, 1
1, 0, 1, 1, 0
0, 1, 0, 0, 1
• What is your estimate of P(Xk| Xk-1=0)?
MLE: Markov Chain, Intuition
• Assume three sequences of observations D
1, 0, 0, 1, 1
1, 0, 1, 1, 0
0, 1, 0, 0, 1
• What is your estimate of P(Xk| Xk-1=1)?
MLE: Markov Chain
• Let X1,...,XT follows Markov chain. Find a maximum likelihood estimate of the
Markov chain parameters 𝜃.
• Assume three sequences of observations D
X1
1,...,XT
1
X1
2,...,XT
2
X1
3,...,XT
3
MLE: Markov Chain
• Step 1: Write down the log likelihood of data
𝐿"(𝜃) =
𝑙"(𝜃) =
MLE: Markov Chain
• Step 2: Maximize the log likelihood (initial probabilities)
argmax
&
𝑙"(𝜃)
MLE: Markov Chain
• Step 2: Maximize the log likelihood (transition probabilities)
argmax
&
𝑙"(𝜃)
MLE in Practice
• Follow the recipe and work out MLE on paper, given probability model
• Write a program to compute a maximum likelihood estimate of parameters from
data
– Load data into memory
– For normal distribution, compute μ = sample mean, σ2 = sample variance
– For Bernoulli distribution, compute p = proportion of successes
• Simple but important learning principle!
Maximum Likelihood Estimation as an Optimization Problem
• To find an MLE of the parameter 𝜃, we solve the following optimization problem
*
𝜃" = argmax
&
𝑙"(𝜃)
• In our examples for Bernoulli, univariate/multivariate normal distributions
– 𝐿!(𝜃) is a convex function: a single global maximum
– a closed-form solution exists
Maximum Likelihood Estimation as an Optimization Problem
• For some models, MLE is easy
• In general, in more complex probability models, performing MLE is not always easy
– 𝐿!(𝜃) is a non-convex function, multiple local maxima
– a closed-form solution does not exist, need to rely on iterative optimization methods
Other Probability Models and MLE: Neural Networks
• Deep neural nets for modeling P(Y|X)
• Optimization criterion for learning the model: MLE!
• No closed form solution for the parameter estimates:
– Rely on iterative optimization method
Summary
• Maximum likelihood estimation is the most commonly used technique for learning a
parametric probability models from data
• MLE, recipe
– Write down the log likelihood
– Differentiate the log likelihood with respect to the parameters
– Set the above to zero and solve for the parameters
• MLE for
– Bernoulli distribution
– Univariate/multivariate normal distributions

MLE.pdf

  • 1.
    Maximum Likelihood Estimation 02-680Essential Mathematics and Statistics for Scientists DeGroot Ch 7.5 Wasserman Ch 9
  • 2.
    Overview • Our recipefor learning Data Step 1: collect data Samples Features Step 0: form a scientific hypothesis Step 2: Pick the appropriate probability distribution, with parameter 𝜃 Step 3: Estimate the parameter 𝜃
  • 3.
    Coin Flip Example •Our recipe for learning Step 1: collect data Samples Feature Step 0: I got a new coin. What is the probability of head? Step 2: Pick the appropriate probability distribution, with parameter 𝜃 Step 3: Estimate the parameter 𝜃
  • 4.
    Weight Example • Ourrecipe for learning Step 1: collect data Samples Feature Step 0: What is the distribution of weights of people in Pittsburgh? Step 2: Pick the appropriate probability distribution, with parameter 𝜃 Step 3: Estimate the parameter 𝜃
  • 5.
    Before estimation inStep 3, we need to set things up
  • 6.
    Statistical Inference, Learning:a Formal Set Up • Given a sample X1,...,Xn ∼ F, how do we learn our probability model? – Statistical model F: a set of distributions – Parametric model: F that can be parameterized by a finite number of parameters F = { p( x| 𝜃 ): 𝜃 ∈Θ }, where 𝜃 = {𝜃1, ..., 𝜃k} is unknown parameters θ ∈ Θ.
  • 7.
    Parametric Models: CoinFlip Ex) Probability of head p of a coin Bernoulli distribution Parameter: Parametric model:
  • 8.
    Parametric Models: GeneExpression Ex) Learn the probability distribution of the expression levels of gene A Normal distribution Parameters: Parametric model:
  • 9.
    Which Parametric Model? •Discrete or continuous? • Univariate or multivariate features? • Prior knowledge ex) Count data in intervals: Poisson distribution ex) real-value, bell-shape: Normal distribution • Evidence from exploratory analysis of data ex) relationship between mean and variance: Poisson vs negative binomial distribution
  • 10.
    Overview • Our recipefor learning Data Step 1: collect data Samples Features Step 0: form a scientific hypothesis Step 2: Pick the appropriate probability distribution, with parameter 𝜃 Step 3: Estimate the parameter 𝜃
  • 11.
    Maximum Likelihood Estimation •The most commonly used method for parametric estimation • MLE for probability distributions we have seen so far • MLE for complex state-of-the-art probability models
  • 12.
    Parametric Models MeetData • Ingredient 1: Given our choice of parametric model X ~ F, for a random variable X • Ingredient 2: Collect data D for n samples. Model each sample as a random variable, X1,...,Xn • Recipe: Maximum likelihood estimation (MLE) – Pick a member in the set F that maximizes the “likelihood of data D”
  • 13.
  • 14.
    Formally, Parametric Estimation,Point Estimation • Select f ∈ F that best describes data • Select parameter 𝜃 that best describes data • Inference: estimate 𝜃 given data • Maximum likelihood estimate: the parameter 𝜃 estimated from data using MLE
  • 15.
    MLE as anOptimization Problem • How can we select parameter 𝜃 that best describes data? • Maximum likelihood estimation as an optimization – Likelihood: Score function, for scoring how well a candidate parameter 𝜃 describes data – Maximize the likelihood: find the parameter 𝜃 that maximizes the score function
  • 16.
    “Likelihood of Data” •Assume X1,...,Xn are i.i.d. random variables, representing samples from a distribution P(X|𝜃) in F. Then, the “likelihood of data” is defined as the probability of data D P (D| 𝜃) =
  • 17.
    “Likelihood of Data” •Assume X1,...,Xn are i.i.d. random variables, representing samples from a distribution P(X|𝜃) in F. • The likelihood function, also called the likelihood of data, is given as 𝐿"(𝜃) = ∏#$% " P(Xi |𝜃) • Function of parameter 𝜃 given data X1,...,Xn
  • 18.
    “Log Likelihood ofData” • The log likelihood function is 𝑙"(𝜃) = log 𝐿"(𝜃) = • Why log likelihood instead of likelihood?
  • 19.
    MLE as anOptimization Problem • How can we select parameter 𝜃 that best describes data? • Maximum likelihood estimation as an optimization – Llikelihood: Score function, for scoring how well a candidate parameter 𝜃 describes data – Maximize the likelihood: find the parameter 𝜽 that maximizes the score function
  • 20.
    Maximum Likelihood Estimation •Maximum likelihood estimator * 𝜃" = argmax & 𝐿"(𝜃) = argmax & 𝑙"(𝜃) log is a monotonically increasing concave function
  • 21.
    MLE as anOptimization Problem • How can we select parameter 𝜃 that best describes data? • Maximum likelihood estimation as an optimization – Llikelihood: Score function, for scoring how well a candidate parameter 𝜃 describes data – Maximize the likelihood: find the parameter 𝜽 that maximizes the score function
  • 22.
  • 23.
    MLE, Examples • Let’sperform MLE – Bernoulli distribution • You have a coin with unknown probability of head, p • You flip the coin 10 times and get H, H, T, T, H, T, T, H, T, H • What is your estimate of the probability of head p? – Normal distribution • You have a gene, gene A, whose expression level and variance are unknown • You collect the expression measurements of gene A for 7 individuals • How would you estimate for mean and variance?
  • 24.
    MLE: Bernoulli • LetX1,...,Xn ∼ Bernoulli(p). Find a maximum likelihood estimate p. • Step 1: Write down the log likelihood of data 𝐿"(𝑝) = 𝑙"(𝑝) =
  • 25.
    MLE: Bernoulli Distribution •Step 2: Maximize the log likelihood argmax ' 𝑙"(𝑝)
  • 26.
    MLE: Bernoulli Distributionwith Observed Data • Let X ∼ Bernoulli(p). • Flipped the coin 10 times and got H, H, T, T, T, H, H, T, T, H • Step 1: Write down the log likelihood of data 𝐿(𝑝; 𝐷) = 𝑙(𝑝; 𝐷) =
  • 27.
    MLE: Bernoulli Distribution •Step 2: Maximize the log likelihood argmax ' 𝑙(𝑝; 𝐷)
  • 28.
    MLE: Normal Distribution •Let X1,...,Xn ∼ N(µ, σ2). Find a maximum likelihood estimate of µ, σ2 . • Step 1: Write down the log likelihood of data 𝐿"(µ, σ2) = 𝑙"(µ, σ2) =
  • 29.
    MLE: Normal Distribution •Step 2: Maximize the log likelihood argmax µ, σ2 𝑙"(µ, σ2) MLE for µ
  • 30.
    MLE: Normal Distribution •Step 2: Maximize the log likelihood argmax µ, σ2 𝑙"(µ, σ2) MLE for σ2
  • 31.
    MLE: Normal Distributionwith Observations • Let X ∼ N(µ, σ2). Find a maximum likelihood estimate of µ, σ2 . • Data with 10 samples 10, 12, 9, 14, 8, 11, 7, 6, 10.5, 12.5 • Step 1: Write down the log likelihood of data 𝐿"(µ, σ2) = 𝑙"(µ, σ2) =
  • 32.
    MLE: Poisson Distribution •Let X1,...,Xn ∼ Poisson(𝜆). Find a maximum likelihood estimate 𝜆. • Step 1: Write down the log likelihood of data 𝐿"(𝜆) = 𝑙"(𝜆) =
  • 33.
    MLE: Poisson Distribution •Step 2: Maximize the log likelihood argmax ( 𝑙"(𝜆)
  • 34.
    Markov Model • Jointdistribution of all binary random variables X1, . . . , XT P(X1, . . . , XT ) • A Markov model is defined by – P(X1): initial distribution – P(Xk| Xk-1): transition probabilities identical for k=2,…,T
  • 35.
    MLE: Markov Chain •Let X1,...,XT follows Markov chain. Find a maximum likelihood estimate of the Markov chain parameters 𝜃. • Assume three sequences of observations D 1, 0, 0, 1, 1 1, 0, 1, 1, 0 0, 1, 0, 0, 1
  • 36.
    MLE: Markov Chain,Intuition • Assume three sequences of observations D 1, 0, 0, 1, 1 1, 0, 1, 1, 0 0, 1, 0, 0, 1 • What is your estimate of P(X1)?
  • 37.
    MLE: Markov Chain,Intuition • Assume three sequences of observations D 1, 0, 0, 1, 1 1, 0, 1, 1, 0 0, 1, 0, 0, 1 • What is your estimate of P(Xk| Xk-1=0)?
  • 38.
    MLE: Markov Chain,Intuition • Assume three sequences of observations D 1, 0, 0, 1, 1 1, 0, 1, 1, 0 0, 1, 0, 0, 1 • What is your estimate of P(Xk| Xk-1=1)?
  • 39.
    MLE: Markov Chain •Let X1,...,XT follows Markov chain. Find a maximum likelihood estimate of the Markov chain parameters 𝜃. • Assume three sequences of observations D X1 1,...,XT 1 X1 2,...,XT 2 X1 3,...,XT 3
  • 40.
    MLE: Markov Chain •Step 1: Write down the log likelihood of data 𝐿"(𝜃) = 𝑙"(𝜃) =
  • 41.
    MLE: Markov Chain •Step 2: Maximize the log likelihood (initial probabilities) argmax & 𝑙"(𝜃)
  • 42.
    MLE: Markov Chain •Step 2: Maximize the log likelihood (transition probabilities) argmax & 𝑙"(𝜃)
  • 43.
    MLE in Practice •Follow the recipe and work out MLE on paper, given probability model • Write a program to compute a maximum likelihood estimate of parameters from data – Load data into memory – For normal distribution, compute μ = sample mean, σ2 = sample variance – For Bernoulli distribution, compute p = proportion of successes • Simple but important learning principle!
  • 44.
    Maximum Likelihood Estimationas an Optimization Problem • To find an MLE of the parameter 𝜃, we solve the following optimization problem * 𝜃" = argmax & 𝑙"(𝜃) • In our examples for Bernoulli, univariate/multivariate normal distributions – 𝐿!(𝜃) is a convex function: a single global maximum – a closed-form solution exists
  • 45.
    Maximum Likelihood Estimationas an Optimization Problem • For some models, MLE is easy • In general, in more complex probability models, performing MLE is not always easy – 𝐿!(𝜃) is a non-convex function, multiple local maxima – a closed-form solution does not exist, need to rely on iterative optimization methods
  • 46.
    Other Probability Modelsand MLE: Neural Networks • Deep neural nets for modeling P(Y|X) • Optimization criterion for learning the model: MLE! • No closed form solution for the parameter estimates: – Rely on iterative optimization method
  • 47.
    Summary • Maximum likelihoodestimation is the most commonly used technique for learning a parametric probability models from data • MLE, recipe – Write down the log likelihood – Differentiate the log likelihood with respect to the parameters – Set the above to zero and solve for the parameters • MLE for – Bernoulli distribution – Univariate/multivariate normal distributions