2. Overview
• Our recipe for learning
Data
Step 1: collect data
Samples
Features
Step 0: form a scientific hypothesis
Step 2:
Pick the
appropriate
probability
distribution,
with parameter 𝜃
Step 3:
Estimate the
parameter 𝜃
3. Coin Flip Example
• Our recipe for learning
Step 1: collect data
Samples
Feature
Step 0: I got a new coin. What is the probability of head?
Step 2:
Pick the
appropriate
probability
distribution,
with parameter 𝜃
Step 3:
Estimate the
parameter 𝜃
4. Weight Example
• Our recipe for learning
Step 1: collect data
Samples
Feature
Step 0: What is the distribution of weights of people in Pittsburgh?
Step 2:
Pick the
appropriate
probability
distribution,
with parameter 𝜃
Step 3:
Estimate the
parameter 𝜃
6. Statistical Inference, Learning: a Formal Set Up
• Given a sample X1,...,Xn ∼ F, how do we learn our probability model?
– Statistical model F: a set of distributions
– Parametric model: F that can be parameterized by a finite number of parameters
F = { p( x| 𝜃 ): 𝜃 ∈Θ },
where 𝜃 = {𝜃1, ..., 𝜃k} is unknown parameters θ ∈ Θ.
7. Parametric Models: Coin Flip
Ex) Probability of head p of a coin
Bernoulli distribution
Parameter:
Parametric model:
8. Parametric Models: Gene Expression
Ex) Learn the probability distribution of the expression levels of gene A
Normal distribution
Parameters:
Parametric model:
9. Which Parametric Model?
• Discrete or continuous?
• Univariate or multivariate features?
• Prior knowledge
ex) Count data in intervals: Poisson distribution
ex) real-value, bell-shape: Normal distribution
• Evidence from exploratory analysis of data
ex) relationship between mean and variance: Poisson vs negative binomial distribution
10. Overview
• Our recipe for learning
Data
Step 1: collect data
Samples
Features
Step 0: form a scientific hypothesis
Step 2:
Pick the
appropriate
probability
distribution,
with parameter 𝜃
Step 3:
Estimate the
parameter 𝜃
11. Maximum Likelihood Estimation
• The most commonly used method for parametric estimation
• MLE for probability distributions we have seen so far
• MLE for complex state-of-the-art probability models
12. Parametric Models Meet Data
• Ingredient 1: Given our choice of parametric model X ~ F, for a random variable X
• Ingredient 2: Collect data D for n samples. Model each sample as a random variable,
X1,...,Xn
• Recipe: Maximum likelihood estimation (MLE)
– Pick a member in the set F that maximizes the “likelihood of data D”
14. Formally, Parametric Estimation, Point Estimation
• Select f ∈ F that best describes data
• Select parameter 𝜃 that best describes data
• Inference: estimate 𝜃 given data
• Maximum likelihood estimate: the parameter 𝜃 estimated from data using MLE
15. MLE as an Optimization Problem
• How can we select parameter 𝜃 that best describes data?
• Maximum likelihood estimation as an optimization
– Likelihood: Score function, for scoring how well a candidate parameter 𝜃 describes data
– Maximize the likelihood: find the parameter 𝜃 that maximizes the score function
16. “Likelihood of Data”
• Assume X1,...,Xn are i.i.d. random variables, representing samples from a distribution
P(X|𝜃) in F. Then, the “likelihood of data” is defined as the probability of data D
P (D| 𝜃) =
17. “Likelihood of Data”
• Assume X1,...,Xn are i.i.d. random variables, representing samples from a distribution
P(X|𝜃) in F.
• The likelihood function, also called the likelihood of data, is given as
𝐿"(𝜃) = ∏#$%
"
P(Xi |𝜃)
• Function of parameter 𝜃 given data X1,...,Xn
18. “Log Likelihood of Data”
• The log likelihood function is
𝑙"(𝜃) = log 𝐿"(𝜃)
=
• Why log likelihood instead of likelihood?
19. MLE as an Optimization Problem
• How can we select parameter 𝜃 that best describes data?
• Maximum likelihood estimation as an optimization
– Llikelihood: Score function, for scoring how well a candidate parameter 𝜃 describes data
– Maximize the likelihood: find the parameter 𝜽 that maximizes the score function
20. Maximum Likelihood Estimation
• Maximum likelihood estimator
*
𝜃" = argmax
&
𝐿"(𝜃)
= argmax
&
𝑙"(𝜃)
log is a monotonically increasing concave function
21. MLE as an Optimization Problem
• How can we select parameter 𝜃 that best describes data?
• Maximum likelihood estimation as an optimization
– Llikelihood: Score function, for scoring how well a candidate parameter 𝜃 describes data
– Maximize the likelihood: find the parameter 𝜽 that maximizes the score function
23. MLE, Examples
• Let’s perform MLE
– Bernoulli distribution
• You have a coin with unknown
probability of head, p
• You flip the coin 10 times and get
H, H, T, T, H, T, T, H, T, H
• What is your estimate of the
probability of head p?
– Normal distribution
• You have a gene, gene A, whose
expression level and variance are
unknown
• You collect the expression
measurements of gene A for 7
individuals
• How would you estimate for mean and
variance?
24. MLE: Bernoulli
• Let X1,...,Xn ∼ Bernoulli(p). Find a maximum likelihood estimate p.
• Step 1: Write down the log likelihood of data
𝐿"(𝑝) =
𝑙"(𝑝) =
26. MLE: Bernoulli Distribution with Observed Data
• Let X ∼ Bernoulli(p).
• Flipped the coin 10 times and got
H, H, T, T, T, H, H, T, T, H
• Step 1: Write down the log likelihood of data
𝐿(𝑝; 𝐷) =
𝑙(𝑝; 𝐷) =
28. MLE: Normal Distribution
• Let X1,...,Xn ∼ N(µ, σ2). Find a maximum likelihood estimate of µ, σ2 .
• Step 1: Write down the log likelihood of data
𝐿"(µ, σ2) =
𝑙"(µ, σ2) =
31. MLE: Normal Distribution with Observations
• Let X ∼ N(µ, σ2). Find a maximum likelihood estimate of µ, σ2 .
• Data with 10 samples
10, 12, 9, 14, 8, 11, 7, 6, 10.5, 12.5
• Step 1: Write down the log likelihood of data
𝐿"(µ, σ2) =
𝑙"(µ, σ2) =
32. MLE: Poisson Distribution
• Let X1,...,Xn ∼ Poisson(𝜆). Find a maximum likelihood estimate 𝜆.
• Step 1: Write down the log likelihood of data
𝐿"(𝜆) =
𝑙"(𝜆) =
34. Markov Model
• Joint distribution of all binary random variables X1, . . . , XT
P(X1, . . . , XT )
• A Markov model is defined by
– P(X1): initial distribution
– P(Xk| Xk-1): transition probabilities identical for k=2,…,T
35. MLE: Markov Chain
• Let X1,...,XT follows Markov chain. Find a maximum likelihood estimate of the
Markov chain parameters 𝜃.
• Assume three sequences of observations D
1, 0, 0, 1, 1
1, 0, 1, 1, 0
0, 1, 0, 0, 1
36. MLE: Markov Chain, Intuition
• Assume three sequences of observations D
1, 0, 0, 1, 1
1, 0, 1, 1, 0
0, 1, 0, 0, 1
• What is your estimate of P(X1)?
37. MLE: Markov Chain, Intuition
• Assume three sequences of observations D
1, 0, 0, 1, 1
1, 0, 1, 1, 0
0, 1, 0, 0, 1
• What is your estimate of P(Xk| Xk-1=0)?
38. MLE: Markov Chain, Intuition
• Assume three sequences of observations D
1, 0, 0, 1, 1
1, 0, 1, 1, 0
0, 1, 0, 0, 1
• What is your estimate of P(Xk| Xk-1=1)?
39. MLE: Markov Chain
• Let X1,...,XT follows Markov chain. Find a maximum likelihood estimate of the
Markov chain parameters 𝜃.
• Assume three sequences of observations D
X1
1,...,XT
1
X1
2,...,XT
2
X1
3,...,XT
3
40. MLE: Markov Chain
• Step 1: Write down the log likelihood of data
𝐿"(𝜃) =
𝑙"(𝜃) =
43. MLE in Practice
• Follow the recipe and work out MLE on paper, given probability model
• Write a program to compute a maximum likelihood estimate of parameters from
data
– Load data into memory
– For normal distribution, compute μ = sample mean, σ2 = sample variance
– For Bernoulli distribution, compute p = proportion of successes
• Simple but important learning principle!
44. Maximum Likelihood Estimation as an Optimization Problem
• To find an MLE of the parameter 𝜃, we solve the following optimization problem
*
𝜃" = argmax
&
𝑙"(𝜃)
• In our examples for Bernoulli, univariate/multivariate normal distributions
– 𝐿!(𝜃) is a convex function: a single global maximum
– a closed-form solution exists
45. Maximum Likelihood Estimation as an Optimization Problem
• For some models, MLE is easy
• In general, in more complex probability models, performing MLE is not always easy
– 𝐿!(𝜃) is a non-convex function, multiple local maxima
– a closed-form solution does not exist, need to rely on iterative optimization methods
46. Other Probability Models and MLE: Neural Networks
• Deep neural nets for modeling P(Y|X)
• Optimization criterion for learning the model: MLE!
• No closed form solution for the parameter estimates:
– Rely on iterative optimization method
47. Summary
• Maximum likelihood estimation is the most commonly used technique for learning a
parametric probability models from data
• MLE, recipe
– Write down the log likelihood
– Differentiate the log likelihood with respect to the parameters
– Set the above to zero and solve for the parameters
• MLE for
– Bernoulli distribution
– Univariate/multivariate normal distributions