The document discusses maximum likelihood estimation (MLE). MLE finds the parameter values in a statistical model that make the observed data most probable. For a continuous Gaussian distribution, MLE estimates the mean μ as the average of observed values and the standard deviation σ as the square root of the average squared distance from the mean. The Lagrange multiplier method is also introduced to find maxima or minima of functions subject to constraints.
1. Maximum Likelihood Estimation
INFO-2301: Quantitative Reasoning 2
Michael Paul and Jordan Boyd-Graber
MARCH 7, 2017
INFO-2301: Quantitative Reasoning 2 | Paul and Boyd-Graber Maximum Likelihood Estimation | 1 of 9
2. Why MLE?
• Before: Distribution + Parameter → x
• Now: x + Distribution → Parameter
• (Much more realistic)
• But: Says nothing about how good a fit a distribution is
INFO-2301: Quantitative Reasoning 2 | Paul and Boyd-Graber Maximum Likelihood Estimation | 2 of 9
3. Likelihood
• Likelihood is p(x;θ)
• We want estimate of θ that best explains data we seen
• I.e., Maximum Likelihood Estimate (MLE)
INFO-2301: Quantitative Reasoning 2 | Paul and Boyd-Graber Maximum Likelihood Estimation | 3 of 9
4. Likelihood
• The likelihood function refers to the PMF (discrete) or PDF (continuous).
• For discrete distributions, the likelihood of x is P(X = x).
• For continuous distributions, the likelihood of x is the density f(x).
• We will often refer to likelihood rather than probability/mass/density so
that the term applies to either scenario.
INFO-2301: Quantitative Reasoning 2 | Paul and Boyd-Graber Maximum Likelihood Estimation | 4 of 9
5. Optimizing Unconstrained Functions
Suppose we wanted to optimize
= x2
− 2x +2 (1)
-5 -4 -3 -2 -1 0 1 2 3 4 5
-3
-2
-1
1
2
3
INFO-2301: Quantitative Reasoning 2 | Paul and Boyd-Graber Maximum Likelihood Estimation | 5 of 9
6. Optimizing Unconstrained Functions
Suppose we wanted to optimize
= x2
− 2x +2 (1)
∂
∂ x
= −2x − 2 (2)
-5 -4 -3 -2 -1 0 1 2 3 4 5
-3
-2
-1
1
2
3
INFO-2301: Quantitative Reasoning 2 | Paul and Boyd-Graber Maximum Likelihood Estimation | 5 of 9
7. Optimizing Unconstrained Functions
∂
∂ x
=0 (3)
−2x − 2 =0 (4)
x =− 1 (5)
(Should also check that second derivative is negative)
INFO-2301: Quantitative Reasoning 2 | Paul and Boyd-Graber Maximum Likelihood Estimation | 6 of 9
8. Optimizing Constrained Functions
Theorem: Lagrange Multiplier Method
Given functions f(x1,...xn) and g(x1,...xn), the critical points of f
restricted to the set g = 0 are solutions to equations:
∂ f
∂ xi
(x1,...xn) =λ
∂ g
∂ xi
(x1,...xn) ∀i
g(x1,...xn) = 0
This is n +1 equations in the n +1 variables x1,...xn,λ.
INFO-2301: Quantitative Reasoning 2 | Paul and Boyd-Graber Maximum Likelihood Estimation | 7 of 9
9. Lagrange Example
Maximize (x,y) = xy subject to the constraint 20x +10y = 200.
• Compute derivatives
INFO-2301: Quantitative Reasoning 2 | Paul and Boyd-Graber Maximum Likelihood Estimation | 8 of 9
10. Lagrange Example
Maximize (x,y) = xy subject to the constraint 20x +10y = 200.
• Compute derivatives
∂
∂ x
=
1
2
y
x
∂ g
∂ x
= 20
∂
∂ y
=
1
2
x
y
∂ g
∂ y
= 10
INFO-2301: Quantitative Reasoning 2 | Paul and Boyd-Graber Maximum Likelihood Estimation | 8 of 9
11. Lagrange Example
Maximize (x,y) = xy subject to the constraint 20x +10y = 200.
• Compute derivatives
∂
∂ x
=
1
2
y
x
∂ g
∂ x
= 20
∂
∂ y
=
1
2
x
y
∂ g
∂ y
= 10
• Create new systems of equations
INFO-2301: Quantitative Reasoning 2 | Paul and Boyd-Graber Maximum Likelihood Estimation | 8 of 9
12. Lagrange Example
Maximize (x,y) = xy subject to the constraint 20x +10y = 200.
• Compute derivatives
∂
∂ x
=
1
2
y
x
∂ g
∂ x
= 20
∂
∂ y
=
1
2
x
y
∂ g
∂ y
= 10
• Create new systems of equations
1
2
y
x
= 20λ
1
2
x
y
= 10λ
20x +10y = 200
INFO-2301: Quantitative Reasoning 2 | Paul and Boyd-Graber Maximum Likelihood Estimation | 8 of 9
13. Lagrange Example
• Dividing the first equation by the second gives us
y
x
= 2 (6)
• which means y = 2x, plugging this into the constraint equation gives:
20x +10(2x) = 200
x = 5 ⇒ y = 10
INFO-2301: Quantitative Reasoning 2 | Paul and Boyd-Graber Maximum Likelihood Estimation | 9 of 9
14. Maximum Likelihood Estimation
INFO-2301: Quantitative Reasoning 2
Michael Paul and Jordan Boyd-Graber
MARCH 7, 2017
INFO-2301: Quantitative Reasoning 2 | Paul and Boyd-Graber Maximum Likelihood Estimation | 1 of 4
15. Continuous Distribution: Gaussian
• Recall the density function
f(x) =
1
2πσ2
exp −
(x − µ)2
2σ2
(1)
• Taking the log makes math easier, doesn’t change answer (monotonic)
• If we observe x1 ...xN, then log likelihood is
INFO-2301: Quantitative Reasoning 2 | Paul and Boyd-Graber Maximum Likelihood Estimation | 2 of 4
16. Continuous Distribution: Gaussian
• Recall the density function
f(x) =
1
2πσ2
exp −
(x − µ)2
2σ2
(1)
• Taking the log makes math easier, doesn’t change answer (monotonic)
• If we observe x1 ...xN, then log likelihood is
(µ,σ) ≡ − N logσ −
N
2
log(2π)−
1
2σ2
i
(xi − µ)2
(2)
INFO-2301: Quantitative Reasoning 2 | Paul and Boyd-Graber Maximum Likelihood Estimation | 2 of 4
17. Continuous Distribution: Gaussian
• Recall the density function
f(x) =
1
2πσ2
exp −
(x − µ)2
2σ2
(1)
• Taking the log makes math easier, doesn’t change answer (monotonic)
• If we observe x1 ...xN, then log likelihood is
(µ,σ) ≡ − N logσ −
N
2
log(2π)−
1
2σ2
i
(xi − µ)2
(2)
INFO-2301: Quantitative Reasoning 2 | Paul and Boyd-Graber Maximum Likelihood Estimation | 2 of 4
18. Continuous Distribution: Gaussian
• Recall the density function
f(x) =
1
2πσ2
exp −
(x − µ)2
2σ2
(1)
• Taking the log makes math easier, doesn’t change answer (monotonic)
• If we observe x1 ...xN, then log likelihood is
(µ,σ) ≡ − N logσ −
N
2
log(2π)−
1
2σ2
i
(xi − µ)2
(2)
INFO-2301: Quantitative Reasoning 2 | Paul and Boyd-Graber Maximum Likelihood Estimation | 2 of 4
19. Continuous Distribution: Gaussian
• Recall the density function
f(x) =
1
2πσ2
exp −
(x − µ)2
2σ2
(1)
• Taking the log makes math easier, doesn’t change answer (monotonic)
• If we observe x1 ...xN, then log likelihood is
(µ,σ) ≡ − N logσ −
N
2
log(2π)−
1
2σ2
i
(xi − µ)2
(2)
INFO-2301: Quantitative Reasoning 2 | Paul and Boyd-Graber Maximum Likelihood Estimation | 2 of 4
20. MLE of Gaussian µ
(µ,σ) =− N logσ −
N
2
log(2π)−
1
2σ2
i
(xi − µ)2
(3)
∂
∂ µ
=0 +
1
σ2
i
(xi − µ) (4)
INFO-2301: Quantitative Reasoning 2 | Paul and Boyd-Graber Maximum Likelihood Estimation | 3 of 4
21. MLE of Gaussian µ
(µ,σ) =− N logσ −
N
2
log(2π)−
1
2σ2
i
(xi − µ)2
(3)
∂
∂ µ
=0 +
1
σ2
i
(xi − µ) (4)
INFO-2301: Quantitative Reasoning 2 | Paul and Boyd-Graber Maximum Likelihood Estimation | 3 of 4
22. MLE of Gaussian µ
(µ,σ) =− N logσ −
N
2
log(2π)−
1
2σ2
i
(xi − µ)2
(3)
∂
∂ µ
=0 +
1
σ2
i
(xi − µ) (4)
INFO-2301: Quantitative Reasoning 2 | Paul and Boyd-Graber Maximum Likelihood Estimation | 3 of 4
23. MLE of Gaussian µ
(µ,σ) =− N logσ −
N
2
log(2π)−
1
2σ2
i
(xi − µ)2
(3)
∂
∂ µ
=0 +
1
σ2
i
(xi − µ) (4)
Solve for µ:
0 =
1
σ2
i
(xi − µ) (5)
0 =
i
xi − Nµ (6)
µ = i xi
N
(7)
INFO-2301: Quantitative Reasoning 2 | Paul and Boyd-Graber Maximum Likelihood Estimation | 3 of 4
24. MLE of Gaussian µ
(µ,σ) =− N logσ −
N
2
log(2π)−
1
2σ2
i
(xi − µ)2
(3)
∂
∂ µ
=0 +
1
σ2
i
(xi − µ) (4)
Solve for µ:
0 =
1
σ2
i
(xi − µ) (5)
0 =
i
xi − Nµ (6)
µ = i xi
N
(7)
Consistent with what we said before
INFO-2301: Quantitative Reasoning 2 | Paul and Boyd-Graber Maximum Likelihood Estimation | 3 of 4
25. MLE of Gaussian σ
(µ,σ) =− N logσ −
N
2
log(2π)−
1
2σ2
i
(xi − µ)2
(8)
∂
∂ σ
=−
N
σ
+0 +
1
σ3
i
(xi − µ)2
(9)
INFO-2301: Quantitative Reasoning 2 | Paul and Boyd-Graber Maximum Likelihood Estimation | 4 of 4
26. MLE of Gaussian σ
(µ,σ) =− N logσ −
N
2
log(2π)−
1
2σ2
i
(xi − µ)2
(8)
∂
∂ σ
=−
N
σ
+0 +
1
σ3
i
(xi − µ)2
(9)
INFO-2301: Quantitative Reasoning 2 | Paul and Boyd-Graber Maximum Likelihood Estimation | 4 of 4
27. MLE of Gaussian σ
(µ,σ) =− N logσ −
N
2
log(2π)−
1
2σ2
i
(xi − µ)2
(8)
∂
∂ σ
=−
N
σ
+0 +
1
σ3
i
(xi − µ)2
(9)
INFO-2301: Quantitative Reasoning 2 | Paul and Boyd-Graber Maximum Likelihood Estimation | 4 of 4
28. MLE of Gaussian σ
(µ,σ) =− N logσ −
N
2
log(2π)−
1
2σ2
i
(xi − µ)2
(8)
∂
∂ σ
=−
N
σ
+0 +
1
σ3
i
(xi − µ)2
(9)
INFO-2301: Quantitative Reasoning 2 | Paul and Boyd-Graber Maximum Likelihood Estimation | 4 of 4
29. MLE of Gaussian σ
(µ,σ) =− N logσ −
N
2
log(2π)−
1
2σ2
i
(xi − µ)2
(8)
∂
∂ σ
=−
N
σ
+0 +
1
σ3
i
(xi − µ)2
(9)
Solve for σ:
0 = −
n
σ
+
1
σ3
i
(xi − µ)2
(10)
N
σ
=
1
σ3
i
(xi − µ)2
(11)
σ2
= i (xi − µ)2
N
(12)
INFO-2301: Quantitative Reasoning 2 | Paul and Boyd-Graber Maximum Likelihood Estimation | 4 of 4
30. MLE of Gaussian σ
(µ,σ) =− N logσ −
N
2
log(2π)−
1
2σ2
i
(xi − µ)2
(8)
∂
∂ σ
=−
N
σ
+0 +
1
σ3
i
(xi − µ)2
(9)
Solve for σ:
0 = −
n
σ
+
1
σ3
i
(xi − µ)2
(10)
N
σ
=
1
σ3
i
(xi − µ)2
(11)
σ2
= i (xi − µ)2
N
(12)
Consistent with what we said before
INFO-2301: Quantitative Reasoning 2 | Paul and Boyd-Graber Maximum Likelihood Estimation | 4 of 4
31. Maximum Likelihood Estimation
INFO-2301: Quantitative Reasoning 2
Michael Paul and Jordan Boyd-Graber
MARCH 7, 2017
INFO-2301: Quantitative Reasoning 2 | Paul and Boyd-Graber Maximum Likelihood Estimation | 1 of 8
32. Optimizing Constrained Functions
Theorem: Lagrange Multiplier Method
Given functions f(x1,...xn) and g(x1,...xn), the critical points of f
restricted to the set g = 0 are solutions to equations:
∂ f
∂ xi
(x1,...xn) =λ
∂ g
∂ xi
(x1,...xn) ∀i
g(x1,...xn) = 0
This is n +1 equations in the n +1 variables x1,...xn,λ.
INFO-2301: Quantitative Reasoning 2 | Paul and Boyd-Graber Maximum Likelihood Estimation | 2 of 8
33. Lagrange Example
Maximize (x,y) = xy subject to the constraint 20x +10y = 200.
• Compute derivatives
INFO-2301: Quantitative Reasoning 2 | Paul and Boyd-Graber Maximum Likelihood Estimation | 3 of 8
34. Lagrange Example
Maximize (x,y) = xy subject to the constraint 20x +10y = 200.
• Compute derivatives
∂
∂ x
=
1
2
y
x
∂ g
∂ x
= 20
∂
∂ y
=
1
2
x
y
∂ g
∂ y
= 10
INFO-2301: Quantitative Reasoning 2 | Paul and Boyd-Graber Maximum Likelihood Estimation | 3 of 8
35. Lagrange Example
Maximize (x,y) = xy subject to the constraint 20x +10y = 200.
• Compute derivatives
∂
∂ x
=
1
2
y
x
∂ g
∂ x
= 20
∂
∂ y
=
1
2
x
y
∂ g
∂ y
= 10
• Create new systems of equations
INFO-2301: Quantitative Reasoning 2 | Paul and Boyd-Graber Maximum Likelihood Estimation | 3 of 8
36. Lagrange Example
Maximize (x,y) = xy subject to the constraint 20x +10y = 200.
• Compute derivatives
∂
∂ x
=
1
2
y
x
∂ g
∂ x
= 20
∂
∂ y
=
1
2
x
y
∂ g
∂ y
= 10
• Create new systems of equations
1
2
y
x
= 20λ
1
2
x
y
= 10λ
20x +10y = 200
INFO-2301: Quantitative Reasoning 2 | Paul and Boyd-Graber Maximum Likelihood Estimation | 3 of 8
37. Lagrange Example
• Dividing the first equation by the second gives us
y
x
= 2 (1)
• which means y = 2x, plugging this into the constraint equation gives:
20x +10(2x) = 200
x = 5 ⇒ y = 10
INFO-2301: Quantitative Reasoning 2 | Paul and Boyd-Graber Maximum Likelihood Estimation | 4 of 8
38. Discrete Distribution: Multinomial
• Recall the mass function (N is total number of observations, xi is the
number for each cell, θi probability of cell)
p(x |θ) =
N!
i xi !
θ
xi
i (2)
• Taking the log makes math easier, doesn’t change answer (monotonic)
• If we observe x1 ...xN, then log likelihood is
INFO-2301: Quantitative Reasoning 2 | Paul and Boyd-Graber Maximum Likelihood Estimation | 5 of 8
39. Discrete Distribution: Multinomial
• Recall the mass function (N is total number of observations, xi is the
number for each cell, θi probability of cell)
p(x |θ) =
N!
i xi !
θ
xi
i (2)
• Taking the log makes math easier, doesn’t change answer (monotonic)
• If we observe x1 ...xN, then log likelihood is
(θ) ≡ log(n!)−
i
log(xi !)+
i
xi logθi (3)
INFO-2301: Quantitative Reasoning 2 | Paul and Boyd-Graber Maximum Likelihood Estimation | 5 of 8
40. Discrete Distribution: Multinomial
• Recall the mass function (N is total number of observations, xi is the
number for each cell, θi probability of cell)
p(x |θ) =
N!
i xi !
θ
xi
i (2)
• Taking the log makes math easier, doesn’t change answer (monotonic)
• If we observe x1 ...xN, then log likelihood is
(θ) ≡ log(n!)−
i
log(xi !)+
i
xi logθi (3)
INFO-2301: Quantitative Reasoning 2 | Paul and Boyd-Graber Maximum Likelihood Estimation | 5 of 8
41. Discrete Distribution: Multinomial
• Recall the mass function (N is total number of observations, xi is the
number for each cell, θi probability of cell)
p(x |θ) =
N!
i xi !
θ
xi
i (2)
• Taking the log makes math easier, doesn’t change answer (monotonic)
• If we observe x1 ...xN, then log likelihood is
(θ) ≡ log(n!)−
i
log(xi !)+
i
xi logθi (3)
INFO-2301: Quantitative Reasoning 2 | Paul and Boyd-Graber Maximum Likelihood Estimation | 5 of 8
42. Discrete Distribution: Multinomial
• Recall the mass function (N is total number of observations, xi is the
number for each cell, θi probability of cell)
p(x |θ) =
N!
i xi !
θ
xi
i (2)
• Taking the log makes math easier, doesn’t change answer (monotonic)
• If we observe x1 ...xN, then log likelihood is
(θ) ≡ log(n!)−
i
log(xi !)+
i
xi logθi (3)
INFO-2301: Quantitative Reasoning 2 | Paul and Boyd-Graber Maximum Likelihood Estimation | 5 of 8
43. MLE of Multinomial θ
(θ) =log(N!)−
i
log(xi !)+
i
xi logθi +λ 1 −
i
θi (4)
(5)
INFO-2301: Quantitative Reasoning 2 | Paul and Boyd-Graber Maximum Likelihood Estimation | 6 of 8
44. MLE of Multinomial θ
(θ) =log(N!)−
i
log(xi !)+
i
xi logθi +λ 1 −
i
θi (4)
(5)
Where did this come from? Constraint that θ must be a distribution.
INFO-2301: Quantitative Reasoning 2 | Paul and Boyd-Graber Maximum Likelihood Estimation | 6 of 8
45. MLE of Multinomial θ
(θ) =log(N!)−
i
log(xi !)+
i
xi logθi +λ 1 −
i
θi (4)
(5)
• ∂
∂ θi
= xi
θi
− λ
• ∂
∂ λ = 1 − i θi
INFO-2301: Quantitative Reasoning 2 | Paul and Boyd-Graber Maximum Likelihood Estimation | 6 of 8
46. MLE of Multinomial θ
(θ) =log(N!)−
i
log(xi !)+
i
xi logθi +λ 1 −
i
θi (4)
(5)
• ∂
∂ θi
= xi
θi
− λ
• ∂
∂ λ = 1 − i θi
INFO-2301: Quantitative Reasoning 2 | Paul and Boyd-Graber Maximum Likelihood Estimation | 6 of 8
47. MLE of Multinomial θ
(θ) =log(N!)−
i
log(xi !)+
i
xi logθi +λ 1 −
i
θi (4)
(5)
• ∂
∂ θi
= xi
θi
− λ
• ∂
∂ λ = 1 − i θi
INFO-2301: Quantitative Reasoning 2 | Paul and Boyd-Graber Maximum Likelihood Estimation | 6 of 8
48. MLE of Multinomial θ
(θ) =log(N!)−
i
log(xi !)+
i
xi logθi +λ 1 −
i
θi (4)
(5)
• ∂
∂ θi
= xi
θi
− λ
• ∂
∂ λ = 1 − i θi
INFO-2301: Quantitative Reasoning 2 | Paul and Boyd-Graber Maximum Likelihood Estimation | 6 of 8
49. MLE of Multinomial θ
• We have system of equations
θ1 =
x1
λ
(6)
...
... (7)
θK =
xK
λ
(8)
i
θi =1 (9)
INFO-2301: Quantitative Reasoning 2 | Paul and Boyd-Graber Maximum Likelihood Estimation | 7 of 8
50. MLE of Multinomial θ
• We have system of equations
θ1 =
x1
λ
(6)
...
... (7)
θK =
xK
λ
(8)
i
θi =1 (9)
• So let’s substitute the first K equations into the last:
i
xi
λ
= 1 (10)
INFO-2301: Quantitative Reasoning 2 | Paul and Boyd-Graber Maximum Likelihood Estimation | 7 of 8
51. MLE of Multinomial θ
• We have system of equations
θ1 =
x1
λ
(6)
...
... (7)
θK =
xK
λ
(8)
i
θi =1 (9)
• So let’s substitute the first K equations into the last:
i
xi
λ
= 1 (10)
• So λ = i xi = N,
INFO-2301: Quantitative Reasoning 2 | Paul and Boyd-Graber Maximum Likelihood Estimation | 7 of 8
52. MLE of Multinomial θ
• We have system of equations
θ1 =
x1
λ
(6)
...
... (7)
θK =
xK
λ
(8)
i
θi =1 (9)
• So let’s substitute the first K equations into the last:
i
xi
λ
= 1 (10)
• So λ = i xi = N, and θi = xi
N
INFO-2301: Quantitative Reasoning 2 | Paul and Boyd-Graber Maximum Likelihood Estimation | 7 of 8
53. Maximum Likelihood Estimation
INFO-2301: Quantitative Reasoning 2
Michael Paul and Jordan Boyd-Graber
MARCH 7, 2017
INFO-2301: Quantitative Reasoning 2 | Paul and Boyd-Graber Maximum Likelihood Estimation | 1 of 6
54. Big Pictures
• Ran through several common examples
• For existing distributions you can (and should) look up MLE
• For new models, you can’t (foreshadowing of later in class)
INFO-2301: Quantitative Reasoning 2 | Paul and Boyd-Graber Maximum Likelihood Estimation | 2 of 6
55. Big Pictures
• Ran through several common examples
• For existing distributions you can (and should) look up MLE
• For new models, you can’t (foreshadowing of later in class)
◦ Classification models
◦ Unsupervised models (Expectation-Maximization)
• Not always so easy
INFO-2301: Quantitative Reasoning 2 | Paul and Boyd-Graber Maximum Likelihood Estimation | 2 of 6
56. Classification
• Classification can be viewed as
p(y |x,θ)
• Have x,y, need θ
• Discovering θ is also problem of
MLE
INFO-2301: Quantitative Reasoning 2 | Paul and Boyd-Graber Maximum Likelihood Estimation | 3 of 6
57. Clustering
• Clustering can be viewed as
p(x |z,θ)
• Have x, need z,θ
• z is guessed at iteratively
(Expectation)
• θ estimated to maximize
likelihood (Maximization)
INFO-2301: Quantitative Reasoning 2 | Paul and Boyd-Graber Maximum Likelihood Estimation | 4 of 6
58. Not always so easy: Bias
• An estimator is biased if ˆθ = θ
• We won’t prove it, but the estimate for variance is biased
• Comes from estimating µ, so need to “shrink” variance
ˆσ2
=
1
N − 1 i
(xi − µ)2
(1)
INFO-2301: Quantitative Reasoning 2 | Paul and Boyd-Graber Maximum Likelihood Estimation | 5 of 6
59. Not always so easy: Intractable Likelihoods
• Not always possible to “solve for” optimal estimator
• Use gradient optimization (we’ll see this for logistic regression)
• Use other approximations (e.g., Monte Carlo sampling)
• Whole subfield of statistics / information science
INFO-2301: Quantitative Reasoning 2 | Paul and Boyd-Graber Maximum Likelihood Estimation | 6 of 6