Hessian Matrices in Statistics

GVlogo
Hessian Matrices In Statistics
Ferris Jumah, David Schlueter, Matt Vance
MTH 327
Final Project
December 7, 2011
Hessian Matrices in Statistics

GVlogo
Topic Introduction
Today we are going to talk about . . .

GVlogo
Topic Introduction
Introduce the Hessian matrix

GVlogo
Topic Introduction
Brief description of relevant statistics

GVlogo
Topic Introduction
Maximum Likelihood Estimation (MLE)

GVlogo
Topic Introduction
Fisher Information and Applications

GVlogo
The Hessian Matrix
Recall the Hessian matrix
H(f) =













∂2f
∂x2
1
∂2f
∂x1 ∂x2
· · · ∂2f
∂x1 ∂xn
∂2f
∂x2 ∂x1
∂2f
∂x2
2
· · · ∂2f
∂x2 ∂xn
...
...
...
...
∂2f
∂xn ∂x1
∂2f
∂xn ∂x2
· · · ∂2f
∂x2
n













(1)

GVlogo
Statistics: Some things to recall

GVlogo
Now, let’s talk a bit about Inferential Statisitics

GVlogo
Parameters

GVlogo
Parameters
Random Variables
Deﬁnition: A random variable X is a function X : Ω → R

GVlogo
Parameters
Random Variables
Each r.v. follows a distribution that has associated probability function
f(x|θ)

GVlogo
Parameters
Random Variables
f(x|θ)
E.g.
f(x|µ, σ2
) =
1
σ
√
2π
exp −
(x − µ)2
2σ2
(2)

GVlogo
Parameters
Random Variables
f(x|θ)
E.g.
f(x|µ, σ2
) =
1
σ
√
2π
exp −
(x − µ)2
2σ2
(2)
What is a Random Sample?

GVlogo
Parameters
Random Variables
f(x|θ)
E.g.
f(x|µ, σ2
) =
1
σ
√
2π
exp −
(x − µ)2
2σ2
(2)
What is a Random Sample? X1, . . . , Xn i.i.d.

GVlogo
Parameters
Random Variables
f(x|θ)
E.g.
f(x|µ, σ2
) =
1
σ
√
2π
exp −
(x − µ)2
2σ2
(2)
What is a Random Sample? X1, . . . , Xn i.i.d.
Outputs of these r.v.s are our sample data

GVlogo
Stats cont.
Estimators (ˆθ) of Population Parameters

GVlogo
Stats cont.
Deﬁnition: Estimator is often a formula to calculate an estimate of a
parameter, θ based on sample data

GVlogo
Stats cont.
Deﬁnition: Estimator is often a formula to calculate an estimate of a
parameter, θ based on sample data
Many estimators, but which is the best?

GVlogo
Key Concept: Maximum Likelihood Estimation

GVlogo
GOAL: to determine the best estimate of a parameter θ from a sample

GVlogo
Likelihood Function

GVlogo
Likelihood Function
We obtain data vector x = (x1, . . . , xn)

GVlogo
Likelihood Function
Since random sample is i.i.d., we express the probability of our observed
data given θ as

GVlogo
Likelihood Function
data given θ as
f(x1, x2, . . . , xn | θ) = f(x1|θ) · f(x2|θ) · · · f(xn|θ) (3)
fn(x|θ) =
n
i=1
f(xi|θ) (4)

GVlogo
Example of MLE
Example: Gaussian (Normal) Linear regression

GVlogo
Example of MLE
Recall Least Squares Regression
Wish to determine weight vector w

GVlogo
Example of MLE
Likelihood function given by
P(y|x, w) =
1
σ
√
2π
n
exp − i(yi − wT xi)2
2σ2
(5)

GVlogo
Example of MLE
Likelihood function given by
P(y|x, w) =
1
σ
√
2π
n
exp − i(yi − wT xi)2
2σ2
(5)
Need to minimize
n
i=1
(yi − wT
xi)2
= (y − Aw)T
(y − Aw) (6)
where A is the design matrix of our data.

GVlogo
Example of MLE cont.
Following standard optimization procedure, we compute gradient of

GVlogo
S = −AT
y + AT
Aw (7)

GVlogo
S = −AT
y + AT
Aw (7)
Notice linear combination of weights and columns of AT
A

GVlogo
S = −AT
y + AT
Aw (7)
A
Our resulting critical point is
ˆw = (AT
A)−1
AT
y, (8)

GVlogo
S = −AT
y + AT
Aw (7)
A
Our resulting critical point is
ˆw = (AT
A)−1
AT
y, (8)
which we recognize to be the normal equations!

GVlogo
Computing the Hessian Matrix
We compute the Hessian in order to show that this is minimum
∂
∂wk
S =
∂
∂wk





w1





x1,1
...
xn,1





+ · · · + wk





x1,k
...
xn,k





+ · · · + wn





x1,n
...
xn,n











GVlogo
∂
∂wk
S =
∂
∂wk





w1





x1,1
...
xn,1





+ · · · + wk





x1,k
...
xn,k





+ · · · + wn





x1,n
...
xn,n










=





x1,k
...
xn,k






GVlogo
∂
∂wk
S =
∂
∂wk





w1





x1,1
...
xn,1





+ · · · + wk





x1,k
...
xn,k





+ · · · + wn





x1,n
...
xn,n










=





x1,k
...
xn,k





Therefore,
H = AT
A (9)
which is positive semi-deﬁnite. Therefore, our estimate for w
maximizes our likelihood function

GVlogo
MLE cont.
Advantages and Disadvantages

GVlogo
MLE cont.
Larger samples, as n → ∞, give better estimates

GVlogo
MLE cont.
ˆθn → θ

GVlogo
MLE cont.
ˆθn → θ
Other Advantages

GVlogo
MLE cont.
ˆθn → θ
Other Advantages
Disadvantages: Uniqueness, existence, reliance upon distribution ﬁt

GVlogo
MLE cont.
ˆθn → θ
Other Advantages
Disadvantages: Uniqueness, existence, reliance upon distribution ﬁt
Begs the question: How much information about a parameter can be
gathered from sample data?

GVlogo
Fisher Information
Key Concept: Fisher Information

GVlogo
Fisher Information
We determine the amount of information about a parameter from
sample using Fisher information deﬁned by

GVlogo
Fisher Information
I(θ) = −E
∂2
ln[f(x|θ)]
∂θ
. (10)

GVlogo
Fisher Information
I(θ) = −E
∂2
ln[f(x|θ)]
∂θ
. (10)
Intuitive appeal: More data provides more information about
population parameter

GVlogo
Fisher information example
Example: Finding the Fisher information for the normal distribution
N(µ, σ2
)

GVlogo
N(µ, σ2
)
Log likelihood function of
ln[f(x|θ)] = −
1
2
ln(2πσ2
) −
(x − µ)2
2σ2
(11)
where the the parameter vector θ = (µ, σ2
).

GVlogo
N(µ, σ2
)
ln[f(x|θ)] = −
1
2
ln(2πσ2
) −
(x − µ)2
2σ2
(11)
).
The gradient of the log likelihood is,

GVlogo
N(µ, σ2
)
ln[f(x|θ)] = −
1
2
ln(2πσ2
) −
(x − µ)2
2σ2
(11)
).
The gradient of the log likelihood is,
∂ ln[f(x|θ)]
∂µ
,
∂ ln[f(x|θ)]
∂σ2
=
x − µ
σ2
,
(x − µ)2
2σ4
−
1
2σ2
(12)

GVlogo
Fisher information example continued
We now compute the Hessian matrix that will lead us to our Fisher
information matrix

GVlogo
information matrix
∂2
ln[f(x|θ)])
∂θ2
=





∂2
ln[f(x|θ)]
∂µ2
∂2
ln[f(x|θ)])
∂µ∂σ2
∂2
ln[f(x|θ)]
∂µ∂σ2
∂2
ln[f(x|θ)]
∂(σ2)2





=







−1
σ2
−
x − µ
σ4
−
x − µ
σ4
1
2σ4
−
(x − µ)2
σ6







(13)
We now compute our Fisher information matrix.

GVlogo
information matrix
∂2
ln[f(x|θ)])
∂θ2
=





∂2
ln[f(x|θ)]
∂µ2
∂2
ln[f(x|θ)])
∂µ∂σ2
∂2
ln[f(x|θ)]
∂µ∂σ2
∂2
ln[f(x|θ)]
∂(σ2)2





=







−1
σ2
−
x − µ
σ4
−
x − µ
σ4
1
2σ4
−
(x − µ)2
σ6







(13)
We now compute our Fisher information matrix. We see that
I(θ) = −E
∂2
f(x|θ)
∂θ2
(14)

GVlogo
information matrix
∂2
ln[f(x|θ)])
∂θ2
=





∂2
ln[f(x|θ)]
∂µ2
∂2
ln[f(x|θ)])
∂µ∂σ2
∂2
ln[f(x|θ)]
∂µ∂σ2
∂2
ln[f(x|θ)]
∂(σ2)2





=







−1
σ2
−
x − µ
σ4
−
x − µ
σ4
1
2σ4
−
(x − µ)2
σ6







(13)
We now compute our Fisher information matrix. We see that
I(θ) = −E
∂2
f(x|θ)
∂θ2
(14)
=
1
σ2 0
0 −1
2σ4
(15)

GVlogo
Applications of Fisher information
Fisher information is used in the calculation of . . .

GVlogo
Lower bound of V ar(ˆθ) given by

GVlogo
V ar(ˆθ) ≥
1
I(θ)
(16)
for an estimator ˆθ

GVlogo
V ar(ˆθ) ≥
1
I(θ)
(16)
Wald Test: Comparing a proposed value of θ against the MLE

GVlogo
V ar(ˆθ) ≥
1
I(θ)
(16)
Test statistic given by
W =
ˆθ − θ0
s.e.(ˆθ)
(17)

GVlogo
V ar(ˆθ) ≥
1
I(θ)
(16)
Test statistic given by
W =
ˆθ − θ0
s.e.(ˆθ)
(17)
where
s.e.(ˆθ) =
1
I(θ)
(18)

Hessian Matrices in Statistics

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (7)

Similar to Hessian Matrices in Statistics

Similar to Hessian Matrices in Statistics (20)

Recently uploaded

Recently uploaded (20)

Hessian Matrices in Statistics