numerical-MLE.pdf 1 Stat 102BSanchez .docx

numerical-MLE.pdf
1
Stat
102B/Sanchez

Handout
18
Introduction.
Solving
maximum
likelihood
estimation
problems
with
numerical
optimization
methods.
There
are
several

numerical
methods
of
solving
maximum
likelihood
estimation
problems,
which
involve
finding
the
maximum
likelihood
estimates
and
at
the
same
time
obtaining
the
Hessian
and
all
ingredients
needed
for
confidence
intervals.
All
the
numerical

methods
were
created
to
solve
optimization
problems
in
calculus,
in
general.
But
statisticians
learned
to
apply
them
to
find
the
maxima
of
likelihood
functions
and
other
statistical
problems,
as
we
already
discussed
in

this
class.
Numerical
methods
are
not
needed
when
you
can
find
a
closed
form
mathematical
solution,
but
we
will
use
them
here
in
some
of
these
close
solution
cases
to
convince
you

that
numerical
methods
work.
Topic
1.
The
mle
function
in
R.
Example
1.
The
MLE
of
the
parameter
of
an
exponential
distribution
using
R’s
mle
function

We
know
that
the
mle
of
the
parameter
of
an
exponential
distribution
€
f (y) = θe−θy y > 0
based
on
a
random
sample
of
size
n
is
€
θ
∧

=
1
x
.
But
we
will
show
here
how
a
numerical
method
gives
us
that
result.
Suppose
€
Y1,Y2
are
iid
with

density
€
f (y) = θe−θy y > 0
Q1.
Write
the
likelihood
function
formula
Mathematically,
the
maximum
likelihood
estimator
for
θ
based
on

this
random
sample
is
€
θ
∧
mle =
2
y1 + y2
.
This
solution
is
unique
and
maximizes
the
log
likelihood,
which
is
€
2logθ −θ(y1 + y2) .

Q2.
Find
the
first
and
second
derivative
of
the
log
likelihood
function
with
respect
to
theta.
Write
the
formula
for
the
maximum
likelihood
estimate
of
theta
and
determine
whether
it
is
a
max,
a

min
or
a
saddle
point.
2
Although
we
have
the
exact
solution
analytical
solution,
let

us
see
how
the
problem
can
be
solved
numerically
using
the
mle (stats4)
function.
The
mle function
takes
as
its
argument
the
function
that
evaluates
–log
likelihood.
The
negative
log
likelihood
is
minimized
by
a

call
to
optim,
an
optimization
routine.
Suppose
the
random
sample
consists
of
the
following
observations:
0.043004550
and
0.50263474.
With
this
data,
the
following
program
would
apply
mle (stats4)
function.

Using
program
1,
R
script
attached,
we
find
that
the
MLE
estimate
is
3.66515
and
the
standard
error
of
the
estimate
is
2.591652.
The
program
also
gives
-‐2log
likelihood,
which
is
equal

to
-‐
1.1954.
Q3.
Write
the
formula
for
the
maximum
likelihood
estimator
and
for
the
asymptotic
standard
error
of
the
maximum
likelihood
estimator.
Notice
that

we
give
the
initial
value
for
theta
in
the
mlogL
function.
Alternatively,
the
initial
value
for
the
optimizer
could
be
supplied
in
the
call
to
mle;
two
examples
are
mle(mlogL,start=list(theta=1))
mle(mlogL,start=list(theta=mean(y)))

#the observed sample
y = c(0.04304550,0.50263474)
#the function containing the formula for the negative log
likelihood
mlogL = function(theta=1){
return(-(length(y)*log(theta) - theta*sum(y)))
}
# Finding numerically the MLE and Fisher’s information
library(stats4)
fit =mle(mlogL)
summary(fit)
3
In
this
example,
the
maximum
likelihood
estimate
is

€
θ
∧
=
1
y
= 3.66515.
The
maximum
log
likelihood
is
€
Log L θ
∧ ⎛
⎝
⎜
⎞
⎠
⎟
1
y
⎛

⎝
⎜
⎞
⎠
⎟
1
y
⎛
⎝
⎜
⎞
⎠
⎟
or
-‐2
Log(L)=-‐1.195477.
The
same
result
was
obtained
by
mle.

€
θ
∧
satisfies
€
dlogL(θ)
dθ
= 0.
Then
€
θ
∧
may
be
a
relative
maximum,
relative
minimum,
or
an
inflection

point
of
the
€
logL θ( ) .
If
€
d2 logL(θ)
dθ2
< 0
then
€
θ
∧
is
a
local
maximum
of

€
logL(θ) .
Topic
2.
The
optim
function
in
R.
Example
2.
Estimating
the
parameters
of
a
gamma
distribution
using
the
optim
function
in
R.

Suppose
€
X1,X2,......,Xn
is
a
random
sample
from
a
Gamma
distribution
We
seek
the
maximum
of
a
two
parameter
likelihood
function.
We
will
use
the

optim
general
purpose
optimization
function
in
R.
It
implements
Nelder-‐Mead,
quasi-‐Newton,
and
conjugate-‐
gradient
algorithms,
and
also
methods
for
box-‐constrained
optimization
and
simulated
annealing.
The
default
method
is
Nelder-‐Mead.

The
likelihood
function
is
€
L = Γ α( )
−n( )e
−λ xi
i=1
n
∑
λnα xi
i=1
n
∏
Q4.
Write
the
formula
for
the
log
likelihood.

4
€
LogL = −nlogΓ(α)− λ xi
i=1
n
∑ + nα logλ + α −1( ) log(xi
i=1
n
∑ )
We
write
in
R
the
log-‐likelihood
function

and
the
following
program
to
illustrate.
As
optim
performs
minimization
by
default,
the
return
value
is
€
−logL θ( ).
Initial
values
for
the
estimates
must
be
chosen
carefully.
For
this

problem,
the
method
of
moments
estimators
could
be
given
as
initial
values
of
the
parameters,
but
for
simplicity,
we
make
€
α = 1 λ = 1.
If
x
is
the
random
sample
of

size
n,
the
optim
call
is
The
return
object
includes
an
error
code
$convergence,
which
is
0
for
success
and
otherwise
indicates
a
problem.
The
MLE
is

computed
for
an
artificial
random
sample
from
a
gamma
distribution
with
alpha=5
and
lambda=2.
The
MLE
estimates
are:
€
α
∧
= 4.966438; λ
∧
= 1.619640

Notice how estimates change after removal of –
lambda in the program.
## Writing the function containing formula for log likelihood
## of Gamma distribution.
LL =function(theta,sx,slogx,n) {
alpha=theta[1]
lambda=theta[2]
loglik = -n*log(gamma(alpha))-lambda*sx +
n*alpha*log(lambda) +
(alpha-1)*slogx Typo fixed: Notice removal of –
lambda
-loglik
} there will be effect of removing lambda on results
below
# Generate artificial gamma data
x=rgamma(20, 5,2)
# Apply optim with initial values alpha=1, lambda=1
optim(c(1,1),
LL,sx=sum(x),slogx=sum(log(x)),n=20,hessian=T)
optim(c(1,1), LL,sx=sum(x),slogx=sum(log(x)),n=n, hessian=T)

5
The
Hessian
matrix
is
€
4.459452 - 12.34843
-12.348426 37.86504
⎛
⎝
⎜
⎞
⎠
⎟
Notice how Hessian changes after removal of –

lambda in the program.
Q5.
Write
the
standard
errors
of
the
maximum
likelihood
estimators
and
a
95%
confidence
interval
for
each.
By
typing

in
R,
you
obtain
the
negative
of
the
Hessian,
also
called
the
information
matrix.
That
matrix
is
H
-‐
1=
€
2.3125502 0.7541614
0.7541614 0.2723543
⎛
⎝
⎜

⎞
⎠
⎟
The
standard
errors
are
given
by
the
diagonals
and
can
be
obtained
in
R
using
the
command

sqrt(diag(solve(out$hessian))).
This
gives
€
se α
∧ ⎛
⎝
⎜
⎞
⎠
⎟
se λ
∧ ⎛
⎝
⎜
⎞
⎠
⎟
With
the
standard
errors,

we
can
do
confidence
intervals.
Remember
that
asymptotically,
the
maximum
likelihood
estimators
are
normally
distributed,
unbiased,
efficient
and
sufficient.
So
the
95%
confidence
interval,
for
example,
is
given
for
each
parameter
by

€
95% CI for α : 4.966438 ± 1.96(1.520707)
95% CI for λ : 1.619640 ± 1.96(0.5218758)
6
Topic
3.
Finding
maximum
likelihood
estimators
with
the
nlm function.
The
R
function
nlm
minimizes
arbitrary

functions
written
in
R
with
a
Newton-‐type
algorithm.
So
to
maximize
the
likelihood,
we
hand
nlm
the
negative
of
the
log
likelihood
(for
any
function
f,
minimizing
–f
maximizes
f).
The
following
function

defines
the
model
to
maximize
the
log
likehood
of
a
sample
believed
to
have
come
from
a
Poisson
distribution
with
a
parameter
value:
Q6.
Write
the
log
likelihood
function
of

n
observations
coming
from
a
Poisson
distribution.
Then
we
apply
nlm
to
the
function,
with
initial
value
for
the

parameter
equal
to
1,
using
the
data.
my.mle.estimation=nlm(my.mle.model, 1,data,hessian=T)
Notice
you need word data inside the parenthesis. I also added
hessian.
theta.hat=my.mle.estimation$estimate
gives estimate for lambda
€
λ
∧
= 3.3749
theta.se=sqrt(diag(solve(my.mle.estimation$hessian)
gives estimate of the standard error.
€
se λ

∧ ⎛
⎝
⎜
⎞
⎠
⎟
In
its
default
mode
of
operation
nlm
uses
derivatives
calculated
by
finite
differences.
It
will
work
better
and
faster
if
we
supply
the
derivatives.

data=c(2,0,3,4,6,2,1,2,1,5,3,7,9,2,4,3)
#### Specify the log likelihood function
my.mle.model = function(parameters, data) {
sum(-dpois(data,parameters, log=TRUE)) }
7
Let’s
simulate
some
poisson
random
numbers
with
known
lambda
and
then
apply
the
function
to
see

how
good
the
MLE
is.
r.poisson = rpois(20,4)
The
following
two
commands
provide
the
hessian
matrix
and
the
inverse
fisher
information

fish= my.mle.estimation$hessian
fisher.inf=solve(my.mle.estimation$hessian)
Inverse
Fisher
Information
gives
the
asymptotic
variance
matrix
of
the
MLE.
From
it,
we
can
construct
asymptotic
confidence
intervals.
Asymptotic
confidence
intervals
conf.level=0.95

crit = qnorm((1+conf.level)/2)
inv.fish = solve(fish)
theta.hat[1] + c(-1,1)*crit*sqrt(inv.fish[1,1]
theta.hat[2] + c(-1,1)*crit*sqrt(inv.fish[2,2]
As
you
can
see,
there
are
several
ways
of
writing
the
code
to
obtain
what
you
want.
Just
have
clear
that
the
square
root
of
the
diagonal

of
the
hessian
are
the
standard
errors.
These
are,
of
course,
not
simultaneous
confidence
intervals.
Top
get
simultaneous
coverage
we
would
have
to
replace
the
critical
value
calculation
by
a
Bonferroni
correction.

crit = qnorm(1-(1-conf.level)/2/length(theta.hat))
(Note: Program 1 in the code provided with this handout
contains
another example of Poisson fitting with nlm).
8
Example
3.
(Uses
Program
2
in
the
R
code

that
goes
with
this
handout
18).
The
data
file
gamma-‐arrivals
contains
a
set
of
gamma-‐ray
data
consisting
of
the
time
between
arrivals
(interarrival
times)
of
3,935
photons
(units
are
seconds).

(a) Make
a
histogram
of
the
interarrival
times
and
find
summary
statistics.
Does
it
appear
that
a
gamma
distribution
would
be
a
plausible
model?
Yes,
it
appears
plausible
that
a
gamma

is
a
good
model.
An
exponential
model
might
look
good
too,
but
an
exponential
is
a
special
case
of
a
gamma
distribution.
(b)
Fit
the
parameters
by
the
method
of
moments

and
by
maximum
likelihood.
How
do
the
estimates
compare?
€
λ
∧
MM = 0.01266402 α
∧
MM = 1.012436
λ
∧
MLE = 0.01283669 α
∧
MLE = 1.02662
As
we
can

see,
comparing
the
estimates
obtained
by
MLE
and
those
with
MM,
they
are
not
very
different
(c
)
Plot
the
two
fitted
gamma
densities
on
top
of
the
histogram.
Do
the
fits
look

reasonable?
The
fit
looks
reasonable
in
both
cases
because
the
fitted
model
(red
)
does
not
seem
to
over
or
underestimate
the
probability
in
any
particular
region
of
the
histogram.

9
Appendix:
R
code
(d)
Give
the
standard
errors
of
the
MLE
estimates.
€
se α
∧ ⎛
⎝
⎜

⎞
⎠
⎟
se λ
∧ ⎛
⎝
⎜
⎞
⎠
⎟
10
(e)
Write
95%
confidence
intervals
for
the
parameters
and
interpret
them.

handout19.pdf
Stat 102B -Computation and Optimization in Statistics
Handout 19
NAME (Last, First):————————— UCLA ID:—————
Date: ——–

J. Sanchez
UCLA Department of Statistics
Topic 1. Machine Learning basic principle from Probability.
Bayes theorem is sometimes used in classification of items
where a system has already learnt the probabilities.
Suppose there are two classes, y = 1 and y = 2 into which we
can classify w, a new value of the item. By Bayes theorem,
we can write
P(y = 1 | w) =
P(y = 1 ∩ w)
P(w)
=
P(y = 1)P(w | y = 1)
P(w)
P(y = 2 | w) =
P(y = 2 ∩ w)
P(w)
=
P(y = 2)P(w | y = 2)
P(w)
Dividing,
P(y = 1 | w)
P(y = 2 | w)
=
P(y = 1)P(w | y = 1)

P(y = 2)P(w | y = 2)
Our decision is to classify a new example into class 1 if
P(y = 1 | w)
P(y = 2 | w)
> 1
or equivalently if
P(y = 1)P(w | y = 1)
P(y = 2)P(w | y = 2)
> 1
which means that w goes into class 1 if
P(y = 1)P(w | y = 1) > P(y = 2)P(w | y = 2)
and
w goes into class 2 if
P(y = 1)P(w | y = 1) < P(y = 2)P(w | y = 2).
When
P(y = 1)P(w | y = 1) = P(y = 2)P(w | y = 2),
the result is inconclusive.
The conditional probabilities of P(w | y = 1) and p(w | y = 2) are
assumed to be already learnt as are the prior probabilities
P(y = 1) and P(y = 2). If these can be accurately estimated, the
classifications will have a high probability of being correct. For
example, an e-mail spam filter has learned from past e-mails
what proportion are spam (y=1) and which are not (y=2). It has
also been tracking what proportion of those spam e-mails

contain the sentence “click here“ (event w), thus knows p(w | y
= 1).
Similarly, it has been tracking what percentage of e-mails that
are not spam contain the same sentence, thus knows p(w | y =
2).
In fact, many commercial spam filters are based on this basic
training based on past e-mails and Bayes theorem. With that
information, answer the following question:
Suppose the prior probabilities of being in either of the two
classes are P(y = 1) = 0.4, and P(y = 2) = 0.6. Also the
conditional probabilities for the new example w are P(w | y = 1)
= 0.5 and P(w | y = 2) = 0.3. Into what class should you
classify the new example? Show the work.
Solution
1.
P(y = 1)P(w | y = 1) = 0.4(0.5) = 0.2
P(y = 2)P(w | y = 2) = 0.6(0.3) = 0.18
and since
P(y = 1)P(w | y = 1) > P(y = 2)P(w | y = 2) ,
the new example goes into class 1.
March 10, 2014 1

Handout 19
Date: ——–
J. Sanchez
Example 1. The implementation of EM clustering uses the above
reasoning. The steps followed can be summarized in the
following example. We have a small data set with n = 5
observations, each observation being a vector of several
variables. It
is believed that the observations can be grouped in 2 clusters.
Suppose we have designed a method that tells us the following
about latent variable z representing the group number.
• Observation 1 has probability 0.1 of being in group 1 and
probability 0.9 of begin in group 2. Then we allocate
observation
1 to group 2 (z=2).

• Observation 2 has probability 0.8 of being in group 1 and 0.2
of being in group 2. Then observation 2 is allocated to
group 1 (z=1).
• Observation 3 has probability 0.5 of being in either group, so
it could go to either way.
• Observation 4 has probability 0.3 of being in group 1, and
probability 0.7 of being in group 2, so it is assigned to group
2.
probability 0.6 of being in group 2.
Because we have that fifty fifty situation, we can not say clearly
how many go to group 1 or group 2. So a possibility is to add
the probabilities of each group, and take that as an estimate of
the number of observations in each group.
n1 = 0.1 + 0.8 + 0.5 + 0.3 + 0.4 = 2.1, n2 = 0.9 + 0.2 + 0.5 + 0.7
+ 0.6 = 2.9
.
This can be considered an estimate of the expected number of
observations in each group (Expectation step). The resulting

estimated proportion of observations in each group can be
denoted by
α1 =
2.1
5
, α2 =
2.9
5
Then given that, we estimate the means of each of the groups as
follows:
µ1 =
1
2.1
(0.1x1,1 + 0.8x2,1 + 0.5x3,1 + 0.3x4,1 + 0.4x5,1
µ2 =
1
2.9

(0.9x1,2 + 0.2x2,2 + 0.5x3,2 + 0.7x4,2 + 0.6x5,2
Topic 2. Model-Based clustering (MBC) with mixtures
Model based clustering also envisages a dataset as made of
several latent (that is, missing, unobserved) strata or subpop-
ulations. Depending on the setting, the inferential goal may be
either to reconstitute the groups by estimating the missing
component, z, an operation called classification or clustering, to
provide estimators for the parameters of the different groups,
or even to estimate the number k of groups (Markov switching
models).
Since objects within a class differ from one another, it is
reasonable to assume the existence of a probability distribution
of
characteristics for a population belonging to this class.
(a) Elements from a different class will have a different
probability distribution fk(xi | θs) k = 1, ..., K
(b) The combined population taken from all classes will have a
probability distribution which is a mixture of distributions
f (xi | θ) =

K∑
k=1
pk fk(xi | θ j)
K∑
k=1
pk = 1
where pk ≥ 0, k is the number of latent clusters, unknown, i = 1,
..., n is the observation number. The parameters θk, pk
are unknown.
We distinguish the weights ps from the other parameters, θ. The
weights are associated with the missing data structure of
the model (i.e., the allocation of the observations to a given
unknown cluster), while the others are related to the
observations
within a cluster.
March 10, 2014 2

Handout 19
Date: ——–
J. Sanchez
The maximum likelihood estimates of the parameters are those
values of {p j,θ j} that maximize the likelihood of the sample
L =
n∏
k=1
k∑
j=1
p j f j(xi | θ j)
subject to the constraint
∑K

k=1 pk = 1. Using a Lagrange multiplier. An analytical solution
of this problem is not possible. Thus,
finding the clusters with maximum likelihood estimation of
mixtures involves using the EM algorithm or Bayesian markov
chain montecarlo methods (Stat 102C will cover the latter more
in detail; here we will just introduce mcmc on Wednesday).
Topic 3. EM clustering EM clustering is the probabilistic
version of k-means.
EM clustering consists of thinking about the mixture problem as
a missing data problem, i.e., a problem where
zi ∼ Multinomial( p1, p2, ..., pK ), i = 1, ...., n
and then defining a complete data likelihood
L =
n∏
k=1
k∏
j=1
p

zi j
j f j(xi | θ j)
Integrating out z we are back to the previous likelihood. We
could use this last likelihood to maximize as usual, given the
values
of zi, i = 1, ..., n.
The method then goes as follows:
• E-step: z(t) = E p(z|θ,x,p)[l(θ | y, z)]
• M-step: Maximize l(θ, p | y, z(t)), which gives θ(t) and p(t).
The procedure has these properties:
• The procedure converges usually to a local maximum, and
given its simplicity, it is widely used. i.e., l(µ(t+1),α(t+1)) ≥
l(µ(t),α(t))
• {µ
(t)
k ,α
(t)

k }
K
k=1 ⇒ MLE of µ and α as t →∞.
• Determines the final cluster assignments by assigning each
row of the data matrix to cluster k∗ = argmax1≤k≤K wNik ,
with
N being the last iteration.
Topic 4. Model assumptions for implementation
• p(zi = k) = αk, for k = 1, ....., K.
∑K
k=1 αk = 1
• [xi | zi = k] ∼ N(µk,σ2 I), σ2 given.
Unknown parameters are (α1, ...,αk,µ1, .....,µk)
Observed data X = (xi j)n×p, missing data Z = (z1, z2, ...., zn),
the class labels.
Topic 5. Preliminaries

• (1) P(zi = k | xi) =
P(xi | zi = k)P(zi = k)∑K
k=1 P(xi | zi = k)P(zi = k)
=
αk fk(xi)∑K
k=1 αk fk(xi)
= wik
where fk(xi) is the multivariate normal distribution N(µ,σ2 I).
• (2) If Z is given, then the MLE of µ1, ....,µk and α1, ....,αk is
given by
α̂k =
nk
n
, where nk = number of observations for which zi = k.
µ̂k =
1
nk

∑
i:zi =k
xi,
for k = 1, 2, ., , , , , K.
March 10, 2014 3
Handout 19
Date: ——–
J. Sanchez
Topic 6. EM clustering algorithm Choose µ(1)1 , ....,µ
(1)
K and α
(1)

1 , ........,α
(1)
K , for t = 1, 2, ......, N, perhaps by visual inspec-
tion of the data or based on prior estimates or information from
others.
• E-step: Given {µ(t)k ,α
(t)
k }
K
k=1, compute for each xi,
w(t)ik =
α
(t)
k f
(t)
k (xi)∑
α

(t)
k f
(t)
k (xi)
=
α
(t)
k exp
(
−
1
2σ2 || xi −µ
(t)
k ||
2
)
∑K

k=1 α
(t)
k exp
(
−
1
2σ2 || xi −µ
(t)
k ||
2
)
, for k = 1, ..., K.
• M-step: Given {w(t)ik : i = 1, ..., n; k = 1, ..., K}, estimate
α
(t+1)
k =

n(t+1k
n
, n(t+1)k =
n∑
i=1
w(t)ik
µk =
1
n(t+1)k
n∑
i=1
w(t)ik xi
Topic 7. Some background on the EM algorithm
Dempster, Laird and Rubin (1977)’s seminal paper on the EM
algorithm estimulated interest in the use of finite mixture
distributions to model heterogeneous data. This is because the

fitting of mixture models by maximum likelihood is a classic
example of a problem that is simplified considerably by the
EM’s conceptual unification of maximum likelihood estimation
(ML) from data that can be viewed as being incomplete.
With the considerable attention being given to the analysis of
large data sets, as in typical data mining applications, recent
work on speeding up the implementation of the EM algorithm is
widely discussed, including (a) the use of the
sparse/incremental
EM and of multiresolution kd-trees and (b) the scaling of the
EM algorithm to massively large databases where there is a
limited
memory buffer.
The EM algorithm is a non-Montecarlo algorithm used to locate
the mode or modes of the likelihood function or the
posterior distribution. It does not require the input of a stream
of pseudo-random numbers. With EM, one augments the
observed
data with latent data such that one complicated maximization is
replaced by an iterative series of simple maximizations.
Problem 1. Suppose that in the first example seen in this
lecture, the observations are: (1,2), (3,1), (10,11),(12,14), (2,4).

Compute the next E-step and M-step. Provide the values of the
parameters.
March 10, 2014 4
Handout 19
Date: ——–
J. Sanchez
March 10, 2014 5
HWK-7R-script-start.R.txt
#####################################################
#############
# Stat 102B/Sanchez UCLA ID
# Date
# Homework 7, Program 1.
#

# MLE estimation of parameters of the log normal distribution
# fitted to the radon data
##
#This program fits a log normal model to the radon
# data . It use the functon nlm in R, which is set to
# minimize the negative of the log likelihood (that is equivalent
# to maximizing the log likelihood.
#####################################################
#############
# Read the data from its web site
data=read.table("http://www.stat.berkeley.edu/users/statlabs/dat
a/radon.data", header=T)
attach(data)
head(data) # to see the names of the variables in the data
set.
y=data$radon # more convenient to call it y
n=length(y) # number of observations in the radon data set
#####################################################
#######################
## View the distribution of the data and guess a model. Play

with several
## models to see how they fit. Since the problem asks to fit a
## log normal distribution, we fit several ad hoc log normal
models.
#####################################################
#######################
hist(y, prob=T,ylim=c(0,0.3))
# define discrete values of x over specified range
x = c(seq(0:max(y)),by=0.1)
# simulate Log normal distributions for lambda = to get an
idea
points(x,dlnorm(x, meanlog=0, sdlog=1,log=FALSE),
col="red", type="o",
pch=21, bg="red")
points(x,dlnorm(x, meanlog=1, sdlog=0.5, log=FALSE),
col="green",
type="o", pch=22, bg="red")
points(x,dlnorm(x, meanlog=3, sdlog=1, log=FALSE),
col="purple",
points(x,dlnorm(x, meanlog=1.5, sdlog=0.8, log=FALSE),
col="brown",

# create legend
legend(25,0.25, legend=c("meanlog=0,sdlog=1","meanlog=1,
sdlog=0.5","meanlog=3, sdlog=1","meanlog=1.5,sdlog=0.8"),
cex=0.75, pch=c(21,22,24,25), col="red", pt.bg="red")
## This graph must be put in the homework document that you
turn in in lecture.
#######################################
# Write the negative log likelihood. Use formula in radon article
posted in # the homework web site
## to find the likelihood and log likelihood.
## To minimize - log likelihood we will ignore terms not
depending on
## parameters. Use the programs learned in Handout 18.
########################################
# negative log-likelihood: p=c(sigma2, gamma), y=radon,
n=total #observations
##### Write your program here. and finish.
homework7.pdf

Stat 102B - Computation and Optimization in Statistics
Homework 7
J. Sanchez
Instructions
(1) Homework must be stapled. Writing in two columns per
page not allowed.
(2) No late homework accepted under any circumstances.
(3) THERE IS ONE R script file to upload. It must be uploaded
before the deadline. The hard copy part with answers
must be turned in in lecture the due time or before the deadline.
(4) Hardcopy with answers must be handed in person to prof.
Sanchez at the beginning of lecture. Homework turned
in at the end of lecture will get points deducted. No email,
mailboxes, fax or other way of turning it in will be
allowed. If you need to turn in your homework early, please
contact prof. Sanchez and make arrangements with
her.

(5) Write your Last name, first name, ID, Hwk number, date and
your section on the upper right corner of the hardcopy
homework. Your script file must conform to sample script file
and also have your name inside and as a file name.
(6) To get full credit, you must show work even when not asked
and pay attention to the instructions and follow
them. Points will be deducted for not following instructions
given in each problem. You are also responsible for
uploading your R script early. No hard copies of R scripts will
be accepted. Excuses about individual technical
difficulties will not be accepted. Plan to do it early to get help
from us if needed.
(7) Must answer problems in the order given. There should be
no R code whatsoever in your hardcopy with answers
turned in in lecture. Must use notation used in lecture.
(8) It is ok to work with other students for homework but each
student must turn in their own writing of the problems.
Evidence to the contrary will result in 0 points for all parties
involved.
(9) Hardcopy part of homework can be hand written ONLY. If

hand written, your writing must be neat and easy
to read. You may not use double columns to write your answers.
Open an R script file, put your name, ID, date and homework
number as heading. Then add to it all the programs
requested in the following problems, in the order requested, and
well labelled and separated, as usual. If still in doubt,
look at past R script answer keys for format.
Problem 1. The article “Minnesota Radon Levels“ (posted with
this homework) contains data on radon levels in
Minnesota houses in Minnesota counties. On page 69 a log
normal model is suggested as a possible candidate for the
mechanism generating the radon data. Your job is to use
contents of lecture handout 18 (the updated version posted
on Friday in CCLE and complementary code) to do the
following:
(a) Fit numerically a log normal model to the radon data making
use of R nlm routine seen in Handout 18. For that,
you will need to write a program that you will put in the script
file submitted to CCLE. Program examples can be
seen in Handout 18. Report here, handwritten, the formula for
the log likelihood function.

(b) What are the maximum likelihood parameter estimates and
their standard errors? What is the Hessian matrix?
How do you use the Hessian matrix to compute the standard
errors?
(c) Write by hand here confidence interval for each parameter.
Interpret the intervals.
March 10, 2014 1
Stat 102B - Computation and Optimization in Statistics
Homework 7
J. Sanchez
(d) In addition to that, you will report here the histogram of the
radon data and the fitted estimated model on top of it.
Comment on the fit. Is it good, bad?
Note: a small program showing how to read the radon data is
posted next to this homework.

Problem 2. This problem uses the example seen in lecture on
3/10.
We have a small data set with n = 5 observations, each
observation being a vector of several variables. It is
believed that the observations can be grouped in 2 clusters.
Suppose we have designed a method that tells us the
following about latent variable z representing the group number.
• Observation 1 has probability 0.1 of being in group 1 and
probability 0.9 of begin in group 2. Then we allocate
observation 1 to group 2 (z=2).
• Observation 2 has probability 0.8 of being in group 1 and 0.2
of being in group 2. Then observation 2 is allocated
to group 1 (z=1).
• Observation 3 has probability 0.5 of being in either group, so
it could go to either way.
probability 0.7 of being in group 2, so it is assigned
to group 2.

probability 0.6 of being in group 2.
Because we have that fifty fifty situation, we can not say clearly
how many go to group 1 or group 2. So a possibility
is to add the probabilities of each group, and take that as an
estimate of the number of observations in each group.
n1 = 0.1 + 0.8 + 0.5 + 0.3 + 0.4 = 2.1, n2 = 0.9 + 0.2 + 0.5 + 0.7
+ 0.6 = 2.9
.
This can be considered an estimate of the expected number of
observations in each group (Expectation step). The
resulting estimated proportion of observations in each group can
be denoted by
α1 =
2.1
5
, α2 =
2.9
5

Then given that, we estimate the means of each of the groups as
follows:
µ1 =
1
2.1
(0.1x1,1 + 0.8x2,1 + 0.5x3,1 + 0.3x4,1 + 0.4x5,1
µ2 =
1
2.9
(0.9x1,2 + 0.2x2,2 + 0.5x3,2 + 0.7x4,2 + 0.6x5,2
(notice correction to mistake made writing the last mu2 on the
blackboard... please correct in the notes. Suppose this
was iteration 0. Do the E-step for next iteration 1 and compute
the Wik matrix. Then do the M-step. Repeat 2 more
times the E and M steps. (t=1,2, 3). Assume σ2 = 1.
TO DO:
Assume that the data matrix is
X =

1 2
3 1
10 11
12 14
2 4
Write by hand a table with the values of the following formulas
for the algorithm at t=1,2,3. No R code needed for
this problem. Then determine which clusters your observations
end up at and what are the final MLE estimates of mus
and alphas.
Recall that the formulas for the steps are given in Topic 6, page
3 and 4 of handout 19.
March 10, 2014 2
handout17.pdf

Handout 17
Date: ——–
J. Sanchez
Topic 1. Review of k-means clustering. An algorithm to
implement it. Time ago in this class, we saw k-means
clustering. We
repeat the exercise now but using the fancier notation
introduced here (see attached exercise sheet). We will also be
referring to
the R program attached. Given an n ×p data matrix X containing
data for individuals from K groups, we wanted to group them
into clusters.
Question 1. What other methods did the job of grouping
individuals into clusters?
Question 2. What were those other methods based on?
Consider an unknown vector Z = (z1, ...., zn) where each zi ∈
{1, ...., K} is cluster label of row Xi. The cluster center, the
vectors µk, k = 1, ...., K are also unknown. For example, if K =

2, Figure 1 shows a hypothetical data set with two mean
vectors µ1,µ2 and two groups z1, z2.
Figure 1: Both z and µ vectors are unknown
As we saw earlier, to allocate a row of the matrix to a cluster,
we choose the cluster that minimizes the total squared distance
(TSD) of the row from the vector of means for that cluster, i.e,
T S Dk =
∑
zi=k || xi −µk ||
2 .
For all data points, T S D =
∑K
k=1 T S Dk
We want to find Z and {µk} such that TSD is minimized. The
problem is
Min T S D(Z,µ) =
∑K
k=1

∑
i:zi=k || xi −µk ||
2 (µ = {µk, k = 1, ...., K}).
The k-means algorithm that we used was then the following:
(a) Choose initial centers µ1, ...,µk. Iterate the following two
steps (1) and (2) until Z does not change.
(a) Given cluster centers {µk, k = 1, ...., K}, assign each Xi to
the closest cluster center
zi = argmin1≤k≤K || xi −µk ||
2, i = 1, ..., n
February 27, 2014 1
Handout 17
Date: ——–

J. Sanchez
(b) Given Z = {z1, ..., zn} update the centers by:
µk =
1
nk
∑
i:zi=k
xi, nk = #{i : zi = k}
for k = 1, ..., K.
We can see this process in Figure 2
Figure 2: Minimizing T S D(Z,µ) =
∑K
k=1
∑
i:zi=k || xi −µk ||

2
The k-means algorithm is an iterative descent algorithm. The
steps are sketched in Figure 3
Figure 3: Iterative descent algorithm,
See the R program in the next page. Use the part of the program
indicated there to do the homework.
Topic 2. Expectation Maximization (EM) Clustering
EM clustering is the probabilistic version of k-means. We will
see that after we have studied numerical MLE.
February 27, 2014 2
Handout 17
Date: ——–
J. Sanchez

####################################################
##### Stat 102B/Sanchez
#####
##### ID
##### Date:
##### For lecture on k-means algorithm
##### with e-m type optimization.
#####################################################
########
#######################################
# Programs for k-means algorithm of
# made up data. For your IRis data, you
# just need from the indicated line on.

#######################################
### I generate an artificial X matrix to show you
### what to do. Your iris data will be your
### X matrix. So you do not need the first lines
X=matrix(0,ncol=2,nrow=200) # space to put made up data
# mean and covariance parameters of group 1
mu1 <- c(15, 15)
Sigma1 <- matrix(c(20, -.8, -.8, 15), nrow = 2, ncol = 2)
mu2 <- c(30,30)
Sigma2 <- matrix(c(40,0.6,0.6,60),ncol=2)
rmvn.eigen <-
function(n, mu, Sigma) {

# generate n random vectors from MVN(mu, Sigma)
# dimension is inferred from mu and Sigma
d <- length(mu)
ev <- eigen(Sigma, symmetric = TRUE)
lambda <- ev$values
V <- ev$vectors
R <- V %*% diag(sqrt(lambda)) %*% t(V)
Z <- matrix(rnorm(n*d), nrow = n, ncol = d)
X <- Z %*% R + matrix(mu, n, d, byrow = TRUE)
X
}
# generate the sample
X[1:100,] <- rmvn.eigen(100, mu1, Sigma1)

X[101:200,]<-rmvn.eigen(100,mu2,Sigma2)
#plot to getidea of initial values
############ Give an X matrix with your IRIS data ########
###### your program for hwk would start here
#################
plot(X)
February 27, 2014 3
Handout 17
Date: ——–
J. Sanchez
###########choose initial values of mu#####

#### Note, you must choose your initial values for the iris data
##
### The ones below are for the artificial data. ######
c1=c(12,12) #initial center for cluster 1
c2=c(35,35) #initial center for cluster 2
pastIndicator=200:1 #initial value for z
indicator=1:200 # past indicator will be compared with new
indicator
### note: we initialize this way to get the algorithm started
###### We must iterate until z does not chang, e., until
pastIndicator=indicator
while(sum(pastIndicator!=indicator)!=0)
{
pastIndicator=indicator;

#distance to current cluster centers
dc1 =colSums((t(X)-c1)ˆ2)
dc2=colSums((t(X)-c2)ˆ2)
dMat=matrix(c(dc1,dc2),,2)
#decide which cluster each point belongs to
indicator = max.col(-dMat)
# update the cluster centers
c1=colMeans(X[indicator==1,])
c2=colMeans(X[indicator==2,])
# Make plot
}
c1 ## to see the mu’s to which I converge for group 1

c2 # to see the mus to which I converge for group 2
######## If you want to see which group each observation goes
to
########## you type
indicator
pastIndicator
#### Both should be the same. IF you look at how I generated
the data
#### notice that I have used two multivariate normals with very
different
#### means and var-cov matrices.... You should have the first
100 observations in one
#### group and the next 100 in another group, or very close.
This may or may not
### be the case for your

February 27, 2014 4
Bryce, a bank official, is married and files a joint return. During
2013 he engages in the following activities and transactions:
a. Being an avid fisherman, Bryce develops an expertise in
tying flies. At times during the year, he is asked to conduct fly-
tying demonstrations, for which he is paid a small fee. He also
periodically sells flies that he makes. Income generated from
these activities during the year is $2,500. The expenses for the
year associated with Bryce’s fly-tying activity include $125
personal property taxes on a small trailer that he uses
exclusively for this purpose, $2,900 in supplies, $270 in repairs
on the trailer, and $200 in gasoline for traveling to the
demonstrations.
b. Bryce sells a small building lot to his brother for $40,000.
Bryce purchased the lot four years ago for $47,000, hoping to
make a profit.
c. Bryce enters into the following stock transactions: (None of
the stock qualifies as small business stock.)
Date Transaction
March 22 Purchases 100 shares of Silver Corporation common
stock for $2,800.

April 5 Sells 200 shares of Gold Corporation common stock for
$8,000. The stock was originally purchased two years ago for
$5,000.
April 15 Sells 200 shares of Silver Corporation common stock
for $5,400. The stock was originally purchased three years ago
for $9,400.
May 20 Sells 100 shares of United Corporation common stock
for $12,000.
The stock was originally purchased five years ago for $10,000.
d. Bryce’s salary for the year is $115,000. In addition to the
items above, he also incurs $5,000 in other miscellaneous
deductible itemized expenses.
Answer the following questions regarding Bryce’s activities for
the year.
1. Compute Bryce’s taxable income for the year.
2. What is Bryce’s basis in the Silver stock he continues to
own?

numerical-MLE.pdf 1 Stat 102BSanchez .docx

numerical-MLE.pdf 1 Stat 102BSanchez .docx

Recommended

Recommended

More Related Content

Similar to numerical-MLE.pdf 1 Stat 102BSanchez .docx

Similar to numerical-MLE.pdf 1 Stat 102BSanchez .docx (20)

More from hopeaustin33688

More from hopeaustin33688 (20)

Recently uploaded

Recently uploaded (20)

numerical-MLE.pdf 1 Stat 102BSanchez .docx