numerical-MLE.pdf
1
Stat
102B/Sanchez
Handout
18
Introduction.
Solving
maximum
likelihood
estimation
problems
with
numerical
optimization
methods.
There
are
several
numerical
methods
of
solving
maximum
likelihood
estimation
problems,
which
involve
finding
the
maximum
likelihood
estimates
and
at
the
same
time
obtaining
the
Hessian
and
all
ingredients
needed
for
confidence
intervals.
All
the
numerical
methods
were
created
to
solve
optimization
problems
in
calculus,
in
general.
But
statisticians
learned
to
apply
them
to
find
the
maxima
of
likelihood
functions
and
other
statistical
problems,
as
we
already
discussed
in
this
class.
Numerical
methods
are
not
needed
when
you
can
find
a
closed
form
mathematical
solution,
but
we
will
use
them
here
in
some
of
these
close
solution
cases
to
convince
you
that
numerical
methods
work.
Topic
1.
The
mle
function
in
R.
Example
1.
The
MLE
of
the
parameter
of
an
exponential
distribution
using
R’s
mle
function
We
know
that
the
mle
of
the
parameter
of
an
exponential
distribution
€
f (y) = θe−θy y > 0
based
on
a
random
sample
of
size
n
is
€
θ
∧
=
1
x
.
But
we
will
show
here
how
a
numerical
method
gives
us
that
result.
Suppose
€
Y1,Y2
are
iid
with
density
€
f (y) = θe−θy y > 0
Q1.
Write
the
likelihood
function
formula
Mathematically,
the
maximum
likelihood
estimator
for
θ
based
on
this
random
sample
is
€
θ
∧
mle =
2
y1 + y2
.
This
solution
is
unique
and
maximizes
the
log
likelihood,
which
is
€
2logθ −θ(y1 + y2) .
Q2.
Find
the
first
and
second
derivative
of
the
log
likelihood
function
with
respect
to
theta.
Write
the
formula
for
the
maximum
likelihood
estimate
of
theta
and
determine
whether
it
is
a
max,
a
min
or
a
saddle
point.
2
Although
we
have
the
exact
solution
analytical
solution,
let
us
see
how
the
problem
can
be
solved
numerically
using
the
mle (stats4)
function.
The
mle function
takes
as
its
argument
the
function
that
evaluates
–log
likelihood.
The
negative
log
likelihood
is
minimized
by
a
call
to
optim,
an
.
19. #the observed sample
y = c(0.04304550,0.50263474)
#the function containing the formula for the negative log
likelihood
mlogL = function(theta=1){
return(-(length(y)*log(theta) - theta*sum(y)))
}
# Finding numerically the MLE and Fisher’s information
library(stats4)
fit =mle(mlogL)
summary(fit)
3
In
this
example,
the
maximum
likelihood
estimate
is
33. Notice how estimates change after removal of –
lambda in the program.
## Writing the function containing formula for log likelihood
## of Gamma distribution.
LL =function(theta,sx,slogx,n) {
alpha=theta[1]
lambda=theta[2]
loglik = -n*log(gamma(alpha))-lambda*sx +
n*alpha*log(lambda) +
(alpha-1)*slogx Typo fixed: Notice removal of –
lambda
-loglik
} there will be effect of removing lambda on results
below
# Generate artificial gamma data
x=rgamma(20, 5,2)
# Apply optim with initial values alpha=1, lambda=1
optim(c(1,1),
LL,sx=sum(x),slogx=sum(log(x)),n=20,hessian=T)
optim(c(1,1), LL,sx=sum(x),slogx=sum(log(x)),n=n, hessian=T)
43. €
95% CI for α : 4.966438 ± 1.96(1.520707)
95% CI for λ : 1.619640 ± 1.96(0.5218758)
6
Topic
3.
Finding
maximum
likelihood
estimators
with
the
nlm function.
The
R
function
nlm
minimizes
arbitrary
49. data=c(2,0,3,4,6,2,1,2,1,5,3,7,9,2,4,3)
#### Specify the log likelihood function
my.mle.model = function(parameters, data) {
sum(-dpois(data,parameters, log=TRUE)) }
7
Let’s
simulate
some
poisson
random
numbers
with
known
lambda
and
then
apply
the
function
to
see
52. crit = qnorm((1+conf.level)/2)
inv.fish = solve(fish)
theta.hat[1] + c(-1,1)*crit*sqrt(inv.fish[1,1]
theta.hat[2] + c(-1,1)*crit*sqrt(inv.fish[2,2]
As
you
can
see,
there
are
several
ways
of
writing
the
code
to
obtain
what
you
want.
Just
have
clear
that
the
square
root
of
the
diagonal
64. J. Sanchez
UCLA Department of Statistics
Topic 1. Machine Learning basic principle from Probability.
Bayes theorem is sometimes used in classification of items
where a system has already learnt the probabilities.
Suppose there are two classes, y = 1 and y = 2 into which we
can classify w, a new value of the item. By Bayes theorem,
we can write
P(y = 1 | w) =
P(y = 1 ∩ w)
P(w)
=
P(y = 1)P(w | y = 1)
P(w)
P(y = 2 | w) =
P(y = 2 ∩ w)
P(w)
=
P(y = 2)P(w | y = 2)
P(w)
Dividing,
P(y = 1 | w)
P(y = 2 | w)
=
P(y = 1)P(w | y = 1)
65. P(y = 2)P(w | y = 2)
Our decision is to classify a new example into class 1 if
P(y = 1 | w)
P(y = 2 | w)
> 1
or equivalently if
P(y = 1)P(w | y = 1)
P(y = 2)P(w | y = 2)
> 1
which means that w goes into class 1 if
P(y = 1)P(w | y = 1) > P(y = 2)P(w | y = 2)
and
w goes into class 2 if
P(y = 1)P(w | y = 1) < P(y = 2)P(w | y = 2).
When
P(y = 1)P(w | y = 1) = P(y = 2)P(w | y = 2),
the result is inconclusive.
The conditional probabilities of P(w | y = 1) and p(w | y = 2) are
assumed to be already learnt as are the prior probabilities
P(y = 1) and P(y = 2). If these can be accurately estimated, the
classifications will have a high probability of being correct. For
example, an e-mail spam filter has learned from past e-mails
what proportion are spam (y=1) and which are not (y=2). It has
also been tracking what proportion of those spam e-mails
66. contain the sentence “click here“ (event w), thus knows p(w | y
= 1).
Similarly, it has been tracking what percentage of e-mails that
are not spam contain the same sentence, thus knows p(w | y =
2).
In fact, many commercial spam filters are based on this basic
training based on past e-mails and Bayes theorem. With that
information, answer the following question:
Suppose the prior probabilities of being in either of the two
classes are P(y = 1) = 0.4, and P(y = 2) = 0.6. Also the
conditional probabilities for the new example w are P(w | y = 1)
= 0.5 and P(w | y = 2) = 0.3. Into what class should you
classify the new example? Show the work.
Solution
1.
P(y = 1)P(w | y = 1) = 0.4(0.5) = 0.2
P(y = 2)P(w | y = 2) = 0.6(0.3) = 0.18
and since
P(y = 1)P(w | y = 1) > P(y = 2)P(w | y = 2) ,
the new example goes into class 1.
March 10, 2014 1
67. Stat 102B -Computation and Optimization in Statistics
Handout 19
NAME (Last, First):————————— UCLA ID:—————
Date: ——–
J. Sanchez
UCLA Department of Statistics
Example 1. The implementation of EM clustering uses the above
reasoning. The steps followed can be summarized in the
following example. We have a small data set with n = 5
observations, each observation being a vector of several
variables. It
is believed that the observations can be grouped in 2 clusters.
Suppose we have designed a method that tells us the following
about latent variable z representing the group number.
• Observation 1 has probability 0.1 of being in group 1 and
probability 0.9 of begin in group 2. Then we allocate
observation
1 to group 2 (z=2).
68. • Observation 2 has probability 0.8 of being in group 1 and 0.2
of being in group 2. Then observation 2 is allocated to
group 1 (z=1).
• Observation 3 has probability 0.5 of being in either group, so
it could go to either way.
• Observation 4 has probability 0.3 of being in group 1, and
probability 0.7 of being in group 2, so it is assigned to group
2.
• Observation 5 has probability 0.4 of being in group 1, and
probability 0.6 of being in group 2.
Because we have that fifty fifty situation, we can not say clearly
how many go to group 1 or group 2. So a possibility is to add
the probabilities of each group, and take that as an estimate of
the number of observations in each group.
n1 = 0.1 + 0.8 + 0.5 + 0.3 + 0.4 = 2.1, n2 = 0.9 + 0.2 + 0.5 + 0.7
+ 0.6 = 2.9
.
This can be considered an estimate of the expected number of
observations in each group (Expectation step). The resulting
69. estimated proportion of observations in each group can be
denoted by
α1 =
2.1
5
, α2 =
2.9
5
Then given that, we estimate the means of each of the groups as
follows:
µ1 =
1
2.1
(0.1x1,1 + 0.8x2,1 + 0.5x3,1 + 0.3x4,1 + 0.4x5,1
µ2 =
1
2.9
70. (0.9x1,2 + 0.2x2,2 + 0.5x3,2 + 0.7x4,2 + 0.6x5,2
Topic 2. Model-Based clustering (MBC) with mixtures
Model based clustering also envisages a dataset as made of
several latent (that is, missing, unobserved) strata or subpop-
ulations. Depending on the setting, the inferential goal may be
either to reconstitute the groups by estimating the missing
component, z, an operation called classification or clustering, to
provide estimators for the parameters of the different groups,
or even to estimate the number k of groups (Markov switching
models).
Since objects within a class differ from one another, it is
reasonable to assume the existence of a probability distribution
of
characteristics for a population belonging to this class.
(a) Elements from a different class will have a different
probability distribution fk(xi | θs) k = 1, ..., K
(b) The combined population taken from all classes will have a
probability distribution which is a mixture of distributions
f (xi | θ) =
71. K∑
k=1
pk fk(xi | θ j)
K∑
k=1
pk = 1
where pk ≥ 0, k is the number of latent clusters, unknown, i = 1,
..., n is the observation number. The parameters θk, pk
are unknown.
We distinguish the weights ps from the other parameters, θ. The
weights are associated with the missing data structure of
the model (i.e., the allocation of the observations to a given
unknown cluster), while the others are related to the
observations
within a cluster.
March 10, 2014 2
72. Stat 102B -Computation and Optimization in Statistics
Handout 19
NAME (Last, First):————————— UCLA ID:—————
Date: ——–
J. Sanchez
UCLA Department of Statistics
The maximum likelihood estimates of the parameters are those
values of {p j,θ j} that maximize the likelihood of the sample
L =
n∏
k=1
k∑
j=1
p j f j(xi | θ j)
subject to the constraint
∑K
73. k=1 pk = 1. Using a Lagrange multiplier. An analytical solution
of this problem is not possible. Thus,
finding the clusters with maximum likelihood estimation of
mixtures involves using the EM algorithm or Bayesian markov
chain montecarlo methods (Stat 102C will cover the latter more
in detail; here we will just introduce mcmc on Wednesday).
Topic 3. EM clustering EM clustering is the probabilistic
version of k-means.
EM clustering consists of thinking about the mixture problem as
a missing data problem, i.e., a problem where
zi ∼ Multinomial( p1, p2, ..., pK ), i = 1, ...., n
and then defining a complete data likelihood
L =
n∏
k=1
k∏
j=1
p
74. zi j
j f j(xi | θ j)
Integrating out z we are back to the previous likelihood. We
could use this last likelihood to maximize as usual, given the
values
of zi, i = 1, ..., n.
The method then goes as follows:
• E-step: z(t) = E p(z|θ,x,p)[l(θ | y, z)]
• M-step: Maximize l(θ, p | y, z(t)), which gives θ(t) and p(t).
The procedure has these properties:
• The procedure converges usually to a local maximum, and
given its simplicity, it is widely used. i.e., l(µ(t+1),α(t+1)) ≥
l(µ(t),α(t))
• {µ
(t)
k ,α
(t)
75. k }
K
k=1 ⇒ MLE of µ and α as t →∞.
• Determines the final cluster assignments by assigning each
row of the data matrix to cluster k∗ = argmax1≤k≤K wNik ,
with
N being the last iteration.
Topic 4. Model assumptions for implementation
• p(zi = k) = αk, for k = 1, ....., K.
∑K
k=1 αk = 1
• [xi | zi = k] ∼ N(µk,σ2 I), σ2 given.
Unknown parameters are (α1, ...,αk,µ1, .....,µk)
Observed data X = (xi j)n×p, missing data Z = (z1, z2, ...., zn),
the class labels.
Topic 5. Preliminaries
76. • (1) P(zi = k | xi) =
P(xi | zi = k)P(zi = k)∑K
k=1 P(xi | zi = k)P(zi = k)
=
αk fk(xi)∑K
k=1 αk fk(xi)
= wik
where fk(xi) is the multivariate normal distribution N(µ,σ2 I).
• (2) If Z is given, then the MLE of µ1, ....,µk and α1, ....,αk is
given by
α̂k =
nk
n
, where nk = number of observations for which zi = k.
µ̂k =
1
nk
77. ∑
i:zi =k
xi,
for k = 1, 2, ., , , , , K.
March 10, 2014 3
Stat 102B -Computation and Optimization in Statistics
Handout 19
NAME (Last, First):————————— UCLA ID:—————
Date: ——–
J. Sanchez
UCLA Department of Statistics
Topic 6. EM clustering algorithm Choose µ(1)1 , ....,µ
(1)
K and α
(1)
78. 1 , ........,α
(1)
K , for t = 1, 2, ......, N, perhaps by visual inspec-
tion of the data or based on prior estimates or information from
others.
• E-step: Given {µ(t)k ,α
(t)
k }
K
k=1, compute for each xi,
w(t)ik =
α
(t)
k f
(t)
k (xi)∑
α
80. k=1 α
(t)
k exp
(
−
1
2σ2 || xi −µ
(t)
k ||
2
)
, for k = 1, ..., K.
• M-step: Given {w(t)ik : i = 1, ..., n; k = 1, ..., K}, estimate
α
(t+1)
k =
81. n(t+1k
n
, n(t+1)k =
n∑
i=1
w(t)ik
µk =
1
n(t+1)k
n∑
i=1
w(t)ik xi
Topic 7. Some background on the EM algorithm
Dempster, Laird and Rubin (1977)’s seminal paper on the EM
algorithm estimulated interest in the use of finite mixture
distributions to model heterogeneous data. This is because the
82. fitting of mixture models by maximum likelihood is a classic
example of a problem that is simplified considerably by the
EM’s conceptual unification of maximum likelihood estimation
(ML) from data that can be viewed as being incomplete.
With the considerable attention being given to the analysis of
large data sets, as in typical data mining applications, recent
work on speeding up the implementation of the EM algorithm is
widely discussed, including (a) the use of the
sparse/incremental
EM and of multiresolution kd-trees and (b) the scaling of the
EM algorithm to massively large databases where there is a
limited
memory buffer.
The EM algorithm is a non-Montecarlo algorithm used to locate
the mode or modes of the likelihood function or the
posterior distribution. It does not require the input of a stream
of pseudo-random numbers. With EM, one augments the
observed
data with latent data such that one complicated maximization is
replaced by an iterative series of simple maximizations.
Problem 1. Suppose that in the first example seen in this
lecture, the observations are: (1,2), (3,1), (10,11),(12,14), (2,4).
83. Compute the next E-step and M-step. Provide the values of the
parameters.
March 10, 2014 4
Stat 102B -Computation and Optimization in Statistics
Handout 19
NAME (Last, First):————————— UCLA ID:—————
Date: ——–
J. Sanchez
UCLA Department of Statistics
March 10, 2014 5
HWK-7R-script-start.R.txt
#####################################################
#############
# Stat 102B/Sanchez UCLA ID
# Date
# Homework 7, Program 1.
#
84. # MLE estimation of parameters of the log normal distribution
# fitted to the radon data
##
#This program fits a log normal model to the radon
# data . It use the functon nlm in R, which is set to
# minimize the negative of the log likelihood (that is equivalent
# to maximizing the log likelihood.
#####################################################
#############
# Read the data from its web site
data=read.table("http://www.stat.berkeley.edu/users/statlabs/dat
a/radon.data", header=T)
attach(data)
head(data) # to see the names of the variables in the data
set.
y=data$radon # more convenient to call it y
n=length(y) # number of observations in the radon data set
#####################################################
#######################
## View the distribution of the data and guess a model. Play
85. with several
## models to see how they fit. Since the problem asks to fit a
## log normal distribution, we fit several ad hoc log normal
models.
#####################################################
#######################
hist(y, prob=T,ylim=c(0,0.3))
# define discrete values of x over specified range
x = c(seq(0:max(y)),by=0.1)
# simulate Log normal distributions for lambda = to get an
idea
points(x,dlnorm(x, meanlog=0, sdlog=1,log=FALSE),
col="red", type="o",
pch=21, bg="red")
points(x,dlnorm(x, meanlog=1, sdlog=0.5, log=FALSE),
col="green",
type="o", pch=22, bg="red")
points(x,dlnorm(x, meanlog=3, sdlog=1, log=FALSE),
col="purple",
type="o", pch=24, bg="red")
points(x,dlnorm(x, meanlog=1.5, sdlog=0.8, log=FALSE),
col="brown",
type="o", pch=25, bg="red")
86. # create legend
legend(25,0.25, legend=c("meanlog=0,sdlog=1","meanlog=1,
sdlog=0.5","meanlog=3, sdlog=1","meanlog=1.5,sdlog=0.8"),
cex=0.75, pch=c(21,22,24,25), col="red", pt.bg="red")
## This graph must be put in the homework document that you
turn in in lecture.
#######################################
# Write the negative log likelihood. Use formula in radon article
posted in # the homework web site
## to find the likelihood and log likelihood.
## To minimize - log likelihood we will ignore terms not
depending on
## parameters. Use the programs learned in Handout 18.
########################################
# negative log-likelihood: p=c(sigma2, gamma), y=radon,
n=total #observations
##### Write your program here. and finish.
homework7.pdf
87. Stat 102B - Computation and Optimization in Statistics
Homework 7
J. Sanchez
UCLA Department of Statistics
Instructions
(1) Homework must be stapled. Writing in two columns per
page not allowed.
(2) No late homework accepted under any circumstances.
(3) THERE IS ONE R script file to upload. It must be uploaded
before the deadline. The hard copy part with answers
must be turned in in lecture the due time or before the deadline.
(4) Hardcopy with answers must be handed in person to prof.
Sanchez at the beginning of lecture. Homework turned
in at the end of lecture will get points deducted. No email,
mailboxes, fax or other way of turning it in will be
allowed. If you need to turn in your homework early, please
contact prof. Sanchez and make arrangements with
her.
88. (5) Write your Last name, first name, ID, Hwk number, date and
your section on the upper right corner of the hardcopy
homework. Your script file must conform to sample script file
and also have your name inside and as a file name.
(6) To get full credit, you must show work even when not asked
and pay attention to the instructions and follow
them. Points will be deducted for not following instructions
given in each problem. You are also responsible for
uploading your R script early. No hard copies of R scripts will
be accepted. Excuses about individual technical
difficulties will not be accepted. Plan to do it early to get help
from us if needed.
(7) Must answer problems in the order given. There should be
no R code whatsoever in your hardcopy with answers
turned in in lecture. Must use notation used in lecture.
(8) It is ok to work with other students for homework but each
student must turn in their own writing of the problems.
Evidence to the contrary will result in 0 points for all parties
involved.
(9) Hardcopy part of homework can be hand written ONLY. If
89. hand written, your writing must be neat and easy
to read. You may not use double columns to write your answers.
Open an R script file, put your name, ID, date and homework
number as heading. Then add to it all the programs
requested in the following problems, in the order requested, and
well labelled and separated, as usual. If still in doubt,
look at past R script answer keys for format.
Problem 1. The article “Minnesota Radon Levels“ (posted with
this homework) contains data on radon levels in
Minnesota houses in Minnesota counties. On page 69 a log
normal model is suggested as a possible candidate for the
mechanism generating the radon data. Your job is to use
contents of lecture handout 18 (the updated version posted
on Friday in CCLE and complementary code) to do the
following:
(a) Fit numerically a log normal model to the radon data making
use of R nlm routine seen in Handout 18. For that,
you will need to write a program that you will put in the script
file submitted to CCLE. Program examples can be
seen in Handout 18. Report here, handwritten, the formula for
the log likelihood function.
90. (b) What are the maximum likelihood parameter estimates and
their standard errors? What is the Hessian matrix?
How do you use the Hessian matrix to compute the standard
errors?
(c) Write by hand here confidence interval for each parameter.
Interpret the intervals.
March 10, 2014 1
Stat 102B - Computation and Optimization in Statistics
Homework 7
J. Sanchez
UCLA Department of Statistics
(d) In addition to that, you will report here the histogram of the
radon data and the fitted estimated model on top of it.
Comment on the fit. Is it good, bad?
Note: a small program showing how to read the radon data is
posted next to this homework.
91. Problem 2. This problem uses the example seen in lecture on
3/10.
We have a small data set with n = 5 observations, each
observation being a vector of several variables. It is
believed that the observations can be grouped in 2 clusters.
Suppose we have designed a method that tells us the
following about latent variable z representing the group number.
• Observation 1 has probability 0.1 of being in group 1 and
probability 0.9 of begin in group 2. Then we allocate
observation 1 to group 2 (z=2).
• Observation 2 has probability 0.8 of being in group 1 and 0.2
of being in group 2. Then observation 2 is allocated
to group 1 (z=1).
• Observation 3 has probability 0.5 of being in either group, so
it could go to either way.
• Observation 4 has probability 0.3 of being in group 1, and
probability 0.7 of being in group 2, so it is assigned
to group 2.
• Observation 5 has probability 0.4 of being in group 1, and
92. probability 0.6 of being in group 2.
Because we have that fifty fifty situation, we can not say clearly
how many go to group 1 or group 2. So a possibility
is to add the probabilities of each group, and take that as an
estimate of the number of observations in each group.
n1 = 0.1 + 0.8 + 0.5 + 0.3 + 0.4 = 2.1, n2 = 0.9 + 0.2 + 0.5 + 0.7
+ 0.6 = 2.9
.
This can be considered an estimate of the expected number of
observations in each group (Expectation step). The
resulting estimated proportion of observations in each group can
be denoted by
α1 =
2.1
5
, α2 =
2.9
5
93. Then given that, we estimate the means of each of the groups as
follows:
µ1 =
1
2.1
(0.1x1,1 + 0.8x2,1 + 0.5x3,1 + 0.3x4,1 + 0.4x5,1
µ2 =
1
2.9
(0.9x1,2 + 0.2x2,2 + 0.5x3,2 + 0.7x4,2 + 0.6x5,2
(notice correction to mistake made writing the last mu2 on the
blackboard... please correct in the notes. Suppose this
was iteration 0. Do the E-step for next iteration 1 and compute
the Wik matrix. Then do the M-step. Repeat 2 more
times the E and M steps. (t=1,2, 3). Assume σ2 = 1.
TO DO:
Assume that the data matrix is
X =
94. 1 2
3 1
10 11
12 14
2 4
Write by hand a table with the values of the following formulas
for the algorithm at t=1,2,3. No R code needed for
this problem. Then determine which clusters your observations
end up at and what are the final MLE estimates of mus
and alphas.
Recall that the formulas for the steps are given in Topic 6, page
3 and 4 of handout 19.
March 10, 2014 2
handout17.pdf
Stat 102B -Computation and Optimization in Statistics
95. Handout 17
NAME (Last, First):————————— UCLA ID:—————
Date: ——–
J. Sanchez
UCLA Department of Statistics
Topic 1. Review of k-means clustering. An algorithm to
implement it. Time ago in this class, we saw k-means
clustering. We
repeat the exercise now but using the fancier notation
introduced here (see attached exercise sheet). We will also be
referring to
the R program attached. Given an n ×p data matrix X containing
data for individuals from K groups, we wanted to group them
into clusters.
Question 1. What other methods did the job of grouping
individuals into clusters?
Question 2. What were those other methods based on?
Consider an unknown vector Z = (z1, ...., zn) where each zi ∈
{1, ...., K} is cluster label of row Xi. The cluster center, the
vectors µk, k = 1, ...., K are also unknown. For example, if K =
96. 2, Figure 1 shows a hypothetical data set with two mean
vectors µ1,µ2 and two groups z1, z2.
Figure 1: Both z and µ vectors are unknown
As we saw earlier, to allocate a row of the matrix to a cluster,
we choose the cluster that minimizes the total squared distance
(TSD) of the row from the vector of means for that cluster, i.e,
T S Dk =
∑
zi=k || xi −µk ||
2 .
For all data points, T S D =
∑K
k=1 T S Dk
We want to find Z and {µk} such that TSD is minimized. The
problem is
Min T S D(Z,µ) =
∑K
k=1
97. ∑
i:zi=k || xi −µk ||
2 (µ = {µk, k = 1, ...., K}).
The k-means algorithm that we used was then the following:
(a) Choose initial centers µ1, ...,µk. Iterate the following two
steps (1) and (2) until Z does not change.
(a) Given cluster centers {µk, k = 1, ...., K}, assign each Xi to
the closest cluster center
zi = argmin1≤k≤K || xi −µk ||
2, i = 1, ..., n
February 27, 2014 1
Stat 102B -Computation and Optimization in Statistics
Handout 17
NAME (Last, First):————————— UCLA ID:—————
Date: ——–
98. J. Sanchez
UCLA Department of Statistics
(b) Given Z = {z1, ..., zn} update the centers by:
µk =
1
nk
∑
i:zi=k
xi, nk = #{i : zi = k}
for k = 1, ..., K.
We can see this process in Figure 2
Figure 2: Minimizing T S D(Z,µ) =
∑K
k=1
∑
i:zi=k || xi −µk ||
99. 2
The k-means algorithm is an iterative descent algorithm. The
steps are sketched in Figure 3
Figure 3: Iterative descent algorithm,
See the R program in the next page. Use the part of the program
indicated there to do the homework.
Topic 2. Expectation Maximization (EM) Clustering
EM clustering is the probabilistic version of k-means. We will
see that after we have studied numerical MLE.
February 27, 2014 2
Stat 102B -Computation and Optimization in Statistics
Handout 17
NAME (Last, First):————————— UCLA ID:—————
Date: ——–
J. Sanchez
UCLA Department of Statistics
100. ####################################################
##### Stat 102B/Sanchez
#####
##### ID
##### Date:
##### For lecture on k-means algorithm
##### with e-m type optimization.
#####################################################
########
#######################################
# Programs for k-means algorithm of
# made up data. For your IRis data, you
# just need from the indicated line on.
101. #######################################
### I generate an artificial X matrix to show you
### what to do. Your iris data will be your
### X matrix. So you do not need the first lines
X=matrix(0,ncol=2,nrow=200) # space to put made up data
# mean and covariance parameters of group 1
mu1 <- c(15, 15)
Sigma1 <- matrix(c(20, -.8, -.8, 15), nrow = 2, ncol = 2)
mu2 <- c(30,30)
Sigma2 <- matrix(c(40,0.6,0.6,60),ncol=2)
rmvn.eigen <-
function(n, mu, Sigma) {
102. # generate n random vectors from MVN(mu, Sigma)
# dimension is inferred from mu and Sigma
d <- length(mu)
ev <- eigen(Sigma, symmetric = TRUE)
lambda <- ev$values
V <- ev$vectors
R <- V %*% diag(sqrt(lambda)) %*% t(V)
Z <- matrix(rnorm(n*d), nrow = n, ncol = d)
X <- Z %*% R + matrix(mu, n, d, byrow = TRUE)
X
}
# generate the sample
X[1:100,] <- rmvn.eigen(100, mu1, Sigma1)
103. X[101:200,]<-rmvn.eigen(100,mu2,Sigma2)
#plot to getidea of initial values
############ Give an X matrix with your IRIS data ########
###### your program for hwk would start here
#################
plot(X)
February 27, 2014 3
Stat 102B -Computation and Optimization in Statistics
Handout 17
NAME (Last, First):————————— UCLA ID:—————
Date: ——–
J. Sanchez
UCLA Department of Statistics
###########choose initial values of mu#####
104. #### Note, you must choose your initial values for the iris data
##
### The ones below are for the artificial data. ######
c1=c(12,12) #initial center for cluster 1
c2=c(35,35) #initial center for cluster 2
pastIndicator=200:1 #initial value for z
indicator=1:200 # past indicator will be compared with new
indicator
### note: we initialize this way to get the algorithm started
###### We must iterate until z does not chang, e., until
pastIndicator=indicator
while(sum(pastIndicator!=indicator)!=0)
{
pastIndicator=indicator;
105. #distance to current cluster centers
dc1 =colSums((t(X)-c1)ˆ2)
dc2=colSums((t(X)-c2)ˆ2)
dMat=matrix(c(dc1,dc2),,2)
#decide which cluster each point belongs to
indicator = max.col(-dMat)
# update the cluster centers
c1=colMeans(X[indicator==1,])
c2=colMeans(X[indicator==2,])
# Make plot
}
c1 ## to see the mu’s to which I converge for group 1
106. c2 # to see the mus to which I converge for group 2
######## If you want to see which group each observation goes
to
########## you type
indicator
pastIndicator
#### Both should be the same. IF you look at how I generated
the data
#### notice that I have used two multivariate normals with very
different
#### means and var-cov matrices.... You should have the first
100 observations in one
#### group and the next 100 in another group, or very close.
This may or may not
### be the case for your
107. February 27, 2014 4
Bryce, a bank official, is married and files a joint return. During
2013 he engages in the following activities and transactions:
a. Being an avid fisherman, Bryce develops an expertise in
tying flies. At times during the year, he is asked to conduct fly-
tying demonstrations, for which he is paid a small fee. He also
periodically sells flies that he makes. Income generated from
these activities during the year is $2,500. The expenses for the
year associated with Bryce’s fly-tying activity include $125
personal property taxes on a small trailer that he uses
exclusively for this purpose, $2,900 in supplies, $270 in repairs
on the trailer, and $200 in gasoline for traveling to the
demonstrations.
b. Bryce sells a small building lot to his brother for $40,000.
Bryce purchased the lot four years ago for $47,000, hoping to
make a profit.
c. Bryce enters into the following stock transactions: (None of
the stock qualifies as small business stock.)
Date Transaction
March 22 Purchases 100 shares of Silver Corporation common
stock for $2,800.
108. April 5 Sells 200 shares of Gold Corporation common stock for
$8,000. The stock was originally purchased two years ago for
$5,000.
April 15 Sells 200 shares of Silver Corporation common stock
for $5,400. The stock was originally purchased three years ago
for $9,400.
May 20 Sells 100 shares of United Corporation common stock
for $12,000.
The stock was originally purchased five years ago for $10,000.
d. Bryce’s salary for the year is $115,000. In addition to the
items above, he also incurs $5,000 in other miscellaneous
deductible itemized expenses.
Answer the following questions regarding Bryce’s activities for
the year.
1. Compute Bryce’s taxable income for the year.
2. What is Bryce’s basis in the Silver stock he continues to
own?