The document discusses likelihood functions and inference. It begins by defining the likelihood function as the function that gives the probability of observing a sample given a parameter value. The likelihood varies with the parameter, while the density function varies with the data. Maximum likelihood estimation chooses parameters that maximize the likelihood function. The score function is the gradient of the log-likelihood and has an expected value of zero at the true parameter value. The Fisher information matrix measures the curvature of the likelihood surface and provides information about the precision of parameter estimates. It relates to the concentration of likelihood functions around the true parameter value as sample size increases.
Abstract: This PDSG workshop introduces basic concepts of multiple linear regression in machine learning. Concepts covered are Feature Elimination and Backward Elimination, with examples in Python.
Level: Fundamental
Requirements: Should have some experience with Python programming.
Abstract: This PDSG workshop introduces basic concepts of multiple linear regression in machine learning. Concepts covered are Feature Elimination and Backward Elimination, with examples in Python.
Level: Fundamental
Requirements: Should have some experience with Python programming.
YouTube Link: https://youtu.be/UoHu27xoTyc
** Machine Learning Engineer Masters Program: https://www.edureka.co/machine-learning-certification-training **
This Edureka PPT on Least Squares Regression Method will help you understand the math behind Regression Analysis and how it can be implemented using Python.
Follow us to never miss an update in the future.
YouTube: https://www.youtube.com/user/edurekaIN
Instagram: https://www.instagram.com/edureka_learning/
Facebook: https://www.facebook.com/edurekaIN/
Twitter: https://twitter.com/edurekain
LinkedIn: https://www.linkedin.com/company/edureka
Castbox: https://castbox.fm/networks/505?country=in
Simple Linear Regression: Step-By-StepDan Wellisch
This presentation was made to our meetup group found here.: https://www.meetup.com/Chicago-Technology-For-Value-Based-Healthcare-Meetup/ on 9/26/2017. Our group is focused on technology applied to healthcare in order to create better healthcare.
When we collect the data, it has some variables value with it. For better results from the collected data, we need to study these variables. For this, we need to analyze the probability which gives us the idea what the variable can do, by using a probability distribution. Copy the link given below and paste it in new browser window to get more information on Laplace Distribution:- http://www.transtutors.com/homework-help/statistics/laplace-distribution.aspx
YouTube Link: https://youtu.be/UoHu27xoTyc
** Machine Learning Engineer Masters Program: https://www.edureka.co/machine-learning-certification-training **
This Edureka PPT on Least Squares Regression Method will help you understand the math behind Regression Analysis and how it can be implemented using Python.
Follow us to never miss an update in the future.
YouTube: https://www.youtube.com/user/edurekaIN
Instagram: https://www.instagram.com/edureka_learning/
Facebook: https://www.facebook.com/edurekaIN/
Twitter: https://twitter.com/edurekain
LinkedIn: https://www.linkedin.com/company/edureka
Castbox: https://castbox.fm/networks/505?country=in
Simple Linear Regression: Step-By-StepDan Wellisch
This presentation was made to our meetup group found here.: https://www.meetup.com/Chicago-Technology-For-Value-Based-Healthcare-Meetup/ on 9/26/2017. Our group is focused on technology applied to healthcare in order to create better healthcare.
When we collect the data, it has some variables value with it. For better results from the collected data, we need to study these variables. For this, we need to analyze the probability which gives us the idea what the variable can do, by using a probability distribution. Copy the link given below and paste it in new browser window to get more information on Laplace Distribution:- http://www.transtutors.com/homework-help/statistics/laplace-distribution.aspx
Sensitivity, specificity and likelihood ratiosChew Keng Sheng
A short tutorial on sensitivity, specificity and likelihood ratios. In this presentation, I demonstrate why likelihood ratios are better parameters compared to sensitivity and specificity in real world setting.
Honest Reviews of Tim Han LMA Course Program.pptxtimhan337
Personal development courses are widely available today, with each one promising life-changing outcomes. Tim Han’s Life Mastery Achievers (LMA) Course has drawn a lot of interest. In addition to offering my frank assessment of Success Insider’s LMA Course, this piece examines the course’s effects via a variety of Tim Han LMA course reviews and Success Insider comments.
Embracing GenAI - A Strategic ImperativePeter Windle
Artificial Intelligence (AI) technologies such as Generative AI, Image Generators and Large Language Models have had a dramatic impact on teaching, learning and assessment over the past 18 months. The most immediate threat AI posed was to Academic Integrity with Higher Education Institutes (HEIs) focusing their efforts on combating the use of GenAI in assessment. Guidelines were developed for staff and students, policies put in place too. Innovative educators have forged paths in the use of Generative AI for teaching, learning and assessments leading to pockets of transformation springing up across HEIs, often with little or no top-down guidance, support or direction.
This Gasta posits a strategic approach to integrating AI into HEIs to prepare staff, students and the curriculum for an evolving world and workplace. We will highlight the advantages of working with these technologies beyond the realm of teaching, learning and assessment by considering prompt engineering skills, industry impact, curriculum changes, and the need for staff upskilling. In contrast, not engaging strategically with Generative AI poses risks, including falling behind peers, missed opportunities and failing to ensure our graduates remain employable. The rapid evolution of AI technologies necessitates a proactive and strategic approach if we are to remain relevant.
Biological screening of herbal drugs: Introduction and Need for
Phyto-Pharmacological Screening, New Strategies for evaluating
Natural Products, In vitro evaluation techniques for Antioxidants, Antimicrobial and Anticancer drugs. In vivo evaluation techniques
for Anti-inflammatory, Antiulcer, Anticancer, Wound healing, Antidiabetic, Hepatoprotective, Cardio protective, Diuretics and
Antifertility, Toxicity studies as per OECD guidelines
Francesca Gottschalk - How can education support child empowerment.pptxEduSkills OECD
Francesca Gottschalk from the OECD’s Centre for Educational Research and Innovation presents at the Ask an Expert Webinar: How can education support child empowerment?
Synthetic Fiber Construction in lab .pptxPavel ( NSTU)
Synthetic fiber production is a fascinating and complex field that blends chemistry, engineering, and environmental science. By understanding these aspects, students can gain a comprehensive view of synthetic fiber production, its impact on society and the environment, and the potential for future innovations. Synthetic fibers play a crucial role in modern society, impacting various aspects of daily life, industry, and the environment. ynthetic fibers are integral to modern life, offering a range of benefits from cost-effectiveness and versatility to innovative applications and performance characteristics. While they pose environmental challenges, ongoing research and development aim to create more sustainable and eco-friendly alternatives. Understanding the importance of synthetic fibers helps in appreciating their role in the economy, industry, and daily life, while also emphasizing the need for sustainable practices and innovation.
Acetabularia Information For Class 9 .docxvaibhavrinwa19
Acetabularia acetabulum is a single-celled green alga that in its vegetative state is morphologically differentiated into a basal rhizoid and an axially elongated stalk, which bears whorls of branching hairs. The single diploid nucleus resides in the rhizoid.
Read| The latest issue of The Challenger is here! We are thrilled to announce that our school paper has qualified for the NATIONAL SCHOOLS PRESS CONFERENCE (NSPC) 2024. Thank you for your unwavering support and trust. Dive into the stories that made us stand out!
A Strategic Approach: GenAI in EducationPeter Windle
Artificial Intelligence (AI) technologies such as Generative AI, Image Generators and Large Language Models have had a dramatic impact on teaching, learning and assessment over the past 18 months. The most immediate threat AI posed was to Academic Integrity with Higher Education Institutes (HEIs) focusing their efforts on combating the use of GenAI in assessment. Guidelines were developed for staff and students, policies put in place too. Innovative educators have forged paths in the use of Generative AI for teaching, learning and assessments leading to pockets of transformation springing up across HEIs, often with little or no top-down guidance, support or direction.
This Gasta posits a strategic approach to integrating AI into HEIs to prepare staff, students and the curriculum for an evolving world and workplace. We will highlight the advantages of working with these technologies beyond the realm of teaching, learning and assessment by considering prompt engineering skills, industry impact, curriculum changes, and the need for staff upskilling. In contrast, not engaging strategically with Generative AI poses risks, including falling behind peers, missed opportunities and failing to ensure our graduates remain employable. The rapid evolution of AI technologies necessitates a proactive and strategic approach if we are to remain relevant.
Instructions for Submissions thorugh G- Classroom.pptxJheel Barad
This presentation provides a briefing on how to upload submissions and documents in Google Classroom. It was prepared as part of an orientation for new Sainik School in-service teacher trainees. As a training officer, my goal is to ensure that you are comfortable and proficient with this essential tool for managing assignments and fostering student engagement.
Instructions for Submissions thorugh G- Classroom.pptx
Statistics (1): estimation Chapter 3: likelihood function and likelihood estimation
1. Chapter 3 :
Likelihood function and inference
[0]
4 Likelihood function and inference
The likelihood
Information and curvature
Suciency and ancilarity
Maximum likelihood estimation
Non-regular models
EM algorithm
2. The likelihood
Given an usually parametric family of distributions
F 2 fF, 2 g
with densities f [wrt a
3. xed measure ], the density of the iid
sample x1, . . . , xn is
Yn
i=1
f(xi)
Note In the special case is a counting measure,
Yn
i=1
f(xi)
is the probability of observing the sample x1, . . . , xn among all
possible realisations of X1, . . . ,Xn
5. nition (likelihood function)
The likelihood function associated with a sample x1, . . . , xn is the
function
L : ! R+
!
Yn
i=1
f(xi)
same formula as density but dierent space of variation
6. Example: density function versus likelihood function
Take the case of a Poisson density
[against the counting measure]
f(x; ) =
x
x!
e- IN(x)
which varies in N as a function of x
versus
L(; x) =
x
x!
e-
which varies in R+ as a function of = 3
7. Example: density function versus likelihood function
Take the case of a Poisson density
[against the counting measure]
f(x; ) =
x
x!
e- IN(x)
which varies in N as a function of x
versus
L(; x) =
x
x!
e-
which varies in R+ as a function of x = 3
8. Example: density function versus likelihood function
Take the case of a Normal N(0, )
density [against the Lebesgue measure]
f(x; ) =
1
p
2
e-x2=2 IR(x)
which varies in R as a function of x
versus
L(; x) =
1
p
2
e-x2=2
which varies in R+ as a function of
= 2
9. Example: density function versus likelihood function
Take the case of a Normal N(0, )
density [against the Lebesgue measure]
f(x; ) =
1
p
2
e-x2=2 IR(x)
which varies in R as a function of x
versus
L(; x) =
1
p
2
e-x2=2
which varies in R+ as a function of
x = 2
10. Example: density function versus likelihood function
Take the case of a Normal N(0, 1=)
density [against the Lebesgue measure]
f(x; ) =
p
p
2
e-x2=2 IR(x)
which varies in R as a function of x
versus
L(; x) =
p
p
2
e-x2=2 IR(x)
which varies in R+ as a function of
= 1=2
11. Example: density function versus likelihood function
Take the case of a Normal N(0, 1=)
density [against the Lebesgue measure]
f(x; ) =
p
p
2
e-x2=2 IR(x)
which varies in R as a function of x
versus
L(; x) =
p
p
2
e-x2=2 IR(x)
which varies in R+ as a function of
x = 1=2
12. Example: Hardy-Weinberg equilibrium
Population genetics:
Genotypes of biallelic genes AA, Aa, and aa
sample frequencies nAA, nAa and naa
multinomial model M(n; pAA, pAa, paa)
related to population proportion of A alleles, pA:
pAA = p2
A , pAa = 2pA(1 - pA) , paa = (1 - pA)2
likelihood
L(pAjnAA, nAa, naa) / p2nAA
A [2pA(1 - pA)]nAa(1 - pA)2naa
[Boos Stefanski, 2013]
13. mixed distributions and their likelihood
Special case when a random variable X may take speci
14. c values
a1, . . . , ak and a continum of values A
Example: Rainfall at a given spot on a given day may be zero with
positive probability p0 [it did not rain!] or an arbitrary number
between 0 and 100 [capacity of measurement container] or 100
with positive probability p100 [container full]
15. mixed distributions and their likelihood
Special case when a random variable X may take speci
16. c values
a1, . . . , ak and a continum of values A
Example: Tobit model where y N(XT
18. mixed distributions and their likelihood
Special case when a random variable X may take speci
19. c values
a1, . . . , ak and a continum of values A
Density of X against composition of two measures, counting and
Lebesgue:
fX(a) =
P(X = a) if a 2 fa1, . . . , akg
f(aj) otherwise
Results in likelihood
L(jx1, . . . , xn) =
Yk
j=1
P(X = ai)nj
Y
xi =2fa1,...,akg
f(xij)
where nj # observations equal to aj
20. Enters Fisher, Ronald Fisher!
Fisher's intuition in the 20's:
the likelihood function contains the
relevant information about the
parameter
the higher the likelihood the more
likely the parameter
the curvature of the likelihood
determines the precision of the
estimation
21. Concentration of likelihood mode around true parameter
Likelihood functions for x1, . . . , xn P(3) as n increases
n = 40, ..., 240
22. Concentration of likelihood mode around true parameter
Likelihood functions for x1, . . . , xn P(3) as n increases
n = 38, ..., 240
23. Concentration of likelihood mode around true parameter
Likelihood functions for x1, . . . , xn N(0, 1) as n increases
24. Concentration of likelihood mode around true parameter
Likelihood functions for x1, . . . , xn N(0, 1) as sample varies
25. Concentration of likelihood mode around true parameter
Likelihood functions for x1, . . . , xn N(0, 1) as sample varies
26. why concentration takes place
Consider
x1, . . . , xn
iid
F
Then
log
Yn
i=1
f(xij) =
Xn
i=1
log f(xij)
and by LLN
1=n
Xn
i=1
log f(xij)
L
!
Z
X
log f(xj) dF(x)
Lemma
Maximising the likelihood is asymptotically equivalent to
minimising the Kullback-Leibler divergence
Z
X
log f(x)=f(xj) dF(x)
c Member of the family closest to true distribution
32. ned by
rlog L(jx) =
@=@1L(jx), . . . , @=@pL(jx)
L(jx)
Gradient (slope) of likelihood function at point
lemma
When X F,
E[rlog L(jX)] = 0
Reason:
Z
X
rlog L(jx) dF(x) =
Z
X
rL(jx) dx = r
Z
X
dF(x)
34. ned by
rlog L(jx) =
@=@1L(jx), . . . , @=@pL(jx)
L(jx)
Gradient (slope) of likelihood function at point
lemma
When X F,
E[rlog L(jX)] = 0
Connected with concentration theorem: gradient null on average
for true value of parameter
36. ned by
rlog L(jx) =
@=@1L(jx), . . . , @=@pL(jx)
L(jx)
Gradient (slope) of likelihood function at point
lemma
When X F,
E[rlog L(jX)] = 0
Warning: Not de
38. Fisher's information matrix
Another notion attributed to Fisher [more likely due to Edgeworth]
Information: covariance matrix of the score vector
I() = E
h
rlog f(Xj) frlog f(Xj)gT
i
Often called Fisher information
Measures curvature of the likelihood surface, which translates as
information brought by the data
Sometimes denoted IX to stress dependence on distribution of X
39. Fisher's information matrix
Second derivative of the log-likelihood as well
lemma
If L(jx) is twice dierentiable [as a function of ]
I() = -E
rTrlog f(Xj)
Hence
Iij() = -E
@2
@i@j
log f(Xj)
44. Properties
Additive features translating as accumulation of information:
if X and Y are independent, IX() + IY() = I(X,Y)()
IX1,...,Xn() = nIX1()
if X = T(Y) and Y = S(X), IX() = IY()
if X = T(Y), IX() 6 IY()
If = () is a bijective transform, change of parameterisation:
I() =
46. @
@
In information geometry, this is seen as a change of
coordinates on a Riemannian manifold, and the intrinsic
properties of curvature are unchanged under dierent
parametrization. In general, the Fisher information
matrix provides a Riemannian metric (more precisely, the
Fisher-Rao metric). [Wikipedia]
47. Properties
Additive features translating as accumulation of information:
if X and Y are independent, IX() + IY() = I(X,Y)()
IX1,...,Xn() = nIX1()
if X = T(Y) and Y = S(X), IX() = IY()
if X = T(Y), IX() 6 IY()
If = () is a bijective transform, change of parameterisation:
I() =
49. @
@
In information geometry, this is seen as a change of
coordinates on a Riemannian manifold, and the intrinsic
properties of curvature are unchanged under dierent
parametrization. In general, the Fisher information
matrix provides a Riemannian metric (more precisely, the
Fisher-Rao metric). [Wikipedia]
50. Approximations
Back to the Kullback{Leibler divergence
D(0, ) =
Z
X
f(xj0) log f(xj0)=f(xj) dx
Using a second degree Taylor expansion
log f(xj) = log f(xj0) + ( - 0)Trlog f(xj0)
+
1
2
( - 0)TrrT log f(xj0)( - 0) + o(jj - 0 jj2)
approximation of divergence:
D(0, )
1
2
( - 0)TI(0)( - 0)
[Exercise: show this is exact in the normal case]
51. Approximations
Central limit law of the score vector
p
nrlog L(jX1, . . . ,Xn) N(0, IX1())
1=
Notation I1() stands for IX1() and indicates information
associated with a single observation
52. Suciency
What if a transform of the sample
S(X1, . . . ,Xn)
contains all the information, i.e.
I(X1,...,Xn)() = IS(X1,...,Xn)()
uniformly in ?
In this case S() is called a sucient statistic [because it is
sucient to know the value of S(x1, . . . , xn) to get complete
information]
A statistic is an arbitrary transform of the data X1, . . . ,Xn
54. nition:
If (X1, . . . ,Xn) f(x1, . . . , xnj) and if T = S(X1, . . . ,Xn) is such
that the distribution of (X1, . . . ,Xn) conditional on T does not
depend on , then S() is a sucient statistic
Factorisation theorem
S() is a sucient statistic if and only if
f(x1, . . . , xnj) = g(S(x1, . . . , xnj)) h(x1, . . . , xn)
another notion due to Fisher
55. Illustrations
Uniform U(0, ) distribution
L(jx1, . . . , xn) = -n
Yn
i=1
I(0,)(xi) = -nI max
i
xi
Hence
S(X1, . . . ,Xn) = max
i
Xi = X(n)
is sucient
56. Illustrations
Bernoulli B(p) distribution
L(pjx1, . . . , xn) =
Yn
i=1
pxi(1 - p)n-xi = fp=1-pg
P
i xi (1 - p)n
Hence
S(X1, . . . ,Xn) = Xn
is sucient
58. Suciency and exponential families
Both previous examples belong to exponential families
f(xj) = h(x) exp
T()TS(x) - ()
Generic property of exponential families:
f(x1, . . . , xnj) =
Yn
i=1
h(xi) exp
T()T
Xn
i=1
S(xi) - n()
lemma
For an exponential family with summary statistic S(), the statistic
S(X1, . . . ,Xn) =
Xn
i=1
S(Xi)
is sucient
59. Suciency as a rare feature
Nice property reducing the data to a low dimension transform but...
How frequent is it within the collection of probability distributions?
Very rare as essentially restricted to exponential families
[Pitman-Koopman-Darmois theorem]
with the exception of parameter-dependent families like U(0, )
60. Pitman-Koopman-Darmois characterisation
If X1, . . . ,Xn are iid random variables from a density
f(j) whose support does not depend on and verifying
the property that there exists an integer n0 such that, for
n n0, there is a sucient statistic S(X1, . . . ,Xn) with
61. xed [in n] dimension, then f(j) belongs to an
exponential family
[Factorisation theorem]
Note: Darmois published this result in 1935 [in French] and
Koopman and Pitman in 1936 [in English] but Darmois is generally
omitted from the theorem... Fisher proved it for one-D sucient
statistics in 1934
62. Minimal suciency
Multiplicity of sucient statistics, e.g., S0(x) = (S(x),U(x))
remains sucient when S() is sucient
Search of a most concentrated summary:
Minimal suciency
A sucient statistic S() is minimal sucient if it is a function of
any other sucient statistic
Lemma
For a minimal exponential family representation
f(xj) = h(x) exp
T()TS(x) - ()
S(X1) + . . . + S(Xn) is minimal sucient
63. Ancillarity
Opposite of suciency:
Ancillarity
When X1, . . . ,Xn are iid random variables from a density f(j), a
statistic A() is ancillary if A(X1, . . . ,Xn) has a distribution that
does not depend on
Useless?! Not necessarily, as conditioning upon A(X1, . . . ,Xn)
leads to more precision and eciency:
Use of F(x1, . . . , xnjA(x1, . . . , xn)) instead of F(x1, . . . , xn)
Notion of maximal ancillary statistic
65. Point estimation, estimators and estimates
When given a parametric family f(j) and a sample supposedly
drawn from this family
(X1, . . . ,XN)
iid
f(xj)
an estimator of is a statistic T(X1, . . . ,XN) or ^n providing a
[reasonable] substitute for the unknown value .
an estimate of is the value of the estimator for a given [realised]
sample, T(x1, . . . , xn)
Example: For a Normal N(, 2 sample X1, . . . ,XN,
T(X1, . . . ,XN) = ^n = 1=n Xn
is an estimator of and ^n = 2.014 is an estimate
66. Maximum likelihood principle
Given the concentration property of the likelihood function,
reasonable choice of estimator as mode:
MLE
A maximum likelihood estimator (MLE) ^n satis
67. es
L(^njX1, . . . ,XN) L(^njX1, . . . ,XN) for all 2
Under regularity of L(jX1, . . . ,XN), MLE also solution of the
likelihood equations
rlog L(njX1, . . . ,XN) = 0
Warning: ^n is not most likely value of but makes observation
(x1, . . . , xN) most likely...
68. Maximum likelihood invariance
Principle independent of parameterisation:
If = h() is a one-to-one transform of , then
^
MLE
n = h(^MLE
n )
[estimator of transform = transform of estimator]
By extension, if = h() is any transform of , then
^
MLE
n = h(^MLE
n )
69. Unicity of maximum likelihood estimate
Depending on regularity of L(jx1, . . . , xN), there may be
1 an a.s. unique MLE ^MLE
n
2
3
1 Case of x1, . . . , xn N(, 1)
2
3
70. Unicity of maximum likelihood estimate
Depending on regularity of L(jx1, . . . , xN), there may be
1
2 several or an in
71. nity of MLE's [or of solutions to likelihood
equations]
3
1
2 Case of x1, . . . , xn N(1 + 2, 1) [[and mixtures of normal]
3
72. Unicity of maximum likelihood estimate
Depending on regularity of L(jx1, . . . , xN), there may be
1
2
3 no MLE at all
1
2
3 Case of x1, . . . , xn N(i, -2)
73. Unicity of maximum likelihood estimate
Consequence of standard dierential calculus results on
`() = L(jx1, . . . , xn):
lemma
If is connected and open, and if `() is twice-dierentiable with
lim
!@
`() +1
and if H() = rrT`() is positive de
74. nite at all solutions of the
likelihood equations, then `() has a unique global maximum
Limited appeal because excluding local maxima
75. Unicity of MLE for exponential families
lemma
If f(j) is a minimal exponential family
f(xj) = h(x) exp
T()TS(x) - ()
with T() one-to-one and twice dierentiable over , if is open,
and if there is at least one solution to the likelihood equations,
then it is the unique MLE
Likelihood equation is equivalent to S(x) = E[S(x)]
76. Unicity of MLE for exponential families
Likelihood equation is equivalent to S(x) = E[S(x)]
lemma
If is connected and open, and if `() is twice-dierentiable with
lim
!@
`() +1
and if H() = rrT`() is positive de
77. nite at all solutions of the
likelihood equations, then `() has a unique global maximum
78. Illustrations
Uniform U(0, ) likelihood
L(jx1, . . . , xn) = -nI max
i
xi
not dierentiable at X(n) but
^MLE
n = X(n)
79. Illustrations
Bernoulli B(p) likelihood
L(pjx1, . . . , xn) = fp=1-pg
P
i xi (1 - p)n
dierentiable over (0, 1) and
^pMLE
n = Xn
81. The fundamental theorem of Statistics
fundamental theorem
Under appropriate conditions, if (X1, . . . ,Xn)
iid
f(xj), if ^n is
solution of rlog f(X1, . . . ,Xnj) = 0, then
p
nf^n - g
L
! Np(0, I()-1)
Equivalent of CLT for estimation purposes
83. able
support of f(j) constant in
`() thrice dierentiable
[the killer] there exists g(x) integrable against f(j) in a
neighbourhood of the true parameter such that
91. 6 g(x)
the identity
I() = E
h
rlog f(Xj) frlog f(Xj)gT
i
= -E
rTrlog f(Xj)
stands [mostly super
uous]
^n converges in probability to [similarly super
uous]
[Boos Stefanski, 2014, p.286; Lehmann Casella, 1998]
92. Inecient MLEs
Example of MLE of = jjjj2 when x Np(, Ip):
^MLE = jjxjj2
Then E[jjxjj2] = + p diverges away from with p
Note: Consistent and ecient behaviour when considering the
MLE of based on
Z = jjXjj2 2
p()
[Robert, 2001]
93. Inconsistent MLEs
Take X1, . . . ,Xn
iid
f(x) with
f(x) = (1 - )
1
()
f0(x-=()) + f1(x)
for 2 [0, 1],
f1(x) = I[-1,1](x) f0(x) = (1 - jxj)I[1,1](x)
and
() = (1 - ) expf-(1 - )-4 + 1g
Then for any
^MLE
n
a.s.
! 1
[Ferguson, 1982; John Wellner's slides, ca. 2005]
94. Inconsistent MLEs
Consider Xij i = 1, . . . , n, j = 1, 2 with Xij N(i, 2). Then
^MLE
i = Xi1+Xi2=2 ^2
MLE
=
1
4n
Xn
i=1
(Xi1 - Xi2)2
Therefore
^2
MLE a.s.
! 2=2
[Neyman Scott, 1948]
95. Inconsistent MLEs
Consider Xij i = 1, . . . , n, j = 1, 2 with Xij N(i, 2). Then
^MLE
i = Xi1+Xi2=2 ^2
MLE
=
1
4n
Xn
i=1
(Xi1 - Xi2)2
Therefore
^2
MLE a.s.
! 2=2
[Neyman Scott, 1948]
Note: Working solely with Xi1 - Xi2 N(0, 22) produces a
consistent MLE
96. Likelihood optimisation
Practical optimisation of the likelihood function
? = arg max
L(jx) =
Yn
i=1
g(Xij).
assuming X = (X1, . . . ,Xn)
iid
g(xj)
analytical resolution feasible for exponential families
rT()
Xn
i=1
S(xi) = nr()
use of standard numerical techniques like Newton-Raphson
(t+1) = (t) + Iobs(X, (t))-1r`((t))
with `(.) log-likelihood and Iobs observed information matrix
97. EM algorithm
Cases where g is too complex for the above to work
Special case when g is a marginal
g(xj) =
Z
Z
f(x, zj) dz
Z called latent or missing variable
98. Illustrations
censored data
X = min(X, a) X N(, 1)
mixture model
X .3N1(0, 1) + .7N1(1, 1),
desequilibrium model
X = min(X, Y) X f1(xj) Y f2(xj)
99. Completion
EM algorithm based on completing data x with z, such as
(X,Z) f(x, zj)
Z missing data vector and pair (X,Z) complete data vector
Conditional density of Z given x:
k(zj, x) =
f(x, zj)
g(xj)
100. Likelihood decomposition
Likelihood associated with complete data (x, z)
Lc(jx, z) = f(x, zj)
and likelihood for observed data
L(jx)
such that
log L(jx) = E[log Lc(jx,Z)j0, x] - E[log k(Zj, x)j0, x] (1)
for any 0, with integration operated against conditionnal
distribution of Z given observables (and parameters), k(zj0, x)
101. [A tale of] two 's
There are two 's ! : in (1), 0 is a
102. xed (and arbitrary) value
driving integration, while both free (and variable)
Maximising observed likelihood
L(jx)
equivalent to maximise r.h.s. term in (1)
E[log Lc(jx,Z)j0, x] - E[log k(Zj, x)j0, x]
103. Intuition for EM
Instead of maximising wrt r.h.s. term in (1), maximise only
E[log Lc(jx,Z)j0, x]
Maximisation of complete log-likelihood impossible since z
unknown, hence substitute by maximisation of expected complete
log-likelihood, with expectation depending on term 0
104. Expectation{Maximisation
Expectation of complete log-likelihood denoted
Q(j0, x) = E[log Lc(jx,Z)j0, x]
to stress dependence on 0 and sample x
Principle
EM derives sequence of estimators ^(j), j = 1, 2, . . ., through
iteration of Expectation and Maximisation steps:
Q(^(j)j^(j-1), x) = max
Q(j^(j-1), x).
105. EM Algorithm
Iterate (in m)
1 (step E) Compute
Q(j^(m), x) = E[log Lc(jx,Z)j^(m), x] ,
2 (step M) Maximise Q(j^(m), x) in and set
^(m+1) = arg max
Q(j^(m), x).
until a
106. xed point [of Q] is found
[Dempster, Laird, Rubin, 1978]
116. Censored data (3)
Dierenciating in ,
n ^(j+1) = mx + (n -m)E[Zj^(j)] ,
with
E[Zj^(j)] =
Z1
a
zk(zj^(j), x) dz = ^(j) +
'(a - ^(j))
1 - (a - ^(j))
.
Hence, EM sequence provided by
^(j+1) =
m
n
x +
n -m
n
^(j) +
'(a - ^(j))
1 - (a - ^(j))
#
,
which converges to likelihood maximum ^
117. Mixtures
Mixture of two normal distributions with unknown means
.3N1(0, 1) + .7N1(1, 1),
sample X1, . . . ,Xn and parameter = (0, 1)
Missing data: Zi 2 f0, 1g, indicator of component associated with
Xi ,
Xijzi N(zi , 1) Zi B(.7)
Complete likelihood
log Lc(jx, z) / -
1
2
Xn
i=1
zi(xi - 1)2 -
1
2
Xn
i=1
(1 - zi)(xi - 0)2
= -
1
2
n1(^1 - 1)2 -
1
2
(n - n1)(^0 - 0)2
with
n1 =
Xn
i=1
zi , n1^1 =
Xn
i=1
zixi , (n - n1)^0 =
Xn
i=1
(1 - zi)xi
118. Mixtures (2)
At j-th EM iteration
Q(j^(j), x) =
1
2
E
n1(^1 - 1)2 + (n - n1)(^0 - 0)2j^(j), x
Dierenciating in
^(j+1) =
0
BB@
E
n1^1
125. ^(j), xi
xi
.Pn
i=1 E
Zij^(j), xi
Pn
i=1 E
(1 - Zi)
126.
127. ^(j), xi
xi
.Pn
i=1 E
(1 - Zi)j^(j), xi
1
CCA
Conclusion
Step (E) in EM replaces missing data Zi with their conditional
expectation, given x (expectation that depend on ^(m)).
128. Mixtures (3)
−1 0 1 2 3
−1 0 1 2 3
m1
m2
EM iterations for several starting values
129. Properties
EM algorithm such that
it converges to local maximum or saddle-point
it depends on the initial condition (0)
it requires several initial values when likelihood multimodal