Deriving likelihood functions is often perceived as a daunting task. This slides shows how the likelihood function is derived in a general case and demonstrate it for different models.
Part of the Eawag Summer School on System Analysis.
1. Formulation of model likelihood functions
The most useful representation of stochastic models
June 12, 2017
Andreas Scheidegger
Eawag: Swiss Federal Institute of Aquatic Science and Technology
2. Statistical models are stories about how the data
came to be – Dave Harris
Andreas Scheidegger Motivation 1
3. What is a likelihood function?
Definition
The likelihood function p(y1, . . . , yn|θ) or L(θ) is the joint
probability (density) of observations {y1, . . . yn} given a stochastic
model with parameter values θ.
Andreas Scheidegger Motivation 2
4. What is a likelihood function?
Definition
The likelihood function p(y1, . . . , yn|θ) or L(θ) is the joint
probability (density) of observations {y1, . . . yn} given a stochastic
model with parameter values θ.
Informal
If we simulate output data similar to my measurements with a
stochastic model while setting the parameters equal θ, what is the
probability (density) that we obtain {y1, . . . yn}?
Andreas Scheidegger Motivation 2
5. For what do we need likelihood functions?
Many parameter calibration and predictions techniques require that
the model is described by its likelihood function:
Frequentist statistics:
Maximum likelihood estimator (MLE), LR-tests, . . .
Bayesian statistics:
Parameter inference, uncertainty propagation,
predictions, model comparison, . . .
→ topic of this course
Note: The actual value of the likelihood function per se is usually not of
interest.
Andreas Scheidegger Motivation 3
6. How to formulate likelihood functions?
Often, models are not described by the likelihood function.
A common description may rather look like this:
Yi = M(xi , θ) + i , i ∼ N(0, σ2
)
While this a complete description of the stochastic model1, it is
not directly useful for inference → we must translate such a
description into p(y|θ, x).
1
M(xi , θ) is a deterministic function. The complete model, however, is
stochastic because we added a random error term i .
Andreas Scheidegger Motivation 4
7. Derivation of a likelihood function
1. Decompose the joint probability density:
p(y|θ) = p(y1, . . . , yn|θ) = p(y1|θ)p(y2|θ, y1)p(y3|θ, y1, y2)
. . . p(yn|θ, y1, . . . , yn−1)
2. Formulate the conditional probabilities:
p(yi |θ, y1, . . . , yi−1)
If the observations are independent:
p(yi |θ, y1, . . . , yi−1) = p(yi |θ).
Andreas Scheidegger Derivation of p(y|θ) 5
8. Some (informal) advices
• Formulate first the likelihood general without specific
distribution assumptions.
• Think (informally!) p(x) as Prob(X = x) and change sums to
integrals.
• Practically, a function that is proportional to the likelihood
function is sufficient.
• The logarithmic scale is prefered for computation.
• Don’t care about identifiability of the parameters at this stage.
Andreas Scheidegger Derivation of p(y|θ) 6
9. Example 1: sex ratio
discrete data
Observed data y
The gender of n newborns.
Model description
We assume the probability for
girl is θ and for boy 1 − θ.
Andreas Scheidegger Examples 7
10. Example 1: sex ratio
discrete data
Probability for a single observation:
Prob(yi |θ) =
θ, yi = girl
1 − θ, yi = boy
Andreas Scheidegger Examples 8
11. Example 1: sex ratio
discrete data
Probability for a single observation:
Prob(yi |θ) =
θ, yi = girl
1 − θ, yi = boy
Independence is a reasonable assumption:
Prob(y1, . . . , yn|θ) =
n
i=1
Prob(yi |θ)
= θ#girls
(1 − θ)#boys
Andreas Scheidegger Examples 8
12. Example 1: sex ratio
discrete data
R implementation as function:
logL <- function(theta, n.girls, n.boys) {
LL <- n.girls*log(theta) + n.boys*log(1-theta)
return(LL)
}
Call:
logL(theta=0.4, n.girls=10, n.boys=5)
> -11.717035
Andreas Scheidegger Examples 9
13. Example 2: rating curve
continuous data
Observed data y, x
n pairs of water level xi and
run-off yi .
Figure: Rating curve of Sluzew
Creek. Sikorska et al. (2013)
Andreas Scheidegger Examples 10
14. Example 2: rating curve
continuous data
Observed data y, x
n pairs of water level xi and
run-off yi .
Model description
Water level x and run-off y are
related as
y = RC(x, θ) = θ1(x − θ2)θ3
Figure: Rating curve of Sluzew
Creek. Sikorska et al. (2013)
Andreas Scheidegger Examples 10
15. Example 2: rating curve
continuous data
A deterministic model?
→ We must make assumptions
about the error distribution. E.g.,
Yi = RC(xi , θ) + i , i ∼ N(0, σ2
)
or equivalent
Yi ∼ N RC(xi , θ), σ2
The RC model describes only the
expected value of an observation
for a given xi .
2
φ(x) = 1√
2π
exp{−x2
2
}
Andreas Scheidegger Examples 11
16. Example 2: rating curve
continuous data
A deterministic model?
→ We must make assumptions
about the error distribution. E.g.,
Yi = RC(xi , θ) + i , i ∼ N(0, σ2
)
or equivalent
Yi ∼ N RC(xi , θ), σ2
The RC model describes only the
expected value of an observation
for a given xi .
So the pdf for a single
observation is the density of a
normal distribution2
p(yi |xi , θ, σ) =
1
σ
φ
yi − RC(xi , θ)
σ
2
φ(x) = 1√
2π
exp{−x2
2
}
Andreas Scheidegger Examples 11
17. Example 2: rating curve
continuous data
A deterministic model?
→ We must make assumptions
about the error distribution. E.g.,
Yi = RC(xi , θ) + i , i ∼ N(0, σ2
)
or equivalent
Yi ∼ N RC(xi , θ), σ2
The RC model describes only the
expected value of an observation
for a given xi .
So the pdf for a single
observation is the density of a
normal distribution2
p(yi |xi , θ, σ) =
1
σ
φ
yi − RC(xi , θ)
σ
Finally, assuming independent
observations
p(y1, . . . , yn|x1, . . . , xn, θ, σ)
=
n
i=1
p(yi |xi , θ, σ)
2
φ(x) = 1√
2π
exp{−x2
2
}
Andreas Scheidegger Examples 11
18. Example 2: rating curve
continuous data
Figure: Rating curve. Example of a non-linear regression.
Andreas Scheidegger Examples 12
19. Example 2: rating curve
continuous data
Figure: Rating curve. Example of a non-linear regression.
Andreas Scheidegger Examples 12
20. Example 2: rating curve
continuous data
X (water level)
Y(runoff)
RC(X,θ)
Figure: Rating curve. Example of a non-linear regression.
Andreas Scheidegger Examples 12
21. Example 2: rating curve
continuous data
## deterministic raiting curve model
RC <- function(x, theta) {
y <- theta[1]*(x-theta[2])^theta[3]
return(y)
}
## log likelihood with normal distributed errors
## sigma is included as theta[4]=sigma.
logL <- function(theta, y.data, x.data) {
mean.y <- RC(x.data, theta[1:3]) # mean value for y
LL <- sum(dnorm(y.data, mean=mean.y,
sd=theta[4], log=TRUE))
return(LL)
}
Andreas Scheidegger Examples 13
22. Example 3: limit of quantification
censored data
Observed data y
lab 1 lab 2 lab 3 . . .
concentration y1 y2 n.d. . . .
Limit of quantification: LOQ
standard deviation of measurements: σ
Model description
“A model? I just want to calculate the
concentration.”
Andreas Scheidegger Examples 14
23. Example 3: limit of quantification
censored data
Figure: Left censored data.
Model description
The measurements are normal
distributed around the true mean
θ with standard deviation σ.
Andreas Scheidegger Examples 15
24. Example 3: limit of quantification
censored data
Figure: Left censored data.
Model description
The measurements are normal
distributed around the true mean
θ with standard deviation σ.
Andreas Scheidegger Examples 15
25. Example 3: limit of quantification
censored data
Figure: Left censored data.
Model description
The measurements are normal
distributed around the true mean
θ with standard deviation σ.
Andreas Scheidegger Examples 15
26. Example 3: limit of quantification
censored data
Likelihood for a single measured observation:
p(yi |θ, σ) = φ
yi − θ
σ
Andreas Scheidegger Examples 16
27. Example 3: limit of quantification
censored data
Likelihood for a single measured observation:
p(yi |θ, σ) = φ
yi − θ
σ
Likelihood for a single “not detected” observation:
Prob(n.d.|θ, σ) = Prob(yi < LOQ|θ, σ) =
LOQ
0
p(y|θ, σ) dy
=Φ
LOQ − θ
σ
Andreas Scheidegger Examples 16
28. Example 3: limit of quantification
censored data
Likelihood for a single measured observation:
p(yi |θ, σ) = φ
yi − θ
σ
Likelihood for a single “not detected” observation:
Prob(n.d.|θ, σ) = Prob(yi < LOQ|θ, σ) =
LOQ
0
p(y|θ, σ) dy
=Φ
LOQ − θ
σ
p(y1, . . . , yn|θ, σ) = Prob(yi < LOQ|θ, σ)#censored
¬censored
p(yi |θ, σ)
Andreas Scheidegger Examples 16
29. Example 3: limit of quantification
censored data
## data, if left censored = "nd"
y <- c(y1=0.35, y2=0.45, y3="nd", y4="nd", y5=0.4)
## log likelihood
logL <- function(theta, y, sigma, LOQ) {
## number of censored observations
n.censored <- sum(y=="nd")
## convert not censored observations into type ’numeric’
y.not.cen <- as.numeric(y[y!="nd"])
## likelihood for not censored observations
LL.not.cen <- sum(dnorm(y.not.cen, mean=theta, sd=sigma, log=TRUE))
## likelihood for left censored observations
LL.left.cen <- n.censored * pnorm(LOQ, mean=theta, sd=sigma, log=TRUE)
return(LL.not.cen + LL.left.cen)
}
Andreas Scheidegger Examples 17
30. Example 4: auto-regressive model
auto-correlated data
Observed data y
equally spaced time series data y1, . . . , yn.
Model description
Classical AR(1) model:
yt+1 = θyt + t+1, t+1 ∼ N(0, σ2
)
y0 = k
Time
waterlevel[feet]
1880 1920 1960
576
577
578
579
580
581
582
Figure: Annual water
level of Lake Huron.
Brockwell and Davis
(1991)
Andreas Scheidegger Examples 18
31. Example 4: auto-regressive model
auto-correlated data
Observations are only dependent on the preceding observation.
Hence:
p(y1, . . . , yn|θ, σ, y0) =
n
i=1
p(yi |yi−1, θ, σ)
LL <- dnorm(y[1], k, sigma, log=TRUE) +
sum(dnorm(y[2:n], mean=theta*y[1:(n-1)],
sd=sigma, log=TRUE))
Andreas Scheidegger Examples 19
32. Example 4: auto-regressive model
auto-correlated data
Observations are only dependent on the preceding observation.
Hence:
p(y1, . . . , yn|θ, σ, y0) =
n
i=1
p(yi |yi−1, θ, σ)
The conditional probabilities are all normal
p(yt|y0, . . . , yt−1, θ, σ) = p(yt|yt−1, θ, σ) = φ
yt − θyt−1
σ
LL <- dnorm(y[1], k, sigma, log=TRUE) +
sum(dnorm(y[2:n], mean=theta*y[1:(n-1)],
sd=sigma, log=TRUE))
Andreas Scheidegger Examples 19
33. Normality and “iid.”
Reality is normally not normal distributed
Typical statistical assumption, such as
• normality
• independence
are often chosen from a computational view point.
However, other distribution assumptions can be
incorporated easily in most cases.
Andreas Scheidegger General remarks 20
34. Rating curve modified
Lets assume we observe more extreme values than compatible with
a normal distribution → try t-distribution.
## log likelihood with t-distributed errors
## theta[4]=scale, theta[5]=degree of freedom.
logL <- function(theta, y.data, x.data) {
mean.y <- RC(x.data, theta[1:3]) # mean value for y
residuals <- (y.data - mean.y)/theta[4] # scaling
LL <- sum(dt(residuals, df=theta[5], log=TRUE))
return(LL)
}
Andreas Scheidegger General remarks 21
35. Summary
1. Decompose the joint probability density:
p(y|θ) = p(y1, . . . , yn|θ) = p(y1|θ)p(y2|θ, y1)p(y3|θ, y1, y2)
. . . p(yn|θ, y1, . . . yn−1)
2. Make assumptions to formulate the conditional probabilities:
p(yi |θ, y1, . . . yi−1)
3. Make inference, check assumptions, revise if necessary.
Andreas Scheidegger General remarks 22