Using the Componentwise Metropolis-Hastings Algorithm to Sample from the Joint Posterior Distribution for the Parameters of a Binomial Regression Model
Markov Chain Monte Carlo (MCMC) methods provide a way to sample from a distribution (e.g., the joint posterior distribution for the parameters of a Bayesian model). These methods are useful when analytic solutions for parameter estimations do not exist. If the Markov chain is long, the sampled random variables are (approximately) identically distributed, but they are not independent because in a Markov chain each random variable depends on the previous one. However, because the Ergodic Theorem applies to MCMC methods, the chains converge (with probability one) to the stationary distribution, which for our purposes is the Bayesian joint posterior distribution.
MCMC methods are frequently implemented using a Gibbs sampler. This, however, requires knowledge of the parameters' conditional distributions, which are frequently not available. In this case, another MCMC method, called the Metropolis-Hastings algorithm, can be used. The Metropolis-Hastings algorithm is a type of acceptance/rejection method. It requires a candidate-generating distribution, also called proposal distribution. Ideally, the proposal distribution should be similar to the posterior distribution, but any distribution with the same support as the posterior is possible.
The Metropolis-Hastings algorithm generalizes to multidimensional distributions. In the multidimensional case, there are two types of algorithms ― the "regular" algorithm and the "componentwise" algorithm. Whereas the "regular" algorithm computes a full proposal vector at each step, the "componentwise" algorithm, which is implemented here for a binomial regression model, updates each component at a time, so that the proposals for all the components are evaluated, i.e., accepted or rejected, in turn.
Similar to Using the Componentwise Metropolis-Hastings Algorithm to Sample from the Joint Posterior Distribution for the Parameters of a Binomial Regression Model
Probabilistic Error Bounds for Reduced Order Modeling M&C2015Mohammad
Similar to Using the Componentwise Metropolis-Hastings Algorithm to Sample from the Joint Posterior Distribution for the Parameters of a Binomial Regression Model (20)
Using the Componentwise Metropolis-Hastings Algorithm to Sample from the Joint Posterior Distribution for the Parameters of a Binomial Regression Model
1. Programming Assignment # 6: Using the Componentwise
Metropolis-Hastings Algorithm to Sample from the Joint Posterior
Distribution for the Parameters of a Binomial Regression Model
Markov Chain Monte Carlo (MCMC) methods provide a way to sample
from a distribution (e.g., the joint posterior distribution for the parameters of a
Bayesian model). These methods are useful when analytic solutions for param-
eter estimations do not exist. If the Markov chain is long, the sampled random
variables are (approximately) identically distributed, but they are not indepen-
dent because in a Markov chain each random variable depends on the previous
one. However, because the Ergodic Theorem applies to MCMC methods, the
chains converge (with probability one) to the stationary distribution, which for
our purposes is the Bayesian joint posterior distribution.
MCMC methods are frequently implemented using a Gibbs sampler. This,
however, requires knowledge of the parameters’ conditional distributions, which
are frequently not available. In this case, another MCMC method, called
the Metropolis-Hastings algorithm, can be used. The Metropolis-Hastings al-
gorithm is a type of acceptance/rejection method. It requires a candidate-
generating distribution, also called proposal distribution. Ideally, the proposal
distribution should be similar to the posterior distribution, but any distribution
with the same support as the posterior is possible. If f(y|x) denotes the pro-
posal density, and p(x) the posterior density, the acceptance probability α(y|x)
is calculated as
α(y|x) = min 1,
f(x|y)p(y)
f(y|x)p(x)
.
One then generates U~U[0, 1] and selects the proposal y if U ≤ α and otherwise
rejects the proposal and remains in state x. Thus, the proposal is accepted with
probability α(y|x) and rejected with probability 1−α(y|x). If the proposed step
is sampled as Y |X~U(X −a, X +a), the proposal densities in the numerator and
denominator cancel. Moreover, it is enough to use the kernel of the posterior
density because α uses the ratio of two values of p(.).
The Metropolis-Hastings algorithm generalizes to multidimensional distri-
butions. In the multidimensional case, there are two types of algorithms —
the “regular” algorithm and the “componentwise” algorithm. Whereas the
“regular” algorithm computes a full proposal vector at each step, the “com-
ponentwise” algorithm, which is implemented in the example discussed in this
assignment, updates each component at a time, so that the proposals for all the
components are evaluated, i.e., accepted or rejected, in turn.
The assignment considers the binomial regression model yi~Bi(mi, θi), with
y denoting the vector of the number of successes and m the vector of the number
of trials; θ is given by
θi =
1
1 + e[−(α+βxi)]
,
with x denoting a vector containing the integers from 1 to 10. The m-vector is
(5, 4, 8, 6, 3, 10, 7, 12, 5, 4), and the y-vector is (0, 0, 0, 1, 3, 5, 4, 9, 3, 3).
1
2. A function called “bireg” was written to simulate sampling from the joint
posterior distribution for the parameters α and β. The function returns the
posterior means of these parameters. The function takes the vectors y, x, and
m as input arguments, as well as the starting values of the parameters (defaults:
α = 1, β = 1), the step sizes of the proposals (defaults: step size of the proposal
for α = 1, for β = 0.2), the number of steps of the chain (default = 10,000), and
the length of the “burn-in” period (default = 5000), i.e., the number of initial
values of the chain to be discarded. Burning-in improves the approximation for
the parameters of the posterior distribution because the initial observations, i.e.,
the observations before the chains have converged to the stationary distribution,
are discarded from the estimation of probabilities and expectations derived from
the posterior.
For practical reasons, a flat, uniform (that is, completely noninformative)
prior was chosen. The choice of this prior also reflects the lack of prior knowl-
edge about previously observed or expected values for α and β. The choice of
this prior makes the posterior distribution synonymous to the likelihood, and
in this writeup, as well as in the program, the two expressions are used ex-
changeably. Under omission of the binomial coefficient, which cancels out when
taking the ratio of the posterior distributions for the calculation of the accep-
tance probability, the log posterior density for the binomial regression model
is
p(α, β|y) =
n
i=1
yi log(θi) + (mi − yi) log(1 − θi).
This log posterior distribution was used to implement the componentwise Metro-
polis-Hastings algorithm, which generates a proposal for each of the two param-
eters (α and β) in turn, and then decides, based on the acceptance probability,
whether to accept or reject the proposal. (The decision about acceptance or
rejection is made with the help of sampling from a log standard uniform dis-
tribution, similar to the procedure described above.) This mechanism ensures
that the probability of acceptance depends on the ratio of the kernels of the
posterior densities. The choice of the step size (“radius”) of the proposal for a
component is critical for rapid convergence to the stationary distribution. The
step size is denoted a, and both the default values for aα and aβ were tuned to
produce acceptance probabilities between 0.3 and 0.6 (~0.46 for α and ~0.36 for
β).
The program records the Markov chains for α and β. An index/counter
that is running during the execution of the program is used to inform the user
about what step the program is currently executing. The goal is to sample
from the log likelihood until all samples, after a burn-in period, are derived
from the stationary distribution. In order to check whether stationarity had
been achieved, the program was run with different starting values (-3, -1, 0,
1, and 3 for both α and β) and with different numbers of steps (10,000 and
100,000); the burn-in period was always half the length of the whole chain.
All these configuations produced posterior means of about -3.8 (α) and 0.59
(β). The following graphics were obtained with starting values of 1 for both
2
3. −10 −8 −6 −4 −2
0.20.40.60.81.01.21.4
mchain[(burnin + 1):nrep, alpha]
mchain[(burnin+1):nrep,beta]
Figure 1: Scatterplot of α and β.
α and β, chains of length 100,000, and burn-in periods of 50,000 steps. A
scatterplot of α and β is shown in Figure 1. It shows the values of α and β after
the burn-in period. Diagnostic plots (time series plots of α and β, histograms
of α and β, and autocorrelation functions of α and β) are shown in Figure
2. The time series plots show convergence to the stationary distribution, and
the histograms do not have any unusual shapes. However, the autocorrelation
functions show considerable positive autocorrelation up to a lag of about 100.
(For greater lags (between about 160 and 230) the autocorrelation becomes
slightly negative.) This high autocorrelation is potentially problematic. If the
correlation between adjacent observations is high, a larger sample size is required
to obtain reasonable numerical accuracy, in addition to the requirement of a
much longer burn-in. However, the various configuations of staring values and
chain lengths that produced nearly identical posterior means, as well as the time
series and histograms, strongly suggest that the samples for the posterior means
were derived from the stationary distribution.
The good fit between the parameter θ of the binomial regression function,
which depends on the parameters α and β, and the probabilities of success
based on the data (y/m) is shown in Figure 3. Only success probability 5, which
equals 1 and is an exception to the trend of generally gradually increasing success
3
4. 0 10000 20000 30000 40000 50000
−10−6−4−2
Index
mchain[(burnin+1):nrep,alpha]
0 10000 20000 30000 40000 50000
0.20.61.01.4
Indexmchain[(burnin+1):nrep,beta]
Histogram of mchain[(burnin + 1):nrep, alpha]
mchain[(burnin + 1):nrep, alpha]
Frequency
−10 −8 −6 −4 −2
05001500
Histogram of mchain[(burnin + 1):nrep, beta]
mchain[(burnin + 1):nrep, beta]
Frequency
0.2 0.4 0.6 0.8 1.0 1.2 1.4
050010001500
0 50 100 150 200 250
0.00.40.8
Lag
ACF
Series mchain[(burnin + 1):nrep, alpha]
0 50 100 150 200 250
0.00.40.8
Lag
ACF
Series mchain[(burnin + 1):nrep, beta]
Figure 2: Diagnostic plots of the results of a run of the Metropolis-Hastings
algorithm for the binomial regression model of the assignment: time series plots,
histograms, and autocorrelation functions of α and β.
4
5. x x x
x
x
x
x
x
x
x
2 4 6 8 10
0.00.20.40.60.81.0
x
y/m,theta
Figure 3: Probabilities of success (yi/mi; crosses) and parameters θi (dashed
line) of the binomial regression model.
probabilities, is located far away from the monotonically increasing regression
line.
The code for the “bireg” function is shown below:
# Function that uses the componentwise Metropolis-Hastings algorithm to
# sample from the joint likelihood for alpha and beta of a binomial
# regression model and returns the (posterior) means of these parameters.
#
# Required arguments:
# y - vector of number of successes
# x - vector of series of trials
# m - vector of number of trials
#
# Optional arguments:
# start - vector (length 2) of parameter (alpha, beta) starting values
# avec - vector (length 2) determining the step sizes of the proposals
# for alpha and beta
# nrep - chain length
5
6. # burnin - number of observations to discard
#
# Returns the (posterior) means of the parameters alpha and beta.
#
bireg <- function(y, x, m, start = c(1, 1), avec = c(1, 0.2),
nrep = 10000, burnin = 5000) {
if(mode(y) != "numeric" || mode(x) != "numeric" || mode(m) != "numeric"
|| mode(start) != "numeric" || mode(nrep) != "numeric"
|| mode(avec) != "numeric" || mode(burnin) != "numeric")
stop("All arguments must be numeric!")
logpost <- function(alpha, beta, x, m, y) {
# logarithm of the joint posterior distribution
theta <- 1/(1+exp(-(alpha+beta*x)))
return(sum(y*log(theta) + (m-y)*log(1-theta)))
}
MHstep <- function(pars, avec, x, m, y){
# implementation of the componentwise Metropolis-Hastings algorithm
res <- pars
for(i in 1:2) {
prop <- res
prop[i] <- res[i] + 2 * avec[i] * (runif(1) - 0.5)
# generate a proposal for a component/parameter
if(log(runif(1)) < logpost(prop[1], prop[2], x, m, y)
- logpost(res[1], res[2], x, m, y))
res[i] <- prop[i]
# conditional for acceptance
}
return(res) # return parameters
}
mchain <- matrix(NA, nrow=nrep, ncol=2)
# initialization of Markov chain matrix
mchain[1,] <- start # parameter starting values
for(i in 2:nrep) {
print(i) # index of current step
mchain[i,] <- MHstep(mchain[i - 1, ], avec, x, m, y)
# execution of the Metropolis-Hastings algorithm
}
return(list(alpha = mean(mchain[(burnin+1):nrep, 1]),
beta = mean(mchain[(burnin+1):nrep, 2])))
# return the means of the parameters alpha and beta
}
6