                                      Pattern Recognition 33 (2000) 1919}1925

                MRF parameter estimation by MCMC method
                                          Lei Wang, Jun Liu*, Stan Z. Li
           School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore 639798, Singapore
                       Received 13 January 1999; received in revised form 28 July 1999; accepted 28 July 1999


   Markov random "eld (MRF) modeling is a popular pattern analysis method and MRF parameter estimation plays an
important role in MRF modeling. In this paper, a method based on Markov Chain Monte Carlo (MCMC) is proposed to
estimate MRF parameters. Pseudo-likelihood is used to represent likelihood function and it gives a good estimation
result.   2000 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved.

Keywords: MRF; MCMC; Least-squares "t; Parameter estimation; Pseudo-likelihood

1. Introduction                                                         consuming. Here a method based on MCMC is used to
                                                                        estimate the parameters which can give a good solution
   The objective of mathematical modeling in pattern                    to the estimation.
analysis is aimed to extract the intrinsic characteristics of              The general parameter estimation principle is as fol-
the pattern in a few parameters so as to represent the                  lows. Let F denote any "nite set which comprises of
pattern e!ectively. Markov random "eld modeling is                      a random "eld and f3F is an observation of F. On
a very popular pattern analysis method and it plays an                  F a family of distributions
important role in pattern recognition and computer vi-                     "+ (F; ): 3 ,
sion. Markov random "eld models were popularized by
Besag to model spatial interactions on lattice system [1].              is considered where LRB is a set of parameters. The
It can be used in texture classi"cation and segmentation                &true' parameter H3 is not known and needs to be
as well as image restoration [2]. The most important                    determined or at least approximated. The only available
characteristic of MRF modeling is that the global pat-                  information is hidden in the observation f which is a
terns can be formed via stochastic propagation of local                 realization of F. Now, the problem is how to choose K
interactions. MRF parameter estimation is necessary in                  as a substitute for H if f is picked at random from
MRF modeling after the form of model is given. During                      (F; H).
the past years, many authors presented methods to esti-                     In this paper, the estimation of parameters is based
mate MRF parameters. Simulated annealing [3], max-                      on deriving posterior distribution calculated using the
imum likelihood [4], coding method [1], mean "eld                       Metropolis}Hastings algorithm. This is a Markov chain
approximations [5], Bayesian estimation [6] and least-                  Monte Carlo (MCMC) technique [8].
squares (LS) "t [7] are discussed to estimate MRF para-                     The paper is arranged as follows. MRF image model is
meters.                                                                 discussed in Section 2. MCMC parameter estimation is
   Least-squares (LS) methods and maximum likelihood                    proposed in Section 3. The experiments are shown in
methods are often used. However, LS is not accurate in                  Section 4 and conclusion is given in Section 5.
estimation and maximum likelihood method is time-
                                                                        2. MRF modeling

   A lattice is a square array of pixels, or sites,                  where c ( j) denotes the neighbor of site j in the ith cli-
+( j, k): 0)j)N!1, 0)k)N!1,. We adopt a                              que c .
simple numbering of sites by assigning sequence number
i"k#Nj to site ( j, k). Letting M"N denote the num-                    Then the distributions have the form
ber of sites,
                                                                     P( f, )Z( ) exp (1 , H( f )2), 3
S+0, 1,2, M!1,
                                                                     where 1 , H( f )2 L        H is the inner product of
                                                                                            G G G
index the set of sites. A random eld model is a distribu-             and H, and Z( )        exp (1 , H( f )2) is the normaliz-
tion for the M-tuple random vector F, which contains                 ing partition function. The conditional probability is as
a random variable F(i) for the value of site i. The sites in         follows:
S are related to one another via a neighborhood system.
A neighborhood system for S is dened as                                                 exp(!1 , H( f )2)
                                                                     P( f  f H )                    H      ,                (2)
                                                                         H ,               exp(! 1 , H(z )2)
                                                                                     XH ZL                 H
N+N ∀i3S,,
                                                                     where H( f ) is the local histogram only calculated in the
where N is the set of sites neighboring i. The neighbor-             neighborhood of site j. H(z ) denotes the local histogram
          G                                                                                       H
ing relationship has the following properties:                       replacing f with z and the neighborhood of j is xed.
                                                                                H        H
                                                                     The computation of Z( ) is infeasible because there are
(1) a site is not neighboring to itself;                             a combinatorial number of elements in the conguration
(2) the neighboring relationship is mutual.                          space. In order to avoid using the partition function Z( ),
                                                                     the pseudo-likelihood function
   A clique c is a set of sites in which all pairs of sites are
mutual neighbors. The set of all cliques in a neighbor-
                                                                     P¸( f  )log “ P( f  f H )
hood system is denoted as Q.                                                             H ,
   Suppose F is an MRF. Let f3F be a realization of F.
A clique function, or potential function,  ( f ), is asso-
                                                A                             “ (1 , H( f )2)!log       exp(1 , H(z )2)      (3)
ciated with each clique and the energy function, ;( f ), of                               H                         H
MRF can be expressed as the sum of clique functions.                           HZ S
                                                                                                   XH ZL

                                                                     can be used to replace likelihood function. The pseudo-
;( f )  ( f ).                                                     likelihood does not involve the partition function Z( ).
           A                                                         Hence it is much easier to be calculated.
  To a homogeneous MRF, the potential function is
independent of locations. Thus, the number of clique
                                                                     3. MCMC estimation of MRF parameters
potentials can be reduced to the number of clique types,
that is, each potential corresponding to a clique type.                 According to Bayesian theorem, the posterior distribu-
  Consider a multi-level logistic (MLL) model [4]. Let               tion of conditional on f is
L+1,2, m, be the label set and ( ,2, ) be the
parameter vector for clique potentials where each com-
                                                                                P( )P( f  )
ponent corresponds to a clique type. Consider the distri-            P(  f )                JP( )P( f  ).                  (4)
bution of Gibbs form,                                                         P( )P( f  ) d

                                                                     According to Gilks et al. [8], any features of the posterior
P(Ff, )JP(Ff  )Z( ) exp(!;( f, )),                    (1)
                                                                     distribution are legitimate for Bayesian inference: mo-
                                                                     ments, quantiles, highest posterior density regions, etc.
where ;( f, ) is energy function and depends linearly on
                                                                     All these quantiles can be expressed in terms of posterior
 . Suppose H( f )(H ,2, H ) is the histogram of
                                L                                   expectations of functions of . The posterior expectation
cliques of f, n denotes the index of clique type. Let
                                                                     of a function g( ) is
        1 if z0,
 (z)                                                                            g( )P( )P( f  ) d
        0 otherwise.                                                 E[g( ) f ]                    .                        (5)
                                                                                   P( )P( f  ) d

H     2            ( f !f )!1 , i1,2, n,                           The integrations in this expression are di$cult to be
 G                     H  HY                                         solved in Bayesian inference. Monte Carlo integration
including Markov Chain Monte Carlo (MCMC)                         of say m iterations, + R, tm#1,2, n, will be depen-
approach [6] can be used to deal with the di$culty [8].           dent samples approximately from 
(.). Let
The task is to evaluate the expectation
                                                                      1          L
                                                                   M                  R.                                (7)
        g( )P( ) d                                                  n!m
E(g( ))            .                                   (6)                   RK
          P( ) d                                                 This is an ergodic average. Convergence to the re-
                                                                  quired expectation is ensured by the ergodic theorem.
A Markov chain can be adopted for the purpose of                  Eq. (7) shows how a Markov chain can be used to
evaluation. Suppose we generate a sequence of random              estimate E(  f ). Such a Markov chain can be constructed
variables + , ,2,. At each time t*0, the next state             by Metropolis}Hastings algorithm [8]. At each time t,
 R is sampled from a distribution P( R R) which de-          the next state R is chosen by rst sampling a candidate
pends only on the current state R of the chain. This              point  from a proposal distribution q(  R). The choice
Markov chain is assumed to be time-homogeneous.                   of proposal distribution is almost arbitrary; here a
Thus, the sequence will gradually converge to a unique            multivariate normal distribution centered on the cur-
stationary distribution 
(.). After a su$cient long burn-in       rent value R is adopted. The candidate  is accepted

            Fig. 1. Textures used in the experiment. (a) Number of graylevels M2, (b) M2, (c) M4, (d) M4.
with probability                                                         The Metropolis}Hastings algorithm can be sum-
                                                                        marized in the following procedures:
                   P(  f )q(  R)
 ( R, )min 1,                      .                                      Initialize ; set t0 and ¹maximum number of
                   P( R f )q( R )
                                                                            While t(¹
The transition kernel for the Metropolis-Hastings algo-
rithm is
                                                                            Sample a point  from q(. R)
                                                                            Sample a uniform (0, 1) random variable v
P( R R)q( R R) ( R, R)#I( R R)                                  If v) ( R, ), set R . Otherwise set R R
                                                                            Increment t

            ; 1! q(  R) ( R, )d 

where I(.) denotes the indicator function (taking 1 when                4. Experiments
its argument is true, and 0 otherwise). If the candidate 
is accepted, the next state becomes R , otherwise                      In order to inspect the performance of the method
  R R. Since P( f)JP( )P(f ) and the prior P( ) can                proposed in this paper, a Gibbs sampler [4] is used to
be assumed to be #at when the prior information is                      sample textures with the specied parameters. Here a sec-
totally unavailable,                                                    ond-order neighborhood system is used and four double-
                                                                        site cliques + , , , , corresponding to 03, 903,
                   P(  f )q(  R)                                    453 and 1353 individually are adopted as non-zero para-
 ( R, )min 1,                                                         meters. Fig. 1 shows four 128;128 textures generated
                   P( R f )q( R )
                                                                        from the Gibbs Sampler. The rst two textures are sam-
                   P( f  )q(  R)                                    pled with two graylevels and the next two textures are
       min 1,                       .                          (8)     sampled with four graylevels. The parameters of the four
                   P( f  R)q( R )
                                                                        textures are listed in Table 1. In order to get acceptable
                                                                        parameters, the MCMC procedure described in the pre-
Since the choice of proposal distribution here is normal                vious section should be repeated until stability of the
centered on the current value, q(  R)q( R ) due                    Markov chain is reached. The choice of starting values
to the symmetric property of the proposed distribution.                    will not a!ect the stationary distribution if the chain is
Thus, the acceptance probability formula can be                         irreducible. In our experiments,  are chosen randomly.
reduced to                                                              The usual informal approach to detection of convergence
                                                                        is visual inspection of plots of the Monte-Carlo output
                   P( f  )                                            + R, t1,2, n,. From Figs. 2}5, three independent sam-
 ( R, )min 1,              .                                  (9)
                   P( f  R)                                            ples of Markov chains for texture 4 are given. From the

Thus, the Metropolis}Hastings algorithm is switched to
Metropolis algorithm. When we use pseudo-likelihood to
represent the likelihood function, we get

 ( R, )min(1, exp(P¸( f  )!P¸( f  R)))

       min 1, exp               (1 , H( f )2!1 R, H( f )2)
                                           H            H

          !log             exp(1 , H(z )2)
                   XH ZL

          #log             exp(1 R, H(z )2)       .            (10)
                   XH ZL

With this acceptance probability, the          can be approxi-          Fig. 2. 1000 iterations with di!erent starting values for estima-
mated e!ectively.                                                       ting
                                                                                for texture 4.
Fig. 3. 1000 iterations with di!erent starting values for estima-
                                                                      Fig. 4. 1000 iterations with di!erent starting values for estima-
ting    for texture 4.
                                                                              for texture 4.

gures, we observe that the length of burn-in depends on              creased. Initially, we set n1000. If the estimates M do
 . The Markov chains converge in less than 300 iter-                 not agree adequately, we increase 500 iterations each
ations in most examples according to visual inspection of             time until estimates are similar. We only need to inspect
the monitoring statistics. Here we set burn-in m500.                 the mean M and variance of the Monte}Carlo output. In
More formal methods for convergence diagnostics can be                our experiments in Table 1, n1000 is enough. The
found in Refs. [9,10]. Decision about the iteration num-              results of MCMC approach in Table 1 are acceptable
ber is an important and practical matter. The aim is to               where      denotes the average standard deviation of
run the chain long enough to obtain adequate precision                Markov chains after burn-in. In order to verify the per-
in the estimator. Here three chains are run in parallel               formance of this method, least- squares (LS) t method
with di!erent starting values M from Eq. (7). If they do not          proposed by Derin and Elliott [7] is also used in
agree adequately, the iteration number n must be in-                  our experiments. From Table 1, it can be seen that LS

Table 1
MRF parameter estimation

Textures                    Method
Texture 1                   Specied                     1                      1                 !0.5                  !0.5
                            LS                           0.8448                 0.8734            !0.4332               !0.4382
                            MCMC                         0.9884                 0.9899            !0.5076               !0.5078
                                                         0.0436                 0.0323             0.0278                0.0400
Texture 2                   Specied                     1                   !0.8                     0.5               !0.5
                            LS                           0.9949              !0.8157                  0.4960            !0.3244
                            MCMC                         1.0093              !0.8586                  0.5569            !0.4522
                                                         0.0147               0.0235                  0.0168             0.0245
Texture 3                   Specied                     0.3                    0.3                   0.3                  0.3
                            LS                           0.1152                 0.1520                0.1867               0.1444
                            MCMC                         0.3478                 0.2762                0.2960               0.2877
                                                         0.0266                 0.0165                0.0130               0.0086

Texture 4                   Specied                     0.5                    1                 !0.5                     0.7
                            LS                           0.0415                 0.4364             0                       0.4201
                            MCMC                         0.5525                 1.0394            !0.5951                  0.6810
                                                         0.0321                 0.0364             0.0490                  0.0156
                                                                      cation and segmentation as well as image restoration.
                                                                      MRF parameter estimation plays an important role in
                                                                      MRF modeling. In order to estimate MRF parameter
                                                                      e!ectively and e$ciently, an MRF parameter estimation
                                                                      method based on MCMC is proposed in this paper.
                                                                      A Markov chain is constructed to sample the MRF
                                                                      parameters via Monte Carlo method. MLL model is used
                                                                      as image model. In order to avoid to calculate the nor-
                                                                      malizing partition function, pseudo-likelihood function
                                                                      is used to represent likelihood function. Compared to
                                                                      least-squares t method, our method is more accurate
                                                                      and can be used for multi-graylevel texture parameter
                                                                      estimation e!ectively as seen from the experiments in the
                                                                      paper. This method can be extended to be used in multi-
                                                                      resolution analysis of texture modeling and segmentation
                                                                      of textured images.
Fig. 5. 1000 iterations with di!erent starting values for estima-
ting    for texture 4.

method is e!ective only to the textures with two gray-                Acknowledgements
levels, while MCMC method is e!ective to all examples
in the experiments even more graylevels are adopted in                  We wish to thank the constructive comments and
the model. From the comparison, MCMC method pro-                      suggestions of the reviewers.
posed in this paper is much better than LS method. The
MCMC routines are run on a Sun Ultra 2 workstation,
each analysis takes less than 3 min to perform 1000
