SlideShare a Scribd company logo
SPEAKER IDENTIFICATION USING GAUSSIAN MIXTURE MODELS
                    BASED ON MULTI-SPACE PROBABILITY DISTRIBUTION

  Chiyomi Miyajima†, Yosuke Hattori†,∗, Keiichi Tokuda†, Takashi Masuko‡, Takao Kobayashi‡, and Tadashi Kitamura†

                                       †
                                         Nagoya Institute of Technology, Nagoya 466-8555, Japan
                                      ‡
                                        Tokyo Institute of Technology, Yokohama 226-8502, Japan
                                    ∗
                                      Present address: DENSO Corporation, Kariya 448-8661, Japan
             †
               {chiyomi, tokuda, kitamura}@ics.nitech.ac.jp, ‡ {masuko, tkobayas}@ip.titech.ac.jp, ∗ hattori@rd.denso.co.jp



                           ABSTRACT                                         In this paper a new speaker modeling technique using a
                                                                       GMM based on multi-space probability distribution (MSD) [8] is
This paper presents a new approach to modeling speech spectra          introduced. The MSD-GMM allows us to model feature vectors
and pitch for text-independent speaker identification using Gaus-       with variable dimensionality including zero-dimensional vectors,
sian mixture models based on multi-space probability distribution      i.e., discrete symbols. Consequently, continuous pitch values
(MSD-GMM). The MSD-GMM allows us to model continuous                   for voiced frames and discrete symbols representing “unvoiced”
pitch values for voiced frames and discrete symbols represent-         can be modeled using an MSD-GMM in a unified framework,
ing unvoiced frames in a unified framework. Spectral and pitch          and spectral and pitch features are jointly modeled by a multi-
features are jointly modeled by a two-stream MSD-GMM. We               stream MSD-GMM, i.e., each speaker is modeled by a single
derive maximum likelihood (ML) estimation formulae for the             statistical model. We derive maximum likelihood (ML) esti-
MSD-GMM parameters, and the MSD-GMM speaker models                     mation formulae and evaluate the MSD-GMM speaker models
are evaluated for text-independent speaker identification tasks.        for text-independent speaker identification tasks comparing with
Experimental results show that the MSD-GMM can efficiently              conventional GMM speaker models.
model spectral and pitch features of each speaker and outper-               The rest of the paper is organized as follows. In Section
forms conventional speaker models.                                     2, we introduce a speaker modeling technique based on MSD-
                                                                       GMM. Section 3 presents the ML estimation procedure for MSD-
                     1. INTRODUCTION                                   GMM parameters. Section 4 reports experimental results, and
                                                                       Section 5 gives conclusions and future works.
Gaussian mixture models (GMMs) have been successfully ap-
plied to speaker modeling in text-independent speaker identifica-                      2. MULTI-STREAM MSD-GMM
tion [1]. Such identification systems mainly use spectral features
represented by cepstral coefficients as speaker features. Pitch fea-    2.1. Likelihood Calculation
                                                                       Let us assume that a given observation Ót at time t consists
tures as well as spectral features contain much speaker specific
                                                                       of S information sources (streams). The s-th stream Óts has
information [2, 3]. However, most of speaker recognition studies
in recent years have focused on using only spectral features. The
                                                                       a set of space indices Xts and a random vector with variable
                                                                       dimensionality Üts , that is
main reasons for this are i) the use of pitch features alone could
not give enough recognition performance and ii) pitch values are
not defined in unvoiced segments and this complicates speaker
modeling and feature integration.
                                                                                           Ót = (Ót1 , Ót2 , . . . , ÓtS ),             (1)
     Several works have reported that speaker recognition accu-                           Óts = (Xts , Üts ) .                          (2)
racy can be improved by the use of pitch features in addition
                                                                       Note here that Xts is a subset of all possible space indices
to spectral features [4, 5, 6, 7]. There are essentially two ap-
                                                                       {1, 2, . . . , Gs }, and all the spaces represented by the indices in
                                                                       Xts have the same dimensionality as Üts .
proaches to integrating spectral and pitch information: i) two
separate models are used for spectral and pitch features and their
                                                                           We define the output probability distribution of an S-stream
                                                                       MSD-GMM λ for Ót as
scores are combined [4, 5], ii) two separate models for voiced
and unvoiced parts are trained and their scores are combined
[6, 7]. In [7], two separate GMMs are used, where the input ob-                                       M         S
servations are concatenations of cepstral coefficients and log F0                       b(Ót | λ) =         cm         pms (Óts ),       (3)
for voiced frames and cepstral coefficients alone for unvoiced                                        m=1        s=1
frames. Since the probability distribution of the conventional
GMM is defined on a single vector space, these two kinds of             where cm is the mixture weight for the m-th mixture component.
vectors require their respective models.                               The observation probability of Óts for mixture m is given by the
                                                                       multi-space probability distribution (MSD) [8] :
    A part of this work was supported by Research Fellowships from
                                                                                     pms (Óts ) =           wmsg Nmsg (Üts ),
                                                                                                                   sg  D
the Japan Society for the Promotion of Science for Young Scientists,                                                                    (4)
No. 199808177.                                                                                      g∈Xts
Stream 1       (G1 = 4)                                                                                                                                                 t
                                                                                                                                                                                      /ashi/




                                                                                                             Frequency (Hz) Frequency (kHz)
Observation at time t                                                                                                                          0




                                                                                                 Spectrum
                                               5
                                                             wm11                                                                              1
                                             N m11 (     )                                                                                     2
o t = (o t1, o t2, o t3 )
                                              5
                                                             wm12                                                                              3
                                             Nm12 (xt1 )                                                                                       4
                                                                                                                                               5
                                                             wm13              pm1 (ot1 )
                                              5                                                                                                         UV          V                    UV                  V            UV
          5-dimensional                      Nm13 (xt1 )                                                                                      200
                                                             wm14                                                                             150




                                                                                                 Pitch
                                              8
                                             Nm14 (      )                                                                                    100
 ot1 = ( { 2, 3 } , xt1 )
                                                                                                                                               50
 ot2 = ( { 1 } , xt2 )      7-dimensional     Stream 2       (G2 = 1)
                                                             wm21 = 1
 ot3 = ( { 2 } , xt3 )                        7
                                             Nm21 (xt2 )                                     Fig. 2. An example of spectral and pitch sequences of a word
                                                                                             “/ashi/” spoken by a male speaker.
                                              Stream 3       (G3 = 2)           pm2 (ot2 )
         0-dimensional
                                              8              wm31                            o t = (o t1, o t2 )                                                              Spectrum        (G1 = 1)
                                             Nm31 (      )
                                                                             pm3 (ot3 )                D-dimensional                                                           D              wm11 = 1           Nm11 (xt1 )
                                                                                                                                                                                                                  D
                                              0
                                             Nm32 (xt3 ) w                                                                                                                    Nm11 (xt1)
                                                          m32
                                                                                              ot1 = ( { 1 } , x t1 )
                                                                                                                                                                                   Pitch          (G2 = 2)
                              pm1 (ot1 ) =   wm12 Nm12 (xt1 )
                                                   5
                                                                +   wm13 Nm13 (xt1 )
                                                                          5                   ot2 = ( { 1 } , x t2 )                                                     V     1                  wm21
              Stream 1:                                                                                                                                                       Nm21 (xt2)
              Stream 2:       pm2 (ot2 ) = Nm21 (xt2 )
                                            7
                                                                                                                                                                               0                                 wm21 Nm21 (xt2 )
                                                                                                                                                                                                                       1
                                                                                                            1-dimensional                                                     Nm22 (       ) w
                                                                                                                                                  or                    UV                    m22                     or
              Stream 3:       pm3 (ot3 ) = wm32                                                             0-dimensional                                                                                           wm22

                                                                                             Fig. 3. The m-th mixture component in a two-stream MSD-
Fig. 1. An example of the m-th mixture component of a three-
                                                                                             GMM based on spectra and pitch.
stream MSD-GMM.


where wmsg is the weight for the g-th vector space of the s-th                               voiced frames and discrete symbols representing “unvoiced” in
                Dsg                                                                          unvoiced frames because pitch values are defined only in voiced
stream and Nmsg ( · ) is the Dsg -variate Gaussian function with
                                                                                             segments.
mean vector msg and covariance matrix Σmsg (for the case
                                                      0                                          As shown in Fig. 3, each speaker can be modeled by a
Dsg > 0). For simplicity of notation, we define Nmsg ( · ) ≡ 1                                two-stream MSD-GMM (S = 2); the first stream is for the
(for the case Dsg = 0). Note here that the multi-space probabil-                             spectral feature and the second stream is for the pitch feature.
ity distribution (MSD) is equivalent to continuous probability dis-                          The spectral stream has a D-dimensional space (G1 = 1), and
tribution and discrete probability distribution when Dsg ≡ n > 0                             the pitch stream has two spaces (G2 = 2) : a one-dimensional
and Dsg ≡ 0, respectively. Also, an MSD-GMM is assumed to                                    space and a zero-dimensional space for voiced and unvoiced
be a generalized GMM, which includes the traditional GMM as                                  parts, respectively.
a special case when S = 1, G1 = 1, and D11 > 0.
     For an observation sequence Ç = (Ó1 , Ó2 , . . . , ÓT ), the
likelihood of MSD-GMM λ is given by                                                                                                                3. ML-ESTIMATION FOR MSD-GMM

                                                T                                            In a similar way to the ML-estimation procedure in [8], P (Ç | λ)
                             P (Ç | λ) =             b(Ót | λ).                        (5)   is increased by iterating the maximization of an auxiliary function
                                               t=1                                           Q(λ , λ) over λ to improve current parameters λ based on the
                                                                                             expectation maximization (EM) algorithm.
    Figure 1 illustrates an example of the m-th mixture compo-
nent of a three-stream MSD-GMM (S = 3). The sample space of
the first stream consists of four spaces (G1 = 4), among which                                3.1. Definition of Q-Function
the second and third spaces are triggered by the space indices                               The log-likelihood of λ for an observation sequence Ç , a se-
and pm1 (Ót1 ) becomes the sum of the two weighted Gaussians.                                quence of mixture components and a sequence of space indices
The second stream has only one space (G2 = 1) and always out-                                Ð can be written as
puts its Gaussian as pm2 (Ót2 ). The third stream consists of two
spaces (G3 = 2), where a zero-dimensional space is selected,                                                                                                                   T

and outputs its space weight wm32 (a discrete probability) as                                                                        log P (Ç, , Ð | λ) =                           log cit
pm3 (Ót3 ).                                                                                                                                                                   t=1
                                                                                                                                                    T        S                       T        S
                                                                                                                                                                                                   log Nit slts (Üts ),
                                                                                                                                                                                                         D slts
2.2. Speaker Modeling Based on MSD-GMM                                                                                                        +                  log wit slts +                                                (6)
                                                                                                                                                  t=1 s=1                           t=1 s=1
Figure 2 shows an example of spectral and pitch sequences of
a Japanese word “/ashi/” spoken by a Japanese male speaker.                                  where
Generally, spectral features are represented by multi-dimensional                                                                                                       = (i1 , i2 , . . . , iT ),                             (7)
vectors of cepstral coefficients with continuous values. On the
other hand, pitch features are represented by one-dimensional                                                                                                       Ð = (Ð1 , Ð2 , . . . , ÐT ),                               (8)
continuous values of log fundamental frequencies (log F0 ) in                                                                                                      Ðt = (lt1 , lt2 , . . . , ltS ).                            (9)
Hence the Q-function is defined as                                                           Similarly, the second term is maximized as
        Q(λ , λ) =                   P ( , Ð | Ç, λ ) log P (Ç , , Ð | λ)                                                                  ξts (m, g)
                           all i,l                                                                                          t∈T (O ,s,g)
                                              T                                                               wmsg =      Gs
                                                                                                                                                          ,           (15)
        =             P ( , Ð | Ç, λ )             log cit                                                                                   ξts (m, l)
            all i,l                          t=1                                                                          l=1 t∈T (O ,s,l)
                                              T      S
        +             P ( , Ð | Ç, λ )                   log wit slts                       where ξts (m, g) is the posterior probability of being in the g-th
            all i,l                          t=1 s=1
                                                                                            space of stream s in the m-th mixture component at time t:
                                              T      S                                              ξts (m, g) = P (it = m, lts = g | Ç, λ)
                      P ( , Ð | Ç, λ )                   log Nit slts (Üts )
                                                                  D
                                                                 slts
        +
            all i,l                          t=1 s=1                                                 = P (it = m | Ç, λ)P (lts = g | it = m, Ç, λ)
                                                                                                                  wmsg Nmsg (Üts )
                                                                                                                         sg D
                                                                                                     = γt (m)                      .
                                                                                                                    pms (Óts )
              M       T                                                                                                                                               (16)
        =                  P (it = m | Ç , λ ) log cm
            m=1 t=1                                                                         The third term is maximized by solving following equations:
                                                                                                        ∂
                                                                                                                               P (it = m, lts = g | Ç, λ )
             M        S    Gs
        +                                                                                           ∂      msg t∈T (O ,s,g)

                                                                                                                                       · log Nmsg (Üts ) = 0,
            m=1 s=1 g=1 t∈T (O ,s,g)                                                                                                           D
                                                                                                                                               sg
                                                                                                                                                                      (17)
                 P (it = m, lts = g | Ç, λ ) log wmsg                                                 ∂
                                                                                                                               P (it = m, lts = g | Ç, λ )
             M        S    Gs                                                                       ∂Σ−1
                                                                                                      msg
                                                                                                               t∈T (O ,s,g)
                                                                                                                                       · log Nmsg (Üts ) = 0,
                                                                                                                                                D
        +                                                                                                                                      sg
                                                                                                                                                                      (18)
            m=1 s=1 g=1 t∈T (O ,s,g)
                                                                                            resulting in
                 P (it = m, lts = g | Ç, λ ) log Nmsg (Üts ),
                                                   sg                 D
                                                                                     (10)
where                                                                                                                                    ξts (m, g)Üts
                                                                                                                          t∈T (O ,s,g)
                          T (Ç, s, g) = {t | g ∈ Xts } .                             (11)                       msg   =   Gs
                                                                                                                                                          ,           (19)
                                                                                                                                             ξts (m, l)

                                                              È
3.2. Maximization of Q-Function                                                                                           l=1 t∈T (O ,s,l)

The first two terms of (10) have the form N ui log yi , which
                                           i=1
attains a global maximum at the single point                                                                                   ξts (m, g)(Üts −       msg )(  Üts −   msg )
                     ui                                                                                         t∈T (O ,s,g)
              yi = N     , for i = 1, 2, . . . , N,     (12)                                       Σmsg =                                                                     .
                                                                                                                                  Gs
                                 uj                                                                                                                  ξts (m, l)

                      È
                           j=1                                                                                                    l=1 t∈T (O ,s,l)
                                                                                                                                                                      (20)
                                     N
under the constraints        yi = 1 and yi ≥ 0. The maximiza-
                                     i=1                                                    The re-estimation is repeated iteratively using λ in place of λ
tion of the first term of (10) leads to the re-estimate of cm :                              and the final result is an ML estimation of the MSD-GMM.
                                     T
                                           P (it = m | Ç , λ )                                             4. EXPERIMENTAL EVALUATION
                                  t=1
                  cm =           M   T                                                      4.1. Experimental Conditions
                                             P (it = m | Ç , λ )
                                m=1 t=1
                                                                                            First, a speaker identification experiment was carried out using
                                      T
                                                                                            the ATR Japanese speech database. We used word data spoken by
                                1
                                           P (it = m | Ç , λ )
                                                                                            80 speakers (40 males and 40 females). Phonetically-balanced
                          =
                                T    t=1
                                                                                            216 words are used for training each speaker model, and 520
                                      T
                                                                                            common words per speaker are used for testing. The number of
                                1                                                           tests was 41600 in total.
                          =                γt (m),                                   (13)
                                T    t=1
                                                                                                 Second, to evaluate the robustness of the MSD-GMM speaker
                                                                                            model against inter-session variability, we also conducted a speaker
where γt (m) is the posterior probability of being in the m-th                              identification experiment using the NTT database. The database
mixture component at time t, that is                                                        consists of sentence data uttered by 35 Japanese speakers (22
                                                              S                             males and 13 females) on five sessions over ten months (Aug.,
                                                         cm         pms (Óts )              Sept., Dec. 1990, Mar., June 1991). In each session, 15 sen-
                                                                                            tences were recorded for each speaker. Ten sentences are com-
        γt (m) = P (it = m | Ç , λ) =
                                                              s=1
                                                                                 .
                                                               b(Ót )
                                                                                     (14)   mon to all speakers and all sessions (A-set), and five sentences
10                                                                       better performance than the conventional GMM system using
                                             ATR          95% CI        NTT                                      a spectral feature alone. Among the three systems, the MSD-
Speaker identification error rate (%)



                                             database                   database
                                                          GMM                                                    GMM system gave the best results, and achieved 16 % and 18 %
                                         8                S+P-GMMs                                               error reductions (for the ATR database) and 38 % and 51 % error
                                                          V+UV-GMMs                                              reductions (for the NTT database) over the GMM system when
                                                          MSD-GMM                                                using 32 and 64 mixture models, respectively. It is also noted
                                         6                                                                       that the MSD-GMM system requires no combination parameter
                                                   6.2




                                                                                                                 (such as α) which has to be chosen or tuned heuristically.
                                                 5.8
                                               5.4

                                              5.2

                                                           5.1




                                                                                                 6.4
                                         4
                                                                  4.5
                                                                  4.4


                                                                                                                                       5. CONCLUSION
                                                                 4.2



                                                                         5.6

                                                                                     4.4




                                                                                                           4.3
                                         2                                                                       This paper has introduced a new technique for modeling speakers




                                                                                                         3.7
                                                                               3.4

                                                                                           3.4
                                                                                                                 based on MSD-GMM for text-independent speaker identification.




                                                                                                       3.1
                                                                                                                 The MSD-GMM can model continuous pitch values of voiced
                                         0                                                                       frames and discrete symbols representing “unvoiced” in a unified
                                                 32
                                              32+32
                                               24+8
                                                 32

                                                              64
                                                           64+64
                                                           48+16
                                                              64


                                                                            32
                                                                         32+32
                                                                          24+8
                                                                            32

                                                                                                    64
                                                                                                 64+64
                                                                                                 48+16
                                                                                                    64
                                                                                                                 framework. Spectral and pitch features can be jointly modeled
                                                                                                                 by a multi-stream MSD-GMM. We derived the ML estimation
                                                        Number of mixture components
                                                                                                                 formulae for the MSD-GMM parameters and evaluated the MSD-
                                                                                                                 GMM speaker models for text-independent speaker identification
                                                                                                                 tasks. The experimental results demonstrated the high utility of
Fig. 4. Comparison of MSD-GMM speaker models with con-                                                           the MSD-GMM speaker model and also proved its robustness
ventional GMM and two-separate GMM speaker models.                                                               against the inter-session variability.
                                                                                                                     Introduction of stream weights to the multi-stream MSD-
                                                                                                                 GMM and application of this framework to speaker verification
are different for each speaker and each session (B-set). The dura-                                               systems will be subjects for future works.
tion of each sentence is approximately four second. We used 15
sentences (A-set + B-set from the first session) per speaker for                                                                        6. REFERENCES
training, and 20 sentences (B-set from the other four sessions)
per speaker for testing. The number of tests was 700 in total.                                                    [1] D.A. Reynolds and R.C. Rose, “Robust text-independent
     The speech data were down-sampled to 10 kHz, windowed                                                            speaker identification using Gaussian mixture speaker mod-
at a 10-ms frame rate with a 25.6-ms Blackman window, and pa-                                                         els,” IEEE Trans. Speech and Audio Process., 3 (1), 72–83,
rameterized into 13 mel-cepstral coefficients using a mel-cepstral                                                     Jan. 1995.
estimation technique [9]. The 12 static parameters excluding the                                                  [2] B.S. Atal, “Automatic speaker recognition based on pitch
zero-th coefficient were used as a spectral feature. Fundamen-                                                         contours,” J. Acoust. Soc. Amer., 52 (6), 1687–1697, 1972.
tal frequencies (F0 ) were estimated at a 10-ms frame rate using                                                  [3] S. Furui, “Research on individuality features in speech
the RAPT method [10] with a 7.5-ms correlation window, and                                                            waves and automatic speaker recognition techniques,”
log F0 for the voiced frames and discrete symbols for unvoiced                                                        Speech Communication, 5 (2), 183–197, 1986.
frames were used as a pitch feature. Speakers were modeled by                                                     [4] M.J. Carey, E.S. Parris, H. Lloyd-Thomas, and S. Bennett,
GMMs or multi-stream MSD-GMMs with diagonal covariances.                                                              “Robust prosodic features for speaker identification,” Proc.
                                                                                                                      ICSLP’96, vol.3, pp.1800–1804, Oct. 1996.
4.2. Experimental Results                                                                                         [5] M.K. S¨ nmez, L. Heck, M. Weintraub, and E. Shriberg,
                                                                                                                              o
                                                                                                                      “A lognormal tied mixture model of pitch for prosody-
The MSD-GMM speaker identification system was compared                                                                 based speaker recognition,” Proc. EUROSPEECH’97, vol.3,
with three kinds of conventional systems. Figure 4 shows speaker                                                      pp.1391–1394, Sept. 1997.
identification error rates with 95 % confidence intervals (CIs)                                                     [6] T. Matsui and S. Furui, “Text-independent speaker recog-
when using 32 and 64 component speaker models. The left and                                                           nition using vocal tract and pitch information,” Proc.
right halves of the figure correspond to the results for the ATR                                                       ICSLP’90, vol.1, pp.137–140, Nov. 1990.
and NTT databases, respectively. In the figure, “GMM” denotes                                                      [7] K.P. Markov and S. Nakagawa, “Integrating pitch and
a conventional GMM speaker model using a spectral feature                                                             LPC-residual information with LPC-cepstrum for text-
alone. “S+P-GMMs” represents a speaker model consisting of                                                            independent speaker recognition,” J. Acoust. Soc. Jpn., (E),
two GMMs for spectra and pitch. “V+UV-GMMs” is a speaker                                                              20 (4), 281–291, July 1999.
model consisting of two GMMs for voiced (V) and unvoiced                                                          [8] K. Tokuda, T. Masuko, N. Miyazaki, and T. Kobayashi,
(UV) parts [7] with the optimum numbers of mixture compo-                                                             “Hidden Markov models based on multi-space probability
nents for the V-GMM and the UV-GMM, i.e., 24 (V)+8 (UV) or                                                            distribution for pitch pattern modeling,” Proc. ICASSP’99,
48 (V)+16 (UV), and a linear combination parameter α = 0.5                                                            vol.1, pp.229–232, May 1999.
(α is the weight for the likelihood of the UV-GMM). “MSD-                                                         [9] T. Fukada, K. Tokuda, T. Kobayashi, and S. Imai, “An adap-
GMM” denotes the proposed model based on the multi-stream                                                             tive algorithm for mel-cepstral analysis of speech,” Proc.
MSD-GMM.                                                                                                              ICASSP’92, vol.1, pp.137–140, Mar. 1992.
    As shown in the figure, the additional use of pitch infor-                                                    [10] D. Talkin, “A robust algorithm for pitch tracking,” in
mation significantly improved the system performance, and the                                                          Speech Coding and Synthesis, W.B. Kleijn and K.K. Pali-
three systems using both spectral and pitch features gave much                                                        wal eds., Elsevier Science, pp.495–518, 1995.

More Related Content

What's hot

Computational Tools and Techniques for Numerical Macro-Financial Modeling
Computational Tools and Techniques for Numerical Macro-Financial ModelingComputational Tools and Techniques for Numerical Macro-Financial Modeling
Computational Tools and Techniques for Numerical Macro-Financial Modeling
Victor Zhorin
 
Pricing Exotics using Change of Numeraire
Pricing Exotics using Change of NumerairePricing Exotics using Change of Numeraire
Pricing Exotics using Change of Numeraire
Swati Mital
 
ABC and empirical likelihood
ABC and empirical likelihoodABC and empirical likelihood
ABC and empirical likelihood
Christian Robert
 
Uncertain Volatility Models
Uncertain Volatility ModelsUncertain Volatility Models
Uncertain Volatility Models
Swati Mital
 
P1
P1P1
On the stability and accuracy of finite difference method for options pricing
On the stability and accuracy of finite difference method for options pricingOn the stability and accuracy of finite difference method for options pricing
On the stability and accuracy of finite difference method for options pricing
Alexander Decker
 
ABC and empirical likelihood
ABC and empirical likelihoodABC and empirical likelihood
ABC and empirical likelihood
Christian Robert
 
Sneutrino Cold Dark Matter, Oxford
Sneutrino Cold Dark Matter, OxfordSneutrino Cold Dark Matter, Oxford
Sneutrino Cold Dark Matter, Oxford
Spinor
 
Mm2521542158
Mm2521542158Mm2521542158
Mm2521542158
IJERA Editor
 
11.the comparative study of finite difference method and monte carlo method f...
11.the comparative study of finite difference method and monte carlo method f...11.the comparative study of finite difference method and monte carlo method f...
11.the comparative study of finite difference method and monte carlo method f...
Alexander Decker
 
From L to N: Nonlinear Predictors in Generalized Models
From L to N: Nonlinear Predictors in Generalized ModelsFrom L to N: Nonlinear Predictors in Generalized Models
From L to N: Nonlinear Predictors in Generalized Models
htstatistics
 
20120140506008 2
20120140506008 220120140506008 2
20120140506008 2
IAEME Publication
 
A Physical Approach to Moving Cast Shadow Detection (ICASSP 2009)
A Physical Approach to Moving Cast Shadow Detection (ICASSP 2009)A Physical Approach to Moving Cast Shadow Detection (ICASSP 2009)
A Physical Approach to Moving Cast Shadow Detection (ICASSP 2009)
Jia-Bin Huang
 
"Deep Learning" Chap.6 Convolutional Neural Net
"Deep Learning" Chap.6 Convolutional Neural Net"Deep Learning" Chap.6 Convolutional Neural Net
"Deep Learning" Chap.6 Convolutional Neural Net
Ken'ichi Matsui
 
Multiplicative Interaction Models in R
Multiplicative Interaction Models in RMultiplicative Interaction Models in R
Multiplicative Interaction Models in R
htstatistics
 
Lecture 02 internet video search
Lecture 02 internet video searchLecture 02 internet video search
Lecture 02 internet video search
zukun
 
gnm: a Package for Generalized Nonlinear Models
gnm: a Package for Generalized Nonlinear Modelsgnm: a Package for Generalized Nonlinear Models
gnm: a Package for Generalized Nonlinear Models
htstatistics
 
Learning Moving Cast Shadows for Foreground Detection (VS 2008)
Learning Moving Cast Shadows for Foreground Detection (VS 2008)Learning Moving Cast Shadows for Foreground Detection (VS 2008)
Learning Moving Cast Shadows for Foreground Detection (VS 2008)
Jia-Bin Huang
 

What's hot (18)

Computational Tools and Techniques for Numerical Macro-Financial Modeling
Computational Tools and Techniques for Numerical Macro-Financial ModelingComputational Tools and Techniques for Numerical Macro-Financial Modeling
Computational Tools and Techniques for Numerical Macro-Financial Modeling
 
Pricing Exotics using Change of Numeraire
Pricing Exotics using Change of NumerairePricing Exotics using Change of Numeraire
Pricing Exotics using Change of Numeraire
 
ABC and empirical likelihood
ABC and empirical likelihoodABC and empirical likelihood
ABC and empirical likelihood
 
Uncertain Volatility Models
Uncertain Volatility ModelsUncertain Volatility Models
Uncertain Volatility Models
 
P1
P1P1
P1
 
On the stability and accuracy of finite difference method for options pricing
On the stability and accuracy of finite difference method for options pricingOn the stability and accuracy of finite difference method for options pricing
On the stability and accuracy of finite difference method for options pricing
 
ABC and empirical likelihood
ABC and empirical likelihoodABC and empirical likelihood
ABC and empirical likelihood
 
Sneutrino Cold Dark Matter, Oxford
Sneutrino Cold Dark Matter, OxfordSneutrino Cold Dark Matter, Oxford
Sneutrino Cold Dark Matter, Oxford
 
Mm2521542158
Mm2521542158Mm2521542158
Mm2521542158
 
11.the comparative study of finite difference method and monte carlo method f...
11.the comparative study of finite difference method and monte carlo method f...11.the comparative study of finite difference method and monte carlo method f...
11.the comparative study of finite difference method and monte carlo method f...
 
From L to N: Nonlinear Predictors in Generalized Models
From L to N: Nonlinear Predictors in Generalized ModelsFrom L to N: Nonlinear Predictors in Generalized Models
From L to N: Nonlinear Predictors in Generalized Models
 
20120140506008 2
20120140506008 220120140506008 2
20120140506008 2
 
A Physical Approach to Moving Cast Shadow Detection (ICASSP 2009)
A Physical Approach to Moving Cast Shadow Detection (ICASSP 2009)A Physical Approach to Moving Cast Shadow Detection (ICASSP 2009)
A Physical Approach to Moving Cast Shadow Detection (ICASSP 2009)
 
"Deep Learning" Chap.6 Convolutional Neural Net
"Deep Learning" Chap.6 Convolutional Neural Net"Deep Learning" Chap.6 Convolutional Neural Net
"Deep Learning" Chap.6 Convolutional Neural Net
 
Multiplicative Interaction Models in R
Multiplicative Interaction Models in RMultiplicative Interaction Models in R
Multiplicative Interaction Models in R
 
Lecture 02 internet video search
Lecture 02 internet video searchLecture 02 internet video search
Lecture 02 internet video search
 
gnm: a Package for Generalized Nonlinear Models
gnm: a Package for Generalized Nonlinear Modelsgnm: a Package for Generalized Nonlinear Models
gnm: a Package for Generalized Nonlinear Models
 
Learning Moving Cast Shadows for Foreground Detection (VS 2008)
Learning Moving Cast Shadows for Foreground Detection (VS 2008)Learning Moving Cast Shadows for Foreground Detection (VS 2008)
Learning Moving Cast Shadows for Foreground Detection (VS 2008)
 

Viewers also liked

A
AA
So you think you’re (a)sexual
So you think you’re (a)sexualSo you think you’re (a)sexual
So you think you’re (a)sexual
Gaby Rivera
 
MBHI Updates
MBHI UpdatesMBHI Updates
MBHI Updates
Migdalia Gonzalez
 
Creating A Game
Creating A GameCreating A Game
Creating A Game
YNWAMediaProductions
 
Raising the Chorus: CLA Presentation Nov2010
Raising the Chorus: CLA Presentation Nov2010Raising the Chorus: CLA Presentation Nov2010
Raising the Chorus: CLA Presentation Nov2010
alsonnie
 
Racconto concorso maggio
Racconto concorso maggioRacconto concorso maggio
Racconto concorso maggio
Antonio de Trizio
 

Viewers also liked (10)

Guida publisher
Guida publisherGuida publisher
Guida publisher
 
A
AA
A
 
So you think you’re (a)sexual
So you think you’re (a)sexualSo you think you’re (a)sexual
So you think you’re (a)sexual
 
MBHI Updates
MBHI UpdatesMBHI Updates
MBHI Updates
 
Director Study - Jonathan Glazer
Director Study - Jonathan GlazerDirector Study - Jonathan Glazer
Director Study - Jonathan Glazer
 
Creating A Game
Creating A GameCreating A Game
Creating A Game
 
Raising the Chorus: CLA Presentation Nov2010
Raising the Chorus: CLA Presentation Nov2010Raising the Chorus: CLA Presentation Nov2010
Raising the Chorus: CLA Presentation Nov2010
 
Racconto concorso maggio
Racconto concorso maggioRacconto concorso maggio
Racconto concorso maggio
 
A
AA
A
 
A
AA
A
 

Similar to 10.1.1.11.1180

A New Approach for Speech Enhancement Based On Eigenvalue Spectral Subtraction
A New Approach for Speech Enhancement Based On Eigenvalue Spectral SubtractionA New Approach for Speech Enhancement Based On Eigenvalue Spectral Subtraction
A New Approach for Speech Enhancement Based On Eigenvalue Spectral Subtraction
CSCJournals
 
L046056365
L046056365L046056365
L046056365
IJERA Editor
 
Higher dimension raster 5d
Higher dimension raster 5dHigher dimension raster 5d
Higher dimension raster 5d
Pedram Mazloom
 
Performance Analysis of Adaptive DOA Estimation Algorithms For Mobile Applica...
Performance Analysis of Adaptive DOA Estimation Algorithms For Mobile Applica...Performance Analysis of Adaptive DOA Estimation Algorithms For Mobile Applica...
Performance Analysis of Adaptive DOA Estimation Algorithms For Mobile Applica...
IJERA Editor
 
9517cnc05
9517cnc059517cnc05
9517cnc05
IJCNCJournal
 
MULTI RESOLUTION LATTICE DISCRETE FOURIER TRANSFORM (MRL-DFT)
MULTI RESOLUTION LATTICE DISCRETE FOURIER TRANSFORM (MRL-DFT)MULTI RESOLUTION LATTICE DISCRETE FOURIER TRANSFORM (MRL-DFT)
MULTI RESOLUTION LATTICE DISCRETE FOURIER TRANSFORM (MRL-DFT)
csandit
 
Терагерцовая камера MIT сканирует закрытые книги
Терагерцовая камера MIT сканирует закрытые книгиТерагерцовая камера MIT сканирует закрытые книги
Терагерцовая камера MIT сканирует закрытые книги
Anatol Alizar
 
Comparison and analysis of combining techniques for spatial multiplexing spac...
Comparison and analysis of combining techniques for spatial multiplexing spac...Comparison and analysis of combining techniques for spatial multiplexing spac...
Comparison and analysis of combining techniques for spatial multiplexing spac...
IAEME Publication
 
Comparison and analysis of combining techniques for spatial multiplexingspace...
Comparison and analysis of combining techniques for spatial multiplexingspace...Comparison and analysis of combining techniques for spatial multiplexingspace...
Comparison and analysis of combining techniques for spatial multiplexingspace...
IAEME Publication
 
Comparison and analysis of combining techniques for spatial multiplexingspace...
Comparison and analysis of combining techniques for spatial multiplexingspace...Comparison and analysis of combining techniques for spatial multiplexingspace...
Comparison and analysis of combining techniques for spatial multiplexingspace...
IAEME Publication
 
A Text-Independent Speaker Identification System based on The Zak Transform
A Text-Independent Speaker Identification System based on The Zak TransformA Text-Independent Speaker Identification System based on The Zak Transform
A Text-Independent Speaker Identification System based on The Zak Transform
CSCJournals
 
Speaker Identification based on GFCC using GMM-UBM
Speaker Identification based on GFCC using GMM-UBMSpeaker Identification based on GFCC using GMM-UBM
Speaker Identification based on GFCC using GMM-UBM
inventionjournals
 
Dq24746750
Dq24746750Dq24746750
Dq24746750
IJERA Editor
 
F43063841
F43063841F43063841
F43063841
IJERA Editor
 
Steganographic Scheme Based on Message-Cover matching
Steganographic Scheme Based on Message-Cover matchingSteganographic Scheme Based on Message-Cover matching
Steganographic Scheme Based on Message-Cover matching
IJECEIAES
 
Dycops2019
Dycops2019 Dycops2019
Dycops2019
Jéssyca Bessa
 
SEGMENTATION OF POLARIMETRIC SAR DATA WITH A MULTI-TEXTURE PRODUCT MODEL
SEGMENTATION OF POLARIMETRIC SAR DATA WITH A MULTI-TEXTURE PRODUCT MODELSEGMENTATION OF POLARIMETRIC SAR DATA WITH A MULTI-TEXTURE PRODUCT MODEL
SEGMENTATION OF POLARIMETRIC SAR DATA WITH A MULTI-TEXTURE PRODUCT MODEL
grssieee
 
09 23sept 8434 10235-1-ed performance (edit ari)update 17jan18tyas
09 23sept 8434 10235-1-ed performance (edit ari)update 17jan18tyas09 23sept 8434 10235-1-ed performance (edit ari)update 17jan18tyas
09 23sept 8434 10235-1-ed performance (edit ari)update 17jan18tyas
IAESIJEECS
 
FK_icassp_2014
FK_icassp_2014FK_icassp_2014
FK_icassp_2014
Fangchen FENG
 
Missing Component Restoration for Masked Speech Signals based on Time-Domain ...
Missing Component Restoration for Masked Speech Signals based on Time-Domain ...Missing Component Restoration for Masked Speech Signals based on Time-Domain ...
Missing Component Restoration for Masked Speech Signals based on Time-Domain ...
NU_I_TODALAB
 

Similar to 10.1.1.11.1180 (20)

A New Approach for Speech Enhancement Based On Eigenvalue Spectral Subtraction
A New Approach for Speech Enhancement Based On Eigenvalue Spectral SubtractionA New Approach for Speech Enhancement Based On Eigenvalue Spectral Subtraction
A New Approach for Speech Enhancement Based On Eigenvalue Spectral Subtraction
 
L046056365
L046056365L046056365
L046056365
 
Higher dimension raster 5d
Higher dimension raster 5dHigher dimension raster 5d
Higher dimension raster 5d
 
Performance Analysis of Adaptive DOA Estimation Algorithms For Mobile Applica...
Performance Analysis of Adaptive DOA Estimation Algorithms For Mobile Applica...Performance Analysis of Adaptive DOA Estimation Algorithms For Mobile Applica...
Performance Analysis of Adaptive DOA Estimation Algorithms For Mobile Applica...
 
9517cnc05
9517cnc059517cnc05
9517cnc05
 
MULTI RESOLUTION LATTICE DISCRETE FOURIER TRANSFORM (MRL-DFT)
MULTI RESOLUTION LATTICE DISCRETE FOURIER TRANSFORM (MRL-DFT)MULTI RESOLUTION LATTICE DISCRETE FOURIER TRANSFORM (MRL-DFT)
MULTI RESOLUTION LATTICE DISCRETE FOURIER TRANSFORM (MRL-DFT)
 
Терагерцовая камера MIT сканирует закрытые книги
Терагерцовая камера MIT сканирует закрытые книгиТерагерцовая камера MIT сканирует закрытые книги
Терагерцовая камера MIT сканирует закрытые книги
 
Comparison and analysis of combining techniques for spatial multiplexing spac...
Comparison and analysis of combining techniques for spatial multiplexing spac...Comparison and analysis of combining techniques for spatial multiplexing spac...
Comparison and analysis of combining techniques for spatial multiplexing spac...
 
Comparison and analysis of combining techniques for spatial multiplexingspace...
Comparison and analysis of combining techniques for spatial multiplexingspace...Comparison and analysis of combining techniques for spatial multiplexingspace...
Comparison and analysis of combining techniques for spatial multiplexingspace...
 
Comparison and analysis of combining techniques for spatial multiplexingspace...
Comparison and analysis of combining techniques for spatial multiplexingspace...Comparison and analysis of combining techniques for spatial multiplexingspace...
Comparison and analysis of combining techniques for spatial multiplexingspace...
 
A Text-Independent Speaker Identification System based on The Zak Transform
A Text-Independent Speaker Identification System based on The Zak TransformA Text-Independent Speaker Identification System based on The Zak Transform
A Text-Independent Speaker Identification System based on The Zak Transform
 
Speaker Identification based on GFCC using GMM-UBM
Speaker Identification based on GFCC using GMM-UBMSpeaker Identification based on GFCC using GMM-UBM
Speaker Identification based on GFCC using GMM-UBM
 
Dq24746750
Dq24746750Dq24746750
Dq24746750
 
F43063841
F43063841F43063841
F43063841
 
Steganographic Scheme Based on Message-Cover matching
Steganographic Scheme Based on Message-Cover matchingSteganographic Scheme Based on Message-Cover matching
Steganographic Scheme Based on Message-Cover matching
 
Dycops2019
Dycops2019 Dycops2019
Dycops2019
 
SEGMENTATION OF POLARIMETRIC SAR DATA WITH A MULTI-TEXTURE PRODUCT MODEL
SEGMENTATION OF POLARIMETRIC SAR DATA WITH A MULTI-TEXTURE PRODUCT MODELSEGMENTATION OF POLARIMETRIC SAR DATA WITH A MULTI-TEXTURE PRODUCT MODEL
SEGMENTATION OF POLARIMETRIC SAR DATA WITH A MULTI-TEXTURE PRODUCT MODEL
 
09 23sept 8434 10235-1-ed performance (edit ari)update 17jan18tyas
09 23sept 8434 10235-1-ed performance (edit ari)update 17jan18tyas09 23sept 8434 10235-1-ed performance (edit ari)update 17jan18tyas
09 23sept 8434 10235-1-ed performance (edit ari)update 17jan18tyas
 
FK_icassp_2014
FK_icassp_2014FK_icassp_2014
FK_icassp_2014
 
Missing Component Restoration for Masked Speech Signals based on Time-Domain ...
Missing Component Restoration for Masked Speech Signals based on Time-Domain ...Missing Component Restoration for Masked Speech Signals based on Time-Domain ...
Missing Component Restoration for Masked Speech Signals based on Time-Domain ...
 

10.1.1.11.1180

  • 1. SPEAKER IDENTIFICATION USING GAUSSIAN MIXTURE MODELS BASED ON MULTI-SPACE PROBABILITY DISTRIBUTION Chiyomi Miyajima†, Yosuke Hattori†,∗, Keiichi Tokuda†, Takashi Masuko‡, Takao Kobayashi‡, and Tadashi Kitamura† † Nagoya Institute of Technology, Nagoya 466-8555, Japan ‡ Tokyo Institute of Technology, Yokohama 226-8502, Japan ∗ Present address: DENSO Corporation, Kariya 448-8661, Japan † {chiyomi, tokuda, kitamura}@ics.nitech.ac.jp, ‡ {masuko, tkobayas}@ip.titech.ac.jp, ∗ hattori@rd.denso.co.jp ABSTRACT In this paper a new speaker modeling technique using a GMM based on multi-space probability distribution (MSD) [8] is This paper presents a new approach to modeling speech spectra introduced. The MSD-GMM allows us to model feature vectors and pitch for text-independent speaker identification using Gaus- with variable dimensionality including zero-dimensional vectors, sian mixture models based on multi-space probability distribution i.e., discrete symbols. Consequently, continuous pitch values (MSD-GMM). The MSD-GMM allows us to model continuous for voiced frames and discrete symbols representing “unvoiced” pitch values for voiced frames and discrete symbols represent- can be modeled using an MSD-GMM in a unified framework, ing unvoiced frames in a unified framework. Spectral and pitch and spectral and pitch features are jointly modeled by a multi- features are jointly modeled by a two-stream MSD-GMM. We stream MSD-GMM, i.e., each speaker is modeled by a single derive maximum likelihood (ML) estimation formulae for the statistical model. We derive maximum likelihood (ML) esti- MSD-GMM parameters, and the MSD-GMM speaker models mation formulae and evaluate the MSD-GMM speaker models are evaluated for text-independent speaker identification tasks. for text-independent speaker identification tasks comparing with Experimental results show that the MSD-GMM can efficiently conventional GMM speaker models. model spectral and pitch features of each speaker and outper- The rest of the paper is organized as follows. In Section forms conventional speaker models. 2, we introduce a speaker modeling technique based on MSD- GMM. Section 3 presents the ML estimation procedure for MSD- 1. INTRODUCTION GMM parameters. Section 4 reports experimental results, and Section 5 gives conclusions and future works. Gaussian mixture models (GMMs) have been successfully ap- plied to speaker modeling in text-independent speaker identifica- 2. MULTI-STREAM MSD-GMM tion [1]. Such identification systems mainly use spectral features represented by cepstral coefficients as speaker features. Pitch fea- 2.1. Likelihood Calculation Let us assume that a given observation Ót at time t consists tures as well as spectral features contain much speaker specific of S information sources (streams). The s-th stream Óts has information [2, 3]. However, most of speaker recognition studies in recent years have focused on using only spectral features. The a set of space indices Xts and a random vector with variable dimensionality Üts , that is main reasons for this are i) the use of pitch features alone could not give enough recognition performance and ii) pitch values are not defined in unvoiced segments and this complicates speaker modeling and feature integration. Ót = (Ót1 , Ót2 , . . . , ÓtS ), (1) Several works have reported that speaker recognition accu- Óts = (Xts , Üts ) . (2) racy can be improved by the use of pitch features in addition Note here that Xts is a subset of all possible space indices to spectral features [4, 5, 6, 7]. There are essentially two ap- {1, 2, . . . , Gs }, and all the spaces represented by the indices in Xts have the same dimensionality as Üts . proaches to integrating spectral and pitch information: i) two separate models are used for spectral and pitch features and their We define the output probability distribution of an S-stream MSD-GMM λ for Ót as scores are combined [4, 5], ii) two separate models for voiced and unvoiced parts are trained and their scores are combined [6, 7]. In [7], two separate GMMs are used, where the input ob- M S servations are concatenations of cepstral coefficients and log F0 b(Ót | λ) = cm pms (Óts ), (3) for voiced frames and cepstral coefficients alone for unvoiced m=1 s=1 frames. Since the probability distribution of the conventional GMM is defined on a single vector space, these two kinds of where cm is the mixture weight for the m-th mixture component. vectors require their respective models. The observation probability of Óts for mixture m is given by the multi-space probability distribution (MSD) [8] : A part of this work was supported by Research Fellowships from pms (Óts ) = wmsg Nmsg (Üts ), sg D the Japan Society for the Promotion of Science for Young Scientists, (4) No. 199808177. g∈Xts
  • 2. Stream 1 (G1 = 4) t /ashi/ Frequency (Hz) Frequency (kHz) Observation at time t 0 Spectrum 5 wm11 1 N m11 ( ) 2 o t = (o t1, o t2, o t3 ) 5 wm12 3 Nm12 (xt1 ) 4 5 wm13 pm1 (ot1 ) 5 UV V UV V UV 5-dimensional Nm13 (xt1 ) 200 wm14 150 Pitch 8 Nm14 ( ) 100 ot1 = ( { 2, 3 } , xt1 ) 50 ot2 = ( { 1 } , xt2 ) 7-dimensional Stream 2 (G2 = 1) wm21 = 1 ot3 = ( { 2 } , xt3 ) 7 Nm21 (xt2 ) Fig. 2. An example of spectral and pitch sequences of a word “/ashi/” spoken by a male speaker. Stream 3 (G3 = 2) pm2 (ot2 ) 0-dimensional 8 wm31 o t = (o t1, o t2 ) Spectrum (G1 = 1) Nm31 ( ) pm3 (ot3 ) D-dimensional D wm11 = 1 Nm11 (xt1 ) D 0 Nm32 (xt3 ) w Nm11 (xt1) m32 ot1 = ( { 1 } , x t1 ) Pitch (G2 = 2) pm1 (ot1 ) = wm12 Nm12 (xt1 ) 5 + wm13 Nm13 (xt1 ) 5 ot2 = ( { 1 } , x t2 ) V 1 wm21 Stream 1: Nm21 (xt2) Stream 2: pm2 (ot2 ) = Nm21 (xt2 ) 7 0 wm21 Nm21 (xt2 ) 1 1-dimensional Nm22 ( ) w or UV m22 or Stream 3: pm3 (ot3 ) = wm32 0-dimensional wm22 Fig. 3. The m-th mixture component in a two-stream MSD- Fig. 1. An example of the m-th mixture component of a three- GMM based on spectra and pitch. stream MSD-GMM. where wmsg is the weight for the g-th vector space of the s-th voiced frames and discrete symbols representing “unvoiced” in Dsg unvoiced frames because pitch values are defined only in voiced stream and Nmsg ( · ) is the Dsg -variate Gaussian function with segments. mean vector msg and covariance matrix Σmsg (for the case 0 As shown in Fig. 3, each speaker can be modeled by a Dsg > 0). For simplicity of notation, we define Nmsg ( · ) ≡ 1 two-stream MSD-GMM (S = 2); the first stream is for the (for the case Dsg = 0). Note here that the multi-space probabil- spectral feature and the second stream is for the pitch feature. ity distribution (MSD) is equivalent to continuous probability dis- The spectral stream has a D-dimensional space (G1 = 1), and tribution and discrete probability distribution when Dsg ≡ n > 0 the pitch stream has two spaces (G2 = 2) : a one-dimensional and Dsg ≡ 0, respectively. Also, an MSD-GMM is assumed to space and a zero-dimensional space for voiced and unvoiced be a generalized GMM, which includes the traditional GMM as parts, respectively. a special case when S = 1, G1 = 1, and D11 > 0. For an observation sequence Ç = (Ó1 , Ó2 , . . . , ÓT ), the likelihood of MSD-GMM λ is given by 3. ML-ESTIMATION FOR MSD-GMM T In a similar way to the ML-estimation procedure in [8], P (Ç | λ) P (Ç | λ) = b(Ót | λ). (5) is increased by iterating the maximization of an auxiliary function t=1 Q(λ , λ) over λ to improve current parameters λ based on the expectation maximization (EM) algorithm. Figure 1 illustrates an example of the m-th mixture compo- nent of a three-stream MSD-GMM (S = 3). The sample space of the first stream consists of four spaces (G1 = 4), among which 3.1. Definition of Q-Function the second and third spaces are triggered by the space indices The log-likelihood of λ for an observation sequence Ç , a se- and pm1 (Ót1 ) becomes the sum of the two weighted Gaussians. quence of mixture components and a sequence of space indices The second stream has only one space (G2 = 1) and always out- Ð can be written as puts its Gaussian as pm2 (Ót2 ). The third stream consists of two spaces (G3 = 2), where a zero-dimensional space is selected, T and outputs its space weight wm32 (a discrete probability) as log P (Ç, , Ð | λ) = log cit pm3 (Ót3 ). t=1 T S T S log Nit slts (Üts ), D slts 2.2. Speaker Modeling Based on MSD-GMM + log wit slts + (6) t=1 s=1 t=1 s=1 Figure 2 shows an example of spectral and pitch sequences of a Japanese word “/ashi/” spoken by a Japanese male speaker. where Generally, spectral features are represented by multi-dimensional = (i1 , i2 , . . . , iT ), (7) vectors of cepstral coefficients with continuous values. On the other hand, pitch features are represented by one-dimensional Ð = (Ð1 , Ð2 , . . . , ÐT ), (8) continuous values of log fundamental frequencies (log F0 ) in Ðt = (lt1 , lt2 , . . . , ltS ). (9)
  • 3. Hence the Q-function is defined as Similarly, the second term is maximized as Q(λ , λ) = P ( , Ð | Ç, λ ) log P (Ç , , Ð | λ) ξts (m, g) all i,l t∈T (O ,s,g) T wmsg = Gs , (15) = P ( , Ð | Ç, λ ) log cit ξts (m, l) all i,l t=1 l=1 t∈T (O ,s,l) T S + P ( , Ð | Ç, λ ) log wit slts where ξts (m, g) is the posterior probability of being in the g-th all i,l t=1 s=1 space of stream s in the m-th mixture component at time t: T S ξts (m, g) = P (it = m, lts = g | Ç, λ) P ( , Ð | Ç, λ ) log Nit slts (Üts ) D slts + all i,l t=1 s=1 = P (it = m | Ç, λ)P (lts = g | it = m, Ç, λ) wmsg Nmsg (Üts ) sg D = γt (m) . pms (Óts ) M T (16) = P (it = m | Ç , λ ) log cm m=1 t=1 The third term is maximized by solving following equations: ∂ P (it = m, lts = g | Ç, λ ) M S Gs + ∂ msg t∈T (O ,s,g) · log Nmsg (Üts ) = 0, m=1 s=1 g=1 t∈T (O ,s,g) D sg (17) P (it = m, lts = g | Ç, λ ) log wmsg ∂ P (it = m, lts = g | Ç, λ ) M S Gs ∂Σ−1 msg t∈T (O ,s,g) · log Nmsg (Üts ) = 0, D + sg (18) m=1 s=1 g=1 t∈T (O ,s,g) resulting in P (it = m, lts = g | Ç, λ ) log Nmsg (Üts ), sg D (10) where ξts (m, g)Üts t∈T (O ,s,g) T (Ç, s, g) = {t | g ∈ Xts } . (11) msg = Gs , (19) ξts (m, l) È 3.2. Maximization of Q-Function l=1 t∈T (O ,s,l) The first two terms of (10) have the form N ui log yi , which i=1 attains a global maximum at the single point ξts (m, g)(Üts − msg )( Üts − msg ) ui t∈T (O ,s,g) yi = N , for i = 1, 2, . . . , N, (12) Σmsg = . Gs uj ξts (m, l) È j=1 l=1 t∈T (O ,s,l) (20) N under the constraints yi = 1 and yi ≥ 0. The maximiza- i=1 The re-estimation is repeated iteratively using λ in place of λ tion of the first term of (10) leads to the re-estimate of cm : and the final result is an ML estimation of the MSD-GMM. T P (it = m | Ç , λ ) 4. EXPERIMENTAL EVALUATION t=1 cm = M T 4.1. Experimental Conditions P (it = m | Ç , λ ) m=1 t=1 First, a speaker identification experiment was carried out using T the ATR Japanese speech database. We used word data spoken by 1 P (it = m | Ç , λ ) 80 speakers (40 males and 40 females). Phonetically-balanced = T t=1 216 words are used for training each speaker model, and 520 T common words per speaker are used for testing. The number of 1 tests was 41600 in total. = γt (m), (13) T t=1 Second, to evaluate the robustness of the MSD-GMM speaker model against inter-session variability, we also conducted a speaker where γt (m) is the posterior probability of being in the m-th identification experiment using the NTT database. The database mixture component at time t, that is consists of sentence data uttered by 35 Japanese speakers (22 S males and 13 females) on five sessions over ten months (Aug., cm pms (Óts ) Sept., Dec. 1990, Mar., June 1991). In each session, 15 sen- tences were recorded for each speaker. Ten sentences are com- γt (m) = P (it = m | Ç , λ) = s=1 . b(Ót ) (14) mon to all speakers and all sessions (A-set), and five sentences
  • 4. 10 better performance than the conventional GMM system using ATR 95% CI NTT a spectral feature alone. Among the three systems, the MSD- Speaker identification error rate (%) database database GMM GMM system gave the best results, and achieved 16 % and 18 % 8 S+P-GMMs error reductions (for the ATR database) and 38 % and 51 % error V+UV-GMMs reductions (for the NTT database) over the GMM system when MSD-GMM using 32 and 64 mixture models, respectively. It is also noted 6 that the MSD-GMM system requires no combination parameter 6.2 (such as α) which has to be chosen or tuned heuristically. 5.8 5.4 5.2 5.1 6.4 4 4.5 4.4 5. CONCLUSION 4.2 5.6 4.4 4.3 2 This paper has introduced a new technique for modeling speakers 3.7 3.4 3.4 based on MSD-GMM for text-independent speaker identification. 3.1 The MSD-GMM can model continuous pitch values of voiced 0 frames and discrete symbols representing “unvoiced” in a unified 32 32+32 24+8 32 64 64+64 48+16 64 32 32+32 24+8 32 64 64+64 48+16 64 framework. Spectral and pitch features can be jointly modeled by a multi-stream MSD-GMM. We derived the ML estimation Number of mixture components formulae for the MSD-GMM parameters and evaluated the MSD- GMM speaker models for text-independent speaker identification tasks. The experimental results demonstrated the high utility of Fig. 4. Comparison of MSD-GMM speaker models with con- the MSD-GMM speaker model and also proved its robustness ventional GMM and two-separate GMM speaker models. against the inter-session variability. Introduction of stream weights to the multi-stream MSD- GMM and application of this framework to speaker verification are different for each speaker and each session (B-set). The dura- systems will be subjects for future works. tion of each sentence is approximately four second. We used 15 sentences (A-set + B-set from the first session) per speaker for 6. REFERENCES training, and 20 sentences (B-set from the other four sessions) per speaker for testing. The number of tests was 700 in total. [1] D.A. Reynolds and R.C. Rose, “Robust text-independent The speech data were down-sampled to 10 kHz, windowed speaker identification using Gaussian mixture speaker mod- at a 10-ms frame rate with a 25.6-ms Blackman window, and pa- els,” IEEE Trans. Speech and Audio Process., 3 (1), 72–83, rameterized into 13 mel-cepstral coefficients using a mel-cepstral Jan. 1995. estimation technique [9]. The 12 static parameters excluding the [2] B.S. Atal, “Automatic speaker recognition based on pitch zero-th coefficient were used as a spectral feature. Fundamen- contours,” J. Acoust. Soc. Amer., 52 (6), 1687–1697, 1972. tal frequencies (F0 ) were estimated at a 10-ms frame rate using [3] S. Furui, “Research on individuality features in speech the RAPT method [10] with a 7.5-ms correlation window, and waves and automatic speaker recognition techniques,” log F0 for the voiced frames and discrete symbols for unvoiced Speech Communication, 5 (2), 183–197, 1986. frames were used as a pitch feature. Speakers were modeled by [4] M.J. Carey, E.S. Parris, H. Lloyd-Thomas, and S. Bennett, GMMs or multi-stream MSD-GMMs with diagonal covariances. “Robust prosodic features for speaker identification,” Proc. ICSLP’96, vol.3, pp.1800–1804, Oct. 1996. 4.2. Experimental Results [5] M.K. S¨ nmez, L. Heck, M. Weintraub, and E. Shriberg, o “A lognormal tied mixture model of pitch for prosody- The MSD-GMM speaker identification system was compared based speaker recognition,” Proc. EUROSPEECH’97, vol.3, with three kinds of conventional systems. Figure 4 shows speaker pp.1391–1394, Sept. 1997. identification error rates with 95 % confidence intervals (CIs) [6] T. Matsui and S. Furui, “Text-independent speaker recog- when using 32 and 64 component speaker models. The left and nition using vocal tract and pitch information,” Proc. right halves of the figure correspond to the results for the ATR ICSLP’90, vol.1, pp.137–140, Nov. 1990. and NTT databases, respectively. In the figure, “GMM” denotes [7] K.P. Markov and S. Nakagawa, “Integrating pitch and a conventional GMM speaker model using a spectral feature LPC-residual information with LPC-cepstrum for text- alone. “S+P-GMMs” represents a speaker model consisting of independent speaker recognition,” J. Acoust. Soc. Jpn., (E), two GMMs for spectra and pitch. “V+UV-GMMs” is a speaker 20 (4), 281–291, July 1999. model consisting of two GMMs for voiced (V) and unvoiced [8] K. Tokuda, T. Masuko, N. Miyazaki, and T. Kobayashi, (UV) parts [7] with the optimum numbers of mixture compo- “Hidden Markov models based on multi-space probability nents for the V-GMM and the UV-GMM, i.e., 24 (V)+8 (UV) or distribution for pitch pattern modeling,” Proc. ICASSP’99, 48 (V)+16 (UV), and a linear combination parameter α = 0.5 vol.1, pp.229–232, May 1999. (α is the weight for the likelihood of the UV-GMM). “MSD- [9] T. Fukada, K. Tokuda, T. Kobayashi, and S. Imai, “An adap- GMM” denotes the proposed model based on the multi-stream tive algorithm for mel-cepstral analysis of speech,” Proc. MSD-GMM. ICASSP’92, vol.1, pp.137–140, Mar. 1992. As shown in the figure, the additional use of pitch infor- [10] D. Talkin, “A robust algorithm for pitch tracking,” in mation significantly improved the system performance, and the Speech Coding and Synthesis, W.B. Kleijn and K.K. Pali- three systems using both spectral and pitch features gave much wal eds., Elsevier Science, pp.495–518, 1995.