10.1.1.11.1180

SPEAKER IDENTIFICATION USING GAUSSIAN MIXTURE MODELS
BASED ON MULTI-SPACE PROBABILITY DISTRIBUTION

Chiyomi Miyajima†, Yosuke Hattori†,∗, Keiichi Tokuda†, Takashi Masuko‡, Takao Kobayashi‡, and Tadashi Kitamura†

†
Nagoya Institute of Technology, Nagoya 466-8555, Japan
‡
Tokyo Institute of Technology, Yokohama 226-8502, Japan
∗
Present address: DENSO Corporation, Kariya 448-8661, Japan
†
{chiyomi, tokuda, kitamura}@ics.nitech.ac.jp, ‡ {masuko, tkobayas}@ip.titech.ac.jp, ∗ hattori@rd.denso.co.jp

ABSTRACT In this paper a new speaker modeling technique using a
GMM based on multi-space probability distribution (MSD) [8] is
This paper presents a new approach to modeling speech spectra introduced. The MSD-GMM allows us to model feature vectors
and pitch for text-independent speaker identification using Gaus- with variable dimensionality including zero-dimensional vectors,
sian mixture models based on multi-space probability distribution i.e., discrete symbols. Consequently, continuous pitch values
(MSD-GMM). The MSD-GMM allows us to model continuous for voiced frames and discrete symbols representing “unvoiced”
pitch values for voiced frames and discrete symbols represent- can be modeled using an MSD-GMM in a unified framework,
ing unvoiced frames in a unified framework. Spectral and pitch and spectral and pitch features are jointly modeled by a multi-
features are jointly modeled by a two-stream MSD-GMM. We stream MSD-GMM, i.e., each speaker is modeled by a single
derive maximum likelihood (ML) estimation formulae for the statistical model. We derive maximum likelihood (ML) esti-
MSD-GMM parameters, and the MSD-GMM speaker models mation formulae and evaluate the MSD-GMM speaker models
are evaluated for text-independent speaker identification tasks. for text-independent speaker identification tasks comparing with
Experimental results show that the MSD-GMM can efficiently conventional GMM speaker models.
model spectral and pitch features of each speaker and outper- The rest of the paper is organized as follows. In Section
forms conventional speaker models. 2, we introduce a speaker modeling technique based on MSD-
GMM. Section 3 presents the ML estimation procedure for MSD-
1. INTRODUCTION GMM parameters. Section 4 reports experimental results, and
Section 5 gives conclusions and future works.
Gaussian mixture models (GMMs) have been successfully ap-
plied to speaker modeling in text-independent speaker identifica- 2. MULTI-STREAM MSD-GMM
tion [1]. Such identification systems mainly use spectral features
represented by cepstral coefficients as speaker features. Pitch fea- 2.1. Likelihood Calculation
Let us assume that a given observation Ót at time t consists
tures as well as spectral features contain much speaker specific
of S information sources (streams). The s-th stream Óts has
information [2, 3]. However, most of speaker recognition studies
in recent years have focused on using only spectral features. The
a set of space indices Xts and a random vector with variable
dimensionality Üts , that is
main reasons for this are i) the use of pitch features alone could
not give enough recognition performance and ii) pitch values are
not defined in unvoiced segments and this complicates speaker
modeling and feature integration.
Ót = (Ót1 , Ót2 , . . . , ÓtS ), (1)
Several works have reported that speaker recognition accu- Óts = (Xts , Üts ) . (2)
racy can be improved by the use of pitch features in addition
Note here that Xts is a subset of all possible space indices
to spectral features [4, 5, 6, 7]. There are essentially two ap-
{1, 2, . . . , Gs }, and all the spaces represented by the indices in
Xts have the same dimensionality as Üts .
proaches to integrating spectral and pitch information: i) two
separate models are used for spectral and pitch features and their
We define the output probability distribution of an S-stream
MSD-GMM λ for Ót as
scores are combined [4, 5], ii) two separate models for voiced
and unvoiced parts are trained and their scores are combined
[6, 7]. In [7], two separate GMMs are used, where the input ob- M S
servations are concatenations of cepstral coefficients and log F0 b(Ót | λ) = cm pms (Óts ), (3)
for voiced frames and cepstral coefficients alone for unvoiced m=1 s=1
frames. Since the probability distribution of the conventional
GMM is defined on a single vector space, these two kinds of where cm is the mixture weight for the m-th mixture component.
vectors require their respective models. The observation probability of Óts for mixture m is given by the
multi-space probability distribution (MSD) [8] :
A part of this work was supported by Research Fellowships from
pms (Óts ) = wmsg Nmsg (Üts ),
sg D
the Japan Society for the Promotion of Science for Young Scientists, (4)
No. 199808177. g∈Xts

Stream 1 (G1 = 4) t
/ashi/

Frequency (Hz) Frequency (kHz)
Observation at time t 0

Spectrum
5
wm11 1
N m11 ( ) 2
o t = (o t1, o t2, o t3 )
5
wm12 3
Nm12 (xt1 ) 4
5
wm13 pm1 (ot1 )
5 UV V UV V UV
5-dimensional Nm13 (xt1 ) 200
wm14 150

Pitch
8
Nm14 ( ) 100
ot1 = ( { 2, 3 } , xt1 )
50
ot2 = ( { 1 } , xt2 ) 7-dimensional Stream 2 (G2 = 1)
wm21 = 1
ot3 = ( { 2 } , xt3 ) 7
Nm21 (xt2 ) Fig. 2. An example of spectral and pitch sequences of a word
“/ashi/” spoken by a male speaker.
Stream 3 (G3 = 2) pm2 (ot2 )
0-dimensional
8 wm31 o t = (o t1, o t2 ) Spectrum (G1 = 1)
Nm31 ( )
pm3 (ot3 ) D-dimensional D wm11 = 1 Nm11 (xt1 )
D
0
Nm32 (xt3 ) w Nm11 (xt1)
m32
ot1 = ( { 1 } , x t1 )
Pitch (G2 = 2)
pm1 (ot1 ) = wm12 Nm12 (xt1 )
5
+ wm13 Nm13 (xt1 )
5 ot2 = ( { 1 } , x t2 ) V 1 wm21
Stream 1: Nm21 (xt2)
Stream 2: pm2 (ot2 ) = Nm21 (xt2 )
7
0 wm21 Nm21 (xt2 )
1
1-dimensional Nm22 ( ) w
or UV m22 or
Stream 3: pm3 (ot3 ) = wm32 0-dimensional wm22

Fig. 3. The m-th mixture component in a two-stream MSD-
Fig. 1. An example of the m-th mixture component of a three-
GMM based on spectra and pitch.
stream MSD-GMM.

where wmsg is the weight for the g-th vector space of the s-th voiced frames and discrete symbols representing “unvoiced” in
Dsg unvoiced frames because pitch values are defined only in voiced
stream and Nmsg ( · ) is the Dsg -variate Gaussian function with
segments.
mean vector msg and covariance matrix Σmsg (for the case
0 As shown in Fig. 3, each speaker can be modeled by a
Dsg > 0). For simplicity of notation, we define Nmsg ( · ) ≡ 1 two-stream MSD-GMM (S = 2); the first stream is for the
(for the case Dsg = 0). Note here that the multi-space probabil- spectral feature and the second stream is for the pitch feature.
ity distribution (MSD) is equivalent to continuous probability dis- The spectral stream has a D-dimensional space (G1 = 1), and
tribution and discrete probability distribution when Dsg ≡ n > 0 the pitch stream has two spaces (G2 = 2) : a one-dimensional
and Dsg ≡ 0, respectively. Also, an MSD-GMM is assumed to space and a zero-dimensional space for voiced and unvoiced
be a generalized GMM, which includes the traditional GMM as parts, respectively.
a special case when S = 1, G1 = 1, and D11 > 0.
For an observation sequence Ç = (Ó1 , Ó2 , . . . , ÓT ), the
likelihood of MSD-GMM λ is given by 3. ML-ESTIMATION FOR MSD-GMM

T In a similar way to the ML-estimation procedure in [8], P (Ç | λ)
P (Ç | λ) = b(Ót | λ). (5) is increased by iterating the maximization of an auxiliary function
t=1 Q(λ , λ) over λ to improve current parameters λ based on the
expectation maximization (EM) algorithm.
Figure 1 illustrates an example of the m-th mixture compo-
nent of a three-stream MSD-GMM (S = 3). The sample space of
the first stream consists of four spaces (G1 = 4), among which 3.1. Definition of Q-Function
the second and third spaces are triggered by the space indices The log-likelihood of λ for an observation sequence Ç , a se-
and pm1 (Ót1 ) becomes the sum of the two weighted Gaussians. quence of mixture components and a sequence of space indices
The second stream has only one space (G2 = 1) and always out- Ð can be written as
puts its Gaussian as pm2 (Ót2 ). The third stream consists of two
spaces (G3 = 2), where a zero-dimensional space is selected, T

and outputs its space weight wm32 (a discrete probability) as log P (Ç, , Ð | λ) = log cit
pm3 (Ót3 ). t=1
T S T S
log Nit slts (Üts ),
D slts
2.2. Speaker Modeling Based on MSD-GMM + log wit slts + (6)
t=1 s=1 t=1 s=1
Figure 2 shows an example of spectral and pitch sequences of
a Japanese word “/ashi/” spoken by a Japanese male speaker. where
Generally, spectral features are represented by multi-dimensional = (i1 , i2 , . . . , iT ), (7)
vectors of cepstral coefficients with continuous values. On the
other hand, pitch features are represented by one-dimensional Ð = (Ð1 , Ð2 , . . . , ÐT ), (8)
continuous values of log fundamental frequencies (log F0 ) in Ðt = (lt1 , lt2 , . . . , ltS ). (9)

Hence the Q-function is defined as Similarly, the second term is maximized as
Q(λ , λ) = P ( , Ð | Ç, λ ) log P (Ç , , Ð | λ) ξts (m, g)
all i,l t∈T (O ,s,g)
T wmsg = Gs
, (15)
= P ( , Ð | Ç, λ ) log cit ξts (m, l)
all i,l t=1 l=1 t∈T (O ,s,l)
T S
+ P ( , Ð | Ç, λ ) log wit slts where ξts (m, g) is the posterior probability of being in the g-th
all i,l t=1 s=1
space of stream s in the m-th mixture component at time t:
T S ξts (m, g) = P (it = m, lts = g | Ç, λ)
P ( , Ð | Ç, λ ) log Nit slts (Üts )
D
slts
+
all i,l t=1 s=1 = P (it = m | Ç, λ)P (lts = g | it = m, Ç, λ)
wmsg Nmsg (Üts )
sg D
= γt (m) .
pms (Óts )
M T (16)
= P (it = m | Ç , λ ) log cm
m=1 t=1 The third term is maximized by solving following equations:
∂
P (it = m, lts = g | Ç, λ )
M S Gs
+ ∂ msg t∈T (O ,s,g)

· log Nmsg (Üts ) = 0,
m=1 s=1 g=1 t∈T (O ,s,g) D
sg
(17)
P (it = m, lts = g | Ç, λ ) log wmsg ∂
P (it = m, lts = g | Ç, λ )
M S Gs ∂Σ−1
msg
t∈T (O ,s,g)
· log Nmsg (Üts ) = 0,
D
+ sg
(18)
m=1 s=1 g=1 t∈T (O ,s,g)
resulting in
P (it = m, lts = g | Ç, λ ) log Nmsg (Üts ),
sg D
(10)
where ξts (m, g)Üts
t∈T (O ,s,g)
T (Ç, s, g) = {t | g ∈ Xts } . (11) msg = Gs
, (19)
ξts (m, l)

È
3.2. Maximization of Q-Function l=1 t∈T (O ,s,l)

The first two terms of (10) have the form N ui log yi , which
i=1
attains a global maximum at the single point ξts (m, g)(Üts − msg )( Üts − msg )
ui t∈T (O ,s,g)
yi = N , for i = 1, 2, . . . , N, (12) Σmsg = .
Gs
uj ξts (m, l)

È
j=1 l=1 t∈T (O ,s,l)
(20)
N
under the constraints yi = 1 and yi ≥ 0. The maximiza-
i=1 The re-estimation is repeated iteratively using λ in place of λ
tion of the first term of (10) leads to the re-estimate of cm : and the final result is an ML estimation of the MSD-GMM.
T
P (it = m | Ç , λ ) 4. EXPERIMENTAL EVALUATION
t=1
cm = M T 4.1. Experimental Conditions
P (it = m | Ç , λ )
m=1 t=1
First, a speaker identification experiment was carried out using
T
the ATR Japanese speech database. We used word data spoken by
1
P (it = m | Ç , λ )
80 speakers (40 males and 40 females). Phonetically-balanced
=
T t=1
216 words are used for training each speaker model, and 520
T
common words per speaker are used for testing. The number of
1 tests was 41600 in total.
= γt (m), (13)
T t=1
Second, to evaluate the robustness of the MSD-GMM speaker
model against inter-session variability, we also conducted a speaker
where γt (m) is the posterior probability of being in the m-th identification experiment using the NTT database. The database
mixture component at time t, that is consists of sentence data uttered by 35 Japanese speakers (22
S males and 13 females) on five sessions over ten months (Aug.,
cm pms (Óts ) Sept., Dec. 1990, Mar., June 1991). In each session, 15 sen-
tences were recorded for each speaker. Ten sentences are com-
γt (m) = P (it = m | Ç , λ) =
s=1
.
b(Ót )
(14) mon to all speakers and all sessions (A-set), and five sentences

10 better performance than the conventional GMM system using
ATR 95% CI NTT a spectral feature alone. Among the three systems, the MSD-
Speaker identification error rate (%)

database database
GMM GMM system gave the best results, and achieved 16 % and 18 %
8 S+P-GMMs error reductions (for the ATR database) and 38 % and 51 % error
V+UV-GMMs reductions (for the NTT database) over the GMM system when
MSD-GMM using 32 and 64 mixture models, respectively. It is also noted
6 that the MSD-GMM system requires no combination parameter
6.2

(such as α) which has to be chosen or tuned heuristically.
5.8
5.4

5.2

5.1

6.4
4
4.5
4.4

5. CONCLUSION
4.2

5.6

4.4

4.3
2 This paper has introduced a new technique for modeling speakers

3.7
3.4

3.4
based on MSD-GMM for text-independent speaker identification.

3.1
The MSD-GMM can model continuous pitch values of voiced
0 frames and discrete symbols representing “unvoiced” in a unified
32
32+32
24+8
32

64
64+64
48+16
64

32
32+32
24+8
32

64
64+64
48+16
64
framework. Spectral and pitch features can be jointly modeled
by a multi-stream MSD-GMM. We derived the ML estimation
Number of mixture components
formulae for the MSD-GMM parameters and evaluated the MSD-
GMM speaker models for text-independent speaker identification
tasks. The experimental results demonstrated the high utility of
Fig. 4. Comparison of MSD-GMM speaker models with con- the MSD-GMM speaker model and also proved its robustness
ventional GMM and two-separate GMM speaker models. against the inter-session variability.
Introduction of stream weights to the multi-stream MSD-
GMM and application of this framework to speaker verification
are different for each speaker and each session (B-set). The dura- systems will be subjects for future works.
tion of each sentence is approximately four second. We used 15
sentences (A-set + B-set from the first session) per speaker for 6. REFERENCES
training, and 20 sentences (B-set from the other four sessions)
per speaker for testing. The number of tests was 700 in total. [1] D.A. Reynolds and R.C. Rose, “Robust text-independent
The speech data were down-sampled to 10 kHz, windowed speaker identification using Gaussian mixture speaker mod-
at a 10-ms frame rate with a 25.6-ms Blackman window, and pa- els,” IEEE Trans. Speech and Audio Process., 3 (1), 72–83,
rameterized into 13 mel-cepstral coefficients using a mel-cepstral Jan. 1995.
estimation technique [9]. The 12 static parameters excluding the [2] B.S. Atal, “Automatic speaker recognition based on pitch
zero-th coefficient were used as a spectral feature. Fundamen- contours,” J. Acoust. Soc. Amer., 52 (6), 1687–1697, 1972.
tal frequencies (F0 ) were estimated at a 10-ms frame rate using [3] S. Furui, “Research on individuality features in speech
the RAPT method [10] with a 7.5-ms correlation window, and waves and automatic speaker recognition techniques,”
log F0 for the voiced frames and discrete symbols for unvoiced Speech Communication, 5 (2), 183–197, 1986.
frames were used as a pitch feature. Speakers were modeled by [4] M.J. Carey, E.S. Parris, H. Lloyd-Thomas, and S. Bennett,
GMMs or multi-stream MSD-GMMs with diagonal covariances. “Robust prosodic features for speaker identification,” Proc.
ICSLP’96, vol.3, pp.1800–1804, Oct. 1996.
4.2. Experimental Results [5] M.K. S¨ nmez, L. Heck, M. Weintraub, and E. Shriberg,
o
“A lognormal tied mixture model of pitch for prosody-
The MSD-GMM speaker identification system was compared based speaker recognition,” Proc. EUROSPEECH’97, vol.3,
with three kinds of conventional systems. Figure 4 shows speaker pp.1391–1394, Sept. 1997.
identification error rates with 95 % confidence intervals (CIs) [6] T. Matsui and S. Furui, “Text-independent speaker recog-
when using 32 and 64 component speaker models. The left and nition using vocal tract and pitch information,” Proc.
right halves of the figure correspond to the results for the ATR ICSLP’90, vol.1, pp.137–140, Nov. 1990.
and NTT databases, respectively. In the figure, “GMM” denotes [7] K.P. Markov and S. Nakagawa, “Integrating pitch and
a conventional GMM speaker model using a spectral feature LPC-residual information with LPC-cepstrum for text-
alone. “S+P-GMMs” represents a speaker model consisting of independent speaker recognition,” J. Acoust. Soc. Jpn., (E),
two GMMs for spectra and pitch. “V+UV-GMMs” is a speaker 20 (4), 281–291, July 1999.
model consisting of two GMMs for voiced (V) and unvoiced [8] K. Tokuda, T. Masuko, N. Miyazaki, and T. Kobayashi,
(UV) parts [7] with the optimum numbers of mixture compo- “Hidden Markov models based on multi-space probability
nents for the V-GMM and the UV-GMM, i.e., 24 (V)+8 (UV) or distribution for pitch pattern modeling,” Proc. ICASSP’99,
48 (V)+16 (UV), and a linear combination parameter α = 0.5 vol.1, pp.229–232, May 1999.
(α is the weight for the likelihood of the UV-GMM). “MSD- [9] T. Fukada, K. Tokuda, T. Kobayashi, and S. Imai, “An adap-
GMM” denotes the proposed model based on the multi-stream tive algorithm for mel-cepstral analysis of speech,” Proc.
MSD-GMM. ICASSP’92, vol.1, pp.137–140, Mar. 1992.
As shown in the figure, the additional use of pitch infor- [10] D. Talkin, “A robust algorithm for pitch tracking,” in
mation significantly improved the system performance, and the Speech Coding and Synthesis, W.B. Kleijn and K.K. Pali-
three systems using both spectral and pitch features gave much wal eds., Elsevier Science, pp.495–518, 1995.

10.1.1.11.1180

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Viewers also liked

Viewers also liked (10)

Similar to 10.1.1.11.1180

Similar to 10.1.1.11.1180 (20)

10.1.1.11.1180