A multi space distribution (msd) and two-stream tone modeling approach to mandarin speech recognition
Upcoming SlideShare
Loading in...5

Like this? Share it with your network


A multi space distribution (msd) and two-stream tone modeling approach to mandarin speech recognition






Total Views
Views on SlideShare
Embed Views



0 Embeds 0

No embeds



Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

A multi space distribution (msd) and two-stream tone modeling approach to mandarin speech recognition Document Transcript

  • 1. Available online at www.sciencedirect.com Speech Communication 51 (2009) 1169–1179 www.elsevier.com/locate/specom A Multi-Space Distribution (MSD) and two-stream tone modeling approach to Mandarin speech recognition Yao Qian *, Frank K. Soong Microsoft Research Asia, Beijing 100190, China Received 12 May 2008; received in revised form 28 July 2009; accepted 5 August 2009 Abstract Tone plays an important role in recognizing spoken tonal languages like Chinese. However, the discontinuity of F0 between voiced and unvoiced transition has traditionally been a hurdle in creating a succinct statistical tone model for automatic speech recognition and synthesis. Various heuristic approaches have been proposed before to get around the problem but with limited success. The Multi-Space Distribution (MSD) proposed by Tokuda et al. which models the two probability spaces, discrete for unvoiced region and continuous for voiced F0 contour, in a linearly weighted mixture, has been successfully applied to Hidden Markov Model (HMM)-based text-to-speech synthesis. We extend MSD to Chinese Mandarin tone modeling for speech recognition. The tone features and spectral features are further separated into two streams and corresponding stream-dependent models are trained. Finally two separated decision trees are constructed by clustering corresponding stream-dependent HMMs. The MSD and two-stream modeling approach is evaluated on large vocabulary, continuously read and spontaneous speech Mandarin databases and its robustness is further investigated in a noisy, continuous Mandarin digit database with eight types of noises at five different SNRs. Experimental results show that our MSD and two-stream based tone modeling approach can significantly improve the recognition performance over a toneless baseline system. The relative tonal syllable error rate (TSER) reductions are 21.0%, 8.4% and 17.4% for large vocabulary read and spontaneous and noisy digit speech recognition tasks, respectively. Comparing with the conventional system where F0 contours are interpolated in unvoiced segments, our approach improves the recognition performance by 9.8%, 7.4% and 13.3% in relative TSER reductions in the corresponding speech recognition tasks, respectively. Ó 2009 Elsevier B.V. All rights reserved. Keywords: Tone model; Mandarin speech recognition; Multi-Space Distribution (MSD); Noisy digit recognition; LVCSR 1. Introduction Mandarin as well as other Chinese dialects is known as a monosyllabically paced tonal language. Each Chinese character, which is the basic morphemic unit in written Chinese, is pronounced as a tonal syllable, i.e., a base syllable plus a lexical tone. All Mandarin syllables have a structure form of (consonant)–vowel–(consonant), where only the vowel nucleus is an obligatory element. If we consider only the phonemic composition of a syllable without tone, the syllable is referred to as a base syllable. Following * Corresponding author. E-mail addresses: yaoqian@microsoft.com (Y. Qian), frankkps@ microsoft.com (F.K. Soong). 0167-6393/$ - see front matter Ó 2009 Elsevier B.V. All rights reserved. doi:10.1016/j.specom.2009.08.001 the convention of Chinese phonology, each base syllable is divided into two parts, namely Initial and Final. The Initial (onset) includes what precedes the vowels while the Final includes the vowel (nucleus) and what follows it (coda). Most Initials are unvoiced and thus the tones are carried primarily by the Finals. Proper tonal syllable recognition is critical to distinguishing homonyms of the same base syllables in applications where strong contextual information is not available in general, e.g. recognizing the name of a person or a place. A recognizer with high tonal syllable recognition accuracy has many useful applications, e.g. objective evaluation of tonal language proficiency of a speaker. It should be obvious that tone plays an important role in perceiving a Chinese tonal syllable. However, to construct a succinct tone model, which is critical for automatic tonal
  • 2. 1170 Y. Qian, F.K. Soong / Speech Communication 51 (2009) 1169–1179 syllable recognition, is not trivial. The discontinuity in F0 contour between voiced and unvoiced regions has made the modeling difficult. Heuristic approaches like interpolating F0 in unvoiced segments to get around the discontinuity problem have been proposed (Hirst and Espesser, 1993; Chen et al., 1997; Chang et al., 2000; Freij and and Fallside, 1988; Wang et al., 1997; Tian et al., 2004; Lei et al., 2006). The interpolated F0 can be generated from a quadratic spline function (Hirst and Espesser, 1993), an exponential decay function towards the running F0 average (Chen et al., 1997), or a probability density function (pdf) with a large variance (Chang et al., 2000; Freij and and Fallside, 1988). These approaches are instrumentally effective since F0 information can be augmented as extra information with short-time spectral features frame synchronously. As a result, the concatenated spectral and pitch features are used frame synchronously in one-pass Viterbi decoding. However, the artificially interpolated F0 values do not reflect the actual tone and the critical voicing/unvoicing information which is in principle useful for recognizing phonetic units is lost. Furthermore, in terms of corresponding time window size, the spectral (segmental) feature is distinctive in a phonetic or phone segment while the pitch (supra-segmental) feature is embedded in a longer, time window of a word, a phrase or a sentence. By using a two-stream approach we can model spectral and pitch features more appropriately than a single stream one (Ho et al., 1999; Seide and and Wang, 2000). There are many other approaches to model tone and spectral information separately (Qian et al., 2006; Lin et al., 1996; Peng and Wang, 2005; Zhang et al., 2005). The tone features are usually derived from syllable with force-aligned boundaries and tone models are incorporated in a post processing stage after the 1st decoding pass. A longer time window can then be used explicitly to take neighboring tone information into considerations (Qian et al., 2007). As to integrate the tone information into the search process, rescoring lattice or N-best lists output from the recognition is usually adopted. In this paper, we adopt a Multi-Space Distribution (MSD) based tone modeling in Mandarin speech recognition. The MSD was originally proposed by Tokuda et al. to model the discontinuous pitch contours in a statistical manner and was successfully applied to HMM-based speech synthesis (Tokuda et al., 2002). We extended the MSD model to speaker-independent Mandarin (‘‘Putonghua”) tone recognition (Wang et al., 2006; Qiang et al., 2007). The tone features and spectral features are further separated into two streams and stream-dependent models are built (clustered) in two separated decision trees. The MSD is seamlessly integrated into the HMM modeling process, which is the predominant technique for acoustic modeling in ASR training. The resultant model, so-called MSD-HMM, is applied naturally to the one-pass viterbi decoding in continuous speech recognition. We test the effectiveness of the MSD approach in a large vocabulary, continuously read and spontaneous speech database and further evaluate its robustness in a noisy, continuous Mandarin digit database. The rest of the paper is organized as follows. In Section 2, the approach of MSD for Mandarin Chinese tone modeling is reviewed and its application to noisy speech recognition is investigated. The experimental results and analysis are shown in Section 3. In Section 4, we give our conclusion. 2. Mandarin speech recognition with MSD based tone models 2.1. MSD for tone modeling Multi-Space Distribution (MSD) was first proposed by Tokuda et al. to model stochastically the piece-wise continuous F0 trajectory and was applied to HMM-based speech synthesis (Tokuda et al., 2002). It assumes that the observation space X of an event is made up of G sub-spaces. probability p(Xg) and all Each sub-space Xg has its priorP its priors are summed up to one, G pðXg Þ ¼ 1. An obserg¼1 vation vector, o, consists of a set of space indices I and a random variable x 2 Rn, that is o 2 (I,x), and it is randomly distributed in each sub-space according to an underlying pdf, pg(V(o)), where V(o) = x. The dimensionality of the observation vector can be different in different sub-spaces. The observation probability of o is defined by X pðXg Þpg ðV ðoÞÞ ð1Þ bðoÞ ¼ g2SðoÞ where S(o) = I. The index set of the sub-spaces I that observation o belongs to is determined by the extracted features x at each time instant of observation. A mixture of K Gaussians can be seen as a special case of MSD, i.e., K subspaces of MSD with the same dimensionality and a Gaussian distribution in each sub-space. The mixture weight associated with the kth Gaussian component ck can be regarded as the prior probability of the kth sub-space ck = p(Xk). F0, the fundamental frequency or the pitch, is the most relevant feature used in recognizing tonal languages. But F0, a continuous variable, only exists in the voiced region of speech signals. In unvoiced segments where no pitch harmonics exist, a discrete random variable is adequate to characterize the un-voicing property. Fig. 1 shows two tonal syllables ‘‘ti2 gan4” (the numerical labels denote their tone types: tone 2 and tone 4.) in their triphone representation form and F0 contours only span across voiced segments in t-i2+g and g-an4+r. The conventional statistical model can only characterize a feature as either continuous or discrete but not both. Therefore, the discontinuity of F0 between voiced and unvoiced segments makes tone modeling difficult. MSD is effective to characterize the piece-wise continuous F0 contour without resorting to any unnecessary heuristic assumptions. In the voiced region, F0 is regarded as sequential, one-dimensional observations generated from
  • 3. Y. Qian, F.K. Soong / Speech Communication 51 (2009) 1169–1179 1171 Fig. 1. F0 contour of tonal syllable ‘‘ti2 gan4” and a schematic representation of using MSD for tone modeling. several one-dimensional Gaussian sub-spaces, while in the unvoiced region, F0 is treated as a yes–no indicator-like, discrete symbol. We use Gaussian mixtures, the most commonly used form in speech recognition systems, for characterizing the output distributions. The MSD assumes that the output pdf of the zero-dimensional, unvoiced sub-space is a Kronecker delta function and continuous F0 in the one-dimensional, voiced sub-space has a Gaussian mixture distribution. Fig. 1 also gives a schematic representation of MSDbased tone HMM. At the beginning unvoiced part, the consonant onset ‘‘t”, the mixture weight which represents an unvoiced sub-space is close to one, while the weight summation of the Gaussian mixture components corresponding to the voiced sub-spaces is close to zero. At the voiced syllable Final ‘‘an4”, the opposite is true. MSD tone modeling does not need any artificial contour interpolation of F0 trajectories. It models the original F0 features in a probabilistic manner and no hard decisions are needed. 2.2. Stream-dependent state tying In LVCSR, context-dependent phone models, e.g., triphone models, are commonly used to capture the acoustic co-articulation between neighboring phones. To deal with the data sparseness problem of context-dependent phones in estimation, model parameters are usually tied together, e.g., state tying via decision-tree clustering is widely used in current LVCSR systems. While spectral features like MFCC represent the vocal tract information, tone features reflect the vibration frequency or no vibration of the vocal cord. They can be modeled through two independent data streams. Moreover, the spectral (segmental) feature is distinctive in a phonetic or phone segment while the pitch (supra-segmental) feature is embedded in a longer time window of a syllable, a word, a phrase or even a sentence. The co-articulation effects of spectral and tone features, or their context dependencies, are different (Xu and Liu, 2006). Accordingly, it is more reasonable to perform state tying in the two streams, separately. We design two question sets corresponding to tonal and phonetic context dependency, respectively. Then decision tree based clustering method is used to automatically find appropriate cluster for state tying. Each tonal syllable is divided into Initial and Tonal Final in the dictionary. We use separate decision trees built for each Initial and tonal Final. An example of stream-dependent state tying based on decision-tree clustering is shown in Fig. 2, which illustrates state tying process performed on state 2 of all tri-phones with central phone ‘‘y”. Two decision trees, spectral and pitch trees, are grown for this state by using their own question sets. Going down from the top of the two trees, different questions are used to split the data samples (states). We find that pitch feature stream mainly concerns with questions of tonal context, while the questions for spectral feature stream are more on querying the segmental context. 2.3. MSD based tone modeling in noise MSD is effective for modeling the piece-wise continuous F0 trajectory of speech signal. But to make the front-end feature extraction more robust in making correct pitch estimates and proper voiced/unvoiced decisions is also important for providing correct features in low SNRs. Fig. 3 shows F0 contour, spectrogram, speech waveform of a digit sequence‘‘9(jiu2)9(jiu3)3(san1)” corrupted by additive street noise under 5 dB SNR. Due to the noise contamination, F0 contour of the second digit ‘‘9” (jiu3) cannot be
  • 4. 1172 Y. Qian, F.K. Soong / Speech Communication 51 (2009) 1169–1179 with interpolation without interpolation F0(Hz) 400 200 0 0 10 20 30 Frame 40 50 60 Fig. 4. F0 contours of a digit sequence ‘‘9 9 3” contaminated by street noise at 5 dB SNR with/without F0 interpolation. with interpolation F0(Hz) 400 without interpolation 200 0 Fig. 2. An example of stream-dependent state tying based on decision-tree clustering. 0 10 20 30 Frame 40 50 60 Fig. 5. F0 contours of a digit sequence ‘‘9 9 3” (clean) with/without F0 interpolation. Fig. 3. F0 contour, spectrogram, wavform of a digit sequence ‘‘9 9 3” in street noise at 5 dB SNR. successfully extracted and the incorrect raw F0 values are erroneously interpolated. Such interpolation of F0 based upon the mis-tracked pitch can have a detrimentally negative impact on the recognition performance. However, MSD-based HMM, designed for modeling piecewise continuous F0 contour stochastically, is more robust to noisy F0 feature in the recognition process. Since voiced and unvoiced observations are evaluated with either a continuous Gaussian mixture or discrete probabilities, misdetection of pitch can still have negative but not disastrous effects on the likelihood computation. For example, no F0 in the 2nd digit ‘‘9” is evaluated as a stochastic event with a lower probability in MSD. If MSD model is trained with both clean and noisy data, it will be even more robust to pitch extraction errors. A popular tone feature preprocessing employs a continuation algorithm (Chen et al., 1997) to interpolate the missing F0 values in unvoiced regions. The pitch is interpolated by running an exponential decay function towards the running average, plus a random noise. The target value of exponential decay function is usually set to the first F0 value in the next voiced segment. The entire F0 contour after interpolation is done with a low-pass filter. Figs. 4 and 5 show the F0 contours of a digit sequence ‘‘993” in clean and noisy conditions, respectively, with/without F0 interpolations. The interpolated F0 values depend upon both the preceding and succeeding F0 values. Consequently, the interpolated F0 contour may deviate significantly from the true F0 values (if they are indeed in a voiced region but missed by the pitch tracking algorithm) or become artificial values in truly unvoiced regions. Furthermore, interpolating F0 values in a long unvoiced region can become difficult for real-time applications, i.e., getting the target F0 value from the next voiced region for interpolation without a long look-ahead. 2.4. Tonal syllable recognition The recognition process of tonal syllables can be rewritten as, " !# s X Y X ^ s s s P ðqt jqtÀ1 Þ ckqt N ot ; lkqt ; M ¼ arg max M Á " X k t c p t N o p ; lp t ; kq kq t k p X kqt kqt !# ; ð2Þ
  • 5. Y. Qian, F.K. Soong / Speech Communication 51 (2009) 1169–1179 where M represents a tonal syllable sequence; qt is the state the at time t; and ot is divided into two streams: os for P spect s tral feature and op for the pitch feature; t k ckqt N  Ps s s ot ; lkqt ; kqt is a mixture of Gaussians of the spectral features, where cs t  the kth mixture weight; while kq is Pp P p ckqt N op ; lp t ; kqt is a MSD trained by the pitcht kq k related features with a subspace weight of cp t . In the implekq mentation, we still use mixture Gaussian distribution to represent MSD, where the output pdf of zero-dimensional, unvoiced subspace is assumed as N(o) = 1, and other voiced subspaces have Gaussian distributions. At state qt, spectral and pitch features access their own decision trees to fetch corresponding parameters, but they share the same HMM state transition probability. 1173 3.2. Experimental set-up The configurations for experiments are listed as follows: (1) MFCC-39: 39 MFCC features; one stream; baseline without F0 features. (2) INTP-42-1S: 39 MFCC & F0 + DF0 + DDF0; one stream; interpolated F0 used for unvoiced speech; HMM for F0 modeling; baseline with F0 features. (3) INTP-42-2S: the same setting as INTP-42-1S except two streams. (4) MSD-42-2S: 39 MFCC & F0 + DF0 + DDF0; two streams; no F0 interpolation for unvoiced speech; MSD-HMM for F0 modeling. (5) MSD-43-2S: MSD-42-2S + normalized duration. (6) MSD-44-2S: MSD-43-2S + long-span pitch. 3. Experimental results and analysis 3.1. Speech databases The recognition experiments are performed on a speaker-independent, large vocabulary, continuously read and spontaneous Mandarin Chinese speech database (BJ2003) and a noisy speech database of connected Chinese digits (CNDigits). The detailed descriptions of databases are given in following subsections. 3.1.1. Gender-dependent read and spontaneous speech database There are a total of 490 speakers (gender balanced) in BJ2003. For each speaker, read speech and spontaneous speech utterances were recorded. The read speech part was collected by requesting the speaker to read through a set of Chinese text, including modern novels and classical Chinese writings. For the spontaneous speech part, the speaker was requested to speak freely on a set of given topics, for example, ‘‘Describe your daily life in Beijing”. The training data contains 50 h of read speech and 100 h of spontaneous speech from 230 male speakers and 230 female speakers. Four thousands utterances from the remaining 16 male and 14 female speakers are designated as testing data. 3.1.2. Noisy speech database CNDigits consists of 8000 digit strings for training and 39,480 digit strings for testing. Training set consists of clean (1600 sentences) and four different kinds of noise: waiting room, street, bus and lounge, four subsets for each type of noise, each subset contains 400 sentences from 120 female speakers and 200 male speakers, from 5 to 20 dB SNRs, at a step of 5 dB. Testing set consists of four noises (waiting room, street, bus, and lounge) seen in the training set (Matched noise) and four additional noises unseen in the training set: platform, shop, outside and exit (Mismatched noise), five subsets for each type of noise, each subset contains 987 sentences from 56 female speakers and 102 male speakers, from 0 to 20 dB SNRs, at a step of 5 dB. MFCC-39 is considered as the baseline without F0 features. It employs the standard 39-dimension MFCC feature vectors. For INTP-42 and MSD-42, the feature vectors are appended with instantaneous log F0 value and first- and second-order derivatives. In INTP-42, the F0 features of unvoiced speech frames are obtained by interpolating with an exponentially decay function (Chen et al., 1997), which is the conventional method of F0 interpolation for modeling. INTP-42-1S is the baseline with F0 features, where the tone features and spectral features are modeled in one stream and share the same decision trees, while in INTP42-2S, the tone features and spectral features are separated into two streams and stream-dependent models are built (clustered) in two separated decision trees. For MSD-422S, no F0 value is assigned to an unvoiced frame and MSD is used to model such partially continuous feature parameters. MSD-43-2S and MSD-44-2S are the enhanced versions of MSD-42-2S, with the inclusion of a duration feature and a long-span feature parameter (Zhou et al., 2004). At each frame, the duration feature is computed as the interval length from the starting point (frame) of the current voiced segment to the present frame. It is normalized with respect to the average tone duration (Zhou et al., 2004). The long-span pitch is computed by normalizing the pitch value with respect to the average pitch over the last ten frames of the preceding voiced segment. For each of the configurations, gender-dependent models for read and spontaneous speech are trained separately by the corresponding training data, respectively. In the baseline without F0 features, MFCC-39, all models are cross-word triphone HMMs with three emitting states. The phone set used is Phn187, which contains 187 phones (Initial and tonal Final). The dictionary of tonal syllable to phone is used and no multiple pronunciation variants exist in the dictionary. Decision-tree based clustering is applied to context-dependent state tying and there are about 3000 tied states after clustering. Each state has 32 Gaussian mixture components. For noisy speech database, a whole-word HMM was trained for each of the ten Chinese digits (from ‘‘0” to ‘‘9” as ‘‘ling2”,‘‘yi1 or yao1”, ‘‘er4”, ‘‘san1”,
  • 6. 1174 Y. Qian, F.K. Soong / Speech Communication 51 (2009) 1169–1179 ‘‘si4”,‘‘wu3”, ‘‘liu4”, ‘‘qi1”, ‘‘ba1” and ‘‘jiu3”). Each model consists of 10 left-to-right states without skipping. Each state output pdf is a mixture of three diagonal covariance Gaussians. Since we focus on evaluating the acoustic model performance, tonal syllables are used as the output of decoding, which itself has useful applications such as Mandarin proficiency test (Zhang et al., 2006). Unconstrained free tonal-syllable loop grammar is used for decoding read and spontaneous speech. Free word (digit or tonal syllable) loop is employed in the noisy digit speech decoding. 3.3. The performance of F0 extraction The main difference between MSD-HMM and the conventional interpolation method is that MSD-HMM can preserve the voiced/unvoiced information along with F0 in tone modeling. We evaluate the performance of F0 extraction, especially voiced/unvoiced decisions, for noisy and spontaneous speech. Extraction of F0 is done on a short-time basis by applying the robust algorithm for pitch tracking (RAPT) (Talkin, 1995). The development of CNDigits is almost the same as that of Aurora 2 (Hirsch and Pearce, 2000). The eight types of noises are added to clean speech signals at five different SNRs. Each noisy digit string has the corresponding clean one. We use the F0s extracted from clean speech as reference and compare the mismatch of unvoiced/voiced decisions between clean and noisy digit strings. The percentage of unvoiced/voiced errors, averaged over matched and mismatched noises in all SNRs noisy conditions, in F0 extraction is shown in Fig. 6, which indicates that the performance of F0 tracking degrades significantly with decreasing SNRs. For the BJ2003 database, no hand-marked reference can be used as the grand-truth for evaluating of F0 tracking performance. As a result, we assume that all frames of vowel segments, which are obtained from forced alignment, are voiced. The system used for forced alignment is our baseline without F0 feature, MFCC-39. Table 1 lists the percentage of frames, which fail in F0 tracking, over all frames of vowel segments. In the table, it is shown that pitch tracking performance of read speech is slightly better than that of spontaneous speech. Fig. 6. The percentage of voiced/unvoiced swapping errors at different SNRs. Table 1 The percentage of F0 tracking errors in vowel segments. Male (%) Read Spontaneous Female (%) 22.90 24.90 19.44 21.53 3.4. Experimental results 3.4.1. Gender-dependent read and spontaneous speech database Table 2 shows the tonal-syllable error rates (TSER) attained by using different configurations for the BJ2003 database. MFCC-39 is used here as the baseline for both spontaneous and read speech. Using two-stream tone modeling (INTP-42-2S) can outperform one stream (INTP-421S) by 0.25% and 1.4% in absolute TSER reduction for spontaneous and read speech recognition, respectively. For read speech, using MSD-42-2S can improve the TSER performance from 46.80% and 45.24% to 39.44% and 36.61%, i.e., relative TSER reductions of 15.73% and 19.08%, for the male and female parts of the databases, respectively. It slightly outperforms INTP-42-2S. For spontaneous speech, the effectiveness of MSD-42-2S is more prominent than INTP-42-2S. MSD-42-2S reduces the absolute TSER by 3.56% and 4.03%, for male and female speech, respectively, while only 0.3% and 1.62% corresponding TSER reductions are obtained by INTP-42-2S. The duration feature (MSD-43-2S) and the long-span pitch feature (MSD-44-2S) can further improve TSER of both read and spontaneous speech. Fig. 7 shows the average recognition performance measured in relative TSER reduction, comparing with MFCC-39, in five different feature configurations: INTP-42-1S, INTP-42-2S, MSD-42-2S, MSD-43-2S and MSD-44-2S for BJ2003. The maximum improvement of 23.14% in relative TSER reduction is obtained for read female speech. It also shows that F0 feature effectiveness in improving speech recognition of spontaneous speech is less than for read speech. It may due to the low recognition baseline performance in spontaneous speech. 3.4.2. Noisy speech database The baseline recognition performance (MFCC-39) in matched and mismatched noise conditions at various SNRs is shown in Table 3. It shows that recognition performance degrades with decreasing SNRs. The baseline system achieves an average 4.1% word (digit or tonal syllable) error rate (WER) in clean condition. Fig. 8 shows average relative WER reductions, averaged over all noise conditions, of INTP-42-1S, INTP-42-2S, MSD-42-2S, MSD-43-2S and MSD-44-2S, comparing with MFCC-39. Among all five configurations, MSD-42-2S achieves the best performance. It yields 19.54% and 15.39% relative WER reductions averaged over all SNRs for matched and mismatched noises, respectively. INTP42 can also improve recognition performance over the
  • 7. Y. Qian, F.K. Soong / Speech Communication 51 (2009) 1169–1179 Table 2 Tonal-syllable error rate (TSER) of the six configurations for BJ2003. Spontaneous (%) Read (%) Male MFCC-39 INTP-42-1S INTP-42-2S MSD-42-2S MSD-43-2S MSD-44-2S Female Male Female 69.01 68.87 68.71 65.45 64.79 63.88 60.55 59.26 58.93 56.52 55.91 54.83 46.80 41.32 39.94 39.44 39.00 37.98 45.24 39.28 37.86 36.61 35.59 34.77 baseline but in a more limited way, compared with MSD42, and two-stream based tone models (INTP-42-2S) only slightly outperforms one-stream modeling (INTP-42-1S). Incorporating more F0 related features, MSD-43-2S and MSD-44-2S have not improved the recognition perfor- 1175 mance further. It may due to the fact that the pitch extraction module fails to detect voiced/unvoiced boundaries properly in noisy conditions. The duration and long-span pitch features are less reliable than the instantaneous F0 feature. Therefore, the maximum improvement for CNDigits corpus is obtained from MSD tone modeling. To compare the performance of MSD-HMM with interpolated F0 based HMM in noisy conditions, we list the detailed numbers at various SNRs in matched and mismatched noise conditions in Tables 4 and 5. The breakdown of recognition performance of INTP42-2S and MSD-42-2S in clean and different SNRs from 20 down to 0 dB noisy conditions is illustrated in Figs. 9 and 10. Fig. 9 shows that MSD-based tone modeling can significantly improve noisy Chinese digit recognition performance at SNRs from 20 to 5 dB. The maximum Fig. 7. The average relative TSER reduction of INTP-42-1S, INTP-42-2S, MSD-42-2S, MSD-43-2S and MSD-44-2S for BJ2003, comparing with MFCC-39. Table 3 Word error rate (WER) using MFCC-39 features on the CNDigits corpus. Matched noise (%) Mismatched noise (%) Waiting room 20 dB 15 dB 10 dB 5 dB 0 dB Street Bus Lounge Plat-form Shop Out-side Exit 3.86 4.83 8.79 17.39 32.53 4.37 4.59 5.41 6.87 11.07 3.64 3.69 3.96 4.57 6.23 4.24 4.65 5.73 9.17 19.05 3.62 4.00 5.07 8.59 17.93 4.55 5.36 8.71 16.83 34.48 3.63 4.26 5.19 7.58 15.12 4.04 4.52 5.41 7.49 14.49 Fig. 8. The average relative WER reduction of INTP-42-1S, INTP-42-2S, MSD-42-2S, MSD-43-2S and MSD-44-2S for CNDigits, comparing with MFCC-39.
  • 8. 1176 Y. Qian, F.K. Soong / Speech Communication 51 (2009) 1169–1179 Table 4 WER of INTP-42-2S for CNDigits. Matched noise (%) Mismatched noise (%) Waiting room Street Bus Lounge Plat-form Shop Out-side Exit 3.86 5.37 8.39 18.29 38.99 3.08 3.40 4.69 8.83 20.55 2.50 2.44 2.61 3.46 5.12 2.98 3.38 4.70 10.45 27.21 2.75 3.31 5.18 11.81 27.54 3.46 4.33 7.62 15.88 39.09 2.59 3.21 5.09 11.77 26.48 2.94 3.14 4.42 8.49 22.04 Waiting Room 20 dB 15 dB 10 dB 5 dB 0 dB Street Bus Lounge Plat-form Shop Out-side Exit 2.81 4.33 6.73 14.28 30.38 3.38 3.60 4.60 6.70 13.66 2.65 2.71 2.92 3.47 4.59 3.19 3.48 4.26 7.94 18.87 2.61 3.45 4.57 8.41 19.22 3.38 4.19 7.17 14.22 32.15 2.63 3.03 4.71 8.21 17.43 3.10 3.29 4.13 6.54 14.97 Table 5 WER of MSD-42-2S for CNDigits. Matched noise (%) 20 dB 15 dB 10 dB 5 dB 0 dB Mismatched noise (%) Fig. 9. Relative WER reduction of MSD-42-2S for CNDigits comparing with MFCC-39. Mixture Weight for Unvoice d Sub-space 0.8 0.7 0.6 Unvoiced Phone Models 0.5 0.4 0.3 0.2 Voiced Phone Models 0.1 0 s1 s2 s3 s1 s2 s3 State Index Fig. 11. Mean of mixture weight for unvoiced sub-space in the states of unvoiced and voiced phone model. It may be due to the fact that at such a low SNR, pitch extraction module fails to track the F0 contour. The Fig. 10 shows recognition performance of F0 interpolation is much worse than the baseline at low SNRs, e.g. 5 and 0 dB. It indicates that the interpolation method suffers more recognition performance loss from deteriorated pitch estimates in those two SNRs. Fig. 10. Relative WER reduction of INTP-42-2S for CNDigits comparing with MFCC-39. 3.5. Results analysis and discussion improvement of 26.01% in relative WER reduction is obtained at 20 dB SNR, averaged over all mismatched noise conditions. Fig. 9 also shows that the performance improvements at SNRs 20 and 10 dB are close to that of clean speech. However, at 0 dB SNR, which is not included in the training data, the recognition performance is worse than that of MFCC-39 in mismatched noise conditions. We analyze the mixture weight values of unvoiced subspace in the states of unvoiced and voiced phones for MSD-HMM trained on male read speech. Their mean values are given in Fig. 11, in which we find the values of state 1 and state 3 in unvoiced phone model are lower than that of state 2, and opposite phenomena are observed in voiced phone model. We think that state 1 and state 3 are in a
  • 9. Y. Qian, F.K. Soong / Speech Communication 51 (2009) 1169–1179 Table 6 BSER and TER in the three configurations: MFCC-39, INTP-42-2S and MSD-42-2S for BJ2003. BSER (TER) Spontaneous (%) INTP-42-2S MSD-42-2S Female Male Female 53.02 (49.09) 54.82 (47.16) 50.94 (44.46) MFCC-39 Read (%) Male 42.80 (44.49) 42.81 (41.40) 40.00 (40.09) 27.24 (34.42) 26.12 (25.79) 24.19 (26.34) 25.50 (32.40) 24.00 (24.14) 21.39 (23.87) transition between unvoiced and voiced segments so they are less distinct than the central states in term of their voiced/unvoiced characteristics. In Table 2, we notice that MSD-42-2S performs much better in spontaneous speech than in read speech, comparing with INTP-42-2S. We further analyze the base syllable (the syllable ignoring tone label) error rate (BSER) and the tone error rate (TER) of three configurations for BJ 2003, as shown in Table 6. Both INTP-42-2S and MSD42-2S significantly improve the performance of TER over the baseline MFCC-39 system for spontaneous and read speech recognition. However, the use of INTP-42-2S worsens the BSER of spontaneous speech, from the baseline of 53.02% to 54.82% in the case of male speech. The 1177 high speaking rate, complex co-articulation pattern and pronunciation variation largely degrade the performance of speech recognition for spontaneous speech (Shinozaki et al., 2001; Fosler-Lussier and Morgan, 1999). Chinese syllable has Initial–Final structure and most Initials are unvoiced. MSD-based F0 models naturally reverse voiced/unvoiced information so that it can indicate the syllable boundaries and hence assist to syllable recognition, especially when the spectral models don’t well fit the testing data. The recognition error patterns, or the confusion matrices, generated by MSD and interpolation-based F0 modeling, in lounge noise at 5 dB SNR are compared in Tables 7 and 8, for MSD-42-2S and INTP-42-2S, respectively. MSD can significantly reduce digit deletion errors from 6.9% to 4.5%. Majority of deletion errors are associated with the semi-vowel, low pitch digit ‘‘5(wu3)”. The digit ‘‘5” is deleted frequently due to the fact that it is not well separated from preceding or succeeding digits, i.e., without unvoiced consonants to ‘‘protect” them from being merged together with adjacent digits. In this paper, we mainly focus on the tonal syllable recognition of Chinese speech for some applications, e.g. Mandarin proficiency test. The performance of Chinese LVCSR is usually measured by the Chinese character error rate (CER) since the definition of a word in Chinese is Table 7 The confusion matrix of recognition result using INTP-42-2S under 5 dB SNR, lounge noise. ling ling yao yi er san si wu liu qi ba jiu Ins yao yi er san si wu liu qi ba jiu Del 708 0 2 0 0 0 1 15 1 0 5 1 1 66 3 2 0 0 0 2 0 0 3 0 15 0 610 0 1 0 2 2 6 0 0 9 0 1 0 687 1 1 1 1 0 6 1 2 0 0 0 1 838 4 0 0 0 1 1 2 1 0 1 1 17 730 4 0 7 1 0 16 6 2 0 3 0 0 539 5 0 0 5 24 3 0 2 3 0 0 0 683 0 0 6 1 2 0 10 0 0 2 2 1 740 0 23 6 0 2 1 20 2 0 0 0 0 821 0 0 0 0 0 0 0 0 1 6 2 0 709 0 55 0 77 37 11 9 290 13 6 9 36 Table 8 The confusion matrix of recognition result using MSD-42-2S under 5 SNR lounge noise. ling ling yao yi er san si wu liu qi ba jiu Ins yao yi er san si wu liu qi ba jiu Del 730 0 3 0 0 0 1 13 0 0 3 4 0 62 2 2 0 0 0 2 0 0 1 0 15 0 627 0 1 1 1 1 6 0 0 8 0 4 0 695 0 0 0 1 0 4 0 3 0 0 0 1 846 4 1 0 0 1 1 2 0 0 4 1 17 736 5 0 9 0 2 22 8 1 0 1 0 0 648 6 0 0 2 22 7 3 2 2 0 0 0 692 0 0 7 0 1 0 7 0 0 2 1 1 743 0 19 3 0 1 0 15 2 0 0 0 0 820 0 0 1 0 0 0 0 0 2 1 1 0 738 0 29 0 61 37 4 3 181 11 3 13 16
  • 10. 1178 Y. Qian, F.K. Soong / Speech Communication 51 (2009) 1169–1179 Table 9 CER of MFCC-39, INTP-42-1S, MSD-44-2S with bi-gram and tri-gram LMs for male and female read speech subsets of BJ 2003. Bi-gram (%) Tri-gram (%) Male MFCC-39 INTP-42-1S MSD-44-2S Female Male Female 16.91 15.00 14.03 12.85 11.44 10.56 12.09 11.15 10.55 8.71 7.79 7.21 nition. F0 information over a horizontal, longer time span can also be used to build explicit tone models for rescoring the decoding lattice in a second pass search. It can further improve the performance of tonal syllable recognition in addition to that of MSD-HMM, as demonstrated in our experiments on continuous read speech (Wang et al., 2006). Acknowledgements somewhat vague. We integrate a language model (LM) with a dictionary of 60k words into the decoding process to test whether our MSD-HMM and two-stream tone modeling approach is still effective to general LVCSR. The words have an average of 1.1 pronunciations per word and an average length of 2.3 Mandarin characters in the dictionary. LM was trained on a huge database (1714M Mandarin words of texts) including news, novels, poem, and data collected from the World Wide Web. The word probabilities are smoothed by good-turning discounting and back-off smoothing. The perplexity of bigram and trigram LMs used in decoding are 398 and 280, respectively. Bi-gram LM is employed for the first-pass search, while trigram LM is used for rescoring the lattice generated from first-pass. Both bi-gram and tri-gram LMs are used for gender-dependent read speech database. The recognition results show that using MSD-HMM and two-stream tone modeling (MSD-44-2S) can outperform the baseline without F0 feature (MFCC-39) by 2.6% and 1.5% of the recognition CER reductions with bi-gram and tri-gram LMs, respectively. Comparing with the conventional interpolation baseline systems (INTP-42-1S), our system can improve the CER performance from 13.2% and 9.5% to 12.3% and 8.9%, respectively. The breakdown of CER for male and female read speech subsets of BJ 2003 is shown in Table 9. Since it is not trivial to find proper LMs for spontaneous speech and noisy digit speech recognition, we didn’t test them in CER performance by using LMs. 4. Conclusions and future work We propose to use MSD and two-stream for modeling tone and apply them to speech recognition of tonal languages. The approach is highly effective in modeling the semi-continuous F0 trajectory and it is distinctive in: (1) modeling the original F0 features without interpolating the discontinuous unvoiced regions artificially; (2) modeling tone and spectral features in two separated streams and using stream-dependent state tying. The MSDHMM, two-stream approach achieves a significant performance improvement of tonal syllable recognition in large vocabulary continuously read and spontaneous Mandarin speech and noisy, continuous Mandarin digit speech recognition. MSD-HMM for tone modeling, which captures the instantaneous F0 information, is seamlessly integrated into the one-pass Viterbi decoding of continuous speech recog- The authors appreciate the help of Prof. Keiichi Tokuda and Dr. Heiga Zen, Department of Computer Science, Nagoya Institute of Technology, Nagoya, Japan for providing MSD training tool HTS software via the website: http:// hts.ics.nitech.ac.jp/. They also want to thank Yuting Yeung, Sheng Qiang and Huanliang Wang for their contributions to this research during their internships at Microsoft Research Asia. References Chang, E., Zhou, J.L., Di, S., Huang, C., Lee, K.-F., 2000. Large vocabulary Mandarin speech recognition with different approach in modeling tones. In: Proc. ICSLP 2000, pp. 983–986. Chen, C.J., Gopinath, R.A., Monkowski, M.D., Picheny, M.A., Shen, K., 1997. New methods in continuous Mandarin speech recognition. In: Proc. Eurospeech 1997, pp. 1543–1546. Fosler-Lussier, E., Morgan, N., 1999. Effects of speaking rate and word frequency on pronunciations in conversational speech. Speech Comm. 29, 137–158. Freij, G.J., Fallside, F., 1988. Lexical stress recognition using hidden Markov models. In: Proc. ICASSP 1988, pp. 135–138. Hirsch, H.G., Pearce, D., 2000. The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions. In: ISCA ITRW ASR. Hirst, D., Espesser, R., 1993. Automatic modeling of fundamental frequency using a quadratic spline function. Travaux de l’Institut de ´ Phonetique d’Aix 15, 71–85. Ho, T.H., Liu, C.J., Sun, H., Tsai, M.Y., Lee, L.S., 1999. Phonetic state tied-mixture tone modeling for large vocabulary continuous Mandarin speech recognition. In Proc. EuroSpeech 1999, pp. 883–886. Lei, X., Siu, M., Hwang, M.-Y., Ostendorf, M., Lee., T., 2006. Improved tone modeling for Mandarin broadcast news speech recognition. In: Proc. InterSpeech, pp. 1237–1240. Lin, C.H., Wu, C.H., Ting, P.Y., Wang, H.M., 1996. Frameworks for recognition of Mandarin syllables with ones using sub-syllabic units. Speech Comm. 18 (2), 175–190. Peng, G., Wang, W.S.-Y., 2005. Tone recognition of continuous Cantonese speech based on support vector machines. Speech Comm. 45, 49– 62. Qian, Y., Soong, F.K., Lee, T., 2006. Tone-enhanced generalized character posterior probability (GCPP) for Cantonese LVCSR. In: Proc. ICASSP 2006, Vol. 1, pp. 133–136. Qian, Y., Lee, T., Soong, F.K., 2007. Tone recognition in continuous Cantonese speech using supratone models. J. Acoust. Soc. Amer. 121 (5), 2936–2945. Qiang, S., Qian, Y., Soong, F.K., Xu, C.-F., 2007. Robust F0 modeling for Mandarin speech recognition in noise. In: Proc. InterSpeech 2007, pp. 1801–1804. Seide, F., Wang, N.J.C., 2000. Two-stream modeling of Mandarin tones. In: Proc. ICSLP 2000, pp. 495–498. Shinozaki, T., Hori, C., Furui, S., 2001. Towards automatic transcription of spontaneous presentations. Proc. Eurospeech 2001, 491–494.
  • 11. Y. Qian, F.K. Soong / Speech Communication 51 (2009) 1169–1179 Talkin, A.D., 1995. Speech Coding and Synthesis, Chapter A Robust Algorithm for Pitch Tracking (RAPT). Elservier Science B.V., Amsterdam, pp. 495–518. Tian, Y., Zhou, J.-L., Chu, M., Chang, E., 2004. Tone recognition with fractionized models and outlined features. In: Proc. ICASSP 1, pp. 105–108. Tokuda, K., Masuko, T., Miyazaki, N., Kobayashi, T., 2002. Multi-space probability distribution HMM. IEICE Trans. Inf. Systems E85-D (3), 455–464. Wang, H.M., Ho, T.H., Yang, R.C., Shen, J.L., Bai, B.R., Hong, J.C., Chen, W.P., Yu, T.L., Lee, L.-S., 1997. Recognition of continuous Mandarin speech for Chinese language with very large vocabulary using limited training data. IEEE Trans. Speech Audio Process. 5, 195–200. Wang, H.L., Qian, Y., Soong, F.K., Zhou, J.-L., Han, J.Q., 2006. A multispace distribution (MSD) approach to speech recognition of tonal languages. In Proc. ICSLP 2006, pp. 1047–1050. 1179 Wang, H.L., Qian, Y., Soong, F.K., Zhou, J.-L., Han, J.-Q., 2006. Improved Mandarin speech recognition by lattice rescoring with enhanced tone models. In: Proc. ISCSLP, Springer LNAI 4274, pp. 445–453. Xu, Y., Liu, F., 2006. Tonal alignment, syllable structure and coarticulation: toward an integrated model. Italian J. Linguist. 18, 125–159. Zhang, J.-S., Nakamura, S., Hirose, K., 2005. Tone nucleus-based multilevel robust acoustic tonal modeling of sentential F0 variations for Chinese continuous speech tone recognition. Speech Comm. 46, 440– 454. Zhang, L., Huang, C., Chu, M., Soong, F.K., Zhang, X., Chen, Y., 2006. Automatic detection of tone mispronunciation in Mandarin. In: Proc. ISCSLP, Springer LNAI 4274, pp. 590–601. Zhou, J.-L., Tian, Y., Shi, Y., Huang, C., Chang, E., 2004. Tone articulation modeling for Mandarin spontaneous speech recognition. In: Proc. ICASSP 2004, pp. 997–1000.