Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Thesis

642 views

Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Thesis

  1. 1. Introduction Feature Compensation based on Stereo Data Feature Compensation based on a Masking Model Temporal Modelling and Uncertainty Decoding Conclusions Noise Robust Speech Recognition of Missing or Uncertain Data Jos´e ´Andres Gonz´alez L´opez Advisors: Dr. Antonio M. Peinado Herreros Dr. ´Angel M. G´omez Garc´ıa Dpt. Signal Theory, Telecommunications and Networking University of Granada Ph.D. Defence February 25th, 2013 1 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
  2. 2. Introduction Feature Compensation based on Stereo Data Feature Compensation based on a Masking Model Temporal Modelling and Uncertainty Decoding Conclusions Outline 1 Introduction 2 Feature Compensation based on Stereo Data 3 Feature Compensation based on a Masking Model 4 Temporal Modelling and Uncertainty Decoding 5 Conclusions 2 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
  3. 3. Introduction Feature Compensation based on Stereo Data Feature Compensation based on a Masking Model Temporal Modelling and Uncertainty Decoding Conclusions Outline 1 Introduction 2 Feature Compensation based on Stereo Data 3 Feature Compensation based on a Masking Model 4 Temporal Modelling and Uncertainty Decoding 5 Conclusions 3 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
  4. 4. Introduction Feature Compensation based on Stereo Data Feature Compensation based on a Masking Model Temporal Modelling and Uncertainty Decoding Conclusions Robust ASR The performance of ASR (Automatic Speech recognition) systems degrades when training and testing conditions differ. This mismatch can be due to different factors Language complexity: grammar, vocabulary, spontaneous speech, ... Speaker variability: accent, age, gender, ... Environmental factors: background noise, channel distortion, room acoustics, ... In this work, we will focus on the environmental factors, especially on the background noise and the channel distortion. Effect of noise on speech: noise modifies the speech distributions and causes loss of information. 4 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
  5. 5. Introduction Feature Compensation based on Stereo Data Feature Compensation based on a Masking Model Temporal Modelling and Uncertainty Decoding Conclusions Approaches for Noise Robustness Different approaches to achieve noise robustness: robust feature extraction, model adaptation and feature modification. Feature compensation enhances the noisy features used for speech recognition. yt and ˆxt are, respectively, the feature vectors for noisy speech and estimated clean speech at time t. uncertainty: information about the reliability of ˆxt. 5 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
  6. 6. Introduction Feature Compensation based on Stereo Data Feature Compensation based on a Masking Model Temporal Modelling and Uncertainty Decoding Conclusions Objectives Development of a set of compensation techniques for speech feature enhancement. To do this, a Bayesian estimation framework is adopted here. Two different approaches for estimating clean speech will be explored Feature compensation based on stereo-data: clean and noisy recordings are used to derive a set of transformations applied to noisy speech. Feature compensation based on a masking model: parametric models of speech degradation are used to estimate clean speech. Finally, an uncertainty decoding approach and temporal modelling of speech will be also investigated. 6 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
  7. 7. Introduction Feature Compensation based on Stereo Data Feature Compensation based on a Masking Model Temporal Modelling and Uncertainty Decoding Conclusions Introduction MMSE Estimation Experimental Results Outline 1 Introduction 2 Feature Compensation based on Stereo Data 3 Feature Compensation based on a Masking Model 4 Temporal Modelling and Uncertainty Decoding 5 Conclusions 7 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
  8. 8. Introduction Feature Compensation based on Stereo Data Feature Compensation based on a Masking Model Temporal Modelling and Uncertainty Decoding Conclusions Introduction MMSE Estimation Experimental Results Introduction Stereo data: simultaneous recordings of clean and noisy speech signals, (X, Y) = ( x1, y1 , x2, y2 , . . . , xT , yT ) The stereo data is used to learn the statistical relationship between the clean and noisy feature spaces. As a result, a set of transformations is derived to enhance speech in a certain acoustic environment. Acoustic environment: combination of additive and convolutional noises at a given SNR. 8 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
  9. 9. Introduction Feature Compensation based on Stereo Data Feature Compensation based on a Masking Model Temporal Modelling and Uncertainty Decoding Conclusions Introduction MMSE Estimation Experimental Results MMSE Estimation (I) MMSE estimation is chosen to obtain suitable estimates for the clean feature vectors, ˆx = E[x|y] = xp(x|y)dx Problem: p(x|y) must be expressed in a convenient form. Solution: clean and noisy feature spaces are represented by VQ codebooks Mx and My , respectively. 9 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
  10. 10. Introduction Feature Compensation based on Stereo Data Feature Compensation based on a Masking Model Temporal Modelling and Uncertainty Decoding Conclusions Introduction MMSE Estimation Experimental Results MMSE Estimation (II) Using these codebooks, the MMSE estimation can be expressed as, ˆx = Mx kx =1 P(kx |k∗ y ) ˆx(kx ) P(kx |k∗ y ): mapping between the clean and noisy cells for a certain environment. Estimated using stereo data. ˆx(kx ) = E[x|y, kx , k∗ y ]: 3 alternatives (Q-VQMMSE, S-VQMMSE and W-VQMMSE). 10 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
  11. 11. Introduction Feature Compensation based on Stereo Data Feature Compensation based on a Masking Model Temporal Modelling and Uncertainty Decoding Conclusions Introduction MMSE Estimation Experimental Results Computation of ˆx(kx ) Q-VQMMSE Assumes that both spaces are quantized. Also, this approach assumes that the spaces are independent. Then, ˆx(kx ) = µ (kx ) x . S-VQMMSE A correction is applied to y, ˆx(kx ) = y − µ (k∗ y ) y − µ(kx ) x = µ(kx ) x + y − µ (k∗ y ) y ∆: quantization error 11 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
  12. 12. Introduction Feature Compensation based on Stereo Data Feature Compensation based on a Masking Model Temporal Modelling and Uncertainty Decoding Conclusions Introduction MMSE Estimation Experimental Results Improving the Mapping Accuracy Subregion modelling C (kx ,ky ) y is the subset of the noisy cell ky whose corresponding clean vectors belong to kx . Similarly, C (kx ,ky ) x is the subset of kx whose corresponding noisy vectors are C (kx ,ky ) y . 12 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
  13. 13. Introduction Feature Compensation based on Stereo Data Feature Compensation based on a Masking Model Temporal Modelling and Uncertainty Decoding Conclusions Introduction MMSE Estimation Experimental Results Whitening-transformation based VQMMSE W-VQMMSE assumes that the subregions of both feature spaces are Gaussian distributed, e.g. C (kx ,ky ) x ∼ N µ (kx ,ky ) x , Σ (kx ,ky ) x Computation of E[x|y, kx , ky ]: the following whitening transformation is applied E[x|y, kx , ky ] = µ (kx ,ky ) x + Σ (kx ,ky ) x 1/2 Σ (kx ,ky ) y −1/2 y − µ (kx ,ky ) y After some manipulations the MMSE estimation becomes, ˆx = A(k∗ y ) y + b(k∗ y ) where the parameters of the affine transformation can be precomputed offline for each noisy cell ky = 1, . . . , My . 13 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
  14. 14. Introduction Feature Compensation based on Stereo Data Feature Compensation based on a Masking Model Temporal Modelling and Uncertainty Decoding Conclusions Introduction MMSE Estimation Experimental Results Experimental Setup Recognition task: based on the Aurora2 noisy digits database. Acoustic environments: 9 noises at 7 SNRs (clean, 20, 15, 10, 5, 0, and -5 dB). Speech features: ETSI FE Standard (13 MFCCs + ∆ + ∆2). Front-end speech models: codebooks with 256 components. SPLICE and MEMLIN are also evaluated (i.e. GMM-based MMSE estimation). A priori knowledge on the acoustic environment is assumed. 14 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
  15. 15. Introduction Feature Compensation based on Stereo Data Feature Compensation based on a Masking Model Temporal Modelling and Uncertainty Decoding Conclusions Introduction MMSE Estimation Experimental Results FE Results System Clean 20 dB 15 dB 10 dB 5 dB 0 dB -5 dB Avg. Baseline 99.02 90.79 75.53 50.70 25.86 11.27 6.18 50.83 Matched 99.02 98.66 98.29 97.02 92.16 75.78 34.88 92.38 SPLICE 99.02 98.09 95.87 88.88 70.62 39.04 15.99 78.50 MEMLIN 99.02 98.36 97.01 92.43 78.26 47.03 18.76 82.62 Q-VQMMSE 96.19 93.72 90.21 81.24 61.82 31.33 14.39 71.66 S-VQMMSE 99.02 97.93 96.28 90.57 74.70 43.02 18.57 80.50 iW-VQMMSE 99.02 98.23 96.79 91.60 76.82 46.60 20.02 82.01 dW-VQMMSE 99.02 98.33 97.06 92.43 78.70 48.88 20.26 83.08 fW-VQMMSE 99.02 98.37 97.15 92.88 79.61 50.04 20.89 83.61 Matched: HMMs trained under the same conditions that in testing. iW-, dW-, fW-: identity, diagonal and full covariance matrices. MEMLIN and iW-VQMMSE behave almost identically, but our proposal is more efficient. When the dynamic features are also processed, MEMLIN and fW-VQMMSE achieves similar results: 87.67 % vs. 87.31 %. 15 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
  16. 16. Introduction Feature Compensation based on Stereo Data Feature Compensation based on a Masking Model Temporal Modelling and Uncertainty Decoding Conclusions Introduction MMSE Estimation Experimental Results AFE Results System Clean 20 dB 15 dB 10 dB 5 dB 0 dB -5 dB Avg. Baseline 99.02 90.79 75.53 50.70 25.86 11.27 6.18 50.83 AFE 99.22 98.24 96.95 93.68 84.37 62.46 29.53 87.14 Matched 99.02 98.66 98.29 97.02 92.16 75.78 34.88 92.38 Q-VQMMSE 95.60 93.56 91.28 85.25 70.23 39.20 12.84 75.90 S-VQMMSE 99.22 98.32 97.39 94.71 86.30 63.07 27.46 87.96 iW-VQMMSE 99.22 98.61 97.93 95.89 89.19 69.46 32.62 90.22 dW-VQMMSE 99.22 98.70 98.05 96.19 89.93 71.47 34.94 90.87 fW-VQMMSE 99.22 98.65 97.99 96.10 89.92 72.29 36.57 90.99 AFE: ETSI Advanced Front-End. The proposed techniques are applied to the features extracted by AFE. The combined systems AFE+VQMMSE increase the robustness of AFE against noise. 16 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
  17. 17. Introduction Feature Compensation based on Stereo Data Feature Compensation based on a Masking Model Temporal Modelling and Uncertainty Decoding Conclusions Speech Masking Model TGI MMSR Noise Model Estimation Outline 1 Introduction 2 Feature Compensation based on Stereo Data 3 Feature Compensation based on a Masking Model 4 Temporal Modelling and Uncertainty Decoding 5 Conclusions 17 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
  18. 18. Introduction Feature Compensation based on Stereo Data Feature Compensation based on a Masking Model Temporal Modelling and Uncertainty Decoding Conclusions Speech Masking Model TGI MMSR Noise Model Estimation Introduction Speech degradation model: an analytical model that relates y with x and n (the additive noise vector). Model-based compensation: the degradation model is used to derive the MMSE estimator. No stereo data is required. Thus, unknown distortions can be mitigated. × MMSE estimation only tackles the distortions considered in the degradation model. E.g. additive and convolutional noises. × Noise need to be estimated. We will only consider the robustness to additive noise here. 18 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
  19. 19. Introduction Feature Compensation based on Stereo Data Feature Compensation based on a Masking Model Temporal Modelling and Uncertainty Decoding Conclusions Speech Masking Model TGI MMSR Noise Model Estimation Speech Masking Model In the log-Mel domain, the degradation model is approximated by y = log(ex + en ) This model can be rewritten as, y = max(x, n) + ε(x, n) Disregarding ε(x, n), the speech masking model is y ≈ max(x, n) 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 -0.4 -0.2 0 0.2 0.4 0.6 Probability ε(x, n) Distribution of ε(x, n) in Aurora2 19 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
  20. 20. Introduction Feature Compensation based on Stereo Data Feature Compensation based on a Masking Model Temporal Modelling and Uncertainty Decoding Conclusions Speech Masking Model TGI MMSR Noise Model Estimation Spectral Reconstruction: Problems According to the speech masking model, the observation can be rearranged into y = (yr , yu). Reliable features (xr ≈ yr ), i.e. speech is dominant. Unreliable features (−∞ ≤ xu ≤ yu): speech is masked by noise. Thus, feature compensation can be reformulated as different two problems 1 Segregation of the noisy spectra into speech and noise. This yields a mask where the reliable and unreliable features are identified. 2 Spectral reconstruction, i.e. estimate the speech energy in the unreliable features. Two alternative techniques are proposed here: TGI only deals with problem 2. MMSR addresses both 1 & 2. 20 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
  21. 21. Introduction Feature Compensation based on Stereo Data Feature Compensation based on a Masking Model Temporal Modelling and Uncertainty Decoding Conclusions Speech Masking Model TGI MMSR Noise Model Estimation Truncated-Gaussian based Imputation TGI estimates the speech energy in the unreliable regions of the observed spectrogram. To do this, the correlation between features is exploited. Prerequisites: the segregation binary mask is known in advance. After spectral reconstruction, MFCC features can be computed and used for recognition. 21 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
  22. 22. Introduction Feature Compensation based on Stereo Data Feature Compensation based on a Masking Model Temporal Modelling and Uncertainty Decoding Conclusions Speech Masking Model TGI MMSR Noise Model Estimation MMSE Estimation of the Unreliable Features MMSE estimation is used again to reconstruct the unreliable features, ˆxu = E[xu|xr = yr , −∞ ≤ xu ≤ yu] Speech model: p(x) is modelled as a Gaussian Mixture Model (GMM), p(x) = M k=1 P(k)N x; µ(k) , Σ(k) Applying this model, the MMSE estimation is given by, ˆxu = M k=1 P(k|yr , yu) ˆx (k) u Problem: computation of P(k|yr , yu) and ˆx (k) u . 22 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
  23. 23. Introduction Feature Compensation based on Stereo Data Feature Compensation based on a Masking Model Temporal Modelling and Uncertainty Decoding Conclusions Speech Masking Model TGI MMSR Noise Model Estimation Posterior Computation After applying Bayes’ rule, the posterior can be expressed as, P(k|yr , yu) = p(yr , yu|k)P(k) M k =1 p(yr , yu|k )P(k ) p(yr , yu|k) is factorized as the following product, p(yr , yu|k) = p(yr |k) yu −∞ p(xu|yr , k)dxu p(yr |k) = N(yr ; µ (k) r , Σ (k) r ): marginal PDF of the reliable features. p(xu|yr , k) = N(xu; µ (k) u|r , Σ (k) u|r ): conditional PDF of the unreliable features given the reliable ones. 23 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
  24. 24. Introduction Feature Compensation based on Stereo Data Feature Compensation based on a Masking Model Temporal Modelling and Uncertainty Decoding Conclusions Speech Masking Model TGI MMSR Noise Model Estimation Partial Estimates According to the speech masking model xu ∈ (−∞, yu]. Thus, ˆx (k) u = yu −∞ xup(xu|yr , k)dxu Independence is assumed to solve the integral. The partial estimate ˆx (k) u = ˜µ(k)(y) corresponds to the mean of a right-truncated Gaussian PDF. 24 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
  25. 25. Introduction Feature Compensation based on Stereo Data Feature Compensation based on a Masking Model Temporal Modelling and Uncertainty Decoding Conclusions Speech Masking Model TGI MMSR Noise Model Estimation Example Clean Noisy (0 dB) Oracle mask TGI reconstruction 23 15 7 12 0 23 15 7 12 5 23 15 7 1 0 23 15 7 12 4 Time (s) eigth six zero one one six two Melchannel 0.5 1.0 1.5 2.0 2.5 3.0 25 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
  26. 26. Introduction Feature Compensation based on Stereo Data Feature Compensation based on a Masking Model Temporal Modelling and Uncertainty Decoding Conclusions Speech Masking Model TGI MMSR Noise Model Estimation Experimental Setup Databases: Aurora2 & Aurora4. The 3 test sets (A, B and C) of Aurora2 are considered. Aurora4: 5000-word recognition task based on the Wall Street Journal corpus. Two testing conditions: Test 01-07 includes utterances with artificially added acoustic noise (random SNR between 10 dB and 20 dB). Test 08-14: acoustic noise + different microphones. TGI is evaluated using both oracle (OR) or estimated (EST) binary masks. Noise estimation: linear interpolation of the first and last frames of the utterance. Front-end speech model: GMM with 256 components. 26 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
  27. 27. Introduction Feature Compensation based on Stereo Data Feature Compensation based on a Masking Model Temporal Modelling and Uncertainty Decoding Conclusions Speech Masking Model TGI MMSR Noise Model Estimation Experimental Results WAcc(%) Aurora2 Aurora4 40 50 60 70 80 90 100 Baseline CBR−OR TGI−OR CBR−EST TGI−EST CBR: Cluster-Based Reconstruction (Raj et al., 2004). TGI outperforms CBR when oracle masks are used. The difference is small when the masks are estimated. Large margin for improvement between OR and EST ⇒ a more robust approach for speech/noise segregation is required. 27 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
  28. 28. Introduction Feature Compensation based on Stereo Data Feature Compensation based on a Masking Model Temporal Modelling and Uncertainty Decoding Conclusions Speech Masking Model TGI MMSR Noise Model Estimation Masking-Model based Spectral Reconstruction As we have seen, TGI achieves excellent results when oracle masks are used. However, its performance diminishes when the masks are estimated ⇒ the noise estimation errors can be magnified by the hard decision implemented by the binary masks. MMSR uses the noise estimates directly in the MMSE estimation. Advantages with respect to TGI No a priori segregation mask is required now. Therefore, the feature reliability and the speech energy in the unreliable regions are jointly estimated. 28 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
  29. 29. Introduction Feature Compensation based on Stereo Data Feature Compensation based on a Masking Model Temporal Modelling and Uncertainty Decoding Conclusions Speech Masking Model TGI MMSR Noise Model Estimation MMSR: Diagram Mx : GMM with Mx gaussians. Mn: GMM with Mn gaussians (alternatively a noise estimate nt ∼ Nn(ˆnt , Σn,t ) for each frame). MMSE estimation ˆx = Mx kx =1 Mn kn=1 P(kx , kn|y) ˆx(kx ,kn) 29 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
  30. 30. Introduction Feature Compensation based on Stereo Data Feature Compensation based on a Masking Model Temporal Modelling and Uncertainty Decoding Conclusions Speech Masking Model TGI MMSR Noise Model Estimation Posterior Computation Applying Bayes’ rule, P(kx , kn|y) ∝ p(y|kx , kn)P(kx )P(kn). Independence assumpion: p(y|kx , kn) is expressed as the product of p(y|kx , kn) for every observed feature y. According to the masking model, p(y|kx , kn) is computed as, p(y|kx , kn) = p(x = y, n ≤ y|kx , kn) + p(n = y, x < y|kx , kn) px (y|kx )Pn(x ≤ y|kn) pn(y|kn)Px (x < y|kx ) Probability that speech is dominant Probability that noise is dominant 30 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
  31. 31. Introduction Feature Compensation based on Stereo Data Feature Compensation based on a Masking Model Temporal Modelling and Uncertainty Decoding Conclusions Speech Masking Model TGI MMSR Noise Model Estimation Partial Estimates Contrary to TGI, the reliability of the observed feature y is unknown in MMSR. Hence, both the reliable and unreliable cases are taken into account, ˆx(kx ,kn) = w(kx ,kn) y + 1 − w(kx ,kn) ˜µ (kx ) x Estimate for high SNRs Estimate for masked speech (i.e. truncated PDF mean) w(kx ,kn) = P(x = y, n < y|kx , kn) is the normalized speech presence probability. 31 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
  32. 32. Introduction Feature Compensation based on Stereo Data Feature Compensation based on a Masking Model Temporal Modelling and Uncertainty Decoding Conclusions Speech Masking Model TGI MMSR Noise Model Estimation MMSR: Mask Estimation MMSR can be also considered as a robust method for speech segregation. To see this, we reproduce here the final expression for the MMSE estimator, ˆx =   Mx kx =1 Mn kn=1 P(kx , kn|y)w(kx ,kn)   m y + Mx kx =1 Mn kn=1 P(kx , kn|y) 1 − w(kx ,kn) ˜µ (kx ) x m ∈ [0, 1] acts as a soft-mask: m ≈ 1 for the reliable features and m ≈ 0 for the unreliable ones. Advantages regarding other methods: Parameter free. Mask estimation is fully integrated within the reconstruction. 32 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
  33. 33. Introduction Feature Compensation based on Stereo Data Feature Compensation based on a Masking Model Temporal Modelling and Uncertainty Decoding Conclusions Speech Masking Model TGI MMSR Noise Model Estimation Experimental Results WAcc(%) Aurora2 Aurora4 40 50 60 70 80 90 100 Baseline TGI−OR TGI−EST MMSR VTS VTS: well-known model-based compensation technique (Moreno, 1996). MMSR outperforms TGI-EST and is upper-bounded by TGI-OR. VTS is slightly better than MMSR ⇒ more accurate noise models can reduce the gap. 33 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
  34. 34. Introduction Feature Compensation based on Stereo Data Feature Compensation based on a Masking Model Temporal Modelling and Uncertainty Decoding Conclusions Speech Masking Model TGI MMSR Noise Model Estimation MMSR: Diagram 34 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
  35. 35. Introduction Feature Compensation based on Stereo Data Feature Compensation based on a Masking Model Temporal Modelling and Uncertainty Decoding Conclusions Speech Masking Model TGI MMSR Noise Model Estimation MMSR: Diagram 35 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
  36. 36. Introduction Feature Compensation based on Stereo Data Feature Compensation based on a Masking Model Temporal Modelling and Uncertainty Decoding Conclusions Speech Masking Model TGI MMSR Noise Model Estimation EM-based Noise Model Estimation Objective: estimate the noise model used in MMSR. Noise model: GMM with Mn gaussians, Mn = π (1) n , µ (1) n , Σ (1) n , . . . , π (Mn) n , µ (Mn) n , Σ (Mn) n where π (kn) n (kn = 1, . . . , Mn) are the component priors. Maximum Likelihood estimation ˆMn = argmax Mn p(y1, . . . , yT |Mn, Mx ) Direct optimization of this expression is unfeasible ⇒ an iterative EM approach is used. 36 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
  37. 37. Introduction Feature Compensation based on Stereo Data Feature Compensation based on a Masking Model Temporal Modelling and Uncertainty Decoding Conclusions Speech Masking Model TGI MMSR Noise Model Estimation Overview Problems The oracle mask is unknown ⇒ the soft-mask estimated by MMSR is used. Treatment of the speech-dominated regions: the noise in these regions can be estimated using the model obtained in the previous iteration. 37 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
  38. 38. Introduction Feature Compensation based on Stereo Data Feature Compensation based on a Masking Model Temporal Modelling and Uncertainty Decoding Conclusions Speech Masking Model TGI MMSR Noise Model Estimation Experimental Results 2 4 6 8 85 85.5 86 86.5 Aurora2 No. of components WAcc(%) 2 4 6 8 10 68 68.5 69 69.5 Aurora4 No. of components WAcc(%) Estimated noise GMM noise model Small but consistent performance improvement is achieved when using GMM noise models in MMSR. GMMs worse than estimated noise in 2 cases 1-gauss GMMs: unable to properly model non-stationary noises. Complex GMMs: not enough data to robustly estimate the GMM parameters. 38 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
  39. 39. Introduction Feature Compensation based on Stereo Data Feature Compensation based on a Masking Model Temporal Modelling and Uncertainty Decoding Conclusions Temporal Modelling Uncertainty Decoding Outline 1 Introduction 2 Feature Compensation based on Stereo Data 3 Feature Compensation based on a Masking Model 4 Temporal Modelling and Uncertainty Decoding 5 Conclusions 39 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
  40. 40. Introduction Feature Compensation based on Stereo Data Feature Compensation based on a Masking Model Temporal Modelling and Uncertainty Decoding Conclusions Temporal Modelling Uncertainty Decoding Temporal Modelling More accurate MMSE estimates are obtained with better speech models. Here, the temporal correlation of speech is considered. Two alternative approaches Patch-based modelling: short segments of speech are modelled instead of single frames. HMM modelling: the previous speech models (GMMs or VQ codebooks) are augmented with transition probabilities. Then, ˆxt = M k=1 P(k|y1, . . . , yt, . . . , yT )E[x|yt, k] 40 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
  41. 41. Introduction Feature Compensation based on Stereo Data Feature Compensation based on a Masking Model Temporal Modelling and Uncertainty Decoding Conclusions Temporal Modelling Uncertainty Decoding Experimental Results WAcc(%) Aurora2 Aurora4 50 60 70 80 90 100 TGI−OR PATCH−OR HMM−OR TGI−EST PATCH−EST HMM−EST The PATCH and HMM approaches are applied in combination with TGI. Spectral reconstruction benefits from temporal redundancy, especially at low SNRs. The HMM-based modelling achieves the best results. 41 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
  42. 42. Introduction Feature Compensation based on Stereo Data Feature Compensation based on a Masking Model Temporal Modelling and Uncertainty Decoding Conclusions Temporal Modelling Uncertainty Decoding Uncertainty Decoding (I) The accuracy of MMSE estimation depends on many factors, such as the SNR of the signal, stationarity of the noise, etc. Inaccurate ˆxt can degrade the performance of ASR. Two objectives 1 Estimate the uncertainty/reliability of ˆxt. 2 Account for this information in the recognizer. 42 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
  43. 43. Introduction Feature Compensation based on Stereo Data Feature Compensation based on a Masking Model Temporal Modelling and Uncertainty Decoding Conclusions Temporal Modelling Uncertainty Decoding Uncertainty Decoding (II) Uncertainty of ˆx Depends on p(x|y) that appears in the MMSE estimator If p(x|y) = δy(x), then we will consider that ˆx is fully reliable. If p(x|y) is uniformly distributed, then ˆx is badly estimated. How to measure the uncertainty of ˆx? Entropy of p(x|y). Variance of the MMSE estimate: Σˆx. Exploitation in the recognizer Soft-data decoding: Σˆx increases the variance of the Gaussians in the acoustic model. Weighted Viterbi Algorithm: the exponential factor ρ ∈ [0, 1] used to weight the observation probabilities of ˆx is obtained after applying a sigmoid function to MSE = tr(Σˆx). 43 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
  44. 44. Introduction Feature Compensation based on Stereo Data Feature Compensation based on a Masking Model Temporal Modelling and Uncertainty Decoding Conclusions Temporal Modelling Uncertainty Decoding Experimental ResultsWAcc(%) Aurora2 Aurora4 40 50 60 70 80 90 100 Baseline TGI−OR UD−OR TGI−EST UD−EST UD: TGI + Weighted Viterbi Algorithm. OR vs. EST: oracle masks and oracle uncertainties vs. estimated masks and uncertainties. The recognition performance is improved after accounting for the uncertainty, especially in Aurora4. 44 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
  45. 45. Introduction Feature Compensation based on Stereo Data Feature Compensation based on a Masking Model Temporal Modelling and Uncertainty Decoding Conclusions Outline 1 Introduction 2 Feature Compensation based on Stereo Data 3 Feature Compensation based on a Masking Model 4 Temporal Modelling and Uncertainty Decoding 5 Conclusions 45 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
  46. 46. Introduction Feature Compensation based on Stereo Data Feature Compensation based on a Masking Model Temporal Modelling and Uncertainty Decoding Conclusions Conclusions (I) The performance of ASR is severely affected by noise. To improve the robustness of ASR to noise, a feature compensation approach has been adopted in this thesis. Stereo-data based compensation: stereo recordings are used to estimate a set of transformations that are later applied to noisy speech. Excellent results for the environments seen during training. Efficient implementation without a significant performance degradation when VQ codebooks are used. The proposed techniques can be used to reduce the residual noise of other robust techniques. 46 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
  47. 47. Introduction Feature Compensation based on Stereo Data Feature Compensation based on a Masking Model Temporal Modelling and Uncertainty Decoding Conclusions Conclusions (II) Model-based compensation: a model that considers the distortion of speech as a masking problem is used to derive two reconstruction techniques. TGI estimates the masked regions in the noisy spectra. Good results if the masking pattern is perfectly known, otherwise its performance is significantly affected. MMSR uses clean speech and noise models to enhance noisy speech. Unlike TGI, mask estimation is an integrated part of the reconstruction algorithm. An EM-based iterative algorithm has been proposed to estimate the noise models used by MMSR. Finally, several approaches to account for temporal correlations and to decode uncertain speech evidence were also investigated. 47 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
  48. 48. Introduction Feature Compensation based on Stereo Data Feature Compensation based on a Masking Model Temporal Modelling and Uncertainty Decoding Conclusions Future Work Speech masking model vs. perceptual masking. EM algorithm: joint estimation of additive and convolutional noises. Using more information in MMSR. E.g. pitch, onset/offset position, etc. Joint speaker and noise compensation. 48 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
  49. 49. Introduction Feature Compensation based on Stereo Data Feature Compensation based on a Masking Model Temporal Modelling and Uncertainty Decoding Conclusions Thank you! 49 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data

×