Gaussian Process Experts for Voice Conversion                                                                             ...
Upcoming SlideShare
Loading in...5
×

Gaussian process experts for voice conversion

637

Published on

Interspeech, August 2011

Conventional approaches to voice conversion typically use a GMM to represent the joint probability density of source and target features. This model is then used to perform spectral conversion between speakers. This approach is reasonably effective but can be prone to overfitting and oversmoothing of the target spectra. This paper proposes an alternative scheme that uses a collection of Gaussian process experts to perform the spectral conversion. Gaussian processes are robust to overfitting and oversmoothing and can predict the target spectra more accurately. Experimental results indicate that the objective performance of voice conversion can be improved using the proposed approach.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
637
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "Gaussian process experts for voice conversion"

  1. 1. Gaussian Process Experts for Voice Conversion Nicholas Pilkington, Heiga Zen, Mark J. F. Gales (Toshiba Research Europe Ltd) 1. Introduction 4. Gaussian Processes 6. ImplementationGMM-based voice conversion (VC) Typical form for parametric prediction models Covariance & mean functions· Overview · Covariance (kernel) function yt = f (xt ; λ) + ∼ N 0, σ 2 - Measure describing the local covariance - Train GMM on joint PDF using parallel speech corpus - Derive conditional PDF given source feats p(yt | xt , λ) = N yt ; f (xt ; λ), σ 2 - e.g., linear, squared exponential, polynomial - Predict tgt feats from conditional PDF · Mean function· Limitations ⇒ Instead, use non-parametric kernel-based model - Describe mean characteristics - Oversmoothing ⇒ Poor modeling · Gaussian Process (GP) - e.g., m(x)=0, m(x)=b, m(x)=ax, m(x)=ax+b - Overfitting ⇒ Poor generalization - Probabilistic Bayesian model - We used m(x)=ax+b - Distribution over functionsGaussian process (GP)-based VC Partitioning input space - Cov matrix is given by Gramian matrix via kernel func· Gaussian Process [Rasmussen;06] - Inverting Gramian matrix (N x N) is memory & - Non-parametric model - Non-parametric Bayesian model computationally intractable if amount of data is large ⇒ Unnecessary to set parametric form a priori - Many advantages over parametric approaches - Partition input space by LBG (e.g., 32 sub-spaces) * Flexibility * Robust against overfitting 7. Experiment ⇒ Can address limitations in GMM-based VC Conditions 2. GMM-based VC [Kain;98] Database CMU ARCTIC speech database Training data Speakers CLB & SLT, 50 utterances (CLB⇒SLT) Src speaker test dataProcess overview 34,664 frames x1 x2 xT Samples from GP prior Samples from GP posterior Test data 50 utterances not included in training data Src speaker x ... Sampling freq. 16 kHz training data Joint vector 3 · GP regression Frame shift 5-msx∗ ... training data Joint PDF Cond PDFs - Mapping function is now a sample from GP Feature vec 0--40 mel-cepstral coefficients, D & DD x∗ 1 x∗ 2 x∗ N 2 3 1 z∗ ... GMM w/o dynamic feats, GMM w/ dynamic feats, ... f (x; λ) ∼ GP (m(x), k(x, x )) Methods trajectory GMM, GP w/o dynamic features,y∗ ... ∗ ∗ z1 z2 ∗ zN ˆ p(yt | xt , λ(z) ) m(x) : mean function GP w/ dynamic features p(z | λ∗ (z) ) 4 ∗ ∗ y1 y2 yN ∗ Mapping Spectrum: Mapping by probabilistic model Tgt speaker k(x, x ) : covariance (kernel) function F0: Linearly transformed in the log domain ˆ y ... training data ˆ ˆ y1 y2 ˆ yT - GP predictive distribution Objective evaluation (mel-cepstral distortions)1. Make joint vectors from src & tgt speaker training data p(yt | xt , x∗ , y ∗ ) = N (yt ; µ(xt ), Σ(xt )) GP-based VC w/ various covariance (kernel) functions2. Model joint PDF between src & tgt data by GMM µ(xt ) = m(xt ) + kt [K ∗ + σ 2 I]−1 (y ∗ − µ∗ ) Covariance functions w/o dyn. w/ dyn. M Σ(xt ) = k(xt , xt ) + σ − kt [K + σ I] 2 ∗ 2 −1 kt p(zt | λ (z) )= wm N (zt ; µ(z) , Σ(z) ) m m µ(z) m = (x) (y) µm , µm Linear 3.96 4.15 m=1 µ∗ = m(x∗ ) m(x∗ ) . . . m(x∗ ) 1 2 N Linear+ARD 3.95 4.15 (xx) (xy)   ˆ Σm Σm k(x1 , x1 ) k(x1 , x2 ) . . . k(x1 , xN ) ∗ ∗ ∗ ∗ ∗ ∗ Matern 4.96 5.99 λ(z) = arg max p(z ∗ | λ(z) ) Σ(z) m = (yx) (yy) Σm Σm  k(x∗ , x∗ ) k(x∗ , x∗ ) . . . k(x∗ , x∗ )  λ(z ) K = ∗  2 1 2 2 2 N  Neural network 4.96 5.95 . . . . .. . .   . . . .  Polynomial 4.95 5.803. Derive cond PDF given x from joint PDF k(x∗ , x∗ ) k(x∗ , x∗ ) . . . k(x∗ , x∗ ) N 1 N 2 N N Piecewise polynomial isotoropic 4.96 6.00 M ˆ p(yt | xt , λ(z) ) = ˆ ˆ P (m | xt , λ(z) )p(yt | xt , m, λ(z) ) kt = k(x∗ , xt ) 1 k(x∗ , xt ) . . . 2 k(x∗ , xt ) N Rational quadratic+ARD 4.96 5.98 m=1 Rational quadratic isotropic 4.96 5.98 p(yt | xt , m, λ(z) ) = N (yt ; µ(y|xt ) , Σ(y|xt ) ) m m Squared exponential+ARD 4.95 5.98 Squared exponential isotropic 4.95 5.98 (yx) (xx) −1 µ(y|xt ) m = µ(y) m + Σm Σ m (xt − µ(x) ) m 5. GP-based VC Performance of conventional approaches (yx) (xx) −1 (xy) Σ(y|xt ) = Σ(yy) − m m Σm Σm Σm Using GP for VC Src speaker test data x1 x2 xT GMM GMM Trajectory4. Predict tgt speech feats from cond PDF by MMSE #mix w/o dyn. w/ dyn. GMM Src speaker x ... M training data 2 5.97 5.95 5.90 Joint vector 3 4 5.75 5.82 5.81 yt = ˆ ˆ P (m | xt , λ(z) ) µ(y|xt ) ... m x ∗ training data GP Predictive m=1 x∗ x∗ x∗ 2 Gramian 3 distributions 8 5.66 5.69 5.63 1 2 N 1 z∗ ... & 16 5.56 5.59 5.52⇒ Discontinuity due to frame-by-frame mapping GP ... regression 32 5.49 5.53 5.45 ∗ ... ∗ ∗ ∗ p(yt | xt , x∗ , y ∗ ) y z1 z2 zN 4 64 5.43 5.45 5.38 3. GMM-based VC w/ dyn feats [Toda;07] ∗ ∗ ∗ y1 y2 yN 128 5.40 5.38 5.33Use static & dynamic experts for conversion Tgt speaker ˆ y ... 256 5.39 5.35 5.35 training data1. Both static & dynamic feats in joint vectors ˆ ˆ y1 y2 ˆ yT 512 5.41 5.33 5.42 · Use GP directly 1,024 5.50 5.34 5.64 ∆xt = xt+1 − xt ∆yt = yt+1 − yt - MMSE estimates of tgt speech feats zt = [xt , yt , ∆xt , ∆yt ] yt = µ(xt ) ˆ2. Model joint PDF between src & tgt data by GMM ⇒ Discontinuity due to frame-by-frame mapping µ(z) = µ(x) µ(y) µ(∆ x) µ(∆ y) m m m m m · Use static & dynamic GP experts  (xx) (xy)  Σm Σm 0 0 - GP experts for static & dynamic feats  (yx) (yy)   Σm Σm 0 0  Σm =  (z) (∆ x∆ x) (∆ x∆ y)  Static: yt ∼ N (µ(xt ), Σ(xt ))  0 0 Σm Σm  (∆ y ∆ x) (∆ y ∆ y) 0 0 Σm Σm Dynamic: ∆yt ∼ N (µ(∆xt ), Σ(∆xt ))3. Derive static/dynamic experts given x from joint PDF (y|xt ) (y|xt ) - Predict tgt speech feats Static: N (yt ; µmt , Σmt ) ˆ ˆ T Dynamic: N (∆yt ; µm y|∆ xt ) , Σ(ˆ y|∆ xt ) ) (∆ ˆt ∆ mt ˆ y = arg max {N (yt ; µ(xt ), Σ(xt )) y t=14. Predict tgt speech feats while satisfying both experts N (∆yt ; µ(∆xt ), Σ(∆xt ))} T (y|xt ) (y|xt ) ˆ y = arg max N (yt ; µmt , Σmt ) ˆ ˆ - Predictive distribution is still Gaussian y 0 2 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 4 6 8 t=1 (∆ y|∆ xt ) (∆ y|∆ xt ) ⇒ Maximization can be solved efficiently using the Frequency (kHz) Frequency (kHz) Frequency (kHz) Frequency (kHz) Frequency (kHz) N (∆yt ; µmt , Σm t ) ˆ ˆ speech parameter generation algorithm GMM GMM w/ dyn trajectory GMM GP GP w/ dyn

×