• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Decision tree-based context clustering based on cross validation and hierarchical priors
 

Decision tree-based context clustering based on cross validation and hierarchical priors

on

  • 772 views

ICASSP, May 2011 ...

ICASSP, May 2011

The standard, ad-hoc stopping criteria used in decision tree-based context clustering are known to be sub-optimal and require parameters to be tuned. This paper proposes a new approach for decision tree-based context clustering based on cross validation and hierarchical priors. Combination of cross validation and hierarchical priors within decision tree-based context clustering offers better model selection and more robust parameter estimation than conventional approaches, with no tuning parameters. Experimental results on HMM-based speech synthesis show that the proposed approach achieved significant improvements in naturalness of synthesized speech over the conventional approaches.

Statistics

Views

Total Views
772
Views on SlideShare
770
Embed Views
2

Actions

Likes
0
Downloads
0
Comments
0

2 Embeds 2

http://www.docseek.net 1
http://www.linkedin.com 1

Accessibility

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Decision tree-based context clustering based on cross validation and hierarchical priors Decision tree-based context clustering based on cross validation and hierarchical priors Presentation Transcript

    • Decision Tree-Based Context Clustering Based on Cross Validation and Hierarchical Priors
      Heiga Zen & Mark J. F. Gales
      Toshiba Research Europe Ltd.
      Cambridge Research Lab
      ICASSP 2011 @ Prague, Czech
      May 26th, 2011
    • Context-Dependent Acoustic Modeling
      • Speech is too variable to use context-independent HMMs
      • Use context-dependent HMMs
      • ASR  left & right phonetic contexts (triphone)
      • TTS  phonetic, prosodic, & grammatical contexts (fullcontext)
      • Issues
      • Too many parameters  Robustness of estimated parameters
      • Unseen models  Generalization ability
      • Use the same model for different (but similar) contexts
      • How to build parameter-tying structure?
      2
    • Decision Tree-Based Context Clustering [1]
      • Binary tree that cluster states (distributions)
      • Y/N questions about contexts
      k-+b/A:…
      t-e+h/A:…
      • Advantages
      • Effective treatment of unseen models
      • Expert knowledge as questions
      • Control model complexity by size of trees
      • Issues
      • Ad-hoc stopping criterion
      • No regularization


      C=voiced?
      yes
      no
      R=silence?
      L=“w”?
      no
      yes
      no
      R=silence?
      yes
      L=“gy”?
      no
      yes
      yes
      no
      leaf nodes


      3
    • Proposed Work
      Decision tree-based context clustering using cross validation (CV) & hierarchical priors
      • CV [2] to approximate test-set log likelihood
      • Better generalization & automatic stopping criterion
      • Hierarchical priors similar to structural MAP (SMAP) [3]
      • More robust parameter estimation
      • Combination of CV & SMAP = CVSMAP
       Automatic determination of hyper-parameters
      4
    • Outline
      Background
      Decision tree-based clustering with cross validation & hierarchical priors
      • Procedure
      • ML / MDL
      • CV
      • CVSMAP
      Experiments
      • Setup
      • Results
      Conclusions
      5
    • Procedure of Tree-Based Context Clustering
      Pull all data together to form the root node
      For each leaf node, select the best question to split node into two that maximizes (minimizes) an objective function
      Repeat 2 until stopping criterion is met
      Tie parameters within each leaf node
      k-+b/A:…
      t-e+h/A:…


      C=voiced?
      yes
      no
      R=silence?
      L=“w”?
      no
      yes
      no
      R=silence?
      yes
      L=“gy”?
      no
      yes
      yes
      no
      leaf nodes


      6
    • ML [1] / MDL [3] Approaches
      ML approach
      • Objective function: self-test log likelihood
      • Stopping criterion: ad-hoc thresholds
      MDL approach
      • Objective function: self-test log likelihood
      • Stopping criterion: penalty term for model complexity
       In practice, penalty term is empirically scaled
      yes
      no
      yes
      no
      training data
      0th,1st & 2nd stats
      same
      Gaussian
      (ML estimated)
      test data
      Log likelihood
      7
    • Cross Validation Approach [2]
      yes
      no
      yes
      no
      training data
      0th,1st & 2nd stats
      Gaussian
      (ML estimated)
      test data
      CV log likelihood at each fold
      +
      total CV log likelihood
      8
    • Advantages & Drawbacks
      Advantages
      • CV log likelihood is more reliable
      • Self-test log likelihood  same data for training & test
      • CV log likelihood  different data for training & test
      • Clear stopping criterion
      • CV log likelihood decreases  overfitting  stop
      Drawbacks
      • Still based on ML estimates of Gaussian
       Unreliable if data is small
      • Series of hard decisions based on the unreliable estimates
       May not yield robust parameters & good model size
      9
    • Proposed CVSMAP Approach
      yes
      no
      stats at parent node
      yes
      no
      training data
      Normalization
      & scaling
      0th,1st & 2nd stats
      Normalized & scaled stats
      Gaussian
      (MAPestimated)
      test data
      CV log likelihood at each fold
      +
      total CV log likelihood
      10
    • Automatic Determination of Hyper-Parameters
      • CV often used to determine values of tuning parameters
      • CVSMAP performs CV at all splits & has hyper-parameter
      • Hyper-parameters can be determined at each split by CV
      • For each split, hyper-parameter that maximizes CV log likelihood is selected from pre-defined candidate values
      • Question selection / evaluating stopping criterion are made based on
      11
    • Experiments
      • ASR or TTS?
      • ASR
      • 2 contexts (triphone)
      • State-output distributions are mixture of Gaussians (GMM)
      • Performance is not so sensitive to decision trees
      • TTS
      • Many contexts
      • State-output distributions are single Gaussian
      • Performance is sensitive to decision trees
       Evaluation on TTS
      12
    • Experimental Conditions
      • Setup
      • US English, professional female speaker
      • 16 kHz sampling, 5ms shift
      • 4,624 training / 508 test utterances
      • 40-order mel-cepstrum, log F0, 5 band aperiodicity, delta, & delta-delta
      • 5-state left-to-right no-skip HSMM
      • Speech parameter generation with global variance term [5]
      • Training process
      • Repeat standard speaker-dependent training (5 embedded reestimation + tree reconstruction based on MDL) 5 times
      • Untie parameter sharing structure + 1 embedded reestimation
      • Run decision tree-based context clustering based on the MDL, CVML, SMAP, & CVSMAP criteria
      13
    • Experimental Conditions (cont’d)
      • Compared algorithms
      • MDL [3]
      • SMAP – Tree-based clustering using hierarchical priors
      • CVML [2]
      • CVSMAP – Tree-based clustering using CV & hierarchical priors
      • Details
      • 10 fold CV
      • Candidate values of hyper-parameters = (0.00001, 0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000, 10000, 100000)
      • Hyper-parameter twas fixed to 1, 10, or 100 in SMAP
      • All parameters (mel-cepstrum, log F0, band aperiodicity, & duration) were clustered by the algorithm
      • #models=176,531, #questions=3,294
      • No embedded reestimation after clustering
      14
    • Test-Set Log Likelihood
      15
    • Subjective Preference Listening Test
      • 100 sentences randomly selected from 508 test sentences
      • Max # of samples per subject = 40, listener / subject = 3
      • Carried out on Amazon Mechanical Turk
      CVSMAP achieved small improvement over CVML
      16
    • Can Oversmoothing Problem be Relaxed?
      CVSMAP can yield larger model sizes
       Is oversmoothing relaxed?
      Slightly reduced, but not as good as GV
      17
    • Summary
      Tree-based clustering based on CV & hierarchical priors
      • Better model selection by CV
      • More robust parameter estimation by SMAP-like hierarchical prior
      • Split-by-split automatic determination of hyper-parameters by combination of CV & SMAP
       Fully automatic
      • Better test-set log likelihood than MDL & CVML
      • Slightly better naturalness than CVML
      • Large model size can relax oversmoothing
      • But not as powerful as GV (PoE)
      Future plans
      • Model compaction by the proposed framework
      18
    • Thanks!
      19
    • References
      [1] J.J. Odell, “The use of context in large vocabulary speech recognition,” PhD thesis, Cambridge University, 1995.
      [2] T. Shinozaki, “HMM state clustering based on efficient cross validation,” Proc. ICASSP, 2006.
      [3] K. Shinoda and T. Watanabe, “Acoustic modeling based on the MDL criterion for speech recognition,” Proc. Eurospeech, 1997.
      [4] K. Shinoda and C.-H. Lee, “A structural Bayes approach to speaker adaptation,” IEEE Trans. SAP, vol.9, no.8, 2001.
      [5] T. Toda and K. Tokuda, “Speech parameter generation algorithm considering global variance for HMM-based speech synthesis,” IEICE Trans. Inf. & Syst., 2007.
      20