Machine Learning for Speech
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share

Machine Learning for Speech

  • 1,021 views
Uploaded on

 

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
1,021
On Slideshare
1,021
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
20
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Machine Learning for Speech & Language Processing Mark Gales 28 June 2004 Cambridge University Engineering Department Foresight Cognitive Systems Workshop
  • 2. Machine Learning for Speech & Language Processing Overview • Machine learning. • Feature extraction: – Gaussianisation for speaker normalisation. • Dynamic Bayesian networks: – multiple data stream models – switching linear dynamical systems for ASR. • SVMs and kernel methods: – rational kernels for text classification. • Reinforcement learning and Markov decision processes: – spoken dialogue system policy optimisation. Cambridge University Foresight Cognitive Systems Workshop 1 Engineering Department
  • 3. Machine Learning for Speech & Language Processing Machine Learning • One definition is (Mitchell): “A computer program is said to learn from experience (E) with some class of tasks (T) and a performance measure (P) if its performance at tasks in T as measured by P improves with E” alternatively “Systems built by analysing data sets rather than by using the intuition of experts” • Multiple specific conferences: – {International,European} Conference on Machine Learning; – Neural Information Processing Systems; – International Conference on Pattern Recognition etc etc; • as well as sessions in other conferences: – ICASSP - machine learning for signal processing. Cambridge University Foresight Cognitive Systems Workshop 2 Engineering Department
  • 4. Machine Learning for Speech & Language Processing “Machine Learning” Community “You should come to NIPS. They have lots of ideas. The Speech Community has lots of data.” • Some categories from Neural Information Processing Systems: – clustering; – dimensionality reduction and manifolds; – graphical models; – kernels, margins, boosting; – Monte Carlo methods; – neural networks; – ... – speech and signal processing. • Speech and language processing is just an application Cambridge University Foresight Cognitive Systems Workshop 3 Engineering Department
  • 5. Machine Learning for Speech & Language Processing Too Much of a Good Thing? “You should come to NIPS. They have lots of ideas. Unfortunately, the Speech Community has lots of data.” • Text data: used to train the ASR language model: – large news corpora available; – systems built on > 1 billion words of data. • Acoustic data: used to train the ASR acoustic models: – > 2000 hours speech data (∼ 20 million words, ∼ 720 million frames of data); – rapid transcriptions/closed caption data. • Solutions required to be scalable: – heavily influences (limits!) machine learning approaches used; – additional data masks many problems! Cambridge University Foresight Cognitive Systems Workshop 4 Engineering Department
  • 6. Machine Learning for Speech & Language Processing Feature Extraction Low-dimensional non-linear projection (example from LLE) 0.25 0.4 0.2 0.3 0.15 0.2 0.1 0.05 0.1 0 10 0 8 4 6 3 4 2 2 10 4 1 0 3 5 0 2 −2 1 −1 −4 0 0 −2 −1 → −6 −5 −3 −2 −8 −3 −10 −10 −4 −4 Feature transformation (Gaussianisation) Cambridge University Foresight Cognitive Systems Workshop 5 Engineering Department
  • 7. Machine Learning for Speech & Language Processing Gaussianisation for Speaker Normalisation 1. Linear projection and “decorrelation of the data” (heteroscedastic LDA) 2. Gaussianise the data for each speaker: 0.1 100 100 0.4 0.09 90 90 0.35 0.08 80 80 0.3 0.07 70 70 60 0.25 0.06 60 0.05 50 50 0.2 0.04 40 40 0.15 0.03 30 30 0.1 0.02 20 20 0.05 0.01 10 10 0 0 0 0 −20 −15 −10 −5 0 5 10 15 20 −20 −15 −10 −5 0 5 10 15 20 −5 −4 −3 −2 −1 0 1 2 3 4 5 −5 −4 −3 −2 −1 0 1 2 3 4 5 Source PDF Source CDF Target CDF Target PDF (a) construct a Gaussian mixture model for each dimension; (b) non-linearly transform using cumulative density functions. • May view as higher-moment version of mean and variance normalisation: – single component/dimension GMM equals CMN plus CVN • Performance gains on state-of-the-art tasks Cambridge University Foresight Cognitive Systems Workshop 6 Engineering Department
  • 8. Machine Learning for Speech & Language Processing Bayesian Networks • Bayesian networks are a method to show conditional independence: Cloudy Sprinkler Rain Wet Grass – whether the grass is wet, W , depends on : whether the sprinkler used, S, and whether it has rained; R. – whether sprinkler used (or it rained) depends on: whether it is cloudy C. • W is conditionally independent of C given S and R. • Dynamic Bayesian networks handle variable length data. Cambridge University Foresight Cognitive Systems Workshop 7 Engineering Department
  • 9. Machine Learning for Speech & Language Processing Hidden Markov Model - A Dynamic Bayesian Network o1 o2 o3 o4 oT b () qt qt+1 b2() b3() 4 1 2 3 a 4 a12 a23 a45 5 34 ot ot+1 a22 a33 a44 (a) Standard HMM phone topology (b) HMM Dynamic Bayesian Network • Notation for DBNs: circles - continuous variables squares - discrete variables shaded - observed variables non-shaded - unobserved variables • Observations conditionally independent of other observations given state. • States conditionally independent of other states given previous states, • Poor model of the speech process - piecewise constant state-space. Cambridge University Foresight Cognitive Systems Workshop 8 Engineering Department
  • 10. Machine Learning for Speech & Language Processing Alternative Dynamic Bayesian networks Switching linear dynamical system: q t q t+1 • discrete and continuous state-spaces • observations conditionally independent given continuous and discretes state; x t x t+1 T • exponential growth of paths, O(Ns ) ⇒ approximate inference required. o t o t+1 Multiple data stream DBN: q (1) t q (1) t+1 • e.g. factorial HMM/mixed memory model; • asynchronous data common: – speech and video/noise; o t o t+1 – speech and brain activation patterns. • observation depends on state of both streams q (2) t q (2) t+1 Cambridge University Foresight Cognitive Systems Workshop 9 Engineering Department
  • 11. Machine Learning for Speech & Language Processing SLDS Trajectory Modelling sh ow dh ax g r ih dd l iy z 5 0 Frames from phrase: SHOW THE GRIDLEY’S ... −5 −10 1 MFCC c Legend −15 • True −20 • HMM −25 • SLDS −30 −35 20 40 60 80 100 Frame Index • Unfortunately doesn’t currently classify better than an HMM! Cambridge University Foresight Cognitive Systems Workshop 10 Engineering Department
  • 12. Machine Learning for Speech & Language Processing Support Vector Machines support vector support vector margin width decision boundary • SVMs are a maximum margin, binary, classifier: – related to minimising generalisation error; – unique solution (compare to neural networks); – may be kernelised - training/classification a function of dot-product (xi.xj ). • Successfully applied to many tasks - how to apply to speech and language? Cambridge University Foresight Cognitive Systems Workshop 11 Engineering Department
  • 13. Machine Learning for Speech & Language Processing The “Kernel Trick” 2.5 2.5 class one class one class two class two 2 support vector 2 support vector class one margin class one margin decision boundary decision boundary class two margin class two margin 1.5 1.5 1 1 0.5 0.5 0 0 1 2 3 4 5 6 7 1 2 3 4 5 6 7 • SVM decision boundary linear in the feature-space – may be made non-linear using a non-linear mapping φ() e.g.   √ 1x2 x1 φ =  2x1x2  , K(xi, xj ) = φ(xi).φ(xj ) x2 x2 2 • Efficiently implemented using a Kernel: K(xi, xj ) = (xi.xj )2 Cambridge University Foresight Cognitive Systems Workshop 12 Engineering Department
  • 14. Machine Learning for Speech & Language Processing String Kernel • For speech and text processing input space has variable dimension: – use a kernel to map from variable to a fixed length; – Fisher kernels are one example for acoustic modelling; – String kernels are an example for text. • Consider the words cat, cart, bar and a character string kernel c-a c-t c-r a-r r-t b-a b-r φ(cat) 1 λ 0 0 0 0 0 φ(cart) 1 λ2 λ 1 1 0 0 φ(bar) 0 0 0 1 0 1 λ K(cat, cart) = 1 + λ3, K(cat, bar) = 0, K(cart, bar) = 1 • Successfully applied to various text classification tasks: – how to make process efficient (and more general)? Cambridge University Foresight Cognitive Systems Workshop 13 Engineering Department
  • 15. Machine Learning for Speech & Language Processing Weighted Finite-State Transducers • A weighted finite-state transducer is a weighted directed graph: – transitions labelled with an input symbol, output symbol, weight. • An example transducer, T , for calculating binary numbers: a=0, b=1 b:b/1 b:b/2 Input State Seq. Output Weight a:a/1 a:a/2 112 bab 1 bab b:b/1 211 bab 4 1 2/1 For this sequence output weight: w [bab ◦ T ] = 5 • Standard (highly efficient) algorithms exist for various operations: – combining transducer, T1 ◦ T2; – inverse, T −1, swap the input and output symbols in the tranducer. • May be used for efficient implementation of string kernels. Cambridge University Foresight Cognitive Systems Workshop 14 Engineering Department
  • 16. Machine Learning for Speech & Language Processing Rational Kernels • A transducer, T , for the string kernel (gappy bigram) (vocab {a, b}) b: ε/1 b: ε/λ b: ε/1 a: ε/1 a: ε/λ a: ε/1 a:a/1 a:a/1 1 2 3/1 b:b/1 b:b/1 The kernel is: K(Oi, Oj ) = w Oi ◦ (T ◦ T −1) ◦ Oj • This form can also handle uncertainty in decoding: – lattices can be used rather than the 1-best output (Oi). • This form encompasses various standard feature-spaces and kernels: – bag-of-words and N-gram counts, gappy N-grams (string Kernel), • Successfully applied to a multi-class call classification task. Cambridge University Foresight Cognitive Systems Workshop 15 Engineering Department
  • 17. Machine Learning for Speech & Language Processing Reinforcement Learning • Reinforcement learning is a class of training methods: – problem defined by payoffs; – aims to learn the policy that maximises the payoff; – no need for a mathematical model of environment. PAYOFF PAYOFF FUNCTION FUNCTION CONTROL DIALOGUE SYSTEM POLICY SENSORS ACTUATORS ASR/SU TTS ENVIRONMENT USER Reinforcement learning Spoken dialogue system • Dialogue policy learning fits nicely within this framework. Cambridge University Foresight Cognitive Systems Workshop 16 Engineering Department
  • 18. Machine Learning for Speech & Language Processing Example Dialogue S1: Welcome to NJFun. How may I help you? U1: I’d like to find um winetasting in Lambertville in the morning. [ ASR: I’d like to find out wineries in the Lambertville in the morning ] S2: Did you say you are interested in Lambertville? U2: Yes. S3: Did you want to go in the morning? U3: Yes. S4: I found a winery near Lambertville that is open in the morning. It is Poor Richard’s Winery in Lambertville. • Variety of action choices available: – mixed versus system initiative; – explicit versus implicit confirmation. Cambridge University Foresight Cognitive Systems Workshop 17 Engineering Department
  • 19. Machine Learning for Speech & Language Processing Markov Decision Process • SDS modelled as a MDP: – system state and action at time t: St and At; – transition function: user and ASR/SU model, P (St|St−1, At−1). Turn Delay User + ASR/SU Rt z −1 P(St | St−1,At−1) At St π( St ) Dialogue Policy • Select policy to maximise expected total reward: – total reward: Rt sum of instantaneous rewards from t to end of dialogue; – value function (expected reward) for policy π in state S: V π (S). Cambridge University Foresight Cognitive Systems Workshop 18 Engineering Department
  • 20. Machine Learning for Speech & Language Processing Q-Learning • In reinforcement learning use the Q-function, Qπ (S, A) – expected reward from taking action A in state S using policy π • Best policy using π given state St is obtained from π (St) = argmax (Qπ (St, A)) ˆ A • Transition function not normally known - one-step Q-learning algorithm: – learn Qπ (S, A) rather than transition function; – estimate using difference between actual and estimated values. • How to specify reward: simplest form assign to final state: – positive value for task success; – negative value for task failure. Cambridge University Foresight Cognitive Systems Workshop 19 Engineering Department
  • 21. Machine Learning for Speech & Language Processing Partially Observed MDP • State-space required to encapsulate all information to make decision: – state space can become very large e.g. transcript of dialogue to date etc; – required to compress size - usually application specific choice; – if state-space is too small MDP not appropriate. • Also User beliefs cannot be observed: – decisions required on incomplete information (POMDP); – use of a belief state - value function becomes V π (B) = B(S)V π (S) S where B(S) gives belief in a state. • Major problem: how to obtain sufficient training data? – build prototype system and then refine; – build a user model to simulate user interaction. Cambridge University Foresight Cognitive Systems Workshop 20 Engineering Department
  • 22. Machine Learning for Speech & Language Processing Machine Learning for Speech & Language Processing Briefly described only a few examples • Markov chain Monte-Carlo techniques: – Rao-Blackwellised Gibbs sampling for SLDS one example. • Discriminative training criteria: – use criteria more closely related to WER, (MMI, MPE, MCE). • Latent variable models for language modelling: – Latent semantic analysis (LSA) and Probabilistic LSA. • Boosting style schemes: – generate multiple complementary classifiers and combine them. • Minimum Description Length & evidence framework: – automatically determine numbers of model parameters and configuration. Cambridge University Foresight Cognitive Systems Workshop 21 Engineering Department
  • 23. Machine Learning for Speech & Language Processing Some Standard Toolkits • Hidden Markov model toolkit (HTK) – building state-of-art HMM-based systems – http://htk.eng.cam.ac.uk/ • Graphical model toolkit (GMTK) – training and inference for graphical models – http://ssli.ee.washington.edu/∼bilmes/gmtk/ • Finite state transducer toolkit (FSM) – building, combining, optimising weighted finite state transducers – http://www.research.att.com/sw/tools/fsm/ • Support vector machine toolkit (SVMlight) – training and classifying with SVMs – http://svmlight.joachims.org/ Cambridge University Foresight Cognitive Systems Workshop 22 Engineering Department
  • 24. Machine Learning for Speech & Language Processing (Random) References • Machine Learning, T Mitchell, McGraw Hill, 1997. • Nonlinear dimensionality reduction by locally linear embedding, S. Roweis and L Saul Science, v.290 no.5500 , Dec.22, 2000. pp.2323–2326. • Gaussianisation, S. Chen and R Gopinath, NIPS 2000. • An Introduction to Bayesian Network Theory and Usage,T.A.Stephenson,IDIAP-RR03, 2000 • A Tutorial on Support Vector Machines for Pattern Recognition, C.J.C. Burges, Knowledge Discovery and Data Mining, 2(2), 1998. • Switching Linear Dynamical Systems for Speech Recognition, A-V.I. Rosti and M.J.F. Gales, Technical Report CUED/F-INFENG/TR.461 December 2003, • Factorial hidden Markov models, Z. Ghahramani and M.I. Jordan. Machine Learning 29:245– 275, 1997. • The Nature of Statistical Learning Theory, V. Vapnik, Springer, 2000. • Finite-State Transducers in Language and Speech Processing, M. Mohri,Computational Linguistics, 23:2, 1997. • Weighted Automata Kernels - General Framework and Algorithms, C. Cortes, P. Haffner and M. Mohri, EuroSpeech 2003. • An Application of Reinforcement Learning to Dialogue Strategy Selection in a Spoken Dialogue System for Email, M. Walker, Journal of Artificial Intelligence Research, JAIR, Vol 12., pp. 387-416, 2000. Cambridge University Foresight Cognitive Systems Workshop 23 Engineering Department