Deep Learning for Speech
Recognition
Vikrant Tomar
Founder, Fluent.ai
vt@fluent.ai
We are hiring!
Outline
- Introduction
- General overview of speech recognition framework
- Conventional GMM-HMM based systems
- Deep neural networks in speech
- ConvNets
- RNNs/LSTMs and End-to-end learning
- New interesting stuff
2
Intro 1: What is speech recognition?
- Dream: A machine should be able to develop a functional equivalent of the
speaker’s intended message as effortlessly as humans can
- In other words: The goal is to find the most likely sequence of symbols such as
words or sub-word speech units from a stream of acoustic data.
3
Intro 2: How is deep learning for speech different from
deep learning for images?
- Speech is a temporal signal, there is information in the sequence
- One dimensional signal with multitudes of information:
- Speaker
- Accent and language
- Age and health
- Environment
- Issues:
- Noise and background conditions
- Accents
- Recording devices
4
Overview: Statistical Framework for speech recognition
- Formally, an ASR system maps the sequence of observation vectors, X, to the
optimum sequence of words, Ŵ :
-
5
Overview 2: System Architecture
6
System Architecture : Feature extraction & spectrogram
7
GMM-HMM based systems
8
Deep neural networks in speech
- Few different approaches
- Tandem
- Hybrid
- End-to-end
- Old but new
9
Tandem DNN: DNN -- GMM -- HMM
10
Hybrid DNN - HMM
11
- Good source:
Hinton et. al, Deep neural networks
for acoustic modelling in speech, 2012.
Hybrid CNN - HMM
12
- Good source: A-Hamid et. al, Covolutional neural networks for speech recognition,
2014
Hybrid CNN - HMM -- Partial weight sharing
13
Some benchmarks
14
RNNs and End to end models
- RNN:
- Good because sequential models
- However, cannot capture long-term dependencies
- Vanishing gradients
- Solutions: LSTMs and GRUs
- End to end models have overall simplified arch.
- CTC : Connectionist temporal classification
A. Graves et. al., “Towards End-to-End Speech
Recognition with Recurrent Neural Networks, 2014
15
New interesting stuff
- Baidu Deep Speech: Use bi-directional RNNs to directly map to characters
- IBM 2015/2016 and Microsoft 2016: Deep CNN with 3 x 3 kernels similar to VGG
net etc.
- CLDNN : Conv + LSTMs + Fully Connected
Baidu Lab: Deep Speech 2014 and Deep Speech 2, 2015
Sainath et. al, CONVOLUTIONAL, LONG SHORT-TERM MEMORY, FULLY CONNECTED DEEP
NEURAL NETWORKS, 2015
Xiong et. al, THE MICROSOFT 2016 CONVERSATIONAL SPEECH RECOGNITION SYSTEM, 2016
Saon et. al, The IBM 2015/16 English Conversational Telephone Speech Recognition System, 2015/16 16
Conclusion and resources
- Lots of exciting stuff, most concepts are similar to other deep learning
communities
- Good starting point: http://www.recognize-speech.com
- You can use any toolbox you like to start:
- Tensorflow, Torch, Theano etc.
- Kaldi, Currennt
- Older stuff: CMU-Sphinx, RWTH-ASR, HTK
- Free(-ish) datasets: http://www.openslr.org/resources.php
- Contact: vt@fluent.ai (Hiring Scientists)
17

Deep Learning for Speech Recognition - Vikrant Singh Tomar

  • 1.
    Deep Learning forSpeech Recognition Vikrant Tomar Founder, Fluent.ai vt@fluent.ai We are hiring!
  • 2.
    Outline - Introduction - Generaloverview of speech recognition framework - Conventional GMM-HMM based systems - Deep neural networks in speech - ConvNets - RNNs/LSTMs and End-to-end learning - New interesting stuff 2
  • 3.
    Intro 1: Whatis speech recognition? - Dream: A machine should be able to develop a functional equivalent of the speaker’s intended message as effortlessly as humans can - In other words: The goal is to find the most likely sequence of symbols such as words or sub-word speech units from a stream of acoustic data. 3
  • 4.
    Intro 2: Howis deep learning for speech different from deep learning for images? - Speech is a temporal signal, there is information in the sequence - One dimensional signal with multitudes of information: - Speaker - Accent and language - Age and health - Environment - Issues: - Noise and background conditions - Accents - Recording devices 4
  • 5.
    Overview: Statistical Frameworkfor speech recognition - Formally, an ASR system maps the sequence of observation vectors, X, to the optimum sequence of words, Ŵ : - 5
  • 6.
    Overview 2: SystemArchitecture 6
  • 7.
    System Architecture :Feature extraction & spectrogram 7
  • 8.
  • 9.
    Deep neural networksin speech - Few different approaches - Tandem - Hybrid - End-to-end - Old but new 9
  • 10.
    Tandem DNN: DNN-- GMM -- HMM 10
  • 11.
    Hybrid DNN -HMM 11 - Good source: Hinton et. al, Deep neural networks for acoustic modelling in speech, 2012.
  • 12.
    Hybrid CNN -HMM 12 - Good source: A-Hamid et. al, Covolutional neural networks for speech recognition, 2014
  • 13.
    Hybrid CNN -HMM -- Partial weight sharing 13
  • 14.
  • 15.
    RNNs and Endto end models - RNN: - Good because sequential models - However, cannot capture long-term dependencies - Vanishing gradients - Solutions: LSTMs and GRUs - End to end models have overall simplified arch. - CTC : Connectionist temporal classification A. Graves et. al., “Towards End-to-End Speech Recognition with Recurrent Neural Networks, 2014 15
  • 16.
    New interesting stuff -Baidu Deep Speech: Use bi-directional RNNs to directly map to characters - IBM 2015/2016 and Microsoft 2016: Deep CNN with 3 x 3 kernels similar to VGG net etc. - CLDNN : Conv + LSTMs + Fully Connected Baidu Lab: Deep Speech 2014 and Deep Speech 2, 2015 Sainath et. al, CONVOLUTIONAL, LONG SHORT-TERM MEMORY, FULLY CONNECTED DEEP NEURAL NETWORKS, 2015 Xiong et. al, THE MICROSOFT 2016 CONVERSATIONAL SPEECH RECOGNITION SYSTEM, 2016 Saon et. al, The IBM 2015/16 English Conversational Telephone Speech Recognition System, 2015/16 16
  • 17.
    Conclusion and resources -Lots of exciting stuff, most concepts are similar to other deep learning communities - Good starting point: http://www.recognize-speech.com - You can use any toolbox you like to start: - Tensorflow, Torch, Theano etc. - Kaldi, Currennt - Older stuff: CMU-Sphinx, RWTH-ASR, HTK - Free(-ish) datasets: http://www.openslr.org/resources.php - Contact: vt@fluent.ai (Hiring Scientists) 17