Deep Learning for Speech Recognition - Vikrant Singh Tomar

Deep Learning for Speech
Recognition
Vikrant Tomar
Founder, Fluent.ai
vt@fluent.ai
We are hiring!

Outline
- Introduction
- General overview of speech recognition framework
- Conventional GMM-HMM based systems
- Deep neural networks in speech
- ConvNets
- RNNs/LSTMs and End-to-end learning
- New interesting stuff
2

Intro 1: What is speech recognition?
- Dream: A machine should be able to develop a functional equivalent of the
speaker’s intended message as effortlessly as humans can
- In other words: The goal is to find the most likely sequence of symbols such as
words or sub-word speech units from a stream of acoustic data.
3

Intro 2: How is deep learning for speech different from
deep learning for images?
- Speech is a temporal signal, there is information in the sequence
- One dimensional signal with multitudes of information:
- Speaker
- Accent and language
- Age and health
- Environment
- Issues:
- Noise and background conditions
- Accents
- Recording devices
4

Overview: Statistical Framework for speech recognition
- Formally, an ASR system maps the sequence of observation vectors, X, to the
optimum sequence of words, Ŵ :
-
5

Overview 2: System Architecture
6

System Architecture : Feature extraction & spectrogram
7

Deep neural networks in speech
- Few different approaches
- Tandem
- Hybrid
- End-to-end
- Old but new
9

Tandem DNN: DNN -- GMM -- HMM
10

Hybrid DNN - HMM
11
- Good source:
Hinton et. al, Deep neural networks
for acoustic modelling in speech, 2012.

Hybrid CNN - HMM
12
- Good source: A-Hamid et. al, Covolutional neural networks for speech recognition,
2014

Hybrid CNN - HMM -- Partial weight sharing
13

RNNs and End to end models
- RNN:
- Good because sequential models
- However, cannot capture long-term dependencies
- Vanishing gradients
- Solutions: LSTMs and GRUs
- End to end models have overall simplified arch.
- CTC : Connectionist temporal classification
A. Graves et. al., “Towards End-to-End Speech
Recognition with Recurrent Neural Networks, 2014
15

New interesting stuff
- Baidu Deep Speech: Use bi-directional RNNs to directly map to characters
- IBM 2015/2016 and Microsoft 2016: Deep CNN with 3 x 3 kernels similar to VGG
net etc.
- CLDNN : Conv + LSTMs + Fully Connected
Baidu Lab: Deep Speech 2014 and Deep Speech 2, 2015
Sainath et. al, CONVOLUTIONAL, LONG SHORT-TERM MEMORY, FULLY CONNECTED DEEP
NEURAL NETWORKS, 2015
Xiong et. al, THE MICROSOFT 2016 CONVERSATIONAL SPEECH RECOGNITION SYSTEM, 2016
Saon et. al, The IBM 2015/16 English Conversational Telephone Speech Recognition System, 2015/16 16

Conclusion and resources
- Lots of exciting stuff, most concepts are similar to other deep learning
communities
- Good starting point: http://www.recognize-speech.com
- You can use any toolbox you like to start:
- Tensorflow, Torch, Theano etc.
- Kaldi, Currennt
- Older stuff: CMU-Sphinx, RWTH-ASR, HTK
- Free(-ish) datasets: http://www.openslr.org/resources.php
- Contact: vt@fluent.ai (Hiring Scientists)
17

Deep Learning for Speech Recognition - Vikrant Singh Tomar

More Related Content

What's hot

Similar to Deep Learning for Speech Recognition - Vikrant Singh Tomar

More from WithTheBest

Recently uploaded

Deep Learning for Speech Recognition - Vikrant Singh Tomar