Build your own ASR engine

Speech as a modality
High throughput (130 words per minute)
Natural
Hands-free
Need to be mindful that ASR is errorful
NLP on top of ASR output needs to be able to correct
errors

Dialogue systems
Dialogue Manager
Generation
Speech
synthesis
(TTS)
Language
Understanding
(NLP/NLU)
ASR
User intents
+confidence
text text
System intent
speechspeech
Other outputs

Inside the recognizer
Search Text
Lexicon
Words and pronunciations
Language Model
Text data
Feature
extractor
Acoustic model
Transcribed Speech Data
This is a pen.
This is a pen.This is a pen.
Constraints

Inside the recognizer
Beam search
Text
N-best
Lattice
Acoustic model
This is a pen.
Lexicon
Language Model
Text data
HCLG
FST (Graph)
Feature
extractor MFCC

The feature - MFCC (Mel Frequency
Cepstral Coefficient)
|STFT|
log
DCT
…
Coefficient
selection
ASR features
https://www.webmd.com/cold-and-flu/ear-infection/picture-of-the-ear#1

Acoustic modeling
“This is a cat”
This is a cat
/th/ /iy/ /iy//s/ /s/ /a/ /ae/ /t//k/
Force alignment - process of assigning different segments to each sounds
Automatic if you have an ASR
phoneme - distinct unit of speech sound
Utterance-level
transcription
ASR typically models phoneme level

HMM and GMM in speech
• Each phoneme is separate into parts and model
separately
• This is model by Hidden Markov Model and Gaussian
Mixture Model

Decoding
????????
/th/
10
/iy/
20
/iy/
10
/s/
1
/s/
2
/a/
3
/ae/
35
/t/
2
/k/
1
Acoustic model give scores to possible sequences of phonemes
Intractable: use beamsearch (trade-off between accuracy and compute)
/ae/
50
/iy/
10
/t/
2
/t/
3
/a/
3
/ae/
15
/n/
1
/k/
1
Total score
84
/iy/
20
Total score
105
How do we go back to words?
Can we trust the AM that much?

The lexicon
• A dictionary saying how a word can be pronounced
• Can have multiple pronunciations
• Must use the same phoneme units as the AM
Phoneme lexicon
กรรไกร : k a n^ kr ai z^
กรณี : k a z^ r a z^ n ii z^
กรณี : k or z^ r a z^ n ii z^
เพลา : p ae z^ l aa z^
เพลา : pl ao z^

The grapheme lexicon
• Represent letters (graphemes) as sound units
• The pronunciation is just the sequence of letter spelling
• Works quite well for many languages
• Thai somewhat problematic
Phoneme lexicon Grapheme lexicon

Grapheme vs Phoneme
Language (WER) Phoneme lexicon Grapheme lexicon diff
Kazakh 76.8% 77.0% +0.2
Kurmanji 85.5% 85.1% -0.4
Telugu 86.3% 87.0% +0.7
Cebuano 75.7% 75.9% +0.2
Lao 67.3% 69.9% +2.6
Haitian 52.0% 52.3% +0.3
Assamese 58.6% 58.5% -0.1
English 8.0% 8.5% +0.5
• Not the end of the world if you do not have a lexicon
• Can be slightly improved with some knowledge about the language
(rule-based)
• The more regular the spelling is the closer the gap
D. Harwath and J. Glass. Speech recognition without a lexicon-bridging the gap between graphemic and phonetic systems. In Proc. InterSpeech, 2014.
V. Le, L. Lamel, A. Messaoudi, W. Hartmann, J. Gauvain, C. Woehrling, J. Despres, and A. Roy. Developing STT and KWS systems using limited language
resources. In Proc. InterSpeech, 2014.
E. Chuangsuwanich, Multilingual Techniques for Low Resource Automatic Speech Recognition, MIT, June 2016.

Automatic Grapheme-to-Phoneme
(G2P)
• Given a small lexicon generated by linguists, learn a
model to predict the pronunciation of new words
• Sometimes called L2S model (Letter-to-Sound)
• Can use acoustic data to improve performance of generated
pronunciation

G2P example
• Trained with 5k pronunciations using Sequitur
• Can produce multiple candidate pronunciations
• PythaiNLP also has G2P support (?)

Language Model (LM)
• Specifies the grammar of a valid sentence
• Can be strict
• Or probabilistic n-grams
ขอ ถอน เงิน
ฝาก

Deep learning in ASR?
Beam search
Text
N-best
Lattice
Acoustic model
This is a pen.
Lexicon
Language Model
Text data
HCLG
FST (Graph)
Feature
extractor MFCC

Simpler features
|STFT|
log
DCT
…
Coefficient
selection
Traditional ASR features
https://www.webmd.com/cold-and-flu/ear-infection/picture-of-the-ear#1

Simpler features
|STFT|
…
Features for deep
learning

Deep learning in AM
Three main approaches
• Hybrid DNN-HMM
• Tandem
• End-to-end

Hybrid DNN-HMM approach
• A typical speech recognizer uses the GMM-HMM
framework
• Emission probabilities are modeled by a GMM
• Instead, model emission probabilities with a DNN
• DNN gives posteriors, while GMM gives likelihoods. Convert DNN
outputs to likelihoods by removing the priors.
DN
N
b1
b2
b3
DNN-HM
M
s1
s2
s3
DNN/LSTM/etc

Tandem approach
• Use the DNN to generate good features to feed into the general
GMM-HMM framework.
• Typically done by placing a narrow hidden layer in the network.
Bottleneck layer
Input
features
Input
features
Typical
GMM-HMM

Deep learning in LM
• Use neural language models instead of n-grams
• Can only use neural LM in rescoring

End-to-End models
• BIG network that goes
from waveform to
characters
• Still need LM rescpromg
• Needs large amounts of
data (1000+ hours)
• People also tried on
smaller data (100
hours)**
Chan, W. Listen, Attend and Spell, 2015
Github available (transformer)
https://github.com/tensorflow/tensor2tensor

Factors making ASR hard
Vocabulary size
small (10) large (1000+)
Type of speech
disconnected words news broadcast conversational
Domain
fixed domain open domain
Quality
close talking far field
Speakers
speaker dependent speaker independent
HarderEasier

Why build your own ASR?
• You can better specify your domain and vocabulary
• Google speech API only supports word/phrases emphasis
• You can better specify your type of recordings
• Google now have specialized models for English
• You can adapt to the speaker
• Google prefers speaker independent models
• You can run locally on device

Powered by Kaldi
https://www.nist.gov/itl/iad/mig/openkws16-evaluation

Smartvid.io
https://blogs.nvidia.com/blog/2017/05/10/these-six-ai-startups-just-snagged-a-share-of-1-5-million-in-cash-prizes/

Snowboy (kitt.ai)
https://techcrunch.com/2017/07/05/baidu-acquires-natural-language-startup-kitt-ai-maker-of-chatbot-engine-chatflow/

How much data?
E. Chuangsuwanich, Multilingual Techniques for Low Resource Automatic Speech Recognition, MIT, June 2016.

But no open dataset
• English has 1000 hours open dataset
• Gives deployment-worthy models
• At least four startups I know starts from this
• Transfer learning gives huge leverage
• Let’s build shared resources
• Mozilla common voice https://voice.mozilla.org/
• Recordings from ASR class
https://github.com/ekapolc/gowajee_corpus/
• 40hr spelling corpora planned for release soon(
™)
• Thai crowdsourcing platform solution soon(
™)

Links
• ASR course at Chula (all materials + videos)
• https://github.com/ekapolc/ASR_course
• How to create your own Kaldi ASR and deploy it
• Starting AM, G2P, and docker images
• https://github.com/ekapolc/ASR_classproject
• Kaldi
• http://kaldi-asr.org/
• Snowboy for hotword detection
• https://snowboy.kitt.ai/

Commercial time
Chula engineering is accepting applicants for Master and
PhD
M Computer Science
M Computer Engineering
M Software Engineering
PhD Computer Science
Wide range of Big data/AI courses being offered
https://www.cp.eng.chula.ac.th/
Talks from the faculty on what they are working on
https://www.youtube.com/watch?v=XU1PWNeLv4o

Current projects
Thai NLP basic capabilities
Sentence segmentation
Word correction/normalization
Word segmentation
NER
Domain focus: social media, chat, financial info

Current projects
National platform to assess students’ English
Both written (essays) and spoken
Speech rehab platform for stroke patients
Biomedical text mining
Precision medicine research

Sahin and Tureci. Science (2018)
Synthetic
neoantigen
vaccines
Engineered T
cells
Precision medicine

Kobayashi and van den Elsen. Nature Reviews Immunology (2012).
Precision medicine
Model peptides as a string
ABEASOEWL
will the receptor also a string
…AEOAENWOAIRPEERW....
accept the peptide
Use deep learning

Build your own ASR engine

More Related Content

What's hot

Similar to Build your own ASR engine

More from Korakot Chaovavanich

Recently uploaded

Build your own ASR engine