Speech totext

Speech to Text
conversion
Raaj Tilak Sarma,
Student at Tezpur Central
University
Intern at Senseforth.ai (June-
July’18)

ASR
OR
AUTOMATIC
SPEECH
RECOGNITION

WHY IS ASR STILL A
DIFFICULT PROBLEM ?
❏ STYLE
❏ ENVIRONMENT
❏ SPEAKER CHARACTERISTICS
LIKE AGE AND GENDER

ASR SYSTEMS
❏ GENERIC
❏ Recognizes entire vocabulary of a
language
❏ DOMAIN-SPECIFIC
❏ Recognizes limited vocabulary making
sense in a particular domain like
banking.

What are PHONEMES?
hello ----> H EH L OW
world ----> W ER L D

m(i) = 401.25,
622.50, 843.75,
1065.00, 1286.25,
1507.50, 1728.74,
1949.99, 2171.24,
2392.49, 2613.74,
2834.99, . . . .
MFCC
COEFFICIENTS
39 SUCH
FEATURES

Gaussian
Mixture
Model-
GMM
Hidden
Markov
Model-HMM

Let’s take a look at the History of
ASR
RADIO REX(1922)
500Hz acoustic energy

IBM SHOEBOX - 1961
❏ DIGIT RECOGNIZER
❏ 16 word vocab

HARPY system developed at CMU -
1976
❏ 1000 word vocabulary
❏ Used FST with nodes representing
WORDS and Phones

HIDDEN MARKOV MODELS(HMM)
-1980s (still being used)

WORK:
❏ WALK
❏ SHOP
❏ CLEAN
WEATHER:
❏ RAINY
❏ SUNNY
IMAGINE A SITUATION

WE CAN ASK THREE QUESTIONS:
1)EVALUATION PROBLEM
If I know your sequence of work for the last few days,
how likely it is that you might take a ‘WALK’ when it is
‘RAINY’?
2)DECODING PROBLEM
If I know your sequence of work for the last few days,
what is the most likely sequence of weather conditions?
3)LEARNING PROBLEM
If I know your sequence of work and the sequence of
WEATHER conditions for the last few days, what might be
the weather condition for the next day?

FORWARD BACKWORD ALGOFORWARD BACKWORD ALGO
VITERBI
EM ALGO

BUT HOW DOES IT HELP IN
SPEECH RECOGNITION?

❏ Work ⇔ Audio signal frames
❏ Weather Conditions⇔Phonemes

Slide taken from Rita Singh, School of CSE, CMU

Slide taken from Rita Singh, School of CSE, CMU
Ball B AO L B,AO(B,K),K

USING DEEP NEURAL NETWORKS
(from 2010 onwards)

REPLACE GMMs with DEEP
NEURAL NETWORKS(DNN)

But Neural Networks with Speech
Recognition were tried for as early as
the 90s so why this gap till 2010?

❏ 20K+ Vocab Problem
❏ Conversational speech
❏ DATA
❏ Computational Power
❏ Read speech
❏ Spontaneous speech
❏ Wall Street Journal failed
(20 hours rec)
❏ Statistical models OUTPERFORMED and
still do sometimes

❏ GEOFFREY HINTON - Reducing the
dimensionality of data with Neural Nets(2006)
❏ Investigation of full-sequence training of Deep
Belief Networks(2010)
❏ Tried to recognize phonemes(was ok)
❏ Conversational Speech Transcription using
Context Dependent Deep Neural Networks
❏ Large vocab
❏ 30% WER

What I had to do?
Build a domain specific speech to text convertor that could work with
Indian English.
Here the domain was ‘Banking’.
What I used?
CMU SPHINX4 in Java

THE THREE MODELS. . .
❏ ACOUSTIC MODEL
❏ BUILD MY SEQUENCE OF PHONEMES
❏ DICTIONARY MODEL
❏ BUILD MY WORD
❏ LANGUAGE MODEL
❏ BUILD MY SENTENCE/TRANSCRIPTION

The dictionary model(OR
LEXICON)

The language model
CORPUS USED

1-grams:
-1.1114 </s> -0.3010
-1.1114 <s> -0.2028
-3.1796 A -0.3007
-2.4014 ABOUT -0.2967
-2.1004 ACCOUNT -0.2661
-2.8785 ADDRESS -0.2661
-2.4806 AGAINST -0.2996
-2.3345 AND -0.2990
-2.8785 BEYOND -0.2996
2-grams:
-0.7782 ABOUT LOAN 0.0000
-0.7782 ABOUT PLATINUM
0.0000
-0.7782 ABOUT RECURRING
0.0000
-0.3010 ACCOUNT </s> -0.3010
-0.3010 ADDRESS </s> -0.3010
3-grams:
-0.3010 CAN YOU TELL
-0.5441 CARD BILL </s>
-0.6690 CARD BILL PAYMENT
-0.3010 CARD EXPIRED </s>

The acoustic model
❏ En-us model
❏ Continuous
Every senone has it’s own set of
gaussians, slightly slower
❏ Ptm
Less no of gaussians, faster
❏ En-in model
❏ Continuous

ADAPTING
THE
ACOUSTIC MODEL. . .

REQUIREMENTS
❏ Training Data in wav files
❏ Sampled in 16 Hz mono
❏ Their transcriptions along with their audio
ids
Types of adaptation
❏ MLLR adaptation
❏ MAP adaptation
❏ MAP with MLLR

Word Error Rate
REF: What a bright day
HYP: What a day
REF: What a day
HYP: What a bright day
REF: What a bright day
HYP: What a light day

Adapting the US English acoustic
model

Why use the Indian english acoustic
model for adaptation?
Us acoustic model does not support certain phones like ‘AX’,’OH’!
ACCOUNT ah k aw n t
ACCOUNT(2) eh k aw n t
ACCOUNT(3) ih k aw n t
ACCOUNT ah k aw n t
ACCOUNT(2) eh k aw n t
ACCOUNT(3) ih k aw n t
ACCOUNT(4) ax k aw n t
DEPOSIT d ah p aa z ih t
DEPOSIT(2) d ih p aa z ah t
DEPOSIT d ah p aa z ih t
DEPOSIT(2) d ih p aa z ah t
DEPOSIT(3) d ih p oh z ih t

Adapting the Indian acoustic model

❏ PHRASES
“I NEED TO BLOCK MY”
“PLEASE HELP ME BLOCK MY"
ENTITIES
“CREDIT CARD”
“DEBIT CARD”
“I NEED TO BLOCK MY CREDIT CARD”

I HAVE FULL SENTENCES RECORDED
OF 9 PEOPLE
LET’S ADD “ENTITY” RECORDINGS
TO MY DATASET AND ADAPTING!

LET’S CHECK THE LIVE SPEECH
RECOGNIZER

Observation and Further Scope of
Improvement
❏ MORE DATA
❏ MORE AMOUNT OF ENTITY/PHRASE
RECORDINGS THAN SENTENCES
❏ EQUAL AMOUNT OF DATA FOR MALE AND
FEMALE
❏ BETTER RECORDING ENVIRONMENT
❏ NOISE FILTERING
❏ INVESTING IN GPU’S
❏ DATA COLLECTED FOR ADAPTATION CAN BE
LATER USED FOR DEEP LEARNING BASED
APPROACHES

Thank you for your patience :-)

Speech totext

Recommended

Recommended

More Related Content

Similar to Speech totext

Similar to Speech totext (20)

Recently uploaded

Recently uploaded (20)

Speech totext