Built a Speech-To-Text convertor using Cmu sphinx 4 and adapting an indian acoustic model to recordings collected through a website that was built for audio data collection purposes.
Minimum Word Error Rate - 10.59%
IMPORTANT : For creation of this slide, various resources have been used some of which are other presentations and I would like to mention them and thank them for increasing my understanding:
1) https://www.youtube.com/watch?v=q67z7PTGRi8&t=1558s
This presentation by Professor Preethi Jyothi, IITB
2)ASR slides by Rita Singh, CMU
3)Stanford seminar by Alex Acuro, Apple Computers
4. WHY IS ASR STILL A
DIFFICULT PROBLEM ?
❏ STYLE
❏ ENVIRONMENT
❏ SPEAKER CHARACTERISTICS
LIKE AGE AND GENDER
5. ASR SYSTEMS
❏ GENERIC
❏ Recognizes entire vocabulary of a
language
❏ DOMAIN-SPECIFIC
❏ Recognizes limited vocabulary making
sense in a particular domain like
banking.
17. WE CAN ASK THREE QUESTIONS:
1)EVALUATION PROBLEM
If I know your sequence of work for the last few days,
how likely it is that you might take a ‘WALK’ when it is
‘RAINY’?
2)DECODING PROBLEM
If I know your sequence of work for the last few days,
what is the most likely sequence of weather conditions?
3)LEARNING PROBLEM
If I know your sequence of work and the sequence of
WEATHER conditions for the last few days, what might be
the weather condition for the next day?
26. But Neural Networks with Speech
Recognition were tried for as early as
the 90s so why this gap till 2010?
27. ❏ 20K+ Vocab Problem
❏ Conversational speech
❏ DATA
❏ Computational Power
❏ Read speech
❏ Spontaneous speech
❏ Wall Street Journal failed
(20 hours rec)
❏ Statistical models OUTPERFORMED and
still do sometimes
29. ❏ GEOFFREY HINTON - Reducing the
dimensionality of data with Neural Nets(2006)
❏ Investigation of full-sequence training of Deep
Belief Networks(2010)
❏ Tried to recognize phonemes(was ok)
❏ Conversational Speech Transcription using
Context Dependent Deep Neural Networks
❏ Large vocab
❏ 30% WER
30.
31.
32. What I had to do?
Build a domain specific speech to text convertor that could work with
Indian English.
Here the domain was ‘Banking’.
What I used?
CMU SPHINX4 in Java
33. THE THREE MODELS. . .
❏ ACOUSTIC MODEL
❏ BUILD MY SEQUENCE OF PHONEMES
❏ DICTIONARY MODEL
❏ BUILD MY WORD
❏ LANGUAGE MODEL
❏ BUILD MY SENTENCE/TRANSCRIPTION
36. 1-grams:
-1.1114 </s> -0.3010
-1.1114 <s> -0.2028
-3.1796 A -0.3007
-2.4014 ABOUT -0.2967
-2.1004 ACCOUNT -0.2661
-2.8785 ADDRESS -0.2661
-2.4806 AGAINST -0.2996
-2.3345 AND -0.2990
-2.8785 BEYOND -0.2996
2-grams:
-0.7782 ABOUT LOAN 0.0000
-0.7782 ABOUT PLATINUM
0.0000
-0.7782 ABOUT RECURRING
0.0000
-0.3010 ACCOUNT </s> -0.3010
-0.3010 ADDRESS </s> -0.3010
3-grams:
-0.3010 CAN YOU TELL
-0.5441 CARD BILL </s>
-0.6690 CARD BILL PAYMENT
-0.3010 CARD EXPIRED </s>
37. The acoustic model
❏ En-us model
❏ Continuous
Every senone has it’s own set of
gaussians, slightly slower
❏ Ptm
Less no of gaussians, faster
❏ En-in model
❏ Continuous
39. REQUIREMENTS
❏ Training Data in wav files
❏ Sampled in 16 Hz mono
❏ Their transcriptions along with their audio
ids
Types of adaptation
❏ MLLR adaptation
❏ MAP adaptation
❏ MAP with MLLR
44. Why use the Indian english acoustic
model for adaptation?
Us acoustic model does not support certain phones like ‘AX’,’OH’!
ACCOUNT ah k aw n t
ACCOUNT(2) eh k aw n t
ACCOUNT(3) ih k aw n t
ACCOUNT ah k aw n t
ACCOUNT(2) eh k aw n t
ACCOUNT(3) ih k aw n t
ACCOUNT(4) ax k aw n t
DEPOSIT d ah p aa z ih t
DEPOSIT(2) d ih p aa z ah t
DEPOSIT d ah p aa z ih t
DEPOSIT(2) d ih p aa z ah t
DEPOSIT(3) d ih p oh z ih t
51. Observation and Further Scope of
Improvement
❏ MORE DATA
❏ MORE AMOUNT OF ENTITY/PHRASE
RECORDINGS THAN SENTENCES
❏ EQUAL AMOUNT OF DATA FOR MALE AND
FEMALE
❏ BETTER RECORDING ENVIRONMENT
❏ NOISE FILTERING
❏ INVESTING IN GPU’S
❏ DATA COLLECTED FOR ADAPTATION CAN BE
LATER USED FOR DEEP LEARNING BASED
APPROACHES