SlideShare a Scribd company logo
1 of 52
Speech to Text
conversion
Raaj Tilak Sarma,
Student at Tezpur Central
University
Intern at Senseforth.ai (June-
July’18)
GOAL
ASR
OR
AUTOMATIC
SPEECH
RECOGNITION
WHY IS ASR STILL A
DIFFICULT PROBLEM ?
❏ STYLE
❏ ENVIRONMENT
❏ SPEAKER CHARACTERISTICS
LIKE AGE AND GENDER
ASR SYSTEMS
❏ GENERIC
❏ Recognizes entire vocabulary of a
language
❏ DOMAIN-SPECIFIC
❏ Recognizes limited vocabulary making
sense in a particular domain like
banking.
What are PHONEMES?
hello ----> H EH L OW
world ----> W ER L D
m(i) = 401.25,
622.50, 843.75,
1065.00, 1286.25,
1507.50, 1728.74,
1949.99, 2171.24,
2392.49, 2613.74,
2834.99, . . . .
MFCC
COEFFICIENTS
39 SUCH
FEATURES
Gaussian
Mixture
Model-
GMM
Hidden
Markov
Model-HMM
Let’s take a look at the History of
ASR
RADIO REX(1922)
500Hz acoustic energy
IBM SHOEBOX - 1961
❏ DIGIT RECOGNIZER
❏ 16 word vocab
HARPY system developed at CMU -
1976
❏ 1000 word vocabulary
❏ Used FST with nodes representing
WORDS and Phones
HIDDEN MARKOV MODELS(HMM)
-1980s (still being used)
WORK:
❏ WALK
❏ SHOP
❏ CLEAN
WEATHER:
❏ RAINY
❏ SUNNY
IMAGINE A SITUATION
WE CAN ASK THREE QUESTIONS:
1)EVALUATION PROBLEM
If I know your sequence of work for the last few days,
how likely it is that you might take a ‘WALK’ when it is
‘RAINY’?
2)DECODING PROBLEM
If I know your sequence of work for the last few days,
what is the most likely sequence of weather conditions?
3)LEARNING PROBLEM
If I know your sequence of work and the sequence of
WEATHER conditions for the last few days, what might be
the weather condition for the next day?
FORWARD BACKWORD ALGOFORWARD BACKWORD ALGO
VITERBI
EM ALGO
BUT HOW DOES IT HELP IN
SPEECH RECOGNITION?
❏ Work ⇔ Audio signal frames
❏ Weather Conditions⇔Phonemes
Slide taken from Rita Singh, School of CSE, CMU
Slide taken from Rita Singh, School of CSE, CMU
Ball B AO L B,AO(B,K),K
Slide taken from Rita Singh, School of CSE, CMU
USING DEEP NEURAL NETWORKS
(from 2010 onwards)
REPLACE GMMs with DEEP
NEURAL NETWORKS(DNN)
But Neural Networks with Speech
Recognition were tried for as early as
the 90s so why this gap till 2010?
❏ 20K+ Vocab Problem
❏ Conversational speech
❏ DATA
❏ Computational Power
❏ Read speech
❏ Spontaneous speech
❏ Wall Street Journal failed
(20 hours rec)
❏ Statistical models OUTPERFORMED and
still do sometimes
AND THEN CAME THE DNN WAVE
❏ GEOFFREY HINTON - Reducing the
dimensionality of data with Neural Nets(2006)
❏ Investigation of full-sequence training of Deep
Belief Networks(2010)
❏ Tried to recognize phonemes(was ok)
❏ Conversational Speech Transcription using
Context Dependent Deep Neural Networks
❏ Large vocab
❏ 30% WER
What I had to do?
Build a domain specific speech to text convertor that could work with
Indian English.
Here the domain was ‘Banking’.
What I used?
CMU SPHINX4 in Java
THE THREE MODELS. . .
❏ ACOUSTIC MODEL
❏ BUILD MY SEQUENCE OF PHONEMES
❏ DICTIONARY MODEL
❏ BUILD MY WORD
❏ LANGUAGE MODEL
❏ BUILD MY SENTENCE/TRANSCRIPTION
The dictionary model(OR
LEXICON)
The language model
CORPUS USED
1-grams:
-1.1114 </s> -0.3010
-1.1114 <s> -0.2028
-3.1796 A -0.3007
-2.4014 ABOUT -0.2967
-2.1004 ACCOUNT -0.2661
-2.8785 ADDRESS -0.2661
-2.4806 AGAINST -0.2996
-2.3345 AND -0.2990
-2.8785 BEYOND -0.2996
2-grams:
-0.7782 ABOUT LOAN 0.0000
-0.7782 ABOUT PLATINUM
0.0000
-0.7782 ABOUT RECURRING
0.0000
-0.3010 ACCOUNT </s> -0.3010
-0.3010 ADDRESS </s> -0.3010
3-grams:
-0.3010 CAN YOU TELL
-0.5441 CARD BILL </s>
-0.6690 CARD BILL PAYMENT
-0.3010 CARD EXPIRED </s>
The acoustic model
❏ En-us model
❏ Continuous
Every senone has it’s own set of
gaussians, slightly slower
❏ Ptm
Less no of gaussians, faster
❏ En-in model
❏ Continuous
ADAPTING
THE
ACOUSTIC MODEL. . .
REQUIREMENTS
❏ Training Data in wav files
❏ Sampled in 16 Hz mono
❏ Their transcriptions along with their audio
ids
Types of adaptation
❏ MLLR adaptation
❏ MAP adaptation
❏ MAP with MLLR
DATA COLLECTION
MEASURING ACCURACY
Word Error Rate
REF: What a bright day
HYP: What a day
REF: What a day
HYP: What a bright day
REF: What a bright day
HYP: What a light day
Adapting the US English acoustic
model
Why use the Indian english acoustic
model for adaptation?
Us acoustic model does not support certain phones like ‘AX’,’OH’!
ACCOUNT ah k aw n t
ACCOUNT(2) eh k aw n t
ACCOUNT(3) ih k aw n t
ACCOUNT ah k aw n t
ACCOUNT(2) eh k aw n t
ACCOUNT(3) ih k aw n t
ACCOUNT(4) ax k aw n t
DEPOSIT d ah p aa z ih t
DEPOSIT(2) d ih p aa z ah t
DEPOSIT d ah p aa z ih t
DEPOSIT(2) d ih p aa z ah t
DEPOSIT(3) d ih p oh z ih t
Adapting the Indian acoustic model
❏ PHRASES
“I NEED TO BLOCK MY”
“PLEASE HELP ME BLOCK MY"
ENTITIES
“CREDIT CARD”
“DEBIT CARD”
“I NEED TO BLOCK MY CREDIT CARD”
I HAVE FULL SENTENCES RECORDED
OF 9 PEOPLE
LET’S ADD “ENTITY” RECORDINGS
TO MY DATASET AND ADAPTING!
LET’S CHECK THE LIVE SPEECH
RECOGNIZER
Observation and Further Scope of
Improvement
❏ MORE DATA
❏ MORE AMOUNT OF ENTITY/PHRASE
RECORDINGS THAN SENTENCES
❏ EQUAL AMOUNT OF DATA FOR MALE AND
FEMALE
❏ BETTER RECORDING ENVIRONMENT
❏ NOISE FILTERING
❏ INVESTING IN GPU’S
❏ DATA COLLECTED FOR ADAPTATION CAN BE
LATER USED FOR DEEP LEARNING BASED
APPROACHES
Thank you for your patience :-)

More Related Content

Similar to Speech totext

Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language ProcessingJenny Midwinter
 
Understanding Names with Neural Networks - May 2020
Understanding Names with Neural Networks - May 2020Understanding Names with Neural Networks - May 2020
Understanding Names with Neural Networks - May 2020Basis Technology
 
The Main Concepts of Speech Recognition
The Main Concepts of Speech RecognitionThe Main Concepts of Speech Recognition
The Main Concepts of Speech Recognition子毅 楊
 
Voice recognition
Voice recognitionVoice recognition
Voice recognitionYoseop Shin
 
Samsung voice intelligence.v5.5
Samsung voice intelligence.v5.5Samsung voice intelligence.v5.5
Samsung voice intelligence.v5.5vinutharani1995
 
Lenar Gabdrakhmanov (Provectus): Speech synthesis
Lenar Gabdrakhmanov (Provectus): Speech synthesisLenar Gabdrakhmanov (Provectus): Speech synthesis
Lenar Gabdrakhmanov (Provectus): Speech synthesisProvectus
 
Abbreviations, acronyms and ellipsis
Abbreviations, acronyms and ellipsisAbbreviations, acronyms and ellipsis
Abbreviations, acronyms and ellipsisWan Ummu Aiman
 
XtremeDistil: Multi-stage Distillation for Massive Multilingual Models
XtremeDistil: Multi-stage Distillation for Massive Multilingual ModelsXtremeDistil: Multi-stage Distillation for Massive Multilingual Models
XtremeDistil: Multi-stage Distillation for Massive Multilingual ModelsSubhabrata Mukherjee
 
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language UnderstandingTed Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language UnderstandingMLconf
 
NLP_guest_lecture.pdf
NLP_guest_lecture.pdfNLP_guest_lecture.pdf
NLP_guest_lecture.pdfSoha82
 
Pauls klein 2011-lm_paper(3)
Pauls klein 2011-lm_paper(3)Pauls klein 2011-lm_paper(3)
Pauls klein 2011-lm_paper(3)Red Over
 
Utilizing the Pre-trained Model Effectively for Speech Translation
Utilizing the Pre-trained Model Effectively for Speech TranslationUtilizing the Pre-trained Model Effectively for Speech Translation
Utilizing the Pre-trained Model Effectively for Speech TranslationChen Xu
 
ICIC 2017: Babies and bathwater: Keeping linguistics alongside machine learni...
ICIC 2017: Babies and bathwater: Keeping linguistics alongside machine learni...ICIC 2017: Babies and bathwater: Keeping linguistics alongside machine learni...
ICIC 2017: Babies and bathwater: Keeping linguistics alongside machine learni...Dr. Haxel Consult
 
Sltu12
Sltu12Sltu12
Sltu12tihtow
 
Voice morphing-101113123852-phpapp01
Voice morphing-101113123852-phpapp01Voice morphing-101113123852-phpapp01
Voice morphing-101113123852-phpapp01Rehan Ahmed
 
Speech recognition techniques
Speech recognition techniquesSpeech recognition techniques
Speech recognition techniquessonukumar142
 
Deep Learning for Machine Translation: a paradigm shift - Alberto Massidda - ...
Deep Learning for Machine Translation: a paradigm shift - Alberto Massidda - ...Deep Learning for Machine Translation: a paradigm shift - Alberto Massidda - ...
Deep Learning for Machine Translation: a paradigm shift - Alberto Massidda - ...Codemotion
 

Similar to Speech totext (20)

Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language Processing
 
Understanding Names with Neural Networks - May 2020
Understanding Names with Neural Networks - May 2020Understanding Names with Neural Networks - May 2020
Understanding Names with Neural Networks - May 2020
 
The Main Concepts of Speech Recognition
The Main Concepts of Speech RecognitionThe Main Concepts of Speech Recognition
The Main Concepts of Speech Recognition
 
Voice recognition
Voice recognitionVoice recognition
Voice recognition
 
lec26_audio.pptx
lec26_audio.pptxlec26_audio.pptx
lec26_audio.pptx
 
Samsung voice intelligence.v5.5
Samsung voice intelligence.v5.5Samsung voice intelligence.v5.5
Samsung voice intelligence.v5.5
 
Lenar Gabdrakhmanov (Provectus): Speech synthesis
Lenar Gabdrakhmanov (Provectus): Speech synthesisLenar Gabdrakhmanov (Provectus): Speech synthesis
Lenar Gabdrakhmanov (Provectus): Speech synthesis
 
Abbreviations, acronyms and ellipsis
Abbreviations, acronyms and ellipsisAbbreviations, acronyms and ellipsis
Abbreviations, acronyms and ellipsis
 
XtremeDistil: Multi-stage Distillation for Massive Multilingual Models
XtremeDistil: Multi-stage Distillation for Massive Multilingual ModelsXtremeDistil: Multi-stage Distillation for Massive Multilingual Models
XtremeDistil: Multi-stage Distillation for Massive Multilingual Models
 
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language UnderstandingTed Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
 
NLP_guest_lecture.pdf
NLP_guest_lecture.pdfNLP_guest_lecture.pdf
NLP_guest_lecture.pdf
 
Pauls klein 2011-lm_paper(3)
Pauls klein 2011-lm_paper(3)Pauls klein 2011-lm_paper(3)
Pauls klein 2011-lm_paper(3)
 
Utilizing the Pre-trained Model Effectively for Speech Translation
Utilizing the Pre-trained Model Effectively for Speech TranslationUtilizing the Pre-trained Model Effectively for Speech Translation
Utilizing the Pre-trained Model Effectively for Speech Translation
 
ICIC 2017: Babies and bathwater: Keeping linguistics alongside machine learni...
ICIC 2017: Babies and bathwater: Keeping linguistics alongside machine learni...ICIC 2017: Babies and bathwater: Keeping linguistics alongside machine learni...
ICIC 2017: Babies and bathwater: Keeping linguistics alongside machine learni...
 
Sltu12
Sltu12Sltu12
Sltu12
 
Speech Recognition System
Speech Recognition SystemSpeech Recognition System
Speech Recognition System
 
Voice morphing-101113123852-phpapp01
Voice morphing-101113123852-phpapp01Voice morphing-101113123852-phpapp01
Voice morphing-101113123852-phpapp01
 
Speech recognition techniques
Speech recognition techniquesSpeech recognition techniques
Speech recognition techniques
 
Assign
AssignAssign
Assign
 
Deep Learning for Machine Translation: a paradigm shift - Alberto Massidda - ...
Deep Learning for Machine Translation: a paradigm shift - Alberto Massidda - ...Deep Learning for Machine Translation: a paradigm shift - Alberto Massidda - ...
Deep Learning for Machine Translation: a paradigm shift - Alberto Massidda - ...
 

Recently uploaded

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 

Recently uploaded (20)

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 

Speech totext

  • 1. Speech to Text conversion Raaj Tilak Sarma, Student at Tezpur Central University Intern at Senseforth.ai (June- July’18)
  • 4. WHY IS ASR STILL A DIFFICULT PROBLEM ? ❏ STYLE ❏ ENVIRONMENT ❏ SPEAKER CHARACTERISTICS LIKE AGE AND GENDER
  • 5. ASR SYSTEMS ❏ GENERIC ❏ Recognizes entire vocabulary of a language ❏ DOMAIN-SPECIFIC ❏ Recognizes limited vocabulary making sense in a particular domain like banking.
  • 6. What are PHONEMES? hello ----> H EH L OW world ----> W ER L D
  • 7.
  • 8.
  • 9. m(i) = 401.25, 622.50, 843.75, 1065.00, 1286.25, 1507.50, 1728.74, 1949.99, 2171.24, 2392.49, 2613.74, 2834.99, . . . . MFCC COEFFICIENTS 39 SUCH FEATURES
  • 11. Let’s take a look at the History of ASR RADIO REX(1922) 500Hz acoustic energy
  • 12. IBM SHOEBOX - 1961 ❏ DIGIT RECOGNIZER ❏ 16 word vocab
  • 13. HARPY system developed at CMU - 1976 ❏ 1000 word vocabulary ❏ Used FST with nodes representing WORDS and Phones
  • 14. HIDDEN MARKOV MODELS(HMM) -1980s (still being used)
  • 15. WORK: ❏ WALK ❏ SHOP ❏ CLEAN WEATHER: ❏ RAINY ❏ SUNNY IMAGINE A SITUATION
  • 16.
  • 17. WE CAN ASK THREE QUESTIONS: 1)EVALUATION PROBLEM If I know your sequence of work for the last few days, how likely it is that you might take a ‘WALK’ when it is ‘RAINY’? 2)DECODING PROBLEM If I know your sequence of work for the last few days, what is the most likely sequence of weather conditions? 3)LEARNING PROBLEM If I know your sequence of work and the sequence of WEATHER conditions for the last few days, what might be the weather condition for the next day?
  • 18. FORWARD BACKWORD ALGOFORWARD BACKWORD ALGO VITERBI EM ALGO
  • 19. BUT HOW DOES IT HELP IN SPEECH RECOGNITION?
  • 20. ❏ Work ⇔ Audio signal frames ❏ Weather Conditions⇔Phonemes
  • 21. Slide taken from Rita Singh, School of CSE, CMU
  • 22. Slide taken from Rita Singh, School of CSE, CMU Ball B AO L B,AO(B,K),K
  • 23. Slide taken from Rita Singh, School of CSE, CMU
  • 24. USING DEEP NEURAL NETWORKS (from 2010 onwards)
  • 25. REPLACE GMMs with DEEP NEURAL NETWORKS(DNN)
  • 26. But Neural Networks with Speech Recognition were tried for as early as the 90s so why this gap till 2010?
  • 27. ❏ 20K+ Vocab Problem ❏ Conversational speech ❏ DATA ❏ Computational Power ❏ Read speech ❏ Spontaneous speech ❏ Wall Street Journal failed (20 hours rec) ❏ Statistical models OUTPERFORMED and still do sometimes
  • 28. AND THEN CAME THE DNN WAVE
  • 29. ❏ GEOFFREY HINTON - Reducing the dimensionality of data with Neural Nets(2006) ❏ Investigation of full-sequence training of Deep Belief Networks(2010) ❏ Tried to recognize phonemes(was ok) ❏ Conversational Speech Transcription using Context Dependent Deep Neural Networks ❏ Large vocab ❏ 30% WER
  • 30.
  • 31.
  • 32. What I had to do? Build a domain specific speech to text convertor that could work with Indian English. Here the domain was ‘Banking’. What I used? CMU SPHINX4 in Java
  • 33. THE THREE MODELS. . . ❏ ACOUSTIC MODEL ❏ BUILD MY SEQUENCE OF PHONEMES ❏ DICTIONARY MODEL ❏ BUILD MY WORD ❏ LANGUAGE MODEL ❏ BUILD MY SENTENCE/TRANSCRIPTION
  • 36. 1-grams: -1.1114 </s> -0.3010 -1.1114 <s> -0.2028 -3.1796 A -0.3007 -2.4014 ABOUT -0.2967 -2.1004 ACCOUNT -0.2661 -2.8785 ADDRESS -0.2661 -2.4806 AGAINST -0.2996 -2.3345 AND -0.2990 -2.8785 BEYOND -0.2996 2-grams: -0.7782 ABOUT LOAN 0.0000 -0.7782 ABOUT PLATINUM 0.0000 -0.7782 ABOUT RECURRING 0.0000 -0.3010 ACCOUNT </s> -0.3010 -0.3010 ADDRESS </s> -0.3010 3-grams: -0.3010 CAN YOU TELL -0.5441 CARD BILL </s> -0.6690 CARD BILL PAYMENT -0.3010 CARD EXPIRED </s>
  • 37. The acoustic model ❏ En-us model ❏ Continuous Every senone has it’s own set of gaussians, slightly slower ❏ Ptm Less no of gaussians, faster ❏ En-in model ❏ Continuous
  • 39. REQUIREMENTS ❏ Training Data in wav files ❏ Sampled in 16 Hz mono ❏ Their transcriptions along with their audio ids Types of adaptation ❏ MLLR adaptation ❏ MAP adaptation ❏ MAP with MLLR
  • 42. Word Error Rate REF: What a bright day HYP: What a day REF: What a day HYP: What a bright day REF: What a bright day HYP: What a light day
  • 43. Adapting the US English acoustic model
  • 44. Why use the Indian english acoustic model for adaptation? Us acoustic model does not support certain phones like ‘AX’,’OH’! ACCOUNT ah k aw n t ACCOUNT(2) eh k aw n t ACCOUNT(3) ih k aw n t ACCOUNT ah k aw n t ACCOUNT(2) eh k aw n t ACCOUNT(3) ih k aw n t ACCOUNT(4) ax k aw n t DEPOSIT d ah p aa z ih t DEPOSIT(2) d ih p aa z ah t DEPOSIT d ah p aa z ih t DEPOSIT(2) d ih p aa z ah t DEPOSIT(3) d ih p oh z ih t
  • 45. Adapting the Indian acoustic model
  • 46. ❏ PHRASES “I NEED TO BLOCK MY” “PLEASE HELP ME BLOCK MY" ENTITIES “CREDIT CARD” “DEBIT CARD” “I NEED TO BLOCK MY CREDIT CARD”
  • 47. I HAVE FULL SENTENCES RECORDED OF 9 PEOPLE LET’S ADD “ENTITY” RECORDINGS TO MY DATASET AND ADAPTING!
  • 48.
  • 49.
  • 50. LET’S CHECK THE LIVE SPEECH RECOGNIZER
  • 51. Observation and Further Scope of Improvement ❏ MORE DATA ❏ MORE AMOUNT OF ENTITY/PHRASE RECORDINGS THAN SENTENCES ❏ EQUAL AMOUNT OF DATA FOR MALE AND FEMALE ❏ BETTER RECORDING ENVIRONMENT ❏ NOISE FILTERING ❏ INVESTING IN GPU’S ❏ DATA COLLECTED FOR ADAPTATION CAN BE LATER USED FOR DEEP LEARNING BASED APPROACHES
  • 52. Thank you for your patience :-)