SlideShare a Scribd company logo
1 of 31
Automatic Detection of Sentence Boundaries
and Disfluencies in speech recognition
techniques.
•Ankit Sharma -1MJ10EC013
Speech Processing
Speech is one of the most intriguing signals that humans work
with every day.
• Purpose of speech processing:
– To understand speech as a means of communication;
– To represent speech for transmission and reproduction;
– To analyze speech for automatic recognition and extraction of
information
– To discover some physiological characteristics of the talker.
Automatic speech recognition
 What is the task?
 What are the main difficulties?
 How is it approached?
 How good is it?
 How much better could it be?
3/34
text
(concept)
speech
air flow
Sound source
voiced: pulse
unvoiced: noise
frequency
transfer
characteristics
magnitude
start--end
fundamental
frequency
modulationofcarrierwave
byspeechinformation
fundamentalfreq.
voiced/unvoiced
freq.trans.char.
Speech production process in humans
How might computers do it?
 Digitization
 Acoustic analysis of the speech
signal
 Linguistic interpretation
8
Acoustic waveform Acoustic signal
Speech recognition
Microsoft Speech
Recognition – Windows 7
6/34
Digitization
 Analog to digital conversion
 Sampling and quantizing
 Use filters to measure energy levels for various
points on the frequency spectrum
 Knowing the relative importance of different
frequency bands (for speech) makes this process
more efficient
 E.g. high frequency sounds are less informative, so
can be sampled using a broader bandwidth (log
scale)
7/34
Separating speech from background noise
 Noise cancelling microphones
 Two mics, one facing speaker, the other facing away
 Ambient noise is roughly same for both mics
 Knowing which bits of the signal relate to speech
 Spectrograph analysis
8/34
Variability in individuals’ speech
 Variation among speakers due to
 Vocal range (f0, and pitch range – see later)
 Voice quality (growl, whisper, physiological elements
such as nasality, adenoidality, etc)
 ACCENT !!! (especially vowel systems, but also
consonants, allophones, etc.)
 Variation within speakers due to
 Health, emotional state
 Ambient conditions
 Speech style: formal read vs spontaneous
9/34
10/34
Detection of
Sentence Boundaries
and Disfluencies
11/34
Divide speech into frames
Speech is a non-stationary signal
… but can be assumed to be quasi-stationary
 Divide speech into short-time frames (e.g., 5ms shift, 25ms length)
12/34
Approaches
to ASR
Template
based
Neural
Network
based
Statistics
based
Statistics-based approach
 Collect a large corpus of transcribed speech recordings
 Train the computer to learn the correspondences
(“machine learning”)
 At run time, apply statistical processes to search
through the space of all possible solutions, and pick
the statistically most likely one
13/34
14/34
What is a corpus?
A corpus can be defined as a collection of texts
assumed to be representative of a given
language put together so that it can be used for
linguistic analysis. Usually the assumption is
that the language stored in a corpus is
naturally-occurring, that is gathered according
to explicit design criteria, with a specific
purpose in mind, and with a claim to represent
natural chunks of language selected according
to specific typology
“nowadays the term 'corpus' nearly always implies
the additional feature of 'machine-readable'”.
Statistics based approach
 Acoustic and Lexical Models
 Analyse training data in terms of relevant features
 Learn from large amount of data different possibilities
 different phone sequences for a given word
 different combinations of elements of the speech signal for a
given phone/phoneme
 Combine these into a Hidden Markov Model expressing
the probabilities
15/34
Excitation
generation
Synthesis
Filter
TEXT
Text analysis
SYNTHESIZED
SPEECH
Training HMMs
Parameter generation
from HMMs
Context-dependent HMMs
& state duration models
Labels
Excitation
parameters
Excitation
Spectral
parameters
Labels
Training part
Synthesis part
Excitation
Parameter
extraction
SPEECH
DATABASE
Spectral
Parameter
Extraction
Spectral
parameters
Excitation
parameters
Speech signal
HMM-based speech synthesis system (HTS)
16
HMMs for some words
17/34
 Identify individual phonemes
 Identify words
 Identify sentence structure and/or meaning
18/34
Performance errors
 Performance “errors” include
 Non-speech sounds
 Hesitations
 False starts, repetitions
 Filtering implies handling at syntactic level or above
 Some disfluencies are deliberate and have pragmatic
effect – this is not something we can handle in the
near future
19/34
20/34
Disfluencies
Disfluencies:
standard terminology (Level it)
 Reparandum : thing repaired
 Interruption point (IP): where speaker breaks off
 Editing phase (edit terms): uh, I mean, you know
 Repair: fluent continuation
Prosodic characteristics of
disfluencies
 Fragments are good cues to disfluencies
 Prosody:
 Pause duration is shorter in disfluent silence than fluent silence
 F0 increases from end of reparandum to beginning of repair, but only
minor change
 Repair interval offsets have minor prosodic phrase boundary, even in
middle of NP:
 Show me all n- | round-trip flights | from Pittsburgh | to Atlanta
Syntactic Characteristics of
Disfluencies
 The repair often has same structure as reparandum
 Both are Noun Phrases (NPs) in this example:
 So if could automatically find IP, could find and correct reparandum!
Disfluencies in language
modeling
 Should we “clean up” disfluencies before training LM
(i.e. skip over disfluencies?)
 Filled pauses
 Does United offer any [uh] one-way fares?
 Repetitions
 What what are the fares?
 Deletions
 Fly to Boston from Boston
 Fragments (we’ll come back to these)
 I want fl- flights to Boston.
Detection of disfluencies
 Decision tree at wi-wj boundary
 pause duration
 Word fragments
 Filled pause
 Energy peak within wi
 Amplitude difference between wi and wj
 F0 of wi
 F0 differences
 Whether wi accented
 Results:
 78% recall/89.2% precision
Recent work: EARS Metadata
Evaluation (MDE)
 Sentence-like Unit (SU) detection:
 find end points of SU
 Detect subtype (question, statement, backchannel)
 Edit word detection:
 Find all words in reparandum (words that will be removed)
 Filler word detection
 Filled pauses (uh, um)
 Discourse markers (you know, like, so)
 Editing terms (I mean)
 Interruption point detection
Liu et al 2003
Kinds of disfluencies
 Repetitions
 I * I like it
 Revisions
 We * I like it
 Restarts (false starts)
 It’s also * I like it
MDE transcription
 Conventions:
 ./ for statement SU boundaries,
 <> for fillers,
 [] for edit words,
 * for IP (interruption point) inside edits
 And <uh> <you know> wash your clothes wherever
you are ./ and [ you ] * you really get used to the
outdoors ./
Recent works to improve quality
Vocoding
– MELP-style / CELP-style excitation
– LF model
– Sinusoidal models
Acoustic model
– Segment models, trajectory models
– Model combination (product of experts)
– Minimum generation error training
– Bayesian modeling
Oversmoothing
– Pre & postfiltering
– Improvements of GV
– Hybrid approaches
& more… 29
Other challenging topics
Non-professional speakers
• AVM + adaptation (CSTR)
Too little speech data
• VTLN-based rapid speaker adaptation (Titech, IDIAP)
Noisy recordings
• Spectral subtraction & AVM + adaptation (CSTR)
No labels
• Un- / Semi-supervised voice building (CSTR, NICT, CMU, Toshiba)
Insufficient knowledge of the language or accent
• Letter (grapheme)-based synthesis (CSTR)
• No prosodic contexts (CSTR, Titech)
Wrong language
• Cross-lingual speaker adaptation (MSRA, EMIME)
• Speaker & language adaptive training (Toshiba)
30
31/34

More Related Content

What's hot

Speech synthesis technology
Speech synthesis technologySpeech synthesis technology
Speech synthesis technologyKalluri Madhuri
 
Voice recognition system
Voice recognition systemVoice recognition system
Voice recognition systemavinash raibole
 
A seminar report on speech recognition technology
A seminar report on speech recognition technologyA seminar report on speech recognition technology
A seminar report on speech recognition technologySrijanKumar18
 
Speech signal processing lizy
Speech signal processing lizySpeech signal processing lizy
Speech signal processing lizyLizy Abraham
 
Speech Recognition Technology
Speech Recognition TechnologySpeech Recognition Technology
Speech Recognition TechnologySeminar Links
 
Speech Recognition in Artificail Inteligence
Speech Recognition in Artificail InteligenceSpeech Recognition in Artificail Inteligence
Speech Recognition in Artificail InteligenceIlhaan Marwat
 
Voice Recognition
Voice RecognitionVoice Recognition
Voice RecognitionAmrita More
 
12EEE032- text 2 voice
12EEE032-  text 2 voice12EEE032-  text 2 voice
12EEE032- text 2 voiceNsaroj kumar
 
Speech recognition final
Speech recognition finalSpeech recognition final
Speech recognition finalArchit Vora
 
PERFORMANCE ANALYSIS OF DIFFERENT ACOUSTIC FEATURES BASED ON LSTM FOR BANGLA ...
PERFORMANCE ANALYSIS OF DIFFERENT ACOUSTIC FEATURES BASED ON LSTM FOR BANGLA ...PERFORMANCE ANALYSIS OF DIFFERENT ACOUSTIC FEATURES BASED ON LSTM FOR BANGLA ...
PERFORMANCE ANALYSIS OF DIFFERENT ACOUSTIC FEATURES BASED ON LSTM FOR BANGLA ...ijma
 
Speech recognition final presentation
Speech recognition final presentationSpeech recognition final presentation
Speech recognition final presentationhimanshubhatti
 
Automatic speech recognition system
Automatic speech recognition systemAutomatic speech recognition system
Automatic speech recognition systemAlok Tiwari
 
Speech Recognition System By Matlab
Speech Recognition System By MatlabSpeech Recognition System By Matlab
Speech Recognition System By MatlabAnkit Gujrati
 
Automatic speech recognition system using deep learning
Automatic speech recognition system using deep learningAutomatic speech recognition system using deep learning
Automatic speech recognition system using deep learningAnkan Dutta
 
Deep Learning For Speech Recognition
Deep Learning For Speech RecognitionDeep Learning For Speech Recognition
Deep Learning For Speech Recognitionananth
 

What's hot (20)

Speech synthesis technology
Speech synthesis technologySpeech synthesis technology
Speech synthesis technology
 
Voice recognition system
Voice recognition systemVoice recognition system
Voice recognition system
 
A seminar report on speech recognition technology
A seminar report on speech recognition technologyA seminar report on speech recognition technology
A seminar report on speech recognition technology
 
Speech signal processing lizy
Speech signal processing lizySpeech signal processing lizy
Speech signal processing lizy
 
Speech Recognition Technology
Speech Recognition TechnologySpeech Recognition Technology
Speech Recognition Technology
 
Automatic Speech Recognion
Automatic Speech RecognionAutomatic Speech Recognion
Automatic Speech Recognion
 
Speech Recognition
Speech RecognitionSpeech Recognition
Speech Recognition
 
Speech Recognition in Artificail Inteligence
Speech Recognition in Artificail InteligenceSpeech Recognition in Artificail Inteligence
Speech Recognition in Artificail Inteligence
 
Voice Recognition
Voice RecognitionVoice Recognition
Voice Recognition
 
12EEE032- text 2 voice
12EEE032-  text 2 voice12EEE032-  text 2 voice
12EEE032- text 2 voice
 
Speech recognition final
Speech recognition finalSpeech recognition final
Speech recognition final
 
PERFORMANCE ANALYSIS OF DIFFERENT ACOUSTIC FEATURES BASED ON LSTM FOR BANGLA ...
PERFORMANCE ANALYSIS OF DIFFERENT ACOUSTIC FEATURES BASED ON LSTM FOR BANGLA ...PERFORMANCE ANALYSIS OF DIFFERENT ACOUSTIC FEATURES BASED ON LSTM FOR BANGLA ...
PERFORMANCE ANALYSIS OF DIFFERENT ACOUSTIC FEATURES BASED ON LSTM FOR BANGLA ...
 
Speech recognition final presentation
Speech recognition final presentationSpeech recognition final presentation
Speech recognition final presentation
 
Automatic speech recognition system
Automatic speech recognition systemAutomatic speech recognition system
Automatic speech recognition system
 
Speech Recognition System By Matlab
Speech Recognition System By MatlabSpeech Recognition System By Matlab
Speech Recognition System By Matlab
 
Ey4301913917
Ey4301913917Ey4301913917
Ey4301913917
 
Mini Project- Audio Enhancement
Mini Project-  Audio EnhancementMini Project-  Audio Enhancement
Mini Project- Audio Enhancement
 
Rasta processing of speech
Rasta processing of speechRasta processing of speech
Rasta processing of speech
 
Automatic speech recognition system using deep learning
Automatic speech recognition system using deep learningAutomatic speech recognition system using deep learning
Automatic speech recognition system using deep learning
 
Deep Learning For Speech Recognition
Deep Learning For Speech RecognitionDeep Learning For Speech Recognition
Deep Learning For Speech Recognition
 

Similar to speech recognition and removal of disfluencies

Automatic speech recognition
Automatic speech recognitionAutomatic speech recognition
Automatic speech recognitionRichie
 
An expert system for automatic reading of a text written in standard arabic
An expert system for automatic reading of a text written in standard arabicAn expert system for automatic reading of a text written in standard arabic
An expert system for automatic reading of a text written in standard arabicijnlc
 
Automatic Speech Recognition.ppt
Automatic Speech Recognition.pptAutomatic Speech Recognition.ppt
Automatic Speech Recognition.pptRudraSaraswat3
 
Survey On Speech Synthesis
Survey On Speech SynthesisSurvey On Speech Synthesis
Survey On Speech SynthesisCSCJournals
 
Speech and Language Processing
Speech and Language ProcessingSpeech and Language Processing
Speech and Language ProcessingVikalp Mahendra
 
SECOND LANGUAGE RESEARCH.pptx
SECOND LANGUAGE RESEARCH.pptxSECOND LANGUAGE RESEARCH.pptx
SECOND LANGUAGE RESEARCH.pptxssuser1ac0fa
 
Automatic speech recognition
Automatic speech recognitionAutomatic speech recognition
Automatic speech recognitionanshu shrivastava
 
Automatic speech recognition
Automatic speech recognitionAutomatic speech recognition
Automatic speech recognitionanshu shrivastava
 
Emotional telugu speech signals classification based on k nn classifier
Emotional telugu speech signals classification based on k nn classifierEmotional telugu speech signals classification based on k nn classifier
Emotional telugu speech signals classification based on k nn classifiereSAT Publishing House
 
Emotional telugu speech signals classification based on k nn classifier
Emotional telugu speech signals classification based on k nn classifierEmotional telugu speech signals classification based on k nn classifier
Emotional telugu speech signals classification based on k nn classifiereSAT Journals
 
High Quality Arabic Concatenative Speech Synthesis
High Quality Arabic Concatenative Speech SynthesisHigh Quality Arabic Concatenative Speech Synthesis
High Quality Arabic Concatenative Speech Synthesissipij
 
Speech recognition an overview
Speech recognition   an overviewSpeech recognition   an overview
Speech recognition an overviewVarun Jain
 
Computational linguistics
Computational linguisticsComputational linguistics
Computational linguisticsshrey bhate
 
Automatic speech recognition
Automatic speech recognitionAutomatic speech recognition
Automatic speech recognitionArif A.
 

Similar to speech recognition and removal of disfluencies (20)

Automatic speech recognition
Automatic speech recognitionAutomatic speech recognition
Automatic speech recognition
 
An expert system for automatic reading of a text written in standard arabic
An expert system for automatic reading of a text written in standard arabicAn expert system for automatic reading of a text written in standard arabic
An expert system for automatic reading of a text written in standard arabic
 
Automatic Speech Recognition.ppt
Automatic Speech Recognition.pptAutomatic Speech Recognition.ppt
Automatic Speech Recognition.ppt
 
Isolated English Word Recognition System: Appropriate for Bengali-accented En...
Isolated English Word Recognition System: Appropriate for Bengali-accented En...Isolated English Word Recognition System: Appropriate for Bengali-accented En...
Isolated English Word Recognition System: Appropriate for Bengali-accented En...
 
ch1.pdf
ch1.pdfch1.pdf
ch1.pdf
 
Survey On Speech Synthesis
Survey On Speech SynthesisSurvey On Speech Synthesis
Survey On Speech Synthesis
 
Assign
AssignAssign
Assign
 
Speech and Language Processing
Speech and Language ProcessingSpeech and Language Processing
Speech and Language Processing
 
SECOND LANGUAGE RESEARCH.pptx
SECOND LANGUAGE RESEARCH.pptxSECOND LANGUAGE RESEARCH.pptx
SECOND LANGUAGE RESEARCH.pptx
 
Automatic speech recognition
Automatic speech recognitionAutomatic speech recognition
Automatic speech recognition
 
Automatic speech recognition
Automatic speech recognitionAutomatic speech recognition
Automatic speech recognition
 
Emotional telugu speech signals classification based on k nn classifier
Emotional telugu speech signals classification based on k nn classifierEmotional telugu speech signals classification based on k nn classifier
Emotional telugu speech signals classification based on k nn classifier
 
Emotional telugu speech signals classification based on k nn classifier
Emotional telugu speech signals classification based on k nn classifierEmotional telugu speech signals classification based on k nn classifier
Emotional telugu speech signals classification based on k nn classifier
 
High Quality Arabic Concatenative Speech Synthesis
High Quality Arabic Concatenative Speech SynthesisHigh Quality Arabic Concatenative Speech Synthesis
High Quality Arabic Concatenative Speech Synthesis
 
NLP
NLPNLP
NLP
 
NLP
NLPNLP
NLP
 
Kc3517481754
Kc3517481754Kc3517481754
Kc3517481754
 
Speech recognition an overview
Speech recognition   an overviewSpeech recognition   an overview
Speech recognition an overview
 
Computational linguistics
Computational linguisticsComputational linguistics
Computational linguistics
 
Automatic speech recognition
Automatic speech recognitionAutomatic speech recognition
Automatic speech recognition
 

Recently uploaded

Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile servicerehmti665
 
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)dollysharma2066
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVRajaP95
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfAsst.prof M.Gokilavani
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfAsst.prof M.Gokilavani
 
Risk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfRisk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfROCENODodongVILLACER
 
Introduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxIntroduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxk795866
 
Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxDeepakSakkari2
 
Internship report on mechanical engineering
Internship report on mechanical engineeringInternship report on mechanical engineering
Internship report on mechanical engineeringmalavadedarshan25
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...VICTOR MAESTRE RAMIREZ
 
An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...Chandu841456
 
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfCCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfAsst.prof M.Gokilavani
 
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEINFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEroselinkalist12
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024Mark Billinghurst
 
Introduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHIntroduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHC Sai Kiran
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx959SahilShah
 

Recently uploaded (20)

Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile service
 
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
 
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Serviceyoung call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
 
Risk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfRisk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdf
 
Introduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxIntroduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptx
 
Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptx
 
Internship report on mechanical engineering
Internship report on mechanical engineeringInternship report on mechanical engineering
Internship report on mechanical engineering
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...
 
An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...
 
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCRCall Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
 
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfCCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
 
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEINFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
 
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024
 
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
 
Introduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHIntroduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECH
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx
 

speech recognition and removal of disfluencies

  • 1. Automatic Detection of Sentence Boundaries and Disfluencies in speech recognition techniques. •Ankit Sharma -1MJ10EC013
  • 2. Speech Processing Speech is one of the most intriguing signals that humans work with every day. • Purpose of speech processing: – To understand speech as a means of communication; – To represent speech for transmission and reproduction; – To analyze speech for automatic recognition and extraction of information – To discover some physiological characteristics of the talker.
  • 3. Automatic speech recognition  What is the task?  What are the main difficulties?  How is it approached?  How good is it?  How much better could it be? 3/34
  • 4. text (concept) speech air flow Sound source voiced: pulse unvoiced: noise frequency transfer characteristics magnitude start--end fundamental frequency modulationofcarrierwave byspeechinformation fundamentalfreq. voiced/unvoiced freq.trans.char. Speech production process in humans
  • 5. How might computers do it?  Digitization  Acoustic analysis of the speech signal  Linguistic interpretation 8 Acoustic waveform Acoustic signal Speech recognition
  • 7. Digitization  Analog to digital conversion  Sampling and quantizing  Use filters to measure energy levels for various points on the frequency spectrum  Knowing the relative importance of different frequency bands (for speech) makes this process more efficient  E.g. high frequency sounds are less informative, so can be sampled using a broader bandwidth (log scale) 7/34
  • 8. Separating speech from background noise  Noise cancelling microphones  Two mics, one facing speaker, the other facing away  Ambient noise is roughly same for both mics  Knowing which bits of the signal relate to speech  Spectrograph analysis 8/34
  • 9. Variability in individuals’ speech  Variation among speakers due to  Vocal range (f0, and pitch range – see later)  Voice quality (growl, whisper, physiological elements such as nasality, adenoidality, etc)  ACCENT !!! (especially vowel systems, but also consonants, allophones, etc.)  Variation within speakers due to  Health, emotional state  Ambient conditions  Speech style: formal read vs spontaneous 9/34
  • 11. 11/34 Divide speech into frames Speech is a non-stationary signal … but can be assumed to be quasi-stationary  Divide speech into short-time frames (e.g., 5ms shift, 25ms length)
  • 13. Statistics-based approach  Collect a large corpus of transcribed speech recordings  Train the computer to learn the correspondences (“machine learning”)  At run time, apply statistical processes to search through the space of all possible solutions, and pick the statistically most likely one 13/34
  • 14. 14/34 What is a corpus? A corpus can be defined as a collection of texts assumed to be representative of a given language put together so that it can be used for linguistic analysis. Usually the assumption is that the language stored in a corpus is naturally-occurring, that is gathered according to explicit design criteria, with a specific purpose in mind, and with a claim to represent natural chunks of language selected according to specific typology “nowadays the term 'corpus' nearly always implies the additional feature of 'machine-readable'”.
  • 15. Statistics based approach  Acoustic and Lexical Models  Analyse training data in terms of relevant features  Learn from large amount of data different possibilities  different phone sequences for a given word  different combinations of elements of the speech signal for a given phone/phoneme  Combine these into a Hidden Markov Model expressing the probabilities 15/34
  • 16. Excitation generation Synthesis Filter TEXT Text analysis SYNTHESIZED SPEECH Training HMMs Parameter generation from HMMs Context-dependent HMMs & state duration models Labels Excitation parameters Excitation Spectral parameters Labels Training part Synthesis part Excitation Parameter extraction SPEECH DATABASE Spectral Parameter Extraction Spectral parameters Excitation parameters Speech signal HMM-based speech synthesis system (HTS) 16
  • 17. HMMs for some words 17/34
  • 18.  Identify individual phonemes  Identify words  Identify sentence structure and/or meaning 18/34
  • 19. Performance errors  Performance “errors” include  Non-speech sounds  Hesitations  False starts, repetitions  Filtering implies handling at syntactic level or above  Some disfluencies are deliberate and have pragmatic effect – this is not something we can handle in the near future 19/34
  • 21. Disfluencies: standard terminology (Level it)  Reparandum : thing repaired  Interruption point (IP): where speaker breaks off  Editing phase (edit terms): uh, I mean, you know  Repair: fluent continuation
  • 22. Prosodic characteristics of disfluencies  Fragments are good cues to disfluencies  Prosody:  Pause duration is shorter in disfluent silence than fluent silence  F0 increases from end of reparandum to beginning of repair, but only minor change  Repair interval offsets have minor prosodic phrase boundary, even in middle of NP:  Show me all n- | round-trip flights | from Pittsburgh | to Atlanta
  • 23. Syntactic Characteristics of Disfluencies  The repair often has same structure as reparandum  Both are Noun Phrases (NPs) in this example:  So if could automatically find IP, could find and correct reparandum!
  • 24. Disfluencies in language modeling  Should we “clean up” disfluencies before training LM (i.e. skip over disfluencies?)  Filled pauses  Does United offer any [uh] one-way fares?  Repetitions  What what are the fares?  Deletions  Fly to Boston from Boston  Fragments (we’ll come back to these)  I want fl- flights to Boston.
  • 25. Detection of disfluencies  Decision tree at wi-wj boundary  pause duration  Word fragments  Filled pause  Energy peak within wi  Amplitude difference between wi and wj  F0 of wi  F0 differences  Whether wi accented  Results:  78% recall/89.2% precision
  • 26. Recent work: EARS Metadata Evaluation (MDE)  Sentence-like Unit (SU) detection:  find end points of SU  Detect subtype (question, statement, backchannel)  Edit word detection:  Find all words in reparandum (words that will be removed)  Filler word detection  Filled pauses (uh, um)  Discourse markers (you know, like, so)  Editing terms (I mean)  Interruption point detection Liu et al 2003
  • 27. Kinds of disfluencies  Repetitions  I * I like it  Revisions  We * I like it  Restarts (false starts)  It’s also * I like it
  • 28. MDE transcription  Conventions:  ./ for statement SU boundaries,  <> for fillers,  [] for edit words,  * for IP (interruption point) inside edits  And <uh> <you know> wash your clothes wherever you are ./ and [ you ] * you really get used to the outdoors ./
  • 29. Recent works to improve quality Vocoding – MELP-style / CELP-style excitation – LF model – Sinusoidal models Acoustic model – Segment models, trajectory models – Model combination (product of experts) – Minimum generation error training – Bayesian modeling Oversmoothing – Pre & postfiltering – Improvements of GV – Hybrid approaches & more… 29
  • 30. Other challenging topics Non-professional speakers • AVM + adaptation (CSTR) Too little speech data • VTLN-based rapid speaker adaptation (Titech, IDIAP) Noisy recordings • Spectral subtraction & AVM + adaptation (CSTR) No labels • Un- / Semi-supervised voice building (CSTR, NICT, CMU, Toshiba) Insufficient knowledge of the language or accent • Letter (grapheme)-based synthesis (CSTR) • No prosodic contexts (CSTR, Titech) Wrong language • Cross-lingual speaker adaptation (MSRA, EMIME) • Speaker & language adaptive training (Toshiba) 30
  • 31. 31/34