speech recognition and removal of disfluencies

Automatic Detection of Sentence Boundaries
and Disfluencies in speech recognition
techniques.
•Ankit Sharma -1MJ10EC013

Speech Processing
Speech is one of the most intriguing signals that humans work
with every day.
• Purpose of speech processing:
– To understand speech as a means of communication;
– To represent speech for transmission and reproduction;
– To analyze speech for automatic recognition and extraction of
information
– To discover some physiological characteristics of the talker.

Automatic speech recognition
 What is the task?
 What are the main difficulties?
 How is it approached?
 How good is it?
 How much better could it be?
3/34

text
(concept)
speech
air flow
Sound source
voiced: pulse
unvoiced: noise
frequency
transfer
characteristics
magnitude
start--end
fundamental
frequency
modulationofcarrierwave
byspeechinformation
fundamentalfreq.
voiced/unvoiced
freq.trans.char.
Speech production process in humans

How might computers do it?
 Digitization
 Acoustic analysis of the speech
signal
 Linguistic interpretation
8
Acoustic waveform Acoustic signal
Speech recognition

Microsoft Speech
Recognition – Windows 7
6/34

Digitization
 Analog to digital conversion
 Sampling and quantizing
 Use filters to measure energy levels for various
points on the frequency spectrum
 Knowing the relative importance of different
frequency bands (for speech) makes this process
more efficient
 E.g. high frequency sounds are less informative, so
can be sampled using a broader bandwidth (log
scale)
7/34

Separating speech from background noise
 Noise cancelling microphones
 Two mics, one facing speaker, the other facing away
 Ambient noise is roughly same for both mics
 Knowing which bits of the signal relate to speech
 Spectrograph analysis
8/34

Variability in individuals’ speech
 Variation among speakers due to
 Vocal range (f0, and pitch range – see later)
 Voice quality (growl, whisper, physiological elements
such as nasality, adenoidality, etc)
 ACCENT !!! (especially vowel systems, but also
consonants, allophones, etc.)
 Variation within speakers due to
 Health, emotional state
 Ambient conditions
 Speech style: formal read vs spontaneous
9/34

10/34
Detection of
Sentence Boundaries
and Disfluencies

11/34
Divide speech into frames
Speech is a non-stationary signal
… but can be assumed to be quasi-stationary
 Divide speech into short-time frames (e.g., 5ms shift, 25ms length)

12/34
Approaches
to ASR
Template
based
Neural
Network
based
Statistics
based

Statistics-based approach
 Collect a large corpus of transcribed speech recordings
 Train the computer to learn the correspondences
(“machine learning”)
 At run time, apply statistical processes to search
through the space of all possible solutions, and pick
the statistically most likely one
13/34

14/34
What is a corpus?
A corpus can be defined as a collection of texts
assumed to be representative of a given
language put together so that it can be used for
linguistic analysis. Usually the assumption is
that the language stored in a corpus is
naturally-occurring, that is gathered according
to explicit design criteria, with a specific
purpose in mind, and with a claim to represent
natural chunks of language selected according
to specific typology
“nowadays the term 'corpus' nearly always implies
the additional feature of 'machine-readable'”.

Statistics based approach
 Acoustic and Lexical Models
 Analyse training data in terms of relevant features
 Learn from large amount of data different possibilities
 different phone sequences for a given word
 different combinations of elements of the speech signal for a
given phone/phoneme
 Combine these into a Hidden Markov Model expressing
the probabilities
15/34

Excitation
generation
Synthesis
Filter
TEXT
Text analysis
SYNTHESIZED
SPEECH
Training HMMs
Parameter generation
from HMMs
Context-dependent HMMs
& state duration models
Labels
Excitation
parameters
Excitation
Spectral
parameters
Labels
Training part
Synthesis part
Excitation
Parameter
extraction
SPEECH
DATABASE
Spectral
Parameter
Extraction
Spectral
parameters
Excitation
parameters
Speech signal
HMM-based speech synthesis system (HTS)
16

 Identify individual phonemes
 Identify words
 Identify sentence structure and/or meaning
18/34

Performance errors
 Performance “errors” include
 Non-speech sounds
 Hesitations
 False starts, repetitions
 Filtering implies handling at syntactic level or above
 Some disfluencies are deliberate and have pragmatic
effect – this is not something we can handle in the
near future
19/34

Disfluencies:
standard terminology (Level it)
 Reparandum : thing repaired
 Interruption point (IP): where speaker breaks off
 Editing phase (edit terms): uh, I mean, you know
 Repair: fluent continuation

Prosodic characteristics of
disfluencies
 Fragments are good cues to disfluencies
 Prosody:
 Pause duration is shorter in disfluent silence than fluent silence
 F0 increases from end of reparandum to beginning of repair, but only
minor change
 Repair interval offsets have minor prosodic phrase boundary, even in
middle of NP:
 Show me all n- | round-trip flights | from Pittsburgh | to Atlanta

Syntactic Characteristics of
Disfluencies
 The repair often has same structure as reparandum
 Both are Noun Phrases (NPs) in this example:
 So if could automatically find IP, could find and correct reparandum!

Disfluencies in language
modeling
 Should we “clean up” disfluencies before training LM
(i.e. skip over disfluencies?)
 Filled pauses
 Does United offer any [uh] one-way fares?
 Repetitions
 What what are the fares?
 Deletions
 Fly to Boston from Boston
 Fragments (we’ll come back to these)
 I want fl- flights to Boston.

Detection of disfluencies
 Decision tree at wi-wj boundary
 pause duration
 Word fragments
 Filled pause
 Energy peak within wi
 Amplitude difference between wi and wj
 F0 of wi
 F0 differences
 Whether wi accented
 Results:
 78% recall/89.2% precision

Recent work: EARS Metadata
Evaluation (MDE)
 Sentence-like Unit (SU) detection:
 find end points of SU
 Detect subtype (question, statement, backchannel)
 Edit word detection:
 Find all words in reparandum (words that will be removed)
 Filler word detection
 Filled pauses (uh, um)
 Discourse markers (you know, like, so)
 Editing terms (I mean)
 Interruption point detection
Liu et al 2003

Kinds of disfluencies
 Repetitions
 I * I like it
 Revisions
 We * I like it
 Restarts (false starts)
 It’s also * I like it

MDE transcription
 Conventions:
 ./ for statement SU boundaries,
 <> for fillers,
 [] for edit words,
 * for IP (interruption point) inside edits
 And <uh> <you know> wash your clothes wherever
you are ./ and [ you ] * you really get used to the
outdoors ./

Recent works to improve quality
Vocoding
– MELP-style / CELP-style excitation
– LF model
– Sinusoidal models
Acoustic model
– Segment models, trajectory models
– Model combination (product of experts)
– Minimum generation error training
– Bayesian modeling
Oversmoothing
– Pre & postfiltering
– Improvements of GV
– Hybrid approaches
& more… 29

Other challenging topics
Non-professional speakers
• AVM + adaptation (CSTR)
Too little speech data
• VTLN-based rapid speaker adaptation (Titech, IDIAP)
Noisy recordings
• Spectral subtraction & AVM + adaptation (CSTR)
No labels
• Un- / Semi-supervised voice building (CSTR, NICT, CMU, Toshiba)
Insufficient knowledge of the language or accent
• Letter (grapheme)-based synthesis (CSTR)
• No prosodic contexts (CSTR, Titech)
Wrong language
• Cross-lingual speaker adaptation (MSRA, EMIME)
• Speaker & language adaptive training (Toshiba)
30

speech recognition and removal of disfluencies

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to speech recognition and removal of disfluencies

Similar to speech recognition and removal of disfluencies (20)

Recently uploaded

Recently uploaded (20)

speech recognition and removal of disfluencies