Outline
Define the problem
Whatis speech?
Feature Selection
Models
Early methods
Modern statistical models
Current State of ASR
Future Work
3.
The ASR Problem
Thereis no single ASR problem
The problem depends on many factors
Microphone: Close-mic, throat-mic, microphone
array, audio-visual
Sources: band-limited, background noise,
reverberation
Speaker: speaker dependent, speaker
independent
Language: open/closed vocabulary, vocabulary
size, read/spontaneous speech
Output: Transcription, speaker id, keywords
What is Speech?
Analogsignal produced by humans
You can think about the speech signal being
decomposed into the source and filter
The source is the vocal folds in voiced speech
The filter is the vocal tract and articulators
Feature Selection
As inany data-driven task, the data must be
represented in some format
Cepstral features have been found to perform
well
They represent the frequency of the
frequencies
Mel-frequency cepstral coefficients (MFCC)
are the most common variety
13.
Where do westand?
Defined the multiple problems associated with
ASR
Described how speech is produced
Illustrated how speech can be represented in
an ASR system
Now that we have the data, how do we
recognize the speech?
14.
Radio Rex
First knownattempt at speech recognition
A toy from 1922
Worked by analyzing the signal strength at
500Hz
15.
Actual speech recognition
systems
Originallythought to be a relatively simple
task requiring a few years of concerted effort
1969, “Wither speech recognition” is
published
A DARPA project ran from 1971-1976 in
response to the statements in the Pierce
article
We can examine a few general systems
16.
Template-Based ASR
Originally onlyworked for isolated words
Performs best when training and testing
conditions are best
For each word we want to recognize, we
store a template or example based on actual
data
Each test utterance is checked against the
templates to find the best match
Uses the Dynamic Time Warping (DTW)
algorithm
17.
Dynamic Time Warping
Createa similarity matrix for the two
utterances
Use dynamic programming to find the lowest
cost path
18.
Hearsay-II
One of thesystems developed during the
DARPA program
A blackboard-based system utilizing symbolic
problem solvers
Each problem solver was called a knowledge
group
A complex scheduler was used to decide
when each KG should be called
DARPA Results
The Hearsay-IIsystem performed much
better than the two other similar competing
systems
However, only one system met the
performance goals of the project
The Harpy system was also a CMU built system
In many ways it was a predecessor to the
modern statistical systems
Acoustic Model
For eachframe of data, we need some way
of describing the likelihood of it belonging to
any of our classes
Two methods are commonly used
Multilayer perceptron (MLP) gives the likelihood
of a class given the data
Gaussian Mixture Model (GMM) gives the
likelihood of the data given a class
Pronunciation Model
While thepronunciation model can be very
complex, it is typically just a dictionary
The dictionary contains the valid
pronunciations for each word
Examples:
Cat: k ae t
Dog: d ao g
Fox: f aa x s
26.
Language Model
Now weneed some way of representing the
likelihood of any given word sequence
Many methods exist, but ngrams are the
most common
Ngrams models are trained by simply
counting the occurrences of words in a
training set
27.
Ngrams
A unigram isthe probability of any word in
isolation
A bigram is the probability of a given word
given the previous word
Higher order ngrams continue in a similar
fashion
A backoff probability is used for any unseen
data
28.
How do weput it together?
We now have models to represent the three
parts of our equation
We need a framework to join these models
together
The standard framework used is the Hidden
Markov Model (HMM)
29.
Markov Model
A statemodel using the markov property
The markov property states that the future
depends only on the present state
Models the likelihood of transitions between
states in a model
Given the model, we can determine the
likelihood of any sequence of states
30.
Hidden Markov Model
Similarto a markov model except the states
are hidden
We now have observations tied to the
individual states
We no longer know the exact state sequence
given the data
Allows for the modeling of an underlying
unobservable process
31.
HMMs for ASR
Firstwe build an HMM for each phone
Next we combine the phone models based
on the pronunciation model to create word
level models
Finally, the word level models are combined
based on the language model
We now have a giant network with potentially
thousands or even millions of states
32.
Decoding
Decoding happens inthe same way as the
previous example
For each time frame we need to maintain two
pieces of information
The likelihood of being at any state
The previous state for every state
33.
State of theArt
What works well
Constrained vocabulary systems
Systems adapted to a given speaker
Systems in anechoic environments without
background noise
Systems expecting read speech
What doesn't work
Large unconstrained vocabulary
Noisy environments
Conversational speech
34.
Future Work
Better representationsof audio based on
humans
Better representation of acoustic elements
based on articulatory phonology
Segmental models that do not rely on the
simple frame-based approach