Introduction to Automatic
Speech Recognition
Outline

Define the problem

What is speech?

Feature Selection

Models
 Early methods
 Modern statistical models

Current State of ASR

Future Work
The ASR Problem

There is no single ASR problem

The problem depends on many factors
 Microphone: Close-mic, throat-mic, microphone
array, audio-visual
 Sources: band-limited, background noise,
reverberation
 Speaker: speaker dependent, speaker
independent
 Language: open/closed vocabulary, vocabulary
size, read/spontaneous speech
 Output: Transcription, speaker id, keywords
Performance Evaluation

Accuracy
 Percentage of tokens correctly recognized

Error Rate
 Inverse of accuracy

Token Type
 Phones
 Words*
 Sentences
 Semantics?
What is Speech?

Analog signal produced by humans

You can think about the speech signal being
decomposed into the source and filter

The source is the vocal folds in voiced speech

The filter is the vocal tract and articulators
Speech Production
Speech Production
Speech Production
Speech Visualization
Speech Visualization
Speech Visualization
Feature Selection

As in any data-driven task, the data must be
represented in some format

Cepstral features have been found to perform
well

They represent the frequency of the
frequencies

Mel-frequency cepstral coefficients (MFCC)
are the most common variety
Where do we stand?

Defined the multiple problems associated with
ASR

Described how speech is produced

Illustrated how speech can be represented in
an ASR system

Now that we have the data, how do we
recognize the speech?
Radio Rex

First known attempt at speech recognition

A toy from 1922

Worked by analyzing the signal strength at
500Hz
Actual speech recognition
systems

Originally thought to be a relatively simple
task requiring a few years of concerted effort

1969, “Wither speech recognition” is
published

A DARPA project ran from 1971-1976 in
response to the statements in the Pierce
article

We can examine a few general systems
Template-Based ASR

Originally only worked for isolated words

Performs best when training and testing
conditions are best

For each word we want to recognize, we
store a template or example based on actual
data

Each test utterance is checked against the
templates to find the best match

Uses the Dynamic Time Warping (DTW)
algorithm
Dynamic Time Warping

Create a similarity matrix for the two
utterances

Use dynamic programming to find the lowest
cost path
Hearsay-II

One of the systems developed during the
DARPA program

A blackboard-based system utilizing symbolic
problem solvers

Each problem solver was called a knowledge
group

A complex scheduler was used to decide
when each KG should be called
Hearsay-II
DARPA Results

The Hearsay-II system performed much
better than the two other similar competing
systems

However, only one system met the
performance goals of the project
 The Harpy system was also a CMU built system
 In many ways it was a predecessor to the
modern statistical systems
Modern Statistical ASR
Modern Statistical ASR
Acoustic Model

For each frame of data, we need some way
of describing the likelihood of it belonging to
any of our classes

Two methods are commonly used
 Multilayer perceptron (MLP) gives the likelihood
of a class given the data
 Gaussian Mixture Model (GMM) gives the
likelihood of the data given a class
Gaussian Distribution
Pronunciation Model

While the pronunciation model can be very
complex, it is typically just a dictionary

The dictionary contains the valid
pronunciations for each word

Examples:
 Cat: k ae t
 Dog: d ao g
 Fox: f aa x s
Language Model

Now we need some way of representing the
likelihood of any given word sequence

Many methods exist, but ngrams are the
most common

Ngrams models are trained by simply
counting the occurrences of words in a
training set
Ngrams

A unigram is the probability of any word in
isolation

A bigram is the probability of a given word
given the previous word

Higher order ngrams continue in a similar
fashion

A backoff probability is used for any unseen
data
How do we put it together?

We now have models to represent the three
parts of our equation

We need a framework to join these models
together

The standard framework used is the Hidden
Markov Model (HMM)
Markov Model

A state model using the markov property
 The markov property states that the future
depends only on the present state

Models the likelihood of transitions between
states in a model

Given the model, we can determine the
likelihood of any sequence of states
Hidden Markov Model

Similar to a markov model except the states
are hidden

We now have observations tied to the
individual states

We no longer know the exact state sequence
given the data

Allows for the modeling of an underlying
unobservable process
HMMs for ASR

First we build an HMM for each phone

Next we combine the phone models based
on the pronunciation model to create word
level models

Finally, the word level models are combined
based on the language model

We now have a giant network with potentially
thousands or even millions of states
Decoding

Decoding happens in the same way as the
previous example

For each time frame we need to maintain two
pieces of information
 The likelihood of being at any state
 The previous state for every state
State of the Art

What works well
 Constrained vocabulary systems
 Systems adapted to a given speaker
 Systems in anechoic environments without
background noise
 Systems expecting read speech

What doesn't work
 Large unconstrained vocabulary
 Noisy environments
 Conversational speech
Future Work

Better representations of audio based on
humans

Better representation of acoustic elements
based on articulatory phonology

Segmental models that do not rely on the
simple frame-based approach
Resources

Hidden Markov Model Toolkit (HTK)
 http://htk.eng.cam.ac.uk/

CHIME ( a freely available dataset)
 http://spandh.dcs.shef.ac.uk/projects/chime/PCC
/datasets.html

Machine Learning Lectures
 http://www.stanford.edu/class/cs229/
 http://www.youtube.com/watch?v=UzxYlbK2c7E

speech recognition system of modern world.ppt

  • 1.
  • 2.
    Outline  Define the problem  Whatis speech?  Feature Selection  Models  Early methods  Modern statistical models  Current State of ASR  Future Work
  • 3.
    The ASR Problem  Thereis no single ASR problem  The problem depends on many factors  Microphone: Close-mic, throat-mic, microphone array, audio-visual  Sources: band-limited, background noise, reverberation  Speaker: speaker dependent, speaker independent  Language: open/closed vocabulary, vocabulary size, read/spontaneous speech  Output: Transcription, speaker id, keywords
  • 4.
    Performance Evaluation  Accuracy  Percentageof tokens correctly recognized  Error Rate  Inverse of accuracy  Token Type  Phones  Words*  Sentences  Semantics?
  • 5.
    What is Speech?  Analogsignal produced by humans  You can think about the speech signal being decomposed into the source and filter  The source is the vocal folds in voiced speech  The filter is the vocal tract and articulators
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
    Feature Selection  As inany data-driven task, the data must be represented in some format  Cepstral features have been found to perform well  They represent the frequency of the frequencies  Mel-frequency cepstral coefficients (MFCC) are the most common variety
  • 13.
    Where do westand?  Defined the multiple problems associated with ASR  Described how speech is produced  Illustrated how speech can be represented in an ASR system  Now that we have the data, how do we recognize the speech?
  • 14.
    Radio Rex  First knownattempt at speech recognition  A toy from 1922  Worked by analyzing the signal strength at 500Hz
  • 15.
    Actual speech recognition systems  Originallythought to be a relatively simple task requiring a few years of concerted effort  1969, “Wither speech recognition” is published  A DARPA project ran from 1971-1976 in response to the statements in the Pierce article  We can examine a few general systems
  • 16.
    Template-Based ASR  Originally onlyworked for isolated words  Performs best when training and testing conditions are best  For each word we want to recognize, we store a template or example based on actual data  Each test utterance is checked against the templates to find the best match  Uses the Dynamic Time Warping (DTW) algorithm
  • 17.
    Dynamic Time Warping  Createa similarity matrix for the two utterances  Use dynamic programming to find the lowest cost path
  • 18.
    Hearsay-II  One of thesystems developed during the DARPA program  A blackboard-based system utilizing symbolic problem solvers  Each problem solver was called a knowledge group  A complex scheduler was used to decide when each KG should be called
  • 19.
  • 20.
    DARPA Results  The Hearsay-IIsystem performed much better than the two other similar competing systems  However, only one system met the performance goals of the project  The Harpy system was also a CMU built system  In many ways it was a predecessor to the modern statistical systems
  • 21.
  • 22.
  • 23.
    Acoustic Model  For eachframe of data, we need some way of describing the likelihood of it belonging to any of our classes  Two methods are commonly used  Multilayer perceptron (MLP) gives the likelihood of a class given the data  Gaussian Mixture Model (GMM) gives the likelihood of the data given a class
  • 24.
  • 25.
    Pronunciation Model  While thepronunciation model can be very complex, it is typically just a dictionary  The dictionary contains the valid pronunciations for each word  Examples:  Cat: k ae t  Dog: d ao g  Fox: f aa x s
  • 26.
    Language Model  Now weneed some way of representing the likelihood of any given word sequence  Many methods exist, but ngrams are the most common  Ngrams models are trained by simply counting the occurrences of words in a training set
  • 27.
    Ngrams  A unigram isthe probability of any word in isolation  A bigram is the probability of a given word given the previous word  Higher order ngrams continue in a similar fashion  A backoff probability is used for any unseen data
  • 28.
    How do weput it together?  We now have models to represent the three parts of our equation  We need a framework to join these models together  The standard framework used is the Hidden Markov Model (HMM)
  • 29.
    Markov Model  A statemodel using the markov property  The markov property states that the future depends only on the present state  Models the likelihood of transitions between states in a model  Given the model, we can determine the likelihood of any sequence of states
  • 30.
    Hidden Markov Model  Similarto a markov model except the states are hidden  We now have observations tied to the individual states  We no longer know the exact state sequence given the data  Allows for the modeling of an underlying unobservable process
  • 31.
    HMMs for ASR  Firstwe build an HMM for each phone  Next we combine the phone models based on the pronunciation model to create word level models  Finally, the word level models are combined based on the language model  We now have a giant network with potentially thousands or even millions of states
  • 32.
    Decoding  Decoding happens inthe same way as the previous example  For each time frame we need to maintain two pieces of information  The likelihood of being at any state  The previous state for every state
  • 33.
    State of theArt  What works well  Constrained vocabulary systems  Systems adapted to a given speaker  Systems in anechoic environments without background noise  Systems expecting read speech  What doesn't work  Large unconstrained vocabulary  Noisy environments  Conversational speech
  • 34.
    Future Work  Better representationsof audio based on humans  Better representation of acoustic elements based on articulatory phonology  Segmental models that do not rely on the simple frame-based approach
  • 35.
    Resources  Hidden Markov ModelToolkit (HTK)  http://htk.eng.cam.ac.uk/  CHIME ( a freely available dataset)  http://spandh.dcs.shef.ac.uk/projects/chime/PCC /datasets.html  Machine Learning Lectures  http://www.stanford.edu/class/cs229/  http://www.youtube.com/watch?v=UzxYlbK2c7E