speech recognition system of modern world.ppt

Introduction to Automatic
Speech Recognition

Outline

Define the problem

What is speech?

Feature Selection

Models
 Early methods
 Modern statistical models

Current State of ASR

Future Work

The ASR Problem

There is no single ASR problem

The problem depends on many factors
 Microphone: Close-mic, throat-mic, microphone
array, audio-visual
 Sources: band-limited, background noise,
reverberation
 Speaker: speaker dependent, speaker
independent
 Language: open/closed vocabulary, vocabulary
size, read/spontaneous speech
 Output: Transcription, speaker id, keywords

Performance Evaluation

Accuracy
 Percentage of tokens correctly recognized

Error Rate
 Inverse of accuracy

Token Type
 Phones
 Words*
 Sentences
 Semantics?

What is Speech?

Analog signal produced by humans

You can think about the speech signal being
decomposed into the source and filter

The source is the vocal folds in voiced speech

The filter is the vocal tract and articulators

Feature Selection

As in any data-driven task, the data must be
represented in some format

Cepstral features have been found to perform
well

They represent the frequency of the
frequencies

Mel-frequency cepstral coefficients (MFCC)
are the most common variety

Where do we stand?

Defined the multiple problems associated with
ASR

Described how speech is produced

Illustrated how speech can be represented in
an ASR system

Now that we have the data, how do we
recognize the speech?

Radio Rex

First known attempt at speech recognition

A toy from 1922

Worked by analyzing the signal strength at
500Hz

Actual speech recognition
systems

Originally thought to be a relatively simple
task requiring a few years of concerted effort

1969, “Wither speech recognition” is
published

A DARPA project ran from 1971-1976 in
response to the statements in the Pierce
article

We can examine a few general systems

Template-Based ASR

Originally only worked for isolated words

Performs best when training and testing
conditions are best

For each word we want to recognize, we
store a template or example based on actual
data

Each test utterance is checked against the
templates to find the best match

Uses the Dynamic Time Warping (DTW)
algorithm

Dynamic Time Warping

Create a similarity matrix for the two
utterances

Use dynamic programming to find the lowest
cost path

Hearsay-II

One of the systems developed during the
DARPA program

A blackboard-based system utilizing symbolic
problem solvers

Each problem solver was called a knowledge
group

A complex scheduler was used to decide
when each KG should be called

DARPA Results

The Hearsay-II system performed much
better than the two other similar competing
systems

However, only one system met the
performance goals of the project
 The Harpy system was also a CMU built system
 In many ways it was a predecessor to the
modern statistical systems

Acoustic Model

For each frame of data, we need some way
of describing the likelihood of it belonging to
any of our classes

Two methods are commonly used
 Multilayer perceptron (MLP) gives the likelihood
of a class given the data
 Gaussian Mixture Model (GMM) gives the
likelihood of the data given a class

Pronunciation Model

While the pronunciation model can be very
complex, it is typically just a dictionary

The dictionary contains the valid
pronunciations for each word

Examples:
 Cat: k ae t
 Dog: d ao g
 Fox: f aa x s

Language Model

Now we need some way of representing the
likelihood of any given word sequence

Many methods exist, but ngrams are the
most common

Ngrams models are trained by simply
counting the occurrences of words in a
training set

Ngrams

A unigram is the probability of any word in
isolation

A bigram is the probability of a given word
given the previous word

Higher order ngrams continue in a similar
fashion

A backoff probability is used for any unseen
data

How do we put it together?

We now have models to represent the three
parts of our equation

We need a framework to join these models
together

The standard framework used is the Hidden
Markov Model (HMM)

Markov Model

A state model using the markov property
 The markov property states that the future
depends only on the present state

Models the likelihood of transitions between
states in a model

Given the model, we can determine the
likelihood of any sequence of states

Hidden Markov Model

Similar to a markov model except the states
are hidden

We now have observations tied to the
individual states

We no longer know the exact state sequence
given the data

Allows for the modeling of an underlying
unobservable process

HMMs for ASR

First we build an HMM for each phone

Next we combine the phone models based
on the pronunciation model to create word
level models

Finally, the word level models are combined
based on the language model

We now have a giant network with potentially
thousands or even millions of states

Decoding

Decoding happens in the same way as the
previous example

For each time frame we need to maintain two
pieces of information
 The likelihood of being at any state
 The previous state for every state

State of the Art

What works well
 Constrained vocabulary systems
 Systems adapted to a given speaker
 Systems in anechoic environments without
background noise
 Systems expecting read speech

What doesn't work
 Large unconstrained vocabulary
 Noisy environments
 Conversational speech

Future Work

Better representations of audio based on
humans

Better representation of acoustic elements
based on articulatory phonology

Segmental models that do not rely on the
simple frame-based approach

Resources

Hidden Markov Model Toolkit (HTK)
 http://htk.eng.cam.ac.uk/

CHIME ( a freely available dataset)
 http://spandh.dcs.shef.ac.uk/projects/chime/PCC
/datasets.html

Machine Learning Lectures
 http://www.stanford.edu/class/cs229/
 http://www.youtube.com/watch?v=UzxYlbK2c7E

speech recognition system of modern world.ppt

More Related Content

Similar to speech recognition system of modern world.ppt

Recently uploaded

speech recognition system of modern world.ppt