Theories of Speech Perception

Presentor: Syeda Urooj
Asma Agha
Yasmeen Jamil
Rahat Umer

SPEECH PERCEPTION
 Articulatory phonetics
 Production based
 Place and manner of articulation
 Acoustic phonetics
 Based on the acoustic signal
 Formants, transitions, co-articulation, etc.

Speech Production to Perception
 Acoustic cues are extracted and stored in sensory
memory and then mapped onto linguistic information
 Air is pushed into the larynx across the vocal cords and into the
mouth nose, different types of sounds are produced.
 The different qualities of the sounds are represented in formants
 The formants and other features are mapped onto phonemes

Theories of Speech Perception
 Theories of speech perception must be able to account for
certain facts about the acoustic speech signal, e.g.:
 There is inter-speaker and intra-speaker variability among
signals that convey information about equivalent phonetic
events.
 The acoustic speech signal is continuous even though it is
perceived as and represents a series of discrete units.
 Speech signals contain cues that are transmitted very
quickly (20 to 25 sounds per second) and simultaneously.

Scope of the problem
 Speech perception involves the mapping of speech
acoustic signals onto linguistic messages (e.g.,
phonemes, distinctive features, syllables, words,
phrases…)

Types of Theories:
Theories of speech perception fall into one of three broad
classes:
 Motor Theories: Perception involves processes related to the
production of speech. Examples include Motor Theory and
Analysis-by-Synthesis.
 Direct Perception: Perception recovers the sound producing
objects directly. Examples include Fowler’s Direct Realist
Approach.
 Stage theories: - Perception involves a sequence of transforms
from sound to object. Examples include TRACE and LAFS.

Motor Theory of Speech Perception
(Liberman & Mattingly, 1967)
 "...overlapping activity of several neural networks - those that
supply control signals to the articulators, and those that process
incoming neural patterns from the ear..." and "... that
information can be correlated by these networks and passed
through them in either direction." (Liberman et al, 1967)
 …“the candidate signal descriptions are computed by an analogue
of the production process—an internal, innately specified vocal-tract
synthesizer…—that complete information about the
anatomical and physiological characteristics of the vocal tract
and also about the articulatory and acoustic consequences of
linguistically significant gestures”. (Liberman & Mattingly, 1985
(revised))

Motor Theory
Motor Theory has, as its core, the premise that
perception involves a reference to articulation. This
view is often associated with the idea that speech is
somehow “special” and involves specialized, species-specific
mechanisms in perception.

Motor Theory
 This model was developed in 1967 by Liberman and
colleagues
 The basic principle of this model lies with the production
of speech sounds in the speaker's vocal tract.
 The Motor Theory proposes that a listener specifically
perceives a speaker's phonetic gestures while they are
speaking.
 Speech is perceived in humans by means of a specialized
speech module.

Motor Theory (… contd)
 A phonetic gesture is a representation of the speaker's
vocal tract constriction while producing a speech sound.
 Each phonetic gesture is produced uniquely in the vocal
tract.
 The different places of producing gestures permit the
speaker to produce salient phonemes for listeners to
perceive.
 The Motor Theory model functions by using separate
embedded models within the main model. It is the
interaction of these models that makes Motor Theory
possible.

Motor Theory (… contd)
According to the motor theory of speech perception:
 We have a special system for processing speech.
 Perception and production are closely linked.
 Motor commands in the brain that control movements of
the muscles used to speak help us perceive speech.
 Humans are born with a module that connects sounds with
mental commands - we have an innate speech processing
module.

Analysis-by-Synthesis Theory of Speech
Perception
(Stevens and Halle 1967)
 Stevens and Halle (1967) have postulated that
"... the perception of speech involves the internal
synthesis of patterns according to certain rules, and a
matching of these internally generated patterns
against the pattern under analysis. ..moreover, ...the
generative rules in the perception of speech [are] in
large measure identical to those utilized in speech
production, and that fundamental to both processes
[is] an abstract representation of the speech event."

Analysis-by-Synthesis Model
 In this model the incoming acoustic signal is subjected
to an initial analysis at the periphery of the auditory
system.
 This information is then passed upward to a master
control unit and is there processed along with certain
contextual constraints derived from preceding
segments.
 This produces an hypothesized abstract representation
defined in terms of a set of generative rules.

Analysis-by-Synthesis Model
 This is then used to generate motor commands, but
during speech perception articulation is inhibited and
instead the commands produce a hypothetical
auditory pattern which is then passed to a comparator
module which compares this with the original signal
which is held in a temporary store. If a mismatch
occurs the procedure is repeated until a suitable match
is found.

Analysis-by-synthesis Model (after
Stevens, 1972)

Direct Realism Theory of Speech Perception
(Carol Fowler, 1986)
The direct realist theory of speech perception is a part
of the more general theory of direct realism, which
postulates that perception allows us to have direct
awareness of the world because it involves direct
recovery of the distal source of the event that is
perceived.

Direct Realism Theory
 The theory asserts that the objects of perception are actual vocal
tract movements, or gestures, and not abstract phonemes or (as
in the Motor Theory) events that are causally antecedent to these
movements, i.e. intended gestures.
 Listeners perceive gestures not by means of a specialized decoder
(as in the Motor Theory) but because information in the acoustic
signal specifies the gestures that form it.
 By claiming that the actual articulatory gestures that produce
different speech sounds are themselves the units of speech
perception, the theory bypasses the problem of lack of
invariance.

You say you have a theory?
The result of
underestimating the
complexity of
perceptual
processing in a
theory.

Stage Theories
 Diverse set of theories that do not assume a link
between production and perception.
 Role and nature of segmental (phonetic)
representation is diverse.

Stage Theories – Key Elements
Coding is based on auditory processes.
All use intermediate representations though nature of
representations is diverse.
All use an information processing framework
(perception is the result of a sequence of
transformations).

LAFS THEORY
LAFS – Lexical Access From Spectra

LAFS - Lexical Access From Spectra
(Klatt, 1979)
 In LAFS, Klatt proposed that the input is an auditory
representation of the signal. This representation is a
series of spectral sections.
 A finite-state network parses the input. The path
through the network that results from parsing an
input is a word. That is, this system maps a sequence
of spectral sections onto a word. Parts of the network
that correspond to sequences of spectral sections are
isomorphic to “diaphones” (a type of context sensitive
allophone).

LAFS - Key Elements
 The invariant for perception is a characterization of the
spectral shape, over time.
 The “perceptual unit” is the context sensitive allophone,
but listeners have no direct access to this representation
(phonetic perception is lexically mediated).
 Processing is controlled by a temporal parsing process
(implemented as a finite state machine).
 Note that Hillenbrand’s model of vowel recognition is
similar to LAFS.

TRACE
(McClelland & Elman, 1986)
 Elman and McClellan proposed TRACE as a stages
model that consists of an auditory (ear) front end,
auditory feature extraction, a phonetic level, and a
lexical level.
 TRACE is implemented in a connectionist architecture
and has both ascending and descending (feedback)
connections as well as connections within each level.
 TRACE is both a theory and, in its two versions, a
model of perception.

TRACE – Key Elements
 Invariant cues are not required. Perception is a result
of a cascade of stages involving a one-to-many and
many-to-one mapping (behaves like a prototype
system).
 There are two variants of TRACE. One uses a triphone
(context-sensitive allophone) representation and the
other an abstract phoneme.
 Feedback and competition among nodes at the same
level are used to stabilize perception.

Theories of Speech Perception

More Related Content

What's hot

Similar to Theories of Speech Perception

Recently uploaded

Theories of Speech Perception