Presentor: Syeda Urooj 
Asma Agha 
Yasmeen Jamil 
Rahat Umer
SPEECH PERCEPTION 
 Articulatory phonetics 
 Production based 
 Place and manner of articulation 
 Acoustic phonetics 
 Based on the acoustic signal 
 Formants, transitions, co-articulation, etc.
Speech Production to Perception 
 Acoustic cues are extracted and stored in sensory 
memory and then mapped onto linguistic information 
 Air is pushed into the larynx across the vocal cords and into the 
mouth nose, different types of sounds are produced. 
 The different qualities of the sounds are represented in formants 
 The formants and other features are mapped onto phonemes
Theoretical approaches
Theories of Speech Perception 
 Theories of speech perception must be able to account for 
certain facts about the acoustic speech signal, e.g.: 
 There is inter-speaker and intra-speaker variability among 
signals that convey information about equivalent phonetic 
events. 
 The acoustic speech signal is continuous even though it is 
perceived as and represents a series of discrete units. 
 Speech signals contain cues that are transmitted very 
quickly (20 to 25 sounds per second) and simultaneously.
Scope of the problem 
 Speech perception involves the mapping of speech 
acoustic signals onto linguistic messages (e.g., 
phonemes, distinctive features, syllables, words, 
phrases…)
Types of Theories: 
Theories of speech perception fall into one of three broad 
classes: 
 Motor Theories: Perception involves processes related to the 
production of speech. Examples include Motor Theory and 
Analysis-by-Synthesis. 
 Direct Perception: Perception recovers the sound producing 
objects directly. Examples include Fowler’s Direct Realist 
Approach. 
 Stage theories: - Perception involves a sequence of transforms 
from sound to object. Examples include TRACE and LAFS.
Motor Theory of Speech Perception 
(Liberman & Mattingly, 1967) 
 "...overlapping activity of several neural networks - those that 
supply control signals to the articulators, and those that process 
incoming neural patterns from the ear..." and "... that 
information can be correlated by these networks and passed 
through them in either direction." (Liberman et al, 1967) 
 …“the candidate signal descriptions are computed by an analogue 
of the production process—an internal, innately specified vocal-tract 
synthesizer…—that complete information about the 
anatomical and physiological characteristics of the vocal tract 
and also about the articulatory and acoustic consequences of 
linguistically significant gestures”. (Liberman & Mattingly, 1985 
(revised))
Motor Theory 
Motor Theory has, as its core, the premise that 
perception involves a reference to articulation. This 
view is often associated with the idea that speech is 
somehow “special” and involves specialized, species-specific 
mechanisms in perception.
Motor Theory 
 This model was developed in 1967 by Liberman and 
colleagues 
 The basic principle of this model lies with the production 
of speech sounds in the speaker's vocal tract. 
 The Motor Theory proposes that a listener specifically 
perceives a speaker's phonetic gestures while they are 
speaking. 
 Speech is perceived in humans by means of a specialized 
speech module.
Motor Theory (… contd) 
 A phonetic gesture is a representation of the speaker's 
vocal tract constriction while producing a speech sound. 
 Each phonetic gesture is produced uniquely in the vocal 
tract. 
 The different places of producing gestures permit the 
speaker to produce salient phonemes for listeners to 
perceive. 
 The Motor Theory model functions by using separate 
embedded models within the main model. It is the 
interaction of these models that makes Motor Theory 
possible.
Motor Theory (… contd) 
According to the motor theory of speech perception: 
 We have a special system for processing speech. 
 Perception and production are closely linked. 
 Motor commands in the brain that control movements of 
the muscles used to speak help us perceive speech. 
 Humans are born with a module that connects sounds with 
mental commands - we have an innate speech processing 
module.
The Speech Chain(s)
Analysis-by-Synthesis Theory of Speech 
Perception 
(Stevens and Halle 1967) 
 Stevens and Halle (1967) have postulated that 
"... the perception of speech involves the internal 
synthesis of patterns according to certain rules, and a 
matching of these internally generated patterns 
against the pattern under analysis. ..moreover, ...the 
generative rules in the perception of speech [are] in 
large measure identical to those utilized in speech 
production, and that fundamental to both processes 
[is] an abstract representation of the speech event."
Analysis-by-Synthesis Model 
 In this model the incoming acoustic signal is subjected 
to an initial analysis at the periphery of the auditory 
system. 
 This information is then passed upward to a master 
control unit and is there processed along with certain 
contextual constraints derived from preceding 
segments. 
 This produces an hypothesized abstract representation 
defined in terms of a set of generative rules.
Analysis-by-Synthesis Model 
 This is then used to generate motor commands, but 
during speech perception articulation is inhibited and 
instead the commands produce a hypothetical 
auditory pattern which is then passed to a comparator 
module which compares this with the original signal 
which is held in a temporary store. If a mismatch 
occurs the procedure is repeated until a suitable match 
is found.
Analysis-by-synthesis Model (after 
Stevens, 1972)
Direct Realism Theory of Speech Perception 
(Carol Fowler, 1986) 
The direct realist theory of speech perception is a part 
of the more general theory of direct realism, which 
postulates that perception allows us to have direct 
awareness of the world because it involves direct 
recovery of the distal source of the event that is 
perceived.
Direct Realism Theory 
 The theory asserts that the objects of perception are actual vocal 
tract movements, or gestures, and not abstract phonemes or (as 
in the Motor Theory) events that are causally antecedent to these 
movements, i.e. intended gestures. 
 Listeners perceive gestures not by means of a specialized decoder 
(as in the Motor Theory) but because information in the acoustic 
signal specifies the gestures that form it. 
 By claiming that the actual articulatory gestures that produce 
different speech sounds are themselves the units of speech 
perception, the theory bypasses the problem of lack of 
invariance.
You say you have a theory? 
The result of 
underestimating the 
complexity of 
perceptual 
processing in a 
theory.
Stage Theories 
 Diverse set of theories that do not assume a link 
between production and perception. 
 Role and nature of segmental (phonetic) 
representation is diverse.
Stage Theories – Key Elements 
Coding is based on auditory processes. 
All use intermediate representations though nature of 
representations is diverse. 
All use an information processing framework 
(perception is the result of a sequence of 
transformations).
LAFS THEORY 
LAFS – Lexical Access From Spectra
LAFS - Lexical Access From Spectra 
(Klatt, 1979) 
 In LAFS, Klatt proposed that the input is an auditory 
representation of the signal. This representation is a 
series of spectral sections. 
 A finite-state network parses the input. The path 
through the network that results from parsing an 
input is a word. That is, this system maps a sequence 
of spectral sections onto a word. Parts of the network 
that correspond to sequences of spectral sections are 
isomorphic to “diaphones” (a type of context sensitive 
allophone).
LAFS - Key Elements 
 The invariant for perception is a characterization of the 
spectral shape, over time. 
 The “perceptual unit” is the context sensitive allophone, 
but listeners have no direct access to this representation 
(phonetic perception is lexically mediated). 
 Processing is controlled by a temporal parsing process 
(implemented as a finite state machine). 
 Note that Hillenbrand’s model of vowel recognition is 
similar to LAFS.
TRACE THEORY
TRACE 
(McClelland & Elman, 1986) 
 Elman and McClellan proposed TRACE as a stages 
model that consists of an auditory (ear) front end, 
auditory feature extraction, a phonetic level, and a 
lexical level. 
 TRACE is implemented in a connectionist architecture 
and has both ascending and descending (feedback) 
connections as well as connections within each level. 
 TRACE is both a theory and, in its two versions, a 
model of perception.
TRACE – Key Elements 
 Invariant cues are not required. Perception is a result 
of a cascade of stages involving a one-to-many and 
many-to-one mapping (behaves like a prototype 
system). 
 There are two variants of TRACE. One uses a triphone 
(context-sensitive allophone) representation and the 
other an abstract phoneme. 
 Feedback and competition among nodes at the same 
level are used to stabilize perception.
Theories of Speech Perception
Theories of Speech Perception

Theories of Speech Perception

  • 2.
    Presentor: Syeda Urooj Asma Agha Yasmeen Jamil Rahat Umer
  • 4.
    SPEECH PERCEPTION Articulatory phonetics  Production based  Place and manner of articulation  Acoustic phonetics  Based on the acoustic signal  Formants, transitions, co-articulation, etc.
  • 5.
    Speech Production toPerception  Acoustic cues are extracted and stored in sensory memory and then mapped onto linguistic information  Air is pushed into the larynx across the vocal cords and into the mouth nose, different types of sounds are produced.  The different qualities of the sounds are represented in formants  The formants and other features are mapped onto phonemes
  • 6.
  • 7.
    Theories of SpeechPerception  Theories of speech perception must be able to account for certain facts about the acoustic speech signal, e.g.:  There is inter-speaker and intra-speaker variability among signals that convey information about equivalent phonetic events.  The acoustic speech signal is continuous even though it is perceived as and represents a series of discrete units.  Speech signals contain cues that are transmitted very quickly (20 to 25 sounds per second) and simultaneously.
  • 8.
    Scope of theproblem  Speech perception involves the mapping of speech acoustic signals onto linguistic messages (e.g., phonemes, distinctive features, syllables, words, phrases…)
  • 9.
    Types of Theories: Theories of speech perception fall into one of three broad classes:  Motor Theories: Perception involves processes related to the production of speech. Examples include Motor Theory and Analysis-by-Synthesis.  Direct Perception: Perception recovers the sound producing objects directly. Examples include Fowler’s Direct Realist Approach.  Stage theories: - Perception involves a sequence of transforms from sound to object. Examples include TRACE and LAFS.
  • 11.
    Motor Theory ofSpeech Perception (Liberman & Mattingly, 1967)  "...overlapping activity of several neural networks - those that supply control signals to the articulators, and those that process incoming neural patterns from the ear..." and "... that information can be correlated by these networks and passed through them in either direction." (Liberman et al, 1967)  …“the candidate signal descriptions are computed by an analogue of the production process—an internal, innately specified vocal-tract synthesizer…—that complete information about the anatomical and physiological characteristics of the vocal tract and also about the articulatory and acoustic consequences of linguistically significant gestures”. (Liberman & Mattingly, 1985 (revised))
  • 12.
    Motor Theory MotorTheory has, as its core, the premise that perception involves a reference to articulation. This view is often associated with the idea that speech is somehow “special” and involves specialized, species-specific mechanisms in perception.
  • 13.
    Motor Theory This model was developed in 1967 by Liberman and colleagues  The basic principle of this model lies with the production of speech sounds in the speaker's vocal tract.  The Motor Theory proposes that a listener specifically perceives a speaker's phonetic gestures while they are speaking.  Speech is perceived in humans by means of a specialized speech module.
  • 14.
    Motor Theory (…contd)  A phonetic gesture is a representation of the speaker's vocal tract constriction while producing a speech sound.  Each phonetic gesture is produced uniquely in the vocal tract.  The different places of producing gestures permit the speaker to produce salient phonemes for listeners to perceive.  The Motor Theory model functions by using separate embedded models within the main model. It is the interaction of these models that makes Motor Theory possible.
  • 15.
    Motor Theory (…contd) According to the motor theory of speech perception:  We have a special system for processing speech.  Perception and production are closely linked.  Motor commands in the brain that control movements of the muscles used to speak help us perceive speech.  Humans are born with a module that connects sounds with mental commands - we have an innate speech processing module.
  • 16.
  • 18.
    Analysis-by-Synthesis Theory ofSpeech Perception (Stevens and Halle 1967)  Stevens and Halle (1967) have postulated that "... the perception of speech involves the internal synthesis of patterns according to certain rules, and a matching of these internally generated patterns against the pattern under analysis. ..moreover, ...the generative rules in the perception of speech [are] in large measure identical to those utilized in speech production, and that fundamental to both processes [is] an abstract representation of the speech event."
  • 19.
    Analysis-by-Synthesis Model In this model the incoming acoustic signal is subjected to an initial analysis at the periphery of the auditory system.  This information is then passed upward to a master control unit and is there processed along with certain contextual constraints derived from preceding segments.  This produces an hypothesized abstract representation defined in terms of a set of generative rules.
  • 20.
    Analysis-by-Synthesis Model This is then used to generate motor commands, but during speech perception articulation is inhibited and instead the commands produce a hypothetical auditory pattern which is then passed to a comparator module which compares this with the original signal which is held in a temporary store. If a mismatch occurs the procedure is repeated until a suitable match is found.
  • 21.
  • 23.
    Direct Realism Theoryof Speech Perception (Carol Fowler, 1986) The direct realist theory of speech perception is a part of the more general theory of direct realism, which postulates that perception allows us to have direct awareness of the world because it involves direct recovery of the distal source of the event that is perceived.
  • 24.
    Direct Realism Theory  The theory asserts that the objects of perception are actual vocal tract movements, or gestures, and not abstract phonemes or (as in the Motor Theory) events that are causally antecedent to these movements, i.e. intended gestures.  Listeners perceive gestures not by means of a specialized decoder (as in the Motor Theory) but because information in the acoustic signal specifies the gestures that form it.  By claiming that the actual articulatory gestures that produce different speech sounds are themselves the units of speech perception, the theory bypasses the problem of lack of invariance.
  • 25.
    You say youhave a theory? The result of underestimating the complexity of perceptual processing in a theory.
  • 27.
    Stage Theories Diverse set of theories that do not assume a link between production and perception.  Role and nature of segmental (phonetic) representation is diverse.
  • 28.
    Stage Theories –Key Elements Coding is based on auditory processes. All use intermediate representations though nature of representations is diverse. All use an information processing framework (perception is the result of a sequence of transformations).
  • 29.
    LAFS THEORY LAFS– Lexical Access From Spectra
  • 30.
    LAFS - LexicalAccess From Spectra (Klatt, 1979)  In LAFS, Klatt proposed that the input is an auditory representation of the signal. This representation is a series of spectral sections.  A finite-state network parses the input. The path through the network that results from parsing an input is a word. That is, this system maps a sequence of spectral sections onto a word. Parts of the network that correspond to sequences of spectral sections are isomorphic to “diaphones” (a type of context sensitive allophone).
  • 31.
    LAFS - KeyElements  The invariant for perception is a characterization of the spectral shape, over time.  The “perceptual unit” is the context sensitive allophone, but listeners have no direct access to this representation (phonetic perception is lexically mediated).  Processing is controlled by a temporal parsing process (implemented as a finite state machine).  Note that Hillenbrand’s model of vowel recognition is similar to LAFS.
  • 32.
  • 33.
    TRACE (McClelland &Elman, 1986)  Elman and McClellan proposed TRACE as a stages model that consists of an auditory (ear) front end, auditory feature extraction, a phonetic level, and a lexical level.  TRACE is implemented in a connectionist architecture and has both ascending and descending (feedback) connections as well as connections within each level.  TRACE is both a theory and, in its two versions, a model of perception.
  • 34.
    TRACE – KeyElements  Invariant cues are not required. Perception is a result of a cascade of stages involving a one-to-many and many-to-one mapping (behaves like a prototype system).  There are two variants of TRACE. One uses a triphone (context-sensitive allophone) representation and the other an abstract phoneme.  Feedback and competition among nodes at the same level are used to stabilize perception.