Speech Perception Fundamentals

UNIT 1
SPEECH PERCEPTION
(Introduction to Speech Perception, Acoustics of Speech in relation to
production, Coding of Speech in the auditory pathway)
SUBMITTED TO SUBMITTED BY
MS. VINI ABHIJITH GUPTA HIMANI BANSAL
DEPT. OF AUDIOLOGY MASLP IIND YEAR
MVSCOSH MVSCOSH

DEFINITION
“Speech perception is defined as the process by which a
perceiver tries to identify the talkers underlying language
patterns on the basis of speech sounds and movements. The
ultimate goal of speech perception is to determine the
meaning and intent behind the spoken message.”
-Arthur Boothroyd
(1998)

BASICS OF SPEECH PERCEPTION
ACOUSTIC
CUES
• The speech sound signal contains a number of acoustic cues that are used in speech perception.
• The cues differentiate speech sounds belonging to different phonetic categories.
• For example, VOT is a primary cue signalling the difference between voiced and voiceless stop
consonants, such as "b" and "p".
LINEARITY
AND
SEGMENTATIO
N PROBLEM:
• The linearity of speech is difficult to be seen in the physical speech signal. A speech sound is influenced
by the ones that precede and the ones that follow.
• This influence can even be exerted at a distance of two or more segments (and across syllable- and
word-boundaries).
• The problem of segmentation arises: one encounters serious difficulties trying to delimit a stretch of
speech signal as belonging to a single perceptual unit. E.g., The acoustic properties of the phoneme /d/
will depend on the identity of the following vowel

LACK OF
INVARIANCE:
Reliable constant
relations between a
phoneme of a
language and its
acoustic
manifestation in
speech are difficult
to find. There are
several reasons for
this:
Context- induced variation- Phonetic environment affects the acoustic
properties of speech sounds. the VOT values marking the boundary between
voiced and voiceless stops are different for labial, alveolar and velar stops.
Variation due to differing speech conditions- One important factor that
causes variation is differing speech rate. Many phonemic contrasts are constituted by
temporal characteristics (short vs. long vowels or consonants, affricates vs. fricatives,
stops vs. glides, voiced vs. voiceless stops, etc.) and they are certainly affected by
changes in speaking tempo.
Variation due to different speaker identity- The resulting acoustic
structure of concrete speech productions depends on the physical and psychological
properties of individual speakers. Men, women, and children generally produce voices
having different pitch. Dialect and foreign accent cause variation as well.

• Listeners perceive vowels and consonants produced under different conditions and by different
speakers as constant categories. They filter out the noise (i.e., variation) to arrive at the underlying
category.
• Vocal tract normalization- Vocal-tract-size differences result in formant-frequency variation across
speakers; therefore, a listener has to adjust his/her perceptual system to the acoustic characteristics of
a particular speaker.
• Speech rate normalization- listeners are believed to adjust the perception of duration to the current
tempo of the speech they are listening to.
PERCEPTUAL
CONSTANCY &
NORMALIZATION
• It is involved in processes of perceptual differentiation. We perceive speech sounds categorically, that is
to say, we are more likely to notice the differences between categories (phonemes) than within
categories.
• In the identification and the discrimination test, the listeners will have different sensitivity to the same
relative increase in VOT (considering an artificial continuum between a voiceless and a voiced bilabial
stop where each new step differs from the preceding one in the amount of VOT ) depending on
whether or not the boundary between categories was crossed.
CATEGORICAL
PERCEPTION

• The process of speech perception is not necessarily uni-directional.
• Higher-level language processes connected with morphology, syntax, or semantics
may interact with basic speech perception processes to aid in recognition of speech
sounds.
• It maybe even not possible for listener to recognize phonemes before recognizing
higher units, like words for example.
TOP- DOWN INFLUENCES
ON SPEECH PERCEPTION
• Hearing is the process involving nerves and muscles; it is a peripheral
phenomenon where we don’t attend to the sound, we only hear it.
• Listening is a learned behaviour concerned with hearing, attending, discriminating,
understanding, and remembering and listening involves only the auditory system.
• Perception describes how the brain receives, processes, and interprets information
from the eyes, ears, nose and other sensory organs.
HEARING v/s LISTENING
v/s PERCEPTION

ACOUSTICS OF SPEECH IN RELATION TO
PRODUCTION
 When a person has the urge or intention to speak in her or his brain, she or he forms a sentence and
maps the sequence of phonemes to those physiological movements required to produce that sequence
of phonemes.
 The physical activity begins by contracting the lungs, pushing out air from the lungs, through the throat,
oral and nasal cavities. Airflow is not audible as a sound - sound is an oscillation in air pressure.
 To obtain a sound, we therefore need to obstruct airflow to obtain an oscillation or turbulence.
Oscillations are primarily produced when the vocal folds are tensioned appropriately. This produces
voiced sounds and is perhaps the most characteristic property of speech signals.

STUDIES
Weismer, Jeng, Laures, Kent (2001) conducted a study on acoustic and intelligibility
characteristics of sentence production in neurogenic speech disorders. This study
concluded that the temporal variables typically differentiated the amyotrophic lateral
sclerosis group, but not the Parkinson’s disease groups from the controls and that
vowel spaces were smaller for both neurogenic groups as compared to controls but
only significantly so for the amyotrophic lateral sclerosis speakers.
Snowling, Lervag, Nash & Hulme (2017) conducted a study on longitudinal
relationship between speech perception, phonological skills and reading in children at
high-risk of dyslexia. This study concluded that, there was no significant indirect effect
of speech perception on reading via phoneme awareness, suggesting that its effects
are separable from those of phoneme awareness.
Keith & Matthias (2021) investigated on speakers’ normalization in speech perception.
They concluded that auditory spectral analysis and encoding removes some talker
differences. Contrast coding in an auditory/phonetic frame of reference seems to apply
before lexical processing begins.

CODING OF SPEECH IN THE AUDITORY
PATHWAY
SPEECH CODING-
• The process of transforming the speech signal in a more compressed form.
• The properties of sounds are represented in the normal auditory system in the spatial and temporal patterns of nerve spikes in the
auditory nerve and higher centres of the auditory pathway.
• To be a code, a specific aspect of the neural response pattern should be used by the brain to determine one or more properties of a
stimulus, and changes in that aspect of the response pattern should affect the perception of the stimulus.

SPEECH CODING IN AUDITORY NERVE
By the time the acoustic signal is transduced into neural impulses in the auditory
nerve, the following modifications have taken place:
1.1. Narrow-band filtering (by the cochlea),
2.2. Half-wave rectification (from the chemical response properties of hair cells) and
3.3. Low-pass filtering (from the loss of high frequencies due to limits on neural
synchrony).
Frequency Coding:
1. Neurons in the centre of the bundle are tuned to low frequencies. Neurons near the
edge of the bundle are tuned to high frequencies.
2. The spatial mapping of frequency along the cochlea is transformed into a spatial
mapping of frequency in the auditory nerve.
3. According to Place theories of hearing (Helmholtz,1863), peaks in the acoustic
spectrum of a sound would result in peaks in response in the populations auditory nerve
fibres with characteristic frequencies corresponding with the peak frequencies.

Intensity & Temporal Coding
• Sound level is coded in terms of neural firing rate. Loudness may be
related to the total spike rate evoked by a sound.
 The relative levels of the different frequency components in complex
sounds (such as speech) are also carried in the detailed time pattern of
nerve spikes. In response to a sine wave, nerve spikes tend to be phase
locked or synchronized to the stimulating waveform.
 A given nerve fibre does not necessarily fire on every cycle of the
stimulus, but, when spikes do occur, they occur at roughly the same
phase of the waveform each time.
 Thus, the time intervals between spikes are (approximately) integer
multiples of the period of the stimulating waveform.
• Any change in the spectral composition of a complex sound results in a
change in the pattern of phase locking as a function of CF, provided the
spectral change is in the frequency region below 4 to 5 kHz.
Coding of Pitch of Harmonic &
Inharmonic Complex Tones
• In the profiles of average discharge rates of auditory nerve fibres as a
function of their point of innervation along the cochlea, or
• In the fine temporal patterns of discharge.
Rate place models– Higher discharge rate when a fibre CF coincides with
a harmonic of the fundamental frequency of a complex tone than when
the CF falls between two harmonics.
Prominent inter spike intervals at the fundamental period in response to
complex periodic tones.
The auditory nerve fibres can show inter spike intervals corresponding to
the pitch of in harmonic tones.

Representation of Vowels:
A representation of vowel spectra based on the individual firing rates of ANFs is inadequate for transmitting information known to be
conveyed in humans at levels typical of conversational speech (60–70 dB).
At higher presentation levels, firing rates can saturate, resulting in nearly flat frequency-rate curves that do not resolve frequencies in
complex sounds.
Formant frequencies in vowels are well preserved in a temporal place representation. Formant structure is preserved at high stimulus levels
in a rate place representation. Gross spectral features of vowels are well preserved in both representations.
Representation of Stop Consonants:
Much of the information conveyed by speech is carried by consonants, many of which involve rapid spectral changes.
E.g.: stop consonants are characterized by a brief burst of noise that release from the stop, followed by a rapid formant frequency
transition.
Temporal measure can be used to represent the spectra of the formant transitions in a consonant vowel syllable as well as in the
steady vowel.
Representation of Fricative Consonants:
The unvoiced fricatives which generally have their major energy at higher frequencies and are generated by a noise excitation of the
vocal tract.
The study of Delgutte (1980, 1981) on auditory nerve fibre responses to fricative suggests that fricatives can at least be discriminated
on the basis of a rate place code.
Delgutte (1981) has also shown that short term adaptation effects can be important in the representation of certain fricative features.

SPEECH CODING IN COCHLEAR NUCLEAR
COMPLEX
Primary responses are characterized by a high rate of discharge at
stimulus onset followed by a gradual decline to a more or less
steady response through the remainder of the stimulus.
Onset responses are characterized by a single spike or a brief
burst of spikes at stimulus onset with little or no discharge during
the remainder of the stimulus burst.
Chopper responses are characterized by fluctuations in response
rate that are synchronized with stimulus onset.
Pauses give on onset spike followed by a pause response except
that the onset spike is missing.
Onset-S pattern is characterized by an onset burst followed by a
gradual decline in activity through the rest of the stimulus burst
the decline in response rate more rapid than in primary like units
and less rapid than in other onset units.
The cochlear nucleus is composed of a variety
of different cell types, including pyramidal,
Octopus, Stellate and spherical cells. The
major categories of response patterns are:

SPEECH CODING IN COCHLEAR NUCLEAR
COMPLEX
The encoding of speech has been
studied most extensively in the AVCN.
The two basic synaptic configurations of
auditory nerve inputs to the AVCN, bushy
bushy cells are located in the anterior
portion of the nucleus.
Anatomic studies have traced the
projections of AVCN bushy cells to the
superior olive, which is believed to be an
important site of binaural processing in
the central auditory system.
Binaural temporal cues are essential for
the accurate localization of low-frequency
sounds and the perception of pitch.
Stellate cells can be found throughout the
more posterior regions of the AVCN.
Primary-like neurons with high
spontaneous rates encode the formant
structure for the vowel at low levels, but
saturation effects degrade such
representation at high levels.
Primary-like neurons with low
spontaneous rates fail to respond to low
levels of stimulation, but excellent peak-
to-trough rate differences are observed
at high vowel levels.

SUPERIOR
OLIVARY
COMPLEX
The superior olive is the first place we see binaural cells in the auditory pathway. Nuclei group of the SOC have tonotopic
organization but the LSO and MSO appear to have been studied most extensively.
The LSO has a unique tonotopic arrangement, with the higher frequencies located medially. The discharge patterns observed
on post stimulus time histograms of the SOC are varied but, for the most part would be classified as “chopper” patterns.
LATERAL
LEMNISCUS
Most of the neurons of the dorsal segment of the LL can be activated binaurally, however most of the neurons from the ventral
segment can be activate only by contralateral stimulation.
Burgge et .al reported definite tonotopic organization for both dorsal and ventral nuclei aggregations. In both nuclei groups,
the low frequencies are dorsal and the high frequencies are ventral.
INFERIOR
COLLICULUS
The IC is highly tonotopic with low frequencies located dorsally, high frequencies progress in a ventrolateral direction.
Benevento and Coleman classified four different neural populations in the IC.
• Neurons sensitive to inter aural intensity difference.
• Neuron sensitive to inter aural time difference.
• Neuron sensitive to neither inter aural time nor intensity differences.
• Neuron sensitive to both inter aural time intensity differences.

MEDIAL GENICULATE BODY:
Tonotopic organization of the ventral segment of the medial geniculate body is such that
low frequencies are located laterally and high frequencies medially.
A reticular formation appears to play role in auditory alertness, reflexes and habituation. It
also suppresses back ground noise concentrating on the foreground signals.
AUDITORY CORTEX:
The cortex is composed of billions of nerve cells however; there are primarily only 3 types
pyramidal, stellate, and fusiform.
There is distinct tonotopic organization in the auditory cortex. The auditory cortex is better
suited to respond to complex than to simple acoustic stimuli.

AUTHORS TOPICS CONCLUSION
Hermann, Burkhard, Johnsrude (2022) Neural signature of regularity in sounds is reduced in
older adults
Sensitivity of neural populations in auditory cortex
differs b/w younger and older adults
Fitzpatrick, Carrier, Turgeon, Olmstead,
McAfee (2022)
Benefits of auditory-verbal intervention for adult
cochlear implant users
Participants recommended reducing the intensity of
intervention to facilitate participation
Begus, Zhou, Zhao (2022) Encoding of speech in convolutional layers and the
brain stem based on language experience
Technique can be used to compare encoding between
the human brain and intermediate convolutional
layers for any acoustic property
Johnson, Sjerps (2021) Speaker normalization in speech perception Maintain a stable representation of acoustic voice
properties to provide a frame of reference for further
interpretation
Preisig, Riecke, Adelman (2021) Categorical encoding of speech sounds beyond
auditory cortices
The emergence of categorical speech sounds
implicates decision- making mechanism and auditory
-motor transformations acting on sensory inputs
STUDIES

REFERENCES
1. The handbook of speech perception by: David B, Pisoni and Robert E. Remez (2006).
2. https://psychology.fandom.com/wiki/Speech_perception
3. https://www.sfu.ca/sonic-studio-webdav/cmns/Handbook%20Tutorial/SpeechAcoustics.html
4. http://kunnampallilgejo.blogspot.com/2012/09/acoustic-theory-of-speech-production.html
5. Computers networks and inventive communication (687-702), 2022
6. Journal of positive behaviour intervention 24(1) 69-84, 2022
7. International journal of audiology,1-10, 2022
8. www.scholar.gooogle.com
9. Biorxiv,2021

QUESTIONS ASKED IN PREVIOUS
YEARS
1. Describe coding of speech in auditory pathway - 16 Marks (2018, 2012)
2. Explain the coding of speech in different parts of auditory system - 16 Marks
(2011)
3. Short note on categorical perception – 4 Marks (2021, 2014, 2013)
4. Discuss the neurophysiology of speech perception – 16 Marks (2021)
5. Discuss on the physiological representation of speech in the auditory pathway –
16 Marks (2014, 2009)
6. Discuss the coding of speech in the brainstem – 16 Marks (2013, 2011)
7. Short note on coding of speech in cochlea – 4 Marks (2011)

Speech Perception Fundamentals

Speech Perception Fundamentals

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Speech Perception Fundamentals

Similar to Speech Perception Fundamentals (20)

More from HimaniBansal15

More from HimaniBansal15 (20)

Recently uploaded

Recently uploaded (20)

Speech Perception Fundamentals