Main Ideas from Unit 9
1.1. The Physics of Speech Production
Let’s start by defining sound as the perceived changes in air pressure detected by
the ear. The formation of any sound requires a vibrating medium to be set in motion by
some kind of energy.
In the case of the human speech mechanism the function of vibrator is often
fulfilled by the vocal folds that are activated by air pressure from the lungs. In addition,
the resonating chambers of the pharynx, mouth and, in some cases, the nasal cavities,
modify any sound produced in the larynx. Once the sound is produced and comes out of
the speaker’s mouth, it’s transmitted through the air and conveyed to the listener’s ears
by means of waves.
Before discussing the nature of speech sound waves, we must understand the
sound wave motion first. We’re surrounded by air that consists of many small particles.
These air particles are set into vibration by the action of the vocal tract movements.
Obviously these particles move about rapidly in random directions but we can explain the
generation and transmission of most sound waves without taking into account such
random motion and assuming that each particle has some sort of average stable position
from which it is displaced by the passage of a sound wave.
A sound wave is defined as a disturbance moving through a medium such as air,
without permanent displacement of those particles.
The type of motion that tuning forks undergo is called simple harmonic motion or
sinusoidal motion. It’s useful to study a graph that plots the nature of this simple
vibratory motion. This graph is called time-domain waveform or just waveform, and it
shows the amplitude of displacement as a function of time, i.e., how the amplitude of a
sound signal varies with time.
Tuning forks vibrate sinusoidally and, hence, the waveform of the corresponding
sound wave is also sinusoidal. A sinusoid is periodic because it repeats itself at regular
intervals of time. This, the simplest periodic waveform can be described by referring to
a) Amplitude: the absolute value of the degree of displacement from a 0
value during a single period of vibration. In other words, if a mass (air
particles) is displaced from its rest position and allowed to vibrate, the
distance of the mass from its rest position is called displacement, and
the maximum displacement is called the amplitude of the vibration.
b) Frequency: the number of complete cycles that take place in one
second. [When a waveform repeats itself a number of times, each
complete repetition is called a cycle]. It is measured in cycles per second
but in current usage, the term “cycles-per-second” has been renamed
“Hertz” (abbreviated Hz) in honour of the 19th century German physicist
Heinrich Hertz who proved that electricity can be transmitted in
electromagnetic waves which led to the development of wireless
telegraph and radio.
c) Period: or time taken to complete one cycle of vibration. In other words,
the duration of a single cycle. From the frequency of vibration we can
calculate the period, as there is a simple relationship between the two:
the period is simply one divided by frequency. So, if the frequency of
vibration is 2 Hz its period would be ½ a second.
Let’s also take a look at other properties of that sinusoid. This is:
d) Phase: this is the situation of the cycle of oscillation at a particular
moment. We measure it on a scale from 0 to 360 degrees.
e) Wavelength: the distance the wave travels in one cycle of vibration of the
air particles. One can measure it from any point in a cycle to the
corresponding point in the next cycle. The wavelength depends on the
frequency of vibration and the velocity of sound wave propagation in the
medium (in the case of air is 344 meters per second). [wavelength equals
the velocity of sound divided by the frequency λ = c/f].
f) Axis: is the geometrical line that crosses waves transversally.
Tuning forks generate pure tones. A pure tone has one frequency of vibration.
However, most sound sources, unlike the tuning fork, produce vibrations that are
complex, that is, they generate more than one frequency. Fortunately, Jean Baptiste
Joseph Fourier, a French mathematician of the early nineteenth century, showed in 1822
that any non-sinusoidal wave, no matter how complicated, could be represented as the
sum of a number of sinusoidal waves of different frequencies, amplitudes and phases.
Each of these simple sinusoidal waves is called a ‘Fourier component’. This method of
analysis of complex waves was called Fourier Analysis.
There are two kinds of complex sound waves:
a) Periodic sound waves: are those waves in which the pattern of vibration
repeats itself. They are made up from sinusoids with different frequency,
amplitude and/or phase. The sinewave with the lowest frequency is called
fundamental frequency. The other sinusoids are called the harmonics, which
are whole number multiples of the fundamental frequency.
b) Aperiodic sound waves: are those in which the vibration is random and has no
repeatable pattern. They have components at all frequencies, and these
frequencies are not harmonically related as they are for periodic sounds.
With a more complex waveform we cannot determine the frequencies and the
amplitudes of the other components by visual inspection of the waveform. Therefore, we
need another graphic alternative to the waveform, the amplitude spectrum in the
frequency domain, often shortened to amplitude spectrum. It shows the frequency
content of the sound signal and how the amplitudes of the components of that signal vary
with frequency. The axes are amplitudes (in dB) as a function of frequency.
The amplitude spectrum is called line amplitude spectrum [or just line spectrum]
when it shows the components of complex periodic waves represented by a set of lines
[the location of a particular line in the frequency domain (horizontal axis) identifies the
frequency of that component, and the height of the line along the amplitude scale (vertical
axis) identifies the amplitude].
The continuous amplitude spectrum, o just continuous spectra, represent complex
aperiodic waves in which energy is present at all frequencies between certain frequency
limits. Thus, we no longer draw a separate line for each component, but a single curve.
The height of this curve –at any frequency– represents the energy in the wave near that
Complex waves can also be analysed and displayed by a spectrogram whose value
lies in that it can depict the rapid variations of the speech signal that characterise speech.
It shows how the frequency component of a signal changes over time with a three-
dimensional display: frequency is shown on the vertical axis, time on the horizontal axis
and the energy at any frequency level either by the density of the blackness in a black and
white display, or by colours in a colour display.
A researcher can choose the type of information he wishes to see displayed:
harmonic structure (which is ideal for studying the sound source, i.e., the fundamental
frequency and harmonics of the glottal source) or the formant structure (which is ideal for
studying the filter or system, that is the resonances of the vocal tract). So, in general the
choice is between a narrowband spectrogram or a wideband spectrogram depending on
the filter bandwidth settings on the analysis device.
There are several basic things when reading a spectrogram:
A blank space in the patterns means that the sound has ceased; if it lasts
more than about 150 ms this is due to a pause on the part of the speaker, but
many shorter silences occur which are due to the articulation of plosive
The second distinction is between periodic and aperiodic sounds, in phonetic
terms between voiced and voiceless sounds. We can detect successive openings
and closings of the vocal folds by the vertical striations in the spectrogram.
The distance between any two striations will in fact give us the period of the
vocal fold vibration and by taking the inverse of this we could arrive at the
The formants are well-defined bars throughout most of the sequence. The
formants are the resonance frequencies of the vocal tract.
1.2. Acoustic Theory of Speech Production
We move on now from the physics of speech production to the sound generation:
Acoustic Theory of Speech Production.
When we partially a system, such as the speech production mechanism, we
sometimes make a model of it that is a simplification of the system. By testing the model
under various conditions to see whether it behaves like the system we may learn
something about the system.
Understanding the acoustic theory of speech production paves the way for a
discussion of speech analysis. Knowing the ways in which speech sounds are formed
helps, firstly, to determine appropriate analysis methods and measurements. Secondly, it
helps in relating acoustic measures for a sound segment to the underlying articulation of
that segment in a quantitative and coherent manner. In this sense, the acoustic theory of
speech production is central to the analysis of speech.
According to the Acoustic Theory of Speech Production (based on the models
proposed by Chiba and Kajiyama (in 1941) or Gunnar Fant in 1960), the acoustic
behaviour and properties of the human vocal tract in speech production are traditionally
considered in terms of a source and filter model. In the light of this model, the speech
signal can be viewed acoustically as the result of the properties of the sound source,
modified by the properties of the vocal tract that functions as a filter. Both the properties
of the source and those of the vocal tract can vary –and indeed do continuously vary
during speech. What’s more, they can be independently varied.
The use of this model of source and filter provides a convenient functional division
of the mechanisms that are active in the process of generating speech sounds.
- The Source
The speech sounds that we know as vowels, diphthongs, semivowels, and nasal, they use
a voice source. But the vocal folds can be used to generate an aperiodic or noise source
which is the basis for whispered speech. Aperiodic sources can also be generated at
various locations within the supraglottal vocal tract, for example, by forcing the airstream
through a constriction formed by the articulators, as in any voiceless fricative.
We can also have combined sound sources. The periodic source produced by the
vibration of the vocal folds can be combined with the noise source produced by
channelling the airflow through a constriction, as is heard in voiced fricative sounds.
- The filter
As we have already said, the resonating chambers of the pharynx, mouth and, in
some cases, the nasal cavities, modify any sound produced in the larynx.
Resonance is an important concept that needs to be explained. It is the term used
to describe the phenomenon of one system being set into motion by the vibrations of
another: one system causes a second system to ‘resonate’, the second system is called the
‘resonator’. Everything that vibrates has a natural frequency of vibration, as a tuning
fork, or, in some cases, has several frequencies of vibrations, as happens with the vocal
Vocal tract has been defined as a variable resonator. The large resonating cavities
are the pharynx, the mouth or oral cavity, and when the velum is lowered, the nasal
cavity. But the air spaces between the lips, between the teeth and the cheeks, and within
the larynx are also resonators. The most remarkable characteristic of the human vocal
resonators is that their shape and areas can be varied by the movements of the
articulators. That is why, it is very difficult to describe the acoustics of speech production.
This inherent difficulty led to a number of theories being proposed to describe speech
production, like the linear source-filter theory.
During vowel production the vocal tract approximates a tube closed at one end
(glottis) and open at the other end (lips). The lowest natural frequency at which such a
tube resonates will have a wavelength, 4 times the length of the tube. Therefore, if the
human vocal tract is about 17 cm long, the wavelength of the lowest resonant frequency
at which the air within such a tube would naturally vibrate would 68 cm. From the
wavelength we can determine the frequency of vibration, which would be about 500 Hz
(506 Hz to be precise). The values of the other frequency components will be odd
numbers, multiple of this lowest resonant frequency: 1500 Hz, 2500 Hz, etc.
Then, tubes resonate naturally at certain frequencies when energised, and those
frequencies depend upon the configuration and the length of the tube. A human vocal
tract is similar to the sort of acoustic resonator I’ve been describing. However, it shows an
important difference: the vocal tract is not a tube with a uniform cross-sectional area
throughout its length [apart from having soft and absorbent walls that produce damping].
Therefore, its resonant frequencies are no longer uniform. They are spaced irregularly
depending on the exact shape of the tube. The resonances of the vocal tract are called
When the vocal fold pulses are applied at one end of the vocal tract that resonates
at certain frequencies depending on its configuration, the resulting sound will be a
product of the two. [If we want to be rigorous with the terminology, when the glottal
source with its many harmonics (termed the source function) is filtered according to the
frequency response 1 of the vocal tract (termed the transfer function), the harmonics of the
glottal sound wave which are at or near the spectral peaks of the transfer function of the
vocal tract are resonated, and those distant from the resonant frequencies of the tract
lose energy and are greatly attenuated.]. Therefore, the resonant frequencies of the sound
that emerges at the end of the tract (the lips) will be determined by the configuration of
the vocal tract, whereas the harmonics will be the same as the sound at the source (the
Now we can understand that, according to the source and filter model the vocal
tract functions as a frequency-selective filter.
2. Auditory Phonetics
2.1. Hearing Range
We started by defining sound as the perceived changes in air pressure detected by
the ear. This implies that our ears are sensitive to some of the changes in the air pressure
around us, but not to all of them.
People differ in the frequency range to which their ear are tuned, but in general, young,
healthy, human ear can hear sound waves whose frequencies lie about 20 Hz and 20.000
Hz (20 kHz). Sound waves at much higher frequencies do exist, but they are inaudible to
human beings. [“Dog whistles” generate ultrasonic sounds whose sound waves are heard
by dogs but are inaudible to us. Those too low in frequency to be heard are called
The intensity at which a sound is just distinguishable from silence is called the
absolute threshold of hearing and the intensity at which people begin to feel pain as well
as hear it is the threshold of feeling. The useful range of hearing for any individual,
although it varies with frequency, is usually taken to be the area between the persons’
The frequency response graph shows how the vocal tract affects the amplitudes (in dB) of the
frequency components of any signal that passes through it).
absolute threshold and the threshold of feeling. More than that limit it will cause physical
Most quantities are measured in terms of fixed units. When we measure sound
intensity we use watts per square centimetre. However, most times, it is more convenient
to measure intensities using a decibel scale. The decibel [literally one-tenth of a Bell, a
unit named in honour of the Alexander Graham Bell (1847-1922), the American inventor
of the telephone and educator of the deaf] refers to a ratio. For example, 0 dB means that
the two sound intensities are the same. 1 dB means that the higher intensity is 26%
greater than the lower one. Similarly, a sound intensity of 2 dB is 26 % greater than that
of 1 dB, and so on.
So, a 1 dB step in intensity always corresponds to a fixed percentage change
regardless of the base from which we start. Thus, a 1 dB step at the threshold of hearing
is a very, very small change. But at a normal conversational speech intensity, where the
sound level is 60 dB (or one million times greater) the 26% intensity change for 1 dB is 1
million times greater than the 1 dB change at the hearing threshold.
As it happens, the dB logarithmic scale is very convenient as allows us to
compress the enormous range of intensities into manageable proportions. The strongest
sound we can hear without pain is as much as 10 million million times greater in
intensity than a just audible sound. This huge intensity ratio corresponds to 130 dB.
2.2. Physical vs. Subjective Qualities of Speech Sounds
So far, we have talked about the intensity and the frequency of tones. Both are
physical characteristics of sound and can easily be measured in the laboratory.
Corresponding to these objective physical characteristics, which are inherent in the sound
wave itself and can be measured independently of any human observer, are the subjective
qualities of loudness and pitch, which are characteristics of the sensations evoked in a
human listener and cannot be measured without a live listener.
Frequency relates to pitch. When the frequency of vibration is increased, we hear
a rise in pitch, and when the frequency of vibration is decreased, we hear a lowering of
pitch, although the relationship is not linear. How do listeners judge the pitch of a
complex sound that contains more than one frequency? The pitch is judged by listeners to
correspond to the fundamental frequency.
But the pitch of a sound no only depends on the frequency of vibration but also on
the length, tension and width of the vocal folds.
The intensity of a signal is directed related to its loudness. As intensity is
increased the sound is judged by listeners to be louder.
Voice quality is the subjective quality primarily connected with the number,
distribution and intensity of the component frequencies of the complex tone. This
objective quality accounts in part for the differences of voice quality by which we are able
to recognize speakers, but also for the distinctions of timbre between vowels or
Finally, length is related to the duration of sounds. Again, there is not a total
correspondence between them as the duration between different sounds may vary greatly
in absolute terms but we can only appreciate differences (between short and long) that are
3. Acoustic Cues to Speech Perception
To summarise the wealth of information on acoustic cues to the perception of
speech segments, it may be helpful to divide the cues into those important to the
perception of manner, place and voicing distinctions.
To identify the MANNER of a speech sound, listeners determine whether the
sound is harmonically structured with no noise (which signals vowels, approximants or
nasals) or whether the sound contains an aperiodic component (which signals stops,
fricatives or affricates). The periodic, harmonically structured classes present acoustic
cues in energy regions that are relatively low in frequency. In contrast, the aperiodic,
noisy classes of speech sounds are cued by energy that is relatively high in frequency.
How do the listener further separate the harmonically structured vowels, nasals
and approximants? The main manner cues available are relatively intensity of formants
and formant frequency changes:
1. The nasal consonants have formants that contrast strongly in intensity with
those of the neighbouring vowels. [The nasal formants are weakened because
of the antiresonances produced in the nasal cavity]. In addition, there is the
distinctive low frequency resonance, the nasal murmur below 500 Hz.
2. Approximants display formants that glide from one frequency to another
compared to the relatively more steady state formants of vowels and nasals.
Their transitions are more rapid than those of any diphthong.
3. All English vowel sounds in normal speech are voiced. Therefore, the
spectrograms will show vertical striations throughout the stretch
corresponding to the vowel sound. There is also the presence of a voice bar
below 250 Hz, related to the presence of vibration of vocal folds.
During their production, the vocal tract exhibits at least two and usually
three well-marked resonances, the vowel formants. The most important cues
for the perception of vowels lie in the frequencies and patterning of the
speaker’s formants. As one would suspect, there appear to be certain
relationships between the formant of the vowels and the cavities of the vocal
1. First formant (F1): The first formant is the most sensitive to changes in
mouth opening. There is a direct relation between the frequency of F1
and the overall opening of the vocal tract –measured by the distance
between the highest point of the body of the tongue and the closest part
of the palate-: the wider the overall opening, the higher the frequency of
F1, and vice versa.
2. Second formant (F2): It is the most affective by changes within the oral
cavity. There is a direct relation between back-and-up retracting of the
tongue and F2 frequency lowering, and vice-versa. Besides, the second
formant seems to be inversely related to the length of the front cavity:
there is a lowering of the F2 as the front cavity gets bigger, and vice
3. Lip rounding: There is a also direct relation between lip rounding and
formant frequency lowering: the more the lips are rounded (and
protruded) the more the frequencies of the formants are lowered and
In conclusion, if we follow a progression from close to open vowels, there is
a marked gradual approximation of the frequencies of F1 and F2 (rising of the F1
and lowering of the F2). With the shift from front to back articulation, F1 and F2
come much closer together. They move backwards together
4. Diphthongs are described as vowel sounds of changing resonance, so it is not
surprising to find that relatively rapid changes in the formants of the vowels
are sufficient cues to diphthong perception. This is also the case of three
One manner cue for the classes of sounds having an aperiodic component, the
stops, fricatives, and affricates is the duration of the noise, which is transient for stops,
but lasts longer for affricates and lasts the longest for fricatives.
The presence of noise is the principle acoustic cue to the perception of fricative
manner. Other cues to the stop manner of articulation are, the presence of silence,
relatively rapid formant transitions and a release-burst just before the vowel. The burst is
very short when there is no aspiration, and it appears in the spectrogram as an explosion
bar just before the vowel. Affricates are stops with a fricative release, so they contain the
acoustic cues to perception that are found in both stops and fricatives.
The acoustic cues for PLACE of articulation depend more upon a single parameter
of sound frequency.
1. For vowels and approximants, the formant relationships serve to indicate
tongue placement, mouth opening, and vocal tract length. Approximants
production is mainly reflected in the frequency changes in F2. The semivowel
/j/ begins with the highest F2, with /r/ and /l/ in the middle frequencies, and
/w/ relatively low. F3 serves to contrast the acoustic results of tongue tip
placement for /r/ and /l/.
2. For stops, fricatives and affricates, two prominent acoustic cues to place of
articulation are the F2 transitions into neighbouring vowels and the frequency
of the noise components. In general, a second formant transition with a low
frequency locus 2 cues a labial sound, one with a higher locus cues an alveolar
percept, and with a varied, vowel-dependent locus cues a palatal or velar
percept. The F2 transition is also used to cue the difference between the non-
sibilant fricatives. As far as the frequency of the noise components is
concerned, low frequency spectral peaks [which would correspond to the vowel
formants] cue labial percepts, high frequency peaks cue alveolar percepts, and
mid-frequency range peaks cue palatal or velar percepts.
Finally, the acoustic cues for consonant VOICING depend more upon relative
durations and timing of events than upon frequency or intensity differences, with the
exception of the presence of absence of voice bar (glottal pulsing):
1. To recognise this contrast in stops, listeners use several cues: the presence or
absence of voice bar and aspiration, the duration of the silence marking the
stop closure (shorter in voiced stops) and the timing of the initiation of
phonation relative to stop release (VOT) (shorter in voiced stops).
2. Fricatives and affricates are perceived as voiceless when the frication is of
relatively long duration, and, in the case of affricates, when the closure
duration is also relatively long.
The frequency area where all the transitions point to.
3. The duration of the preceding vowel can cue the voicing contrast of a following
consonant: A voiced consonant will be perceived when the preceding vowel is
relatively long in duration, and a voiceless consonant will be perceived when
the duration of the preceding vowel is relatively short.
One thing which spectrographic analysis makes abundantly clear is that the
acoustic aspect of speech is one of continuous change, as indeed it must be since is the
fruit of continuous movement on the part of speech mechanism. This can be seen by
analysing the suprasegmentals or prosodic features of speech, including intonation,
stress and juncture, usually occur simultaneously with two or more segmental phonemes.
They are overlaid upon syllables, words, phrases and sentences.
The physical features of fundamental frequency, amplitude and duration are
perceived by listeners in terms of variations and contrasts in pitch, loudness and length.
It is important to distinguish the distinctions between them, as linguistic percepts such
as intonation or stress do not have a simple and direct representation in the acoustical
1. Stress: there are three acoustic characteristics associated with heavily stressed
syllables: (a) They have a higher fundamental frequency; (b) They are of greater duration
and (c) they are of greater intensity than weakly stressed syllables.
Although they all contribute to the perception of stress, they do not do so equally.
Fundamental frequency is the most powerful cue to stress, followed by duration and
2. The perception of Intonation requires the ability to track pitch changes, as
what we perceive as intonation are the pitch changes caused by the alterations in the rate
of vocal fold vibration. In acoustic terms these changes manifest themselves as alterations
to the fundamental frequency of the speech signal. Low pitched and high pitched sounds
can be distinguished from each other in the spectrogram by the different spacing of their
harmonics. The closer spacing of the harmonics indicates a high-pitched sound compared
to the more spaced harmonics of a low-pitched one. Therefore, we could distinguish a
rising intonation from a falling intonation. The end of a declarative sentence, for example,
will be marked by a decrease in the fundamental frequency and in intensity. It is possible
to measure the fundamental frequency from spectrograms calculating it from the
harmonics, but it is not easy and limited in terms of time.
3. The prosodic feature of internal juncture (marking the difference between “a
name” and “an aim”) can be cued by a number of acoustic features, such as silence, vowel
lengthening and the presence of voicing or aspiration. If the sequence is articulated so
that the [n] is more closely linked to the preceding schwa than to the following diphthong,
the [n] will of greater duration. In addition, speakers often insert a glottal stop before the
diphthong, inhibiting coarticulation between the nasal and the following vocalic sequence.
4. Finally, sounds may appear to be different length. Clearly, it is possible to
measure the duration or sounds or syllables in the spectrogram. Speech sounds vary in
intrinsic duration. In general, diphthongs and tense vowels are described as “intrinsically”
long sounds. The lax vowels, in contrast, are described as intrinsically short sounds. The
continuants consonants (fricatives, nasals, semivowels) are longer than the burst of the
plosives, etc. However, even when the duration can be measured, variations of duration in
acoustic terms may not correspond to our linguistic judgements of length. Absolute
duration values vary according to many factor such as the speaking rate, the stress, the
voicing setting, etc., but in English system no more than two degrees of length are ever
significant and all absolute durations will be interpreted in terms of this relationship.
Then, what is more important is to realise that the variations of length perceived
in any utterance constitute one manifestation of the rhythmic pattern which is
characteristic of English and is so fundamentally different from the flow of other
languages, such as Spanish, where syllables tend to be of much more ever length.