Unit 9 Summary 2009


Published on

Unit about Acoustics

Published in: Education, Technology, Business
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Unit 9 Summary 2009

  1. 1. Main Ideas from Unit 9 1.Acoustic Phonetics 1.1. The Physics of Speech Production Let’s start by defining sound as the perceived changes in air pressure detected by the ear. The formation of any sound requires a vibrating medium to be set in motion by some kind of energy. In the case of the human speech mechanism the function of vibrator is often fulfilled by the vocal folds that are activated by air pressure from the lungs. In addition, the resonating chambers of the pharynx, mouth and, in some cases, the nasal cavities, modify any sound produced in the larynx. Once the sound is produced and comes out of the speaker’s mouth, it’s transmitted through the air and conveyed to the listener’s ears by means of waves. Before discussing the nature of speech sound waves, we must understand the sound wave motion first. We’re surrounded by air that consists of many small particles. These air particles are set into vibration by the action of the vocal tract movements. Obviously these particles move about rapidly in random directions but we can explain the generation and transmission of most sound waves without taking into account such random motion and assuming that each particle has some sort of average stable position from which it is displaced by the passage of a sound wave. A sound wave is defined as a disturbance moving through a medium such as air, without permanent displacement of those particles. The type of motion that tuning forks undergo is called simple harmonic motion or sinusoidal motion. It’s useful to study a graph that plots the nature of this simple vibratory motion. This graph is called time-domain waveform or just waveform, and it shows the amplitude of displacement as a function of time, i.e., how the amplitude of a sound signal varies with time. Tuning forks vibrate sinusoidally and, hence, the waveform of the corresponding sound wave is also sinusoidal. A sinusoid is periodic because it repeats itself at regular intervals of time. This, the simplest periodic waveform can be described by referring to three parameters: a) Amplitude: the absolute value of the degree of displacement from a 0 value during a single period of vibration. In other words, if a mass (air particles) is displaced from its rest position and allowed to vibrate, the distance of the mass from its rest position is called displacement, and the maximum displacement is called the amplitude of the vibration. 1
  2. 2. b) Frequency: the number of complete cycles that take place in one second. [When a waveform repeats itself a number of times, each complete repetition is called a cycle]. It is measured in cycles per second but in current usage, the term “cycles-per-second” has been renamed “Hertz” (abbreviated Hz) in honour of the 19th century German physicist Heinrich Hertz who proved that electricity can be transmitted in electromagnetic waves which led to the development of wireless telegraph and radio. c) Period: or time taken to complete one cycle of vibration. In other words, the duration of a single cycle. From the frequency of vibration we can calculate the period, as there is a simple relationship between the two: the period is simply one divided by frequency. So, if the frequency of vibration is 2 Hz its period would be ½ a second. Let’s also take a look at other properties of that sinusoid. This is: d) Phase: this is the situation of the cycle of oscillation at a particular moment. We measure it on a scale from 0 to 360 degrees. e) Wavelength: the distance the wave travels in one cycle of vibration of the air particles. One can measure it from any point in a cycle to the corresponding point in the next cycle. The wavelength depends on the frequency of vibration and the velocity of sound wave propagation in the medium (in the case of air is 344 meters per second). [wavelength equals the velocity of sound divided by the frequency λ = c/f]. f) Axis: is the geometrical line that crosses waves transversally. Tuning forks generate pure tones. A pure tone has one frequency of vibration. However, most sound sources, unlike the tuning fork, produce vibrations that are complex, that is, they generate more than one frequency. Fortunately, Jean Baptiste Joseph Fourier, a French mathematician of the early nineteenth century, showed in 1822 that any non-sinusoidal wave, no matter how complicated, could be represented as the sum of a number of sinusoidal waves of different frequencies, amplitudes and phases. Each of these simple sinusoidal waves is called a ‘Fourier component’. This method of analysis of complex waves was called Fourier Analysis. There are two kinds of complex sound waves: a) Periodic sound waves: are those waves in which the pattern of vibration repeats itself. They are made up from sinusoids with different frequency, amplitude and/or phase. The sinewave with the lowest frequency is called 2
  3. 3. fundamental frequency. The other sinusoids are called the harmonics, which are whole number multiples of the fundamental frequency. b) Aperiodic sound waves: are those in which the vibration is random and has no repeatable pattern. They have components at all frequencies, and these frequencies are not harmonically related as they are for periodic sounds. With a more complex waveform we cannot determine the frequencies and the amplitudes of the other components by visual inspection of the waveform. Therefore, we need another graphic alternative to the waveform, the amplitude spectrum in the frequency domain, often shortened to amplitude spectrum. It shows the frequency content of the sound signal and how the amplitudes of the components of that signal vary with frequency. The axes are amplitudes (in dB) as a function of frequency. The amplitude spectrum is called line amplitude spectrum [or just line spectrum] when it shows the components of complex periodic waves represented by a set of lines [the location of a particular line in the frequency domain (horizontal axis) identifies the frequency of that component, and the height of the line along the amplitude scale (vertical axis) identifies the amplitude]. The continuous amplitude spectrum, o just continuous spectra, represent complex aperiodic waves in which energy is present at all frequencies between certain frequency limits. Thus, we no longer draw a separate line for each component, but a single curve. The height of this curve –at any frequency– represents the energy in the wave near that frequency. Complex waves can also be analysed and displayed by a spectrogram whose value lies in that it can depict the rapid variations of the speech signal that characterise speech. It shows how the frequency component of a signal changes over time with a three- dimensional display: frequency is shown on the vertical axis, time on the horizontal axis and the energy at any frequency level either by the density of the blackness in a black and white display, or by colours in a colour display. A researcher can choose the type of information he wishes to see displayed: harmonic structure (which is ideal for studying the sound source, i.e., the fundamental frequency and harmonics of the glottal source) or the formant structure (which is ideal for studying the filter or system, that is the resonances of the vocal tract). So, in general the choice is between a narrowband spectrogram or a wideband spectrogram depending on the filter bandwidth settings on the analysis device. 3
  4. 4. There are several basic things when reading a spectrogram: A blank space in the patterns means that the sound has ceased; if it lasts - more than about 150 ms this is due to a pause on the part of the speaker, but many shorter silences occur which are due to the articulation of plosive consonant sounds. The second distinction is between periodic and aperiodic sounds, in phonetic - terms between voiced and voiceless sounds. We can detect successive openings and closings of the vocal folds by the vertical striations in the spectrogram. The distance between any two striations will in fact give us the period of the vocal fold vibration and by taking the inverse of this we could arrive at the frequency. The formants are well-defined bars throughout most of the sequence. The - formants are the resonance frequencies of the vocal tract. 1.2. Acoustic Theory of Speech Production We move on now from the physics of speech production to the sound generation: Acoustic Theory of Speech Production. When we partially a system, such as the speech production mechanism, we sometimes make a model of it that is a simplification of the system. By testing the model under various conditions to see whether it behaves like the system we may learn something about the system. Understanding the acoustic theory of speech production paves the way for a discussion of speech analysis. Knowing the ways in which speech sounds are formed helps, firstly, to determine appropriate analysis methods and measurements. Secondly, it helps in relating acoustic measures for a sound segment to the underlying articulation of that segment in a quantitative and coherent manner. In this sense, the acoustic theory of speech production is central to the analysis of speech. According to the Acoustic Theory of Speech Production (based on the models proposed by Chiba and Kajiyama (in 1941) or Gunnar Fant in 1960), the acoustic behaviour and properties of the human vocal tract in speech production are traditionally considered in terms of a source and filter model. In the light of this model, the speech signal can be viewed acoustically as the result of the properties of the sound source, modified by the properties of the vocal tract that functions as a filter. Both the properties of the source and those of the vocal tract can vary –and indeed do continuously vary during speech. What’s more, they can be independently varied. The use of this model of source and filter provides a convenient functional division of the mechanisms that are active in the process of generating speech sounds. 4
  5. 5. - The Source The speech sounds that we know as vowels, diphthongs, semivowels, and nasal, they use a voice source. But the vocal folds can be used to generate an aperiodic or noise source which is the basis for whispered speech. Aperiodic sources can also be generated at various locations within the supraglottal vocal tract, for example, by forcing the airstream through a constriction formed by the articulators, as in any voiceless fricative. We can also have combined sound sources. The periodic source produced by the vibration of the vocal folds can be combined with the noise source produced by channelling the airflow through a constriction, as is heard in voiced fricative sounds. - The filter As we have already said, the resonating chambers of the pharynx, mouth and, in some cases, the nasal cavities, modify any sound produced in the larynx. Resonance is an important concept that needs to be explained. It is the term used to describe the phenomenon of one system being set into motion by the vibrations of another: one system causes a second system to ‘resonate’, the second system is called the ‘resonator’. Everything that vibrates has a natural frequency of vibration, as a tuning fork, or, in some cases, has several frequencies of vibrations, as happens with the vocal tract. Vocal tract has been defined as a variable resonator. The large resonating cavities are the pharynx, the mouth or oral cavity, and when the velum is lowered, the nasal cavity. But the air spaces between the lips, between the teeth and the cheeks, and within the larynx are also resonators. The most remarkable characteristic of the human vocal resonators is that their shape and areas can be varied by the movements of the articulators. That is why, it is very difficult to describe the acoustics of speech production. This inherent difficulty led to a number of theories being proposed to describe speech production, like the linear source-filter theory. During vowel production the vocal tract approximates a tube closed at one end (glottis) and open at the other end (lips). The lowest natural frequency at which such a tube resonates will have a wavelength, 4 times the length of the tube. Therefore, if the human vocal tract is about 17 cm long, the wavelength of the lowest resonant frequency at which the air within such a tube would naturally vibrate would 68 cm. From the wavelength we can determine the frequency of vibration, which would be about 500 Hz (506 Hz to be precise). The values of the other frequency components will be odd numbers, multiple of this lowest resonant frequency: 1500 Hz, 2500 Hz, etc. Then, tubes resonate naturally at certain frequencies when energised, and those frequencies depend upon the configuration and the length of the tube. A human vocal tract is similar to the sort of acoustic resonator I’ve been describing. However, it shows an 5
  6. 6. important difference: the vocal tract is not a tube with a uniform cross-sectional area throughout its length [apart from having soft and absorbent walls that produce damping]. Therefore, its resonant frequencies are no longer uniform. They are spaced irregularly depending on the exact shape of the tube. The resonances of the vocal tract are called formants. When the vocal fold pulses are applied at one end of the vocal tract that resonates at certain frequencies depending on its configuration, the resulting sound will be a product of the two. [If we want to be rigorous with the terminology, when the glottal source with its many harmonics (termed the source function) is filtered according to the frequency response 1 of the vocal tract (termed the transfer function), the harmonics of the glottal sound wave which are at or near the spectral peaks of the transfer function of the vocal tract are resonated, and those distant from the resonant frequencies of the tract lose energy and are greatly attenuated.]. Therefore, the resonant frequencies of the sound that emerges at the end of the tract (the lips) will be determined by the configuration of the vocal tract, whereas the harmonics will be the same as the sound at the source (the glottis). Now we can understand that, according to the source and filter model the vocal tract functions as a frequency-selective filter. 2. Auditory Phonetics 2.1. Hearing Range We started by defining sound as the perceived changes in air pressure detected by the ear. This implies that our ears are sensitive to some of the changes in the air pressure around us, but not to all of them. People differ in the frequency range to which their ear are tuned, but in general, young, healthy, human ear can hear sound waves whose frequencies lie about 20 Hz and 20.000 Hz (20 kHz). Sound waves at much higher frequencies do exist, but they are inaudible to human beings. [“Dog whistles” generate ultrasonic sounds whose sound waves are heard by dogs but are inaudible to us. Those too low in frequency to be heard are called subsonic sounds.] The intensity at which a sound is just distinguishable from silence is called the absolute threshold of hearing and the intensity at which people begin to feel pain as well as hear it is the threshold of feeling. The useful range of hearing for any individual, although it varies with frequency, is usually taken to be the area between the persons’ The frequency response graph shows how the vocal tract affects the amplitudes (in dB) of the 1 frequency components of any signal that passes through it). 6
  7. 7. absolute threshold and the threshold of feeling. More than that limit it will cause physical damage. Most quantities are measured in terms of fixed units. When we measure sound intensity we use watts per square centimetre. However, most times, it is more convenient to measure intensities using a decibel scale. The decibel [literally one-tenth of a Bell, a unit named in honour of the Alexander Graham Bell (1847-1922), the American inventor of the telephone and educator of the deaf] refers to a ratio. For example, 0 dB means that the two sound intensities are the same. 1 dB means that the higher intensity is 26% greater than the lower one. Similarly, a sound intensity of 2 dB is 26 % greater than that of 1 dB, and so on. So, a 1 dB step in intensity always corresponds to a fixed percentage change regardless of the base from which we start. Thus, a 1 dB step at the threshold of hearing is a very, very small change. But at a normal conversational speech intensity, where the sound level is 60 dB (or one million times greater) the 26% intensity change for 1 dB is 1 million times greater than the 1 dB change at the hearing threshold. As it happens, the dB logarithmic scale is very convenient as allows us to compress the enormous range of intensities into manageable proportions. The strongest sound we can hear without pain is as much as 10 million million times greater in intensity than a just audible sound. This huge intensity ratio corresponds to 130 dB. 2.2. Physical vs. Subjective Qualities of Speech Sounds So far, we have talked about the intensity and the frequency of tones. Both are physical characteristics of sound and can easily be measured in the laboratory. Corresponding to these objective physical characteristics, which are inherent in the sound wave itself and can be measured independently of any human observer, are the subjective qualities of loudness and pitch, which are characteristics of the sensations evoked in a human listener and cannot be measured without a live listener. Frequency relates to pitch. When the frequency of vibration is increased, we hear a rise in pitch, and when the frequency of vibration is decreased, we hear a lowering of pitch, although the relationship is not linear. How do listeners judge the pitch of a complex sound that contains more than one frequency? The pitch is judged by listeners to correspond to the fundamental frequency. But the pitch of a sound no only depends on the frequency of vibration but also on the length, tension and width of the vocal folds. The intensity of a signal is directed related to its loudness. As intensity is increased the sound is judged by listeners to be louder. 7
  8. 8. Voice quality is the subjective quality primarily connected with the number, distribution and intensity of the component frequencies of the complex tone. This objective quality accounts in part for the differences of voice quality by which we are able to recognize speakers, but also for the distinctions of timbre between vowels or consonants. Finally, length is related to the duration of sounds. Again, there is not a total correspondence between them as the duration between different sounds may vary greatly in absolute terms but we can only appreciate differences (between short and long) that are linguistically significant. 3. Acoustic Cues to Speech Perception 3.1. Segmentals To summarise the wealth of information on acoustic cues to the perception of speech segments, it may be helpful to divide the cues into those important to the perception of manner, place and voicing distinctions. To identify the MANNER of a speech sound, listeners determine whether the sound is harmonically structured with no noise (which signals vowels, approximants or nasals) or whether the sound contains an aperiodic component (which signals stops, fricatives or affricates). The periodic, harmonically structured classes present acoustic cues in energy regions that are relatively low in frequency. In contrast, the aperiodic, noisy classes of speech sounds are cued by energy that is relatively high in frequency. Periodic Sounds How do the listener further separate the harmonically structured vowels, nasals and approximants? The main manner cues available are relatively intensity of formants and formant frequency changes: 1. The nasal consonants have formants that contrast strongly in intensity with those of the neighbouring vowels. [The nasal formants are weakened because of the antiresonances produced in the nasal cavity]. In addition, there is the distinctive low frequency resonance, the nasal murmur below 500 Hz. 2. Approximants display formants that glide from one frequency to another compared to the relatively more steady state formants of vowels and nasals. Their transitions are more rapid than those of any diphthong. 3. All English vowel sounds in normal speech are voiced. Therefore, the spectrograms will show vertical striations throughout the stretch corresponding to the vowel sound. There is also the presence of a voice bar below 250 Hz, related to the presence of vibration of vocal folds. 8
  9. 9. During their production, the vocal tract exhibits at least two and usually three well-marked resonances, the vowel formants. The most important cues for the perception of vowels lie in the frequencies and patterning of the speaker’s formants. As one would suspect, there appear to be certain relationships between the formant of the vowels and the cavities of the vocal tract: 1. First formant (F1): The first formant is the most sensitive to changes in mouth opening. There is a direct relation between the frequency of F1 and the overall opening of the vocal tract –measured by the distance between the highest point of the body of the tongue and the closest part of the palate-: the wider the overall opening, the higher the frequency of F1, and vice versa. 2. Second formant (F2): It is the most affective by changes within the oral cavity. There is a direct relation between back-and-up retracting of the tongue and F2 frequency lowering, and vice-versa. Besides, the second formant seems to be inversely related to the length of the front cavity: there is a lowering of the F2 as the front cavity gets bigger, and vice versa. 3. Lip rounding: There is a also direct relation between lip rounding and formant frequency lowering: the more the lips are rounded (and protruded) the more the frequencies of the formants are lowered and inversely. In conclusion, if we follow a progression from close to open vowels, there is a marked gradual approximation of the frequencies of F1 and F2 (rising of the F1 and lowering of the F2). With the shift from front to back articulation, F1 and F2 come much closer together. They move backwards together 4. Diphthongs are described as vowel sounds of changing resonance, so it is not surprising to find that relatively rapid changes in the formants of the vowels are sufficient cues to diphthong perception. This is also the case of three elements (triphthongs). Aperiodic Sounds One manner cue for the classes of sounds having an aperiodic component, the stops, fricatives, and affricates is the duration of the noise, which is transient for stops, but lasts longer for affricates and lasts the longest for fricatives. The presence of noise is the principle acoustic cue to the perception of fricative manner. Other cues to the stop manner of articulation are, the presence of silence, 9
  10. 10. relatively rapid formant transitions and a release-burst just before the vowel. The burst is very short when there is no aspiration, and it appears in the spectrogram as an explosion bar just before the vowel. Affricates are stops with a fricative release, so they contain the acoustic cues to perception that are found in both stops and fricatives. The acoustic cues for PLACE of articulation depend more upon a single parameter of sound frequency. 1. For vowels and approximants, the formant relationships serve to indicate tongue placement, mouth opening, and vocal tract length. Approximants production is mainly reflected in the frequency changes in F2. The semivowel /j/ begins with the highest F2, with /r/ and /l/ in the middle frequencies, and /w/ relatively low. F3 serves to contrast the acoustic results of tongue tip placement for /r/ and /l/. 2. For stops, fricatives and affricates, two prominent acoustic cues to place of articulation are the F2 transitions into neighbouring vowels and the frequency of the noise components. In general, a second formant transition with a low frequency locus 2 cues a labial sound, one with a higher locus cues an alveolar percept, and with a varied, vowel-dependent locus cues a palatal or velar percept. The F2 transition is also used to cue the difference between the non- sibilant fricatives. As far as the frequency of the noise components is concerned, low frequency spectral peaks [which would correspond to the vowel formants] cue labial percepts, high frequency peaks cue alveolar percepts, and mid-frequency range peaks cue palatal or velar percepts. Finally, the acoustic cues for consonant VOICING depend more upon relative durations and timing of events than upon frequency or intensity differences, with the exception of the presence of absence of voice bar (glottal pulsing): 1. To recognise this contrast in stops, listeners use several cues: the presence or absence of voice bar and aspiration, the duration of the silence marking the stop closure (shorter in voiced stops) and the timing of the initiation of phonation relative to stop release (VOT) (shorter in voiced stops). 2. Fricatives and affricates are perceived as voiceless when the frication is of relatively long duration, and, in the case of affricates, when the closure duration is also relatively long. The frequency area where all the transitions point to. 2 10
  11. 11. 3. The duration of the preceding vowel can cue the voicing contrast of a following consonant: A voiced consonant will be perceived when the preceding vowel is relatively long in duration, and a voiceless consonant will be perceived when the duration of the preceding vowel is relatively short. 3.2. Suprasegmentals One thing which spectrographic analysis makes abundantly clear is that the acoustic aspect of speech is one of continuous change, as indeed it must be since is the fruit of continuous movement on the part of speech mechanism. This can be seen by analysing the suprasegmentals or prosodic features of speech, including intonation, stress and juncture, usually occur simultaneously with two or more segmental phonemes. They are overlaid upon syllables, words, phrases and sentences. The physical features of fundamental frequency, amplitude and duration are perceived by listeners in terms of variations and contrasts in pitch, loudness and length. It is important to distinguish the distinctions between them, as linguistic percepts such as intonation or stress do not have a simple and direct representation in the acoustical signal. 1. Stress: there are three acoustic characteristics associated with heavily stressed syllables: (a) They have a higher fundamental frequency; (b) They are of greater duration and (c) they are of greater intensity than weakly stressed syllables. Although they all contribute to the perception of stress, they do not do so equally. Fundamental frequency is the most powerful cue to stress, followed by duration and amplitude. 2. The perception of Intonation requires the ability to track pitch changes, as what we perceive as intonation are the pitch changes caused by the alterations in the rate of vocal fold vibration. In acoustic terms these changes manifest themselves as alterations to the fundamental frequency of the speech signal. Low pitched and high pitched sounds can be distinguished from each other in the spectrogram by the different spacing of their harmonics. The closer spacing of the harmonics indicates a high-pitched sound compared to the more spaced harmonics of a low-pitched one. Therefore, we could distinguish a rising intonation from a falling intonation. The end of a declarative sentence, for example, will be marked by a decrease in the fundamental frequency and in intensity. It is possible to measure the fundamental frequency from spectrograms calculating it from the harmonics, but it is not easy and limited in terms of time. 3. The prosodic feature of internal juncture (marking the difference between “a name” and “an aim”) can be cued by a number of acoustic features, such as silence, vowel lengthening and the presence of voicing or aspiration. If the sequence is articulated so 11
  12. 12. that the [n] is more closely linked to the preceding schwa than to the following diphthong, the [n] will of greater duration. In addition, speakers often insert a glottal stop before the diphthong, inhibiting coarticulation between the nasal and the following vocalic sequence. 4. Finally, sounds may appear to be different length. Clearly, it is possible to measure the duration or sounds or syllables in the spectrogram. Speech sounds vary in intrinsic duration. In general, diphthongs and tense vowels are described as “intrinsically” long sounds. The lax vowels, in contrast, are described as intrinsically short sounds. The continuants consonants (fricatives, nasals, semivowels) are longer than the burst of the plosives, etc. However, even when the duration can be measured, variations of duration in acoustic terms may not correspond to our linguistic judgements of length. Absolute duration values vary according to many factor such as the speaking rate, the stress, the voicing setting, etc., but in English system no more than two degrees of length are ever significant and all absolute durations will be interpreted in terms of this relationship. Then, what is more important is to realise that the variations of length perceived in any utterance constitute one manifestation of the rhythmic pattern which is characteristic of English and is so fundamentally different from the flow of other languages, such as Spanish, where syllables tend to be of much more ever length. 12