Current developments in phonetics Louis C.W. Pols Institute of Phonetic Sciences (IFA) Amsterdam Center for Language and Communication (ACLC)
Overview My job:  Aspects of Phonetics Size of Phonetics Community Ken Stevens and Phonetics My choices of topics Conclusions
Aspects of Phonetics Phonetics is fantastic and interdisciplinary it is about the speech signal and much more: spoken language, spoken communication phonemes and prosody speaking and listening mental storage and retrieval speech acquisition and speech pathology speech technology, speech databases languages of the world, dialects and more: e.g. laboratory phonology, evaluating cochlear implants, or designing Web avatars
My choices ,  given other intro’s Phonetics as a basic science Computational modeling, computational phonetics Knowledge from annotated, and preferable freely accessible, speech corpora Phonetics as an interdisciplinary science
Size of Phonetics Community 1000+ participants to speech conferences like:  ICSLP & Eurospeech (now Interspeech under ISCA), ICASSP, LREC, ICPhS (under IPA) numerous workshops (see ISCA and FoNETiks newsletters, and News section in SpeCom) IPA ~1000 members, ISCA ~1350 members phonetics community at least 10 times bigger books; journals; LDC & ELRA ICPhS’03 Barcelona: 50 countries (USA 158; FR 81; GE 73; UK 71; JAP 46; SP 45, SW 41; NE 31; CAN 25; RU 19; IT 17; FI 14; AU 12; BRA 12, CH 12) Ba-Ma system: less specialization in Phonetics
Ken Stevens and Phonetics ESCA medal at Eurospeech’95 in Madrid on average one paper per year in JASA special issue JPhon “Quantal Theory” (1989) 1998 master piece “Acoustic Phonetics” regular keynote speaker at conferences many international contacts (also Europe) many good students world-wide
Banquet Eurospeech95, Madrid E’95 chairman  ESCA-medalist   ESCA president J.M. Pardo   Ken Stevens  Mrs. Pardo Louis Pols
Textbook Phonetics Summer course in English Phonetics (UCL): phonemic systems (vowels and consonants) segmental analysis (allophonic processes) word stress weakening and coarticulation processes sentence stress (accent, tonal stress) intonation and meaning similar in most textbooks
Invariance Symp., MIT 1983 Invariance and variability in speech processes (Perkell & Klatt, 1986) also Leitmotiv for my Amsterdam group perception of dynamic speechlike sounds (vW)  formant dynamics (van Son) appropriate context (van Son) acoustic vowel reduction (van Bergem) efficiency of speech (van Son)
DL for short speech-like transitions Adopted from van Wieringen & Pols (1998), Acta Acustica 84, 520-528 “ Discrimination of short and rapid speechlike transitions” complex simple short longer trans. initial final
Static vs. dynamic V recogn. see Weenink (2001) “ Vowel normalizations with the TIMIT acoustic phonetic speech corpus”, IFA Proc. 24, 117-123 438 males, both train & test sentences TIMIT 35,385 vowel segments, hand segmented 13 monophthongeal vowel categories 1-Bark bandfilter anal. (18), intensity normal. 3 frames per segment: central and 25 ms L/R
Some results Vowel classif. (%) with discriminant functions 94.5 87.9 5,374 speaker normalized 90.1 78.9 5,374 438x13 V centers per speaker 69.2 62.2 35,385 speaker normalized 66.9 59.3 35,385 438x13x(1…25) Original Dynamic 3 frames Static  1 frame # Items Condition
Perceiving (speech) dynamics vowel perception w/w or w/o transitions? our claims (vSon, IFA proc. 17(1993): only evidence for compensatory processes (i.e. perceptual-overshoot and dynamic-specification), when in an appropriate context synthetic isolated dynamic formant tracks lead to perceptual undershoot (=averaging) silent center studies are ambiguous concl.: info in formant dynamics is only used when V’s are heard in appropriate context
 
Vowel identification compare V responses for dynamic stimuli with those for static stimuli calculate net shift in V responses per onglide (CV), complete (CVC), or offglide (VC) result: responses  average  over the trailing part of the formant track see Pols & vSon, “Acoustics and perception of dynamic vowel segments”, Speech Comm.
Perceptual undershoot Net shift in vowel responses to tokens with curved formant tracks vs. stationary tokens. All values significant, except small open triangles
Local context and C & V identification 120 CVC fragments taken from a read text various segments per CVC-fragment (50ms V-kernel and beyond) both accented and unaccented vowels subjects identified (pre- or post-vocalic) consonant or vowel in CV-, VC-, or CVC-segments vSon & Pols (1999), “Perisegmental speech improves consonant and vowel identification”, Speech Comm. 29, 1-22
 
Error rates of vowel identification for the individual stimulus token types. Long-short vowel errors (/ α-a:,  -o:/)  are ignored c
results: phoneme identification benefits from extra speech left context more beneficial than right context better identification in CV when also other member of pair was identified correctly (context effect)
Effect of (lack of) context 100 Dutch listeners identifying V segments “ Vowel contrast reduction”, K-vBeinum (1980) ASC = 1/n  Σ |LF i  - LF i | 2   (total variance), LF i  = 100  10 log F i i=1 n 33.0 189 38.9 255 33.3 209 28.7 119 31.2 174 unstr., free conv. % (10) ASC 84.3 407 85.3 529 84.9 374 78.8 320 88.1 406 words % (5) ASC 89.6 480 86.4 634 88.0 447 88.9 404 95.2 433 isolated V % (3) ASC Av. F2 F1 M2 M1 3 conditions
Historical biases R. Plomp (2002) “The intelligent ear. On the nature of sound perception” biases in research: dominance for simple stimuli (e.g., phonemes) preference for microscopic approach (e.g., phoneme discrimination rather than intelligibility) emphasis on psychophysical rather than cognitive aspects of hearing use of clean signals in lab (rather than acoustic reality of outside world with its disruptive sounds)
Computational Phonetics R. Moore (1995) 13th ICPhS, Stockholm unify the emerging theoretical and practical developments in speech technology with the established knowledge and practices in phonetic sciences Sagisaka et al. (1997), “Computing prosody. Computational models for processing spontaneous speech” Klatt (1987), vSanten (1997), Wang (1997), duration modeling vBergem (1993), Acoustic and lexical vowel reduction Steeneken (1992), Speech Transmission Index
Stylized formant contour c 2 c 1 F 2  (t) = c 0  +  c 1 t + c 2  t  2   (second order polynomial) F 2  (t) = F 2  (t) +  α 2 p (t) + β 2 t  (t) + γ 2 α  (t)  for @ in  /p@t α/ F 2  (-1) = 1352 Hz ; F 2  (0) = 1435 Hz; F 2  (1)=1485 Hz F 2 normalized time -1 F center   (c 0 ) F offset 0 1 F onset
Schwa realization The schwa is not just a centralized vowel but something that is completely assimilated with its phonemic context
Human word intelligibility vs. noise from Ph.D thesis H. Steeneken (1992) ‘ On measuring and predicting speech intelligibility’
Knowledge from Annotated Sp. Corp. knowledge casted in rules vs. knowledge derived from intelligent searches in DB ’ s vSanten (1997) greedy algorithm Greenberg et al. (2003) Switchboard Oostdijk et al. (2002)  1000 hrs.- 10M words  spoken Dutch corpus ( CGN ) vSon et al. (2001)  5.5 hrs.  IFA corpus Intas915 project (Dutch, Finnish, Russian)
Freq. effects  vs.  vowel reduction Dutch Finnish Russian -0,100 -0,050 0,000 0,050 0,100 0,150 0,200 0,250 0,300 Duration F12Dist CoG Intensity Correlation Coefficient -> R read speech spontaneous speech -log 2 (word frequency)  vs.  acoustic vowel reduction (in terms of duration, F1F2Dist, CoG, and Intensity) for Du, Fi, Ru  Dutch Finnish Russian 0,000 0,050 0,100 0,150 0,200 0,250 0,300 Duration F12Dist CoG Intensity Correlation Coefficient -> R
Phonetics an Interdisciplinary Science some examples phonetics is a contributor to many signal and data processing techniques as well as pattern recognition techniques use of source-filter model to describe early speech development laryngectomized speech, production and evaluation turn switches in conversational dialogs progress in vowel production in babies
Early speech development vBeinum, Clement, vdDikkenberg, Developmental Sc. 4, 61-70 (2001) average onset (in weeks) Stage I Stage II Stage III Stage IV Stage V (babbling) Stage VI (‘words’) 0 6 10 20 31 40
Tracheoesophageal speech C. van As, Ph.D thesis (2001)
Turn switches in conversation shift in phonetics from isolated stimuli to conversational speech quantitative modelling of the identification of turn-relevent places (TRP’s) integration process of temporally unfolding information at different levels in speech, from conversation acts and semantics to prosody, phonetics and visual cues use of laryngograph to detect preparatory glottal closure that precedes most TRP’s new project Rob van Son (start Jan. 2004)
Progression in V production of babies especially in the first year of life utterances difficult to identify as phon. seq. spectro-temporal analyses difficult because of very high pitch formant measurements biased by expectations pitch-related bandfilter analysis (automatic) 5 normal-hearing and 5 hearing-impaired vdStelt et al. (2003)
Spectral measurements  normal hearing child 5 & 24 mo. hearing-impaired child 5 & 24 mo. i u a
Conclusions importance of dynamic information implications of (lack of) (local) context interdisciplinary nature of phonetics need for large, annotated, and freely accessible speech corpora generalization via computational phonetics phonetics and phonology (Patricia Keating)

Current Dev. In Phonetics

  • 1.
    Current developments inphonetics Louis C.W. Pols Institute of Phonetic Sciences (IFA) Amsterdam Center for Language and Communication (ACLC)
  • 2.
    Overview My job: Aspects of Phonetics Size of Phonetics Community Ken Stevens and Phonetics My choices of topics Conclusions
  • 3.
    Aspects of PhoneticsPhonetics is fantastic and interdisciplinary it is about the speech signal and much more: spoken language, spoken communication phonemes and prosody speaking and listening mental storage and retrieval speech acquisition and speech pathology speech technology, speech databases languages of the world, dialects and more: e.g. laboratory phonology, evaluating cochlear implants, or designing Web avatars
  • 4.
    My choices , given other intro’s Phonetics as a basic science Computational modeling, computational phonetics Knowledge from annotated, and preferable freely accessible, speech corpora Phonetics as an interdisciplinary science
  • 5.
    Size of PhoneticsCommunity 1000+ participants to speech conferences like: ICSLP & Eurospeech (now Interspeech under ISCA), ICASSP, LREC, ICPhS (under IPA) numerous workshops (see ISCA and FoNETiks newsletters, and News section in SpeCom) IPA ~1000 members, ISCA ~1350 members phonetics community at least 10 times bigger books; journals; LDC & ELRA ICPhS’03 Barcelona: 50 countries (USA 158; FR 81; GE 73; UK 71; JAP 46; SP 45, SW 41; NE 31; CAN 25; RU 19; IT 17; FI 14; AU 12; BRA 12, CH 12) Ba-Ma system: less specialization in Phonetics
  • 6.
    Ken Stevens andPhonetics ESCA medal at Eurospeech’95 in Madrid on average one paper per year in JASA special issue JPhon “Quantal Theory” (1989) 1998 master piece “Acoustic Phonetics” regular keynote speaker at conferences many international contacts (also Europe) many good students world-wide
  • 7.
    Banquet Eurospeech95, MadridE’95 chairman ESCA-medalist ESCA president J.M. Pardo Ken Stevens Mrs. Pardo Louis Pols
  • 8.
    Textbook Phonetics Summercourse in English Phonetics (UCL): phonemic systems (vowels and consonants) segmental analysis (allophonic processes) word stress weakening and coarticulation processes sentence stress (accent, tonal stress) intonation and meaning similar in most textbooks
  • 9.
    Invariance Symp., MIT1983 Invariance and variability in speech processes (Perkell & Klatt, 1986) also Leitmotiv for my Amsterdam group perception of dynamic speechlike sounds (vW) formant dynamics (van Son) appropriate context (van Son) acoustic vowel reduction (van Bergem) efficiency of speech (van Son)
  • 10.
    DL for shortspeech-like transitions Adopted from van Wieringen & Pols (1998), Acta Acustica 84, 520-528 “ Discrimination of short and rapid speechlike transitions” complex simple short longer trans. initial final
  • 11.
    Static vs. dynamicV recogn. see Weenink (2001) “ Vowel normalizations with the TIMIT acoustic phonetic speech corpus”, IFA Proc. 24, 117-123 438 males, both train & test sentences TIMIT 35,385 vowel segments, hand segmented 13 monophthongeal vowel categories 1-Bark bandfilter anal. (18), intensity normal. 3 frames per segment: central and 25 ms L/R
  • 12.
    Some results Vowelclassif. (%) with discriminant functions 94.5 87.9 5,374 speaker normalized 90.1 78.9 5,374 438x13 V centers per speaker 69.2 62.2 35,385 speaker normalized 66.9 59.3 35,385 438x13x(1…25) Original Dynamic 3 frames Static 1 frame # Items Condition
  • 13.
    Perceiving (speech) dynamicsvowel perception w/w or w/o transitions? our claims (vSon, IFA proc. 17(1993): only evidence for compensatory processes (i.e. perceptual-overshoot and dynamic-specification), when in an appropriate context synthetic isolated dynamic formant tracks lead to perceptual undershoot (=averaging) silent center studies are ambiguous concl.: info in formant dynamics is only used when V’s are heard in appropriate context
  • 14.
  • 15.
    Vowel identification compareV responses for dynamic stimuli with those for static stimuli calculate net shift in V responses per onglide (CV), complete (CVC), or offglide (VC) result: responses average over the trailing part of the formant track see Pols & vSon, “Acoustics and perception of dynamic vowel segments”, Speech Comm.
  • 16.
    Perceptual undershoot Netshift in vowel responses to tokens with curved formant tracks vs. stationary tokens. All values significant, except small open triangles
  • 17.
    Local context andC & V identification 120 CVC fragments taken from a read text various segments per CVC-fragment (50ms V-kernel and beyond) both accented and unaccented vowels subjects identified (pre- or post-vocalic) consonant or vowel in CV-, VC-, or CVC-segments vSon & Pols (1999), “Perisegmental speech improves consonant and vowel identification”, Speech Comm. 29, 1-22
  • 18.
  • 19.
    Error rates ofvowel identification for the individual stimulus token types. Long-short vowel errors (/ α-a:, -o:/) are ignored c
  • 20.
    results: phoneme identificationbenefits from extra speech left context more beneficial than right context better identification in CV when also other member of pair was identified correctly (context effect)
  • 21.
    Effect of (lackof) context 100 Dutch listeners identifying V segments “ Vowel contrast reduction”, K-vBeinum (1980) ASC = 1/n Σ |LF i - LF i | 2 (total variance), LF i = 100 10 log F i i=1 n 33.0 189 38.9 255 33.3 209 28.7 119 31.2 174 unstr., free conv. % (10) ASC 84.3 407 85.3 529 84.9 374 78.8 320 88.1 406 words % (5) ASC 89.6 480 86.4 634 88.0 447 88.9 404 95.2 433 isolated V % (3) ASC Av. F2 F1 M2 M1 3 conditions
  • 22.
    Historical biases R.Plomp (2002) “The intelligent ear. On the nature of sound perception” biases in research: dominance for simple stimuli (e.g., phonemes) preference for microscopic approach (e.g., phoneme discrimination rather than intelligibility) emphasis on psychophysical rather than cognitive aspects of hearing use of clean signals in lab (rather than acoustic reality of outside world with its disruptive sounds)
  • 23.
    Computational Phonetics R.Moore (1995) 13th ICPhS, Stockholm unify the emerging theoretical and practical developments in speech technology with the established knowledge and practices in phonetic sciences Sagisaka et al. (1997), “Computing prosody. Computational models for processing spontaneous speech” Klatt (1987), vSanten (1997), Wang (1997), duration modeling vBergem (1993), Acoustic and lexical vowel reduction Steeneken (1992), Speech Transmission Index
  • 24.
    Stylized formant contourc 2 c 1 F 2 (t) = c 0 + c 1 t + c 2 t 2 (second order polynomial) F 2 (t) = F 2 (t) + α 2 p (t) + β 2 t (t) + γ 2 α (t) for @ in /p@t α/ F 2 (-1) = 1352 Hz ; F 2 (0) = 1435 Hz; F 2 (1)=1485 Hz F 2 normalized time -1 F center (c 0 ) F offset 0 1 F onset
  • 25.
    Schwa realization Theschwa is not just a centralized vowel but something that is completely assimilated with its phonemic context
  • 26.
    Human word intelligibilityvs. noise from Ph.D thesis H. Steeneken (1992) ‘ On measuring and predicting speech intelligibility’
  • 27.
    Knowledge from AnnotatedSp. Corp. knowledge casted in rules vs. knowledge derived from intelligent searches in DB ’ s vSanten (1997) greedy algorithm Greenberg et al. (2003) Switchboard Oostdijk et al. (2002) 1000 hrs.- 10M words spoken Dutch corpus ( CGN ) vSon et al. (2001) 5.5 hrs. IFA corpus Intas915 project (Dutch, Finnish, Russian)
  • 28.
    Freq. effects vs. vowel reduction Dutch Finnish Russian -0,100 -0,050 0,000 0,050 0,100 0,150 0,200 0,250 0,300 Duration F12Dist CoG Intensity Correlation Coefficient -> R read speech spontaneous speech -log 2 (word frequency) vs. acoustic vowel reduction (in terms of duration, F1F2Dist, CoG, and Intensity) for Du, Fi, Ru Dutch Finnish Russian 0,000 0,050 0,100 0,150 0,200 0,250 0,300 Duration F12Dist CoG Intensity Correlation Coefficient -> R
  • 29.
    Phonetics an InterdisciplinaryScience some examples phonetics is a contributor to many signal and data processing techniques as well as pattern recognition techniques use of source-filter model to describe early speech development laryngectomized speech, production and evaluation turn switches in conversational dialogs progress in vowel production in babies
  • 30.
    Early speech developmentvBeinum, Clement, vdDikkenberg, Developmental Sc. 4, 61-70 (2001) average onset (in weeks) Stage I Stage II Stage III Stage IV Stage V (babbling) Stage VI (‘words’) 0 6 10 20 31 40
  • 31.
    Tracheoesophageal speech C.van As, Ph.D thesis (2001)
  • 32.
    Turn switches inconversation shift in phonetics from isolated stimuli to conversational speech quantitative modelling of the identification of turn-relevent places (TRP’s) integration process of temporally unfolding information at different levels in speech, from conversation acts and semantics to prosody, phonetics and visual cues use of laryngograph to detect preparatory glottal closure that precedes most TRP’s new project Rob van Son (start Jan. 2004)
  • 33.
    Progression in Vproduction of babies especially in the first year of life utterances difficult to identify as phon. seq. spectro-temporal analyses difficult because of very high pitch formant measurements biased by expectations pitch-related bandfilter analysis (automatic) 5 normal-hearing and 5 hearing-impaired vdStelt et al. (2003)
  • 34.
    Spectral measurements normal hearing child 5 & 24 mo. hearing-impaired child 5 & 24 mo. i u a
  • 35.
    Conclusions importance ofdynamic information implications of (lack of) (local) context interdisciplinary nature of phonetics need for large, annotated, and freely accessible speech corpora generalization via computational phonetics phonetics and phonology (Patricia Keating)