Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

From sound to grammar: theory, representations and a computational model

72 views

Published on

This thesis contributes to the investigation of the sound-to-grammar mapping by developing a computational model in which complex acoustic patterns can be represented conveniently, and exploited for simulating the prediction of English prefixes by human listeners.
The model is rooted in the principles of rational analysis and Firthian prosodic analysis, and formulated in Bayesian terms. It is based on three core theoretical assumptions: first, that the goals to be achieved and the computations to be performed in speech recognition, as well as the representation and processing mechanisms recruited, crucially depend on the task a listener is facing, and on the environment in which the task occurs. Second, that whatever the task and the environment, the human speech recognition system behaves optimally with respect to them. Third, that internal representations of acoustic patterns are distinct from the linguistic categories associated with them.
The representational level exploits several tools and findings from the fields of machine learning and signal processing, and interprets them in the context of human speech recognition. Because of their suitability for the modelling task at hand, two tools are dealt with in particular: the relevance vector machine (Tipping, 2001), which is capable of simulating the formation of linguistic categories from complex acoustic spaces, and the auditory primal sketch (Todd, 1994), which is capable of extracting the multi-dimensional features of the acoustic signal that are connected to prominence and rhythm, and represent them in an integrated fashion. Model components based on these tools are designed, implemented and evaluated.
The implemented model, which accepts recordings of real speech as input, is compared in a simulation with the qualitative results of an eye-tracking experiment. The comparison provides useful insights about model behaviour, which are discussed.
Throughout the thesis, a clear distinction is drawn between the computational, representational and implementation devices adopted for model specification.

Published in: Science
  • Be the first to comment

  • Be the first to like this

From sound to grammar: theory, representations and a computational model

  1. 1. From sound to grammar: theory, representations and a computational model Marco A. Piccolino-Boniforti Clare Hall 8th February 2014 This dissertation is submitted for the degree of Doctor of Philosophy at the University of Cambridge
  2. 2. Contents Abstract xii Declaration xiii Acknowledgements xiv 1 Introduction: From sound to grammar 1 1.1 From sound to grammar . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Thesis outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2 Background: Variability and invariance 7 2.1 The study of variability and invariance . . . . . . . . . . . . . . . . . . . . 7 2.2 Traditional approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2.1 Minimal invariant units . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2.2 The role of context . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2.3 Beads on a string . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3.1 Indexical variation . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3.2 Linguistic factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.3.3 Auditory processes . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 ii
  3. 3. Contents iii 2.3.4 Automatic speech recognition . . . . . . . . . . . . . . . . . . . . . 20 3 Theoretical framework: A rational prosodic analysis 22 3.1 Analytic foundations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.1.1 Rational analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.1.2 Bayes’ theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.1.3 Firthian prosodic analysis . . . . . . . . . . . . . . . . . . . . . . . 25 3.2 Central assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.2.1 Specificity of task and environment . . . . . . . . . . . . . . . . . . 30 3.2.2 Optimal behaviour . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.2.3 Auditory patterns and linguistic features . . . . . . . . . . . . . . 33 4 Assessment: Perceptual-magnet effect 36 4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.1.1 The perceptual-magnet effect . . . . . . . . . . . . . . . . . . . . . 36 4.1.2 Context-dependent PME . . . . . . . . . . . . . . . . . . . . . . . 39 4.2 Computations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.2.1 Feldman et al.’s rational model . . . . . . . . . . . . . . . . . . . . 42 4.2.2 A multi-class extension . . . . . . . . . . . . . . . . . . . . . . . . 45 4.3 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.3.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.3.2 Results and discussion . . . . . . . . . . . . . . . . . . . . . . . . 50
  4. 4. Contents iv 5 Representations: Auditory processes and linguistic categories 59 5.1 Auditory processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 5.1.1 Cochlear model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 5.1.2 Auditory primal sketch . . . . . . . . . . . . . . . . . . . . . . . . 61 5.2 Linguistic categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 5.2.1 The relevance vector machine . . . . . . . . . . . . . . . . . . . . . 66 5.2.2 RVM: an example . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 6 Evaluation: Binary classification tasks 76 6.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 6.2 Simulation A: relevance vector machine . . . . . . . . . . . . . . . . . . . 77 6.2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 6.2.2 Research questions . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 6.2.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 6.2.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 6.3 Simulation B: auditory primal sketch . . . . . . . . . . . . . . . . . . . . . 79 6.3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 6.3.2 Research questions . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 6.3.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 6.3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 6.4 Simulation C: cochlear model . . . . . . . . . . . . . . . . . . . . . . . . . 82 6.4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 6.4.2 Research questions . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 6.4.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 6.4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
  5. 5. Contents v 6.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 7 Model: Predicting prefixes 86 7.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 7.1.1 Acoustic cues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 7.1.2 Behavioural evidence . . . . . . . . . . . . . . . . . . . . . . . . . . 88 7.1.2.1 Increased word identification in noise . . . . . . . . . . . 88 7.1.2.2 Predictive looks at target images . . . . . . . . . . . . . . 90 7.2 Computations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 7.2.1 Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 7.2.2 Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 7.2.3 Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 7.2.4 A formal model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 7.3 Processes and representations . . . . . . . . . . . . . . . . . . . . . . . . . 101 7.3.1 Fine-tuned learned pattern . . . . . . . . . . . . . . . . . . . . . . 101 7.3.2 Prefix-like prosody . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 7.3.3 Other model components . . . . . . . . . . . . . . . . . . . . . . . 103 8 Simulation: Linking the computational model to a behavioural experiment 104 8.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 8.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 8.2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 8.2.2 Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 8.2.3 Feature extraction and concatenation . . . . . . . . . . . . . . . . 111 8.2.4 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
  6. 6. Contents vi 8.2.5 Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 8.3 Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 8.3.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 8.3.2 Parameter choices . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 8.3.2.1 Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . 116 8.3.2.2 Feature extraction . . . . . . . . . . . . . . . . . . . . . . 117 8.3.2.3 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 8.4 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 8.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 8.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 9 Conclusion: Main contributions 130 Bibliography 131
  7. 7. List of Figures 4.1.1 Illustration of PME for equally spaced stimuli in one-dimensional acoustic space (top) and corresponding representation in perceptual space (bot- tom). Stimuli closer to prototype (stimulus 0) are attracted more, and thus are less discriminable from neighbouring stimuli. . . . . . . . . . . . 37 4.1.2 Listeners’ individual Ps (circles, joined by continuous lines) and NPs (squares, joined by dashed lines) for the three allophonic contexts (F2 variation). Data from Barrett (1997). Each line joining values of Ps and NPs across subjects shows the great individual variability in terms of ab- solute values. Despite the variability, the values of Ps and NPs for each listener tend to spread over all available acoustic space. This is discussed in greater detail at the end of section 4.3.1. . . . . . . . . . . . . . . . . . 41 4.2.1 Behaviour of the Feldman and Griffiths (2007) model in the case of one category (left) and multiple categories (right). . . . . . . . . . . . . . . . 44 4.3.1 Histogram plots for F2 onset values of, respectively, /u:/,/lu:/ and /ju:/ . 48 4.3.2 A sample plot of S (circles) and E[T|S] (squares) for a continuum of stim- uli varying along the F20 axis. Solid lines show p(c|S) for /u:/,/lu:/ and /ju:/ (from left to right respectively), while dotted lines show the probab- ility density function (multiplied by 100 for visibility) for each category. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 vii
  8. 8. List of Figures viii 4.3.3 The measures of displacement (top left) warping (bottom left), and iden- tification (right, solid curves) for an idealised subject in the case of three categories (from left to right: /u:/, /lu:/,/ju:/). Category prior distribu- tions based on prototypes are indicated in the right pane by the dotted lines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.3.4 Individual results for two subjects (left: S1, right: S2) for PME simu- lations with constant category variance (7234) and three levels of noise σ2 S: top: 1000, middle: 5000 and bottom: 10000 respectively. For an explanation of the plots see figure 4.3.2 and the text in this section. . . . 53 4.3.5 Individual results for two subjects (left: S9, right: S10) for PME simu- lations with constant category variance (7234) and three levels of noise σ2 S: top: 1000, middle: 5000 and bottom: 10000 respectively. For an explanation of the plots see figure 4.3.2 and the text in this section. . . . 54 4.3.6 Individual results for the first six subjects (top: S1,S2; middle: S3,S4; bot- tom: S5,S6) for PME simulations with constant category variance (7234) and the highest level of noise (σ2 S=10000). For an explanation of the plots see figure 4.3.2 and the text in this section. . . . . . . . . . . . . . . . . . 55 4.3.7 Individual results for the last four subjects (top: S7,S8; bottom: S9,S10) for PME simulations with constant category variance (7234) and the highest level of noise (σ2 S=10000). For an explanation of the plots see figure 4.3.2 and the text in this section. . . . . . . . . . . . . . . . . . . . 56 5.1.1 Rhythmogram for the word instability. From top to bottom: spectrogram, waveform and rhythmogram (event and prominence detection). . . . . . . 63 5.1.2 Main processing stages to produce a rhythmogram (word: instability). From top to bottom: waveform, hair cell model output (activation in the auditory nerve), modulation spectrogram (multiresolution amplitude modulation), rhythmogram (event and prominence detection). . . . . . . . 64
  9. 9. List of Figures ix 5.2.1 F1 and F2 onset values for [u:] from /u:/, /lu:/ and /ju:/ . . . . . . . . . 70 5.2.2 Binary RVM classifiers: F1 and F2 onset values for [u:] from /u:/, /lu:/ and /ju:/. Grey stars represent relevance vectors retained by the models. The black dotted line represents evaluation of the RVM decision function at category membership probability = 0.5. Left panel: categories /u:/ (white circles) vs. non-/u:/ (black triangles). Right panel: categories /ju:/ (black triangles) vs. non-/ju:/ (white circles). 71 5.2.3 Binary RVM classifier: F1 and F2 onset values for [u:] from /u:/, /lu:/ and /ju:/. Categories /lu:/ (black triangles) vs. non-/lu:/ (white circles). Grey stars represent relevance vectors retained by the model. The black dotted line represents evaluation of the RVM decision function at category membership probability = 0.5. . . . . . . . . . . . . . . . . . . . . . . . . 72 5.2.4 Categories /lu:/ vs. non-/lu:/. Compare to figure 5.2.3 Left panel: two instances from /lu:/ with very low F1 values have been assigned to the/lu:/ category. Right panel: two previously correctly classified instances from /lu:/ with very low F1 values have been assigned to the non-/lu:/ category. . . . . . 74 5.2.5 Categories /lu:/ vs. non-/lu:/. Compare to figure 5.2.3. Five instances from the non-/lu:/ category with very high F2 values have been assigned to the competing category to simulate an upper threshold. . 75 6.2.1 Simulation A: classification accuracy and sparsity of RVM and SVM. Top: area under the curve (AUC): accuracy. Bottom: number of decision vec- tors (DV): sparsity. Each of S1...S5 bar charts represents a model trained on a single speaker. All values averaged over 5 train/test splits. . . . . . . 80
  10. 10. List of Figures x 6.3.1 Simulation B: classification accuracy and sparsity of APS vs. energy. Top: area under the curve (AUC): accuracy. Bottom: number of decision vectors (DV): sparsity. Each of S1...S5 bar charts represents a model trained on a single speaker. All values averaged over 5 train/test splits. . 82 6.4.1 Simulation C: classification accuracy and sparsity of APS with cochlear model (CM) vs. APS without cochlear model (NCM). Top: area under the curve (AUC): accuracy. Bottom: number of decision vectors (DV): sparsity. Each of S1...S5 bar charts represents a model trained on a single speaker. All values averaged over 5 train/test splits. . . . . . . . . . . . . 84 7.1.1 Spectrograms showing acoustic differences between mistimes (true prefix, top) and mistakes (pseudo-prefix, bottom) in the context of the same utterance (I’d be surprised if Tess mistimes/mistakes it). See section 7.1.1 for details. From Smith et al. (2012) . . . . . . . . . . . . . . . . . . 88 7.2.1 A graphical model of prefix prediction. See section 7.2.4 for an explanation. 98 8.2.1 Components of the model introduced in chapter 7 that were implemented for the simulation presented in this chapter (solid lines). . . . . . . . . . . 106 8.2.2 Overview of the model implementation’s architecture. See section 8.2.1 for details. Thin lines on oscillograms represent acoustic chunks of increasing length. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 8.2.3 The segmentation, feature extraction and feature concatenation processes as implemented. See sections 8.2.2 and 8.2.3 for explanation. . . . . . . . 110 8.2.4 Resampling procedures in the feature extraction process. See section 8.2.3 for details. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 8.2.5 A schematic representation of the training procedure. See section 8.2.4 for explanation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
  11. 11. List of Figures xi 8.2.6 A schematic representation of the recognition procedure. See section 8.2.5 for details. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 8.4.1 A sample plot showing curves of proportion of looks to targets (solid lines) and competitors (dashed lines) for the match (grey) and mismatch (black) conditions. Data from Hawkins et al. (in prep). . . . . . . . . . . . . . . . 121 8.4.2 Eye-tracking results for mis/dis from Hawkins et al. (in prep) in terms of proportion of looks to targets (and competitors) for group M1 (left) and group M2 (right). Group M1 was chosen for comparison with model output. See text for explanation. . . . . . . . . . . . . . . . . . . . . . . . 122 8.4.3 A plot showing bias looks to targets involving a true prefix when listening to either a true (grey line) or pseudo (black line) prefix for the M1 group. 123 8.5.1 RVM model output for the three kinds of feature vectors: APS, MFCC and APS+MFCC. The left panels show average true prefix class probabilities for input tokens of true prefixes (grey line) and pseudo prefixes (black line). The right panels show number of relevance vectors (RV) retained and area under the ROC curve for each model step. . . . . . . . . . . . . 125
  12. 12. Abstract Marco A. Piccolino-Boniforti From sound to grammar: theory, representations and a computational model This thesis contributes to the investigation of the sound-to-grammar mapping by de- veloping a computational model in which complex acoustic patterns can be represented conveniently, and exploited for simulating the prediction of English prefixes by human listeners. The model is rooted in the principles of rational analysis and Firthian prosodic ana- lysis, and formulated in Bayesian terms. It is based on three core theoretical assumptions: first, that the goals to be achieved and the computations to be performed in speech re- cognition, as well as the representation and processing mechanisms recruited, crucially depend on the task a listener is facing, and on the environment in which the task occurs. Second, that whatever the task and the environment, the human speech recognition system behaves optimally with respect to them. Third, that internal representations of acoustic patterns are distinct from the linguistic categories associated with them. The representational level exploits several tools and findings from the fields of machine learning and signal processing, and interprets them in the context of human speech re- cognition. Because of their suitability for the modelling task at hand, two tools are dealt with in particular: the relevance vector machine (Tipping, 2001), which is cap- able of simulating the formation of linguistic categories from complex acoustic spaces, and the auditory primal sketch (Todd, 1994), which is capable of extracting the multi- dimensional features of the acoustic signal that are connected to prominence and rhythm, and represent them in an integrated fashion. Model components based on these tools are designed, implemented and evaluated. The implemented model, which accepts recordings of real speech as input, is com- pared in a simulation with the qualitative results of an eye-tracking experiment. The comparison provides useful insights about model behaviour, which are discussed. Throughout the thesis, a clear distinction is drawn between the computational, rep- resentational and implementation devices adopted for model specification. xii
  13. 13. Declaration This dissertation is the result of my own work and includes nothing which is the outcome of work done in collaboration except where specifically indicated in the text. This dissertation does not exceed 80,000 words, including footnotes, references and appendices, but excluding bibliographies, as required by the Degree Committee of the Faculty of Modern and Medieval Languages. xiii
  14. 14. Acknowledgements This research was funded by an ESR fellowship of the EU MRTN-035561 research train- ing network Sound to Sense. I am particularly grateful to my supervisor and coordinator of Sound to Sense, Sarah Hawkins, who inspired me with her passion for interdisciplinary research, encouraged me during particularly hard times, challenged me intellectually and ultimately made this opportunity of professional and personal growth possible. I am also very thankful to my advisor, Dennis Norris, who was always very approachable and supportive, and from whom I tried to grasp the gifts of sharp thinking and clarity of expression. The interdisciplinary nature of my project required me to gather knowledge in many areas. I greatly benefited from the workshops and discussions with many senior research- ers in Sound to Sense, in particular Guy Brown, Richard Ogden and Martin Cooke, who also welcomed me for a research stay. I am also grateful for the discussions and fun time with fellow early stage researchers, and particularly to Bogdan Ludusan for his contri- bution to my work and to Meghan Clayards for sharing her analyses. I also give thanks to Rachel Baker for sharing data and analyses, and to my colleagues at the phonetics lab and Linguistics department for fostering a positive, supportive and fun environment. Finally, I would have never managed to accomplish this daunting task without the loving support of my family, my girlfriend Silvia, my colleagues Marco and Sergio, and the so many wonderful friendships that I was blessed with during my stay in Cambridge and back home. xiv
  15. 15. 1 Introduction: From sound to grammar 1.1 From sound to grammar When we listen to someone speaking, e.g. during a telephone conversation, we can pay attention to a number of different things: the words they are saying, their accent, their sex, their age, their mood, their physical and even mental condition. We extract this wealth of information from a single source (the speaker), often simultaneously. Despite the fact that we might occasionally get some of this information wrong, in most cases, even in the presence of noise, we succeed in a task whose “inner workings” turn out to be quite complex to understand. Not only can we extract different kinds of information from the same person: we can also extract the same kind of information from different persons. So, for example, we are able to recognise one and the same word even when it is pronounced by two people of different age, sex, geographical origin; or to recognise two individuals as females despite differences in the words they are saying, their voice quality, their pitch. This is possible because the recognition of speech relies on a subtle relationship between variability and invariance. Speech researchers have been trying to shed more light onto this complex relationship for some 60 years now (Jusczyk and Luce, 2002). So far, however, many of the questions aimed at a better understanding of it are still in need of an adequate answer (Luce and McLennan, 2005). Some of these questions concern the relationship between acoustic patterns and gram- matical function, in short the sound-to-grammar mapping. While most models of spoken word recognition to date postulate an obligatory stage of phonemic analysis as the only “beneficiary” of acoustic information, beyond which recognition becomes a matter of 1
  16. 16. 1.1. From sound to grammar 2 pattern matching on combinations of symbols, an increasing body of experimental data suggests that acoustic patterns can be informative and drive recognition well beyond the phonemic level of analysis. So, for example, a complex acoustic pattern can be a direct cue to a grammatical category such as a morpheme (see e.g. Baker, 2008). This thesis contributes to the investigation of the sound-to-grammar mapping by de- veloping a computational model in which complex acoustic patterns can be represented conveniently, and exploited for simulating the prediction of specific grammatical features by human listeners. The computational model described here is based on the following central theoretical assumptions about human speech recognition. These, which require some explanation as to their rationale, are illustrated in greater detail in chapter 3: 1. the specific characteristics of the recognition process are strictly connected to the particular task in the context of which speech recognition happens (3.2.1); 2. the recognition process can be interpreted as a problem of optimal decision making while reasoning under uncertainty (3.2.2); 3. there is a distinction in memory between the representation of acoustic patterns and the linguistic features associated with them (3.2.3). These assumptions impose important constraints about the fundamental properties of the representational and processing tools and techniques used for modelling, which will be introduced in chapter 5. An important contribution of this thesis to the analysis of the sound-to-grammar map- ping lies in making explicit connections about findings and methods from various fields (most notably: theoretical linguistics, experimental phonetics, experimental psychology, machine learning and signal processing) towards the common goal of developing a com- putational model. Another important contribution is the development of an implemented
  17. 17. 1.2. Thesis outline 3 architecture that helps running simulations with real speech, and comparing the results with behavioural data from human listeners. 1.2 Thesis outline Chapter 2 - Background: Variability and invariance An investigation about the relationship between sound and grammar is necessarily con- cerned with the broader issue of variability and invariance in speech recognition. I first introduce the study of this issue (2.1), and show how researchers in human speech re- cognition have dealt with it in the past (2.2). I then describe some issues from the behavioural, linguistic, neuro-physiological and engineering perspectives that challenge these traditional approaches (2.3) and motivate the development of improved theories and computational models. Chapter 3 - Theoretical framework: A rational prosodic analysis I first introduce a theoretical framework for the study of the sound-to-grammar map- ping that is based on Rational Analysis (3.1.1), Bayesian statistics (3.1.2) and Firthian Prosodic Analysis (3.1.3). I then discuss the central theoretical assumptions that build the foundations of a computational model for the sound-to-grammar mapping (3.2): 1) a proper characterisation of speech recognition should account for the specific task that is pursued by listeners, and for the environment in which the task is performed (3.2.1); 2) human speech recognition can be cast as a problem of optimal decision making while reasoning under uncertainty (3.2.2); 3) acoustic patterns and linguistic categories are not the same thing, and they shall not be confused in models of human speech recognition (3.2.3).
  18. 18. 1.2. Thesis outline 4 Chapter 4 - Assessment: Perceptual-magnet effect Modelling the perceptual-magnet effect (PME) helps investigating the mapping between acoustic information and linguistic categories. After introducing PME, I consider work that suggests that the behaviour of listeners can be accounted for by assuming context- dependent prototypes (4.1), rather than phonemic categories. Although a recently pro- posed Bayesian model of PME (4.2) based on a rational and Bayesian account explains elegantly the behaviour of listeners in the case of very simplified data, the simulations I present (4.3) show that the existence of context-dependent prototypes poses important challenges about the way in which phonological and grammatical categories are repres- ented in most current psycholinguistic models. Chapter 5 - Representations: Auditory processes and linguistic categories I introduce some modelling tools and techniques from the fields of machine learning and signal processing, which are compatible with the theoretical principles outlined in chapter 3, and well suited for representing complex auditory patterns and the associated linguistic categories. Because of their suitability for the implementation of the model developed in chapter 7, two kinds of representations are dealt with. For the representation of auditory processes (5.1), I first introduce a cochlear model (5.1.1) whose output is fed to the auditory primal sketch (5.1.2). The auditory primal sketch is capable of extracting the multi-dimensional features of the acoustic signal that are connected to prominence and rhythm, and represent them in an integrated fashion. For the representation of linguistic categories associated with complex auditory patterns I introduce the relevance vector machine (5.2), a sparse Bayesian machine learning technique based on the concept of prototypical exemplars.
  19. 19. 1.2. Thesis outline 5 Chapter 6 - Evaluation: Binary classification tasks I design and implement model components based on the modelling tools described in chapter 5, in order to test their suitability for inclusion in the computational model described in the next chapter, and their relative advantages over other established mod- elling techniques and tools. The implemented components are evaluated by means of probabilistic binary classification tasks. The dataset used is first described (6.1). In the first simulation, a model component based on the relevance vector machine is eval- uated against one based on the support vector machine, a modelling technique which is more widespread but seems less suited to the simulation of aspects of human speech recognition (6.2). In the second simulation, a model component based on the auditory primal sketch is evaluated against a model that just picks up the energy envelope of the signal (6.3). Finally, in the third simulation, a model component based on the auditory primal sketch without cochlear model is compared to a comparable model in which the cochlear model is included (6.4). The general outcomes of the evaluation are then briefly discussed (6.5). Chapter 7 - Model: Predicting prefixes I develop a computational model of prefix prediction for British English in which it is assumed that listeners, by analysing fine-tuned, learned auditory patterns in the proper prosodic and grammatical context, can set prefix prediction as an intermediate task in order to fulfil higher-level goals. The model is first motivated (7.1) in terms of acoustic analyses (7.1.1), and behavioural experiments (7.1.2). The computational aspects of the model are dealt with, in terms of goal (7.2.1), environment (7.2.2) and constraints (7.2.3). The model is then given a formal description with the aid of a Bayesian network (7.2.4). Those model components that are implemented in the simulation are also described in terms of processes and representations (7.3).
  20. 20. 1.2. Thesis outline 6 Chapter 8 - Simulation: Linking the computational model to a behavioural experiment I implement those model components that enable a qualitative comparison between model output and the output of the eye-tracking experiment described in section 7.1.2.2. The motivation for the simulation is first explained (8.1). I then give a detailed account of the system architecture devised for implementing the model, with the various stages that it involves (8.2). I further describe the dataset and model parameters used in the simulation (8.3), and explain the method used for the qualitative comparison (8.4). I finally present the results of the comparison (8.5) and discuss them (8.6). Chapter 9 - Conclusion: Main contributions The main contributions of this thesis to the investigation of the sound-to-grammar map- ping, and more generally to the study of human speech recognition are summarised.
  21. 21. 2 Background: Variability and invariance 2.1 The study of variability and invariance Our understanding of how speech recognition works on a neuro-physiological basis is, at present, quite fragmentary (see Young, 2008, for a recent review). This, however, is not a major obstacle to its characterisation on a functional or formal basis (see Marr, 1982), and available neuro-physiological insights can be put to good use for constraining hypotheses about the functional and formal properties of speech recognition, insofar as they contribute to explain variability and invariance. Functional and formal character- isations of speech recognition constitute a great deal of the work conducted in the last six decades in fields as diverse as psychology, linguistics and statistical pattern recognition. When considering the role of variability and invariance, a researcher can maintain one among several positions between two hypothetical extremes. One extreme would con- sider variability as being always inherently“bad”, because it is random or irrelevant. The other extreme would consider it as being always inherently “good”, because it is system- atic and informative. Evidently neither extreme is defensible since both would imply, for listeners, the inability of making any kind of useful generalisation. Experimental evid- ence, reviewed in the following sections, shows that in fact some variability is random, and some is systematic, some is irrelevant and some is informative. Determining what is irrelevant and what is informative, however, depends on the exact characterisation of the task listeners are faced with (function), and thus of the mapping process that enables to accomplish the task (form). Advances in speech recognition research can be character- ised precisely as refinements about the knowledge of the mapping process triggered by 7
  22. 22. 2.2. Traditional approaches 8 observations about function and form. 2.2 Traditional approaches 2.2.1 Minimal invariant units First published in 1951, Jakobson, Fant and Halle’s Preliminaries to Speech Analysis (Jakobson et al., 1951) represented an innovative blend of theoretical and experimental work on the investigation of the properties of speech. That study became quickly popu- lar, also thanks to an international conference on speech communication held at MIT in the following year (Perkell and Klatt, 1986, Preface). The book’s influence on the kind of questions asked by researchers in speech communication has been long-lasting. The primary goal of the Preliminaries was to propose questions about the nature of the “ultimate discrete entities of language”, i.e. about linguistic form. What made it particularly interesting to practitioners of several disciplines, as compared to other linguistic investigations with the same goal (Twaddell, 1935; Trubetzkoy, 1939; Jones, 1950), was the great attention paid to the articulatory, acoustic and perceptual correlates of the units they identified as the ultimate discrete components of language: distinctive features. A distinctive feature was characterised as a choice faced by a listener between two polar qualities of the same category (see Jakobson et al., 1951, p.3). In their tentative sketch, the authors gave a systematic account of many articulatory and acoustic correlates of distinctive features. The development of the sound spectro- graph at Bell Laboratories (Potter et al., 1947) was invaluable in the determination of the acoustic correlates (Fant, 2004). As to the articulatory aspects, the analysis was influenced by the work of Chiba and Kajiyama (Chiba and Kajiyama, 1941; Fant, 2004). Jakobson et al.’s definition, however, was based on a perceptual criterion: a “choice faced by a listener”. Even the categories they adopted followed a terminology based on perception, despite the explicit acknowledgement that their auditory observations were
  23. 23. 2.2. Traditional approaches 9 not based on a systematic experimental survey. It was only with the development of the Pattern Playback machine at the Haskins Laboratories (Cooper et al., 1951) that a more detailed knowledge about the mapping between acoustic stimuli and perceptual judgements on the identification of (synthetic) phonemes and syllables, intended as bundles of distinctive features, could be gathered. The Pattern Playback was a synthesiser in which a tone wheel modulated a light source at about 50 harmonically related frequencies. A transparent or reflective spectrogram, usually hand-painted, filtered specific portions of this harmonic source, which were passed to a photo-tube and converted to sound. The great novelty introduced by the Pattern Playback consisted in the flexibility it gave to researchers in the manipulation of sounds. This simple, yet powerful technique was key to the discovery of fundamental perceptual phenomena, such as categorical perception (Liberman et al., 1957); the role of spectral loci (Delattre et al., 1955) and that of the main spectral prominences of the transient relative to the vocalic part (Liberman et al., 1952) for the perception of the occlusive in CV syllables. A limitation of this method consisted in its unsuitability to the faithful reproduction of aperiodic portions of the spectrogram. The model of language developed in the Preliminaries followed a computational per- spective that was explicitly formulated according to the principles of the newborn re- search field of information theory (Shannon, 1948). It was the intention of the authors to establish a codebook that could faithfully and efficiently represent the transmission of spoken messages. Distinctive features seemed the appropriate unit of analysis for this endeavour. Jakobson and colleagues pursued this goal by identifying and “stripping away” all acoustic variability in the speech signal that was considered as redundant, while keeping those acoustic correlates that were deemed as essential to the definition of the invariant units of analysis. The same purpose underlay the experimental work at the Haskins Labs (Liberman et al., 1952) and found its linguistic counterpart in structural- ist approaches to the analysis of language, including previous work by Jakobson himself
  24. 24. 2.2. Traditional approaches 10 (Bloomfield, 1933; Jakobson, 1939; Harris, 1951). Evidently, the notion of redundancy could only be elaborated with respect to a certain task, or functional criterion. All the work contained in the Preliminaries assumed that this task was a “test of the intelligibility of speech, [where] an English speaking announcer pronounces isolated root words (bill, put, fig, etc.), and an English speaking listener endeavors to recognize them correctly” (Jakobson et al., 1951, p. 1). This was most likely a choice dictated by experimental and analytical constraints. How- ever, such a task has little to do with actual speech communication. The authors them- selves, in the introduction of the book, highlighted the difference between the two tasks very clearly. Fant offered a critical retrospective of the featural approach, coming to the conclusion that “the hunt for maximum economy often leads to solutions that impair the phonetic reality of features” and that “a simple one-to-one relationship between phonetic events and phonological entities is exceptional” (Fant, 1986, p. 482). While many at MIT and Haskins were exploring the question of minimal units, re- searchers at the Harvard Psycho-Acoustic Laboratory and elsewhere were realising the importance of context (intended both as the whole acoustic neighbourhood and the num- ber of possible lexical choices available to the listener) for the recognition of word stimuli, both in isolation and in relation to a whole sentence. 2.2.2 The role of context Miller et al. (1951) found that 1) level of background noise, 2) number of lexical items to be considered, 3) word vs. non-word and 4) syntactic/semantic context all had a great influence on the intelligibility scores of spoken stimuli: less background noise, smaller number of available choices, word status and previous context improved recognition, with level and number of available choices influencing the threshold of noise for intelligibility.
  25. 25. 2.2. Traditional approaches 11 Ladefoged and Broadbent (1957) demonstrated the role of the preceding acoustic con- text for the identification of the vowel in one out of four possible monosyllabic, synthes- ised words in English. In their study, subjects listened to the carrier sentence Please say what this word is, synthesised with the Parametric Artificial Talker (Lawrence, 1953). Acoustic parameters in vowel formants were varied, as if the sentence was uttered by different talkers. The sentence was followed by an acoustic token, which was exactly the same for different carrier sentences. Despite this fact, for example, 97% of the listeners recognised the token after a version of the carrier sentence as bit, whereas 92% of them recognised the same token after another repetition of the carrier as bet. This study was regarded as positive evidence for the theory of Joos (1948), according to which “the phonetic quality of a vowel [i.e. the acoustic correlates of a perceptual category] depends on the relationship between the formant frequencies for that vowel and the formant fre- quencies of other vowels pronounced by the same speaker” (Ladefoged and Broadbent, 1957, p. 99). In a different study, Miller and Selfridge (1950) investigated the role of the units of analysis from an information-theoretical point of view. While Jakobson et al.’s focus was on the codebook (that is on representational issues concerning the identity of the invariant units) the main interest of Miller and Selfridge was rather on the code, intended as a concatenation of atomic units. For the authors’ purposes, the units could have been either phonemes, or words, or any other element that was amenable to be represented sequentially. In that experimental study, the authors investigated the role of what they named “verbal context” on the recall of spoken passages of text by listeners. To this purpose, they devised so-called nth order approximations to the English language, i.e. statistical models of a language based on the knowledge of the relative frequency of successive units (phones, syllables, words), up to the nth unit. To implement these models, Miller and Selfridge presented a sequence of n words to an English speaker and asked her/him to complete the sequence with one more word. The
  26. 26. 2.2. Traditional approaches 12 n+1 sequence was then presented to another speaker, that completed it with a further word. The completed sequences were then recorded by a male speaker and played to listeners. All listeners heard sequences of various lengths (10, 20, 30 and 50 words) and various orders of approximation and were asked, after having listened to each sequence, to write down as many words in the correct order as they could possibly remember. Miller and Selfridge’s main findings were that both higher order of approximation and shortness of the sequence correlated with higher recall scores, with the two factors interacting: higher order approximations seemed to help recall especially with longer sequences. Studies like these, albeit very different as to methodology adopted and immediate goals, all acknowledged the fact that the amount of contextual information available to listeners strongly influences the recognition of the message at various levels of analysis. 2.2.3 Beads on a string The identification of minimal invariant units and the investigation of their concatenative properties constitute the so called beads-on-a-string view of speech recognition. This view has built the basis for most accounts of human and automatic speech recognition until today (Luce and McLennan, 2005). Current psycholinguistic models of spoken word recognition that rely on this principle include Trace (McClelland and Elman, 1986), Shortlist A (Norris, 1994), and PARSYN (Luce et al., 2000). It is also an integral part of the widespread Hidden Markov Model (HMM) approach to automatic speech recognition (Baker, 1975; Jelinek et al., 1975). As noted above, its origins have connections to information theory and structural linguistics. Crucially for our discussion about the relationship between sound and grammar, this approach postulates 1) a mapping between the acoustic signal and a sequence of discrete, abstract units, all of which belong to the same level of linguistic analysis (in most cases either distinctive-featural or phonemic); 2) a concatenation of these units as input to further levels of analysis (e.g. the word level). Early accounts of human speech recognition that adopted this approach were Fry
  27. 27. 2.3. Challenges 13 (1959) and Halle & Stevens (1962). The account offered by Fry constituted the basis for one of the first automatic speech recognisers ever built, the speech typewriter described in Denes (1959). The accounts of human speech recognition that adopt the beads-on-a-string view differ among them in many respects, particularly as to the mapping function that is used to obtain the sequence of symbolic units from the acoustic stream; however, they all share the assumption that acoustic information only serves the purpose of guiding the recognition of minimal units, and does not intervene directly in the determination of other kinds of linguistic structure. The examples presented in section 2.3 suggest that this might not be the case: rather, human listeners seem to rely on acoustic cues in order to gather information about other kinds of linguistic structure as well, including grammatical categories. 2.3 Challenges 2.3.1 Indexical variation A beads-on-a-string view requires a certain degree of abstraction of the units of analysis, which arises from discretisation. This in turn implies that what is usually termed in- dexical variation, e.g. differences in speaking rate, among talkers or in affective states (Luce and McLennan, 2005), is not accounted for by the units, and thus unexplained variance within the units increases. If indexical variation did not influence the recog- nition of minimal units, it would represent an independent issue. However, this does not seem to be the case. The already mentioned study by Ladefoged and Broadbent (1957), which ultimately simulated a difference among talkers, already pointed to this. Additional evidence was collected in further studies. Peters (1955) found that messages uttered by a single talker in noise were reliably more intelligible than messages uttered by multiple talkers. Creelman (1957) found an
  28. 28. 2.3. Challenges 14 inverse relationship between performance on the identification of words and number of talkers. Findings like these were confirmed in later studies, such as Mullennix et al. (1989). Several studies highlighted the influence of speaking rate on the recognition of phon- emes for which the rate of temporal change had been found to be relevant. Liberman and colleagues (Liberman et al., 1956; Miller and Liberman, 1979) showed it for the [w]/[b] distinction. Verbrugge and Shankweiler (1977) cross-spliced syllables from con- texts at fast speaking rates into contexts at slower speaking rates. This affected vowel identification: for example, subjects misidentified [a] with [2]. Investigating the role of variation in talker, speech rate and amplitude in recognition memory, Bradlow et al. (1999) found that variation in both talker and speech rate had an influence on recognition judgements (old vs. new), while amplitude did not. In the case of an old word, however, listeners were reliably able to indicate whether it was repeated by the same talker, at the same rate or at the same amplitude. This hinted to the fact that some kinds of information (amplitude, in this case) are nonetheless stored even when they do not influence recognition judgements. Thus, there is evidence to postulate some interaction between indexical variation and phoneme or word recognition. This was the main motivation for the development of alternative approaches to speech recognition. Some approaches postulate the retain- ing of a very large number of encountered patterns (exemplars) in long term memory (Goldinger, 1998). Such a formulation accounts for generalisation effects by postulat- ing analogical processes between stored and new exemplars at the time of recognition. Other approaches postulate the co-existence of multiple units of linguistic analysis that would be triggered according to the task at hand and/or the phonological structure of a specific language (3.1.3). These two perspectives are not necessarily at odds, as they mainly differ at the representational level (5.2). The next section will present selected examples in which a direct sound-to-grammar mapping seems necessary to explain the
  29. 29. 2.3. Challenges 15 behaviour of listeners. 2.3.2 Linguistic factors Indexical variation is not the only kind of variability not accommodated for adequately in mainstream models of human speech recognition: acoustic variability due to linguistic factors other than phonemic identity also requires to be accounted for in such models, since there is plenty of evidence that listeners are sensitive to it. Being able to account for and exploit this kind of variability is the main motivation for the research presented in this thesis. A long research tradition acknowledges the fact that, in many languages, acoustic cues not directly mappable onto phonemes or distinctive features play an important role in the perceptual identification of morphological, lexical and syntactic boundaries. For example, in English specific segment and syllable duration relationships may signal an upcoming pause or prosodic boundary, such as the end of an utterance (see e.g. Klatt, 1976). In several cases, different acoustic features of variable granularity, arranged into complex configurations, contribute together to the definition of linguistic structure, e.g. in signalling word segmentation (Smith, 2004, for English). Other experiments show that listeners are also sensitive to subtle but systematic vari- ations in acoustic parameters which are linked to differences in prosodic structure, which in turn are triggered by lexical differences. For example, Salverda et al. (2003), using the visual paradigm (Tanenhaus and Spivey-Knowlton, 1996) and cross-splicing, found that subjects were sensitive to acoustic differences, particularly in duration, which were due to the monosyllabic vs. polysyllabic nature of a word (e.g. ham- as in ham vs. hamster). Kemps et al. (2005a) arrived to similar conclusions for morphologically com- plex vs. morphologically simple words (e.g. in Dutch singular/plural nouns: boek- as in boek [buk] vs. boeken [buk@]). Baker (2007a; 2008) investigated the perception of true vs. pseudo prefixes in English, e.g. dis- as in distasteful (true, i.e. productive and with
  30. 30. 2.3. Challenges 16 clear compositional meaning) vs. distinctive (pseudo). In a fill-the-gap type listening experiment in noise, she found that indeed cross-splicing some true prefixes with pseudo- prefixed stems and vice versa had a negative impact on recognition performance. In this case, although there was interaction with sentence focus (nuclear vs. post-nuclear stress on the accented syllable), some variation should clearly be attributed to morphological differences. These and many other findings suggest that models of human speech recognition should account for many more sources of variability than simply phonemic identity; and that these sources are not limited to indexical properties, but include prosodic structure as a manifestation of grammatical differences at various levels. 2.3.3 Auditory processes One of the main limitations of most models of spoken word recognition is their reliance upon strings of segments (either features or phonemes) as input to the model (2.2.3). In addition to the increasing amount of behavioural evidence about the role of phonetic detail in speech recognition (2.3), also psycho-acoustic and neuro-physiological studies show that sound waves undergo substantial, partly still unexplained transformations along their journey through the auditory nerves and on the cerebral cortex. While a complete account of these transformations is both impossible and inappropriate in this context, still it is worthwhile reviewing the main findings that document the way acous- tic information is encoded during recognition. Some of these turn out to be informative when it comes to the design of processing stages in models (7.3), and to the understand- ing of the role of variability. While most neuro-physiological evidence about auditory mechanisms comes from laboratory animals rather than humans, it seems that some of these mechanisms, particularly those happening at the auditory periphery, are also applicable to humans (Pickles, 2008). A comprehensive review of the findings of the last 25 years about the neural representation of speech can be found in Young (2008), which
  31. 31. 2.3. Challenges 17 constitutes the main information source for this section. Neuro-physiological data shows that brains operate elaborate transformations on the input signal. Moreover, these transformations are not limited to feature extraction, but seem to suggest the formation of auditory objects as a response to specific behavioural needs. While details of the representations in the auditory cortex still escape us, and acknowledging that, because of the interaction with the language areas, auditory areas in human brains might behave in even more complex ways than animal data suggests (see e.g. Zatorre and Gandour, 2008), neuro-physiological data provides an independent source of evidence of the direct relationship between complex acoustic patterns and meaning-bearing linguistic categories. A major finding that seems to encompass, to different degrees, all levels of the neural representation of speech is the so-called tonotopic nature of the representation: in the auditory system different frequency bands are analysed separately. This happens during the conversion from mechanical to neural signal, at the interface between the inner ear and the auditory nerve. Hair cells are arranged along the whole extension of the basilar membrane; departing from them, because of their disposition, specific fibres of the auditory nerve respond to specific frequencies, i.e. their discharge rates become high only in the presence of excitations that fall within a certain frequency range. For this reason, many models have interpreted the basilar-membrane / hair-cells analysis of the signal as a bank of bandpass filters (Patterson et al., 1988). This approximation can be useful, but it is a gross simplification in several respects. First of all, frequency selectivity varies signific- antly with sound pressure level; then, auditory fibres undergo saturation effects; finally, complex interactions among auditory fibres give rise to inhibitory effects: excitation of fibres with a certain best frequency can suppress the excitatory levels of neighbouring fibres. Inhibitory mechanisms are still not fully understood, particularly when it comes to higher neural regions.
  32. 32. 2.3. Challenges 18 Fibres in the auditory nerve can also be classified according to their dynamic range and thresholds, i.e. their activity span, from the lowest sound pressure levels at which discharge rates are observed, to the levels at which they attain saturation. From a temporal perspective, the neural encoding at the auditory periphery displays effects such as stimulus onset enhancement (a sudden increase in neural activity as a response to the onset of a sound after silence) and inhibitory effects among successive sounds. There is also evidence of specialised mechanisms to extract amplitude modulation information (Joris et al., 2004). Despite these transformations and non-linearities, the transmission of information along the auditory nerve can be considered as quite faithful to the original signal, and thus easily interpretable. The same is not true higher up along the auditory pathways. While data about these levels of representation is more fragmentary, it is still possible to determine some of their salient characteristics. Fibres of the auditory nerve terminate in the cochlear nucleus, which contains between five and ten neural subsystems operating in parallel. Neurons that make up this nucleus are of different kinds. Among the few studied, primary-like and chopper neurons display different responses to the input from the auditory nerve. Primary-like neurons behave similarly to neurons in the auditory nerve, i.e. they offer a response which is quite faithful and has high frequency resolution, as their main role is to transmit acoustic information to other centres of the brain for auditory localisation. Chopper neurons, on the other hand, seem to be more robust to ambient noise and differences in sound pressure levels. They achieve this result by being sensitive to low pressure levels, while at the same time having a mechanism to regulate their dynamic range in order to avoid saturation. This behaviour has suggested a hypothesis, according to which chopper neurons might possess a switching mechanism that regulates their responses to input auditory nerve fibres of various thresholds and dynamic ranges. Both primary-like and chopper neurons display higher gain levels than neurons in the auditory nerve. Combined to the tonotopic
  33. 33. 2.3. Challenges 19 architecture, this results in an improved spectral representation of prominent events like vowel formants. The cochlear nucleus is one of the structures connected to the inferior colliculus. The inferior colliculus is quite characteristic in the kind of response to amplitude modula- tion that it provides. While neurons in lower areas of the auditory system provide a fairly straightforward representation of amplitude modulation, responses in the inferior colliculus are mostly transient, and they are observed particularly in relation to transi- ent events in the input signal: conversely, acoustic portions representing steady states (e.g. the central parts of many vowels) are not accompanied by significant neural activ- ity. This has been interpreted as a mechanism of perceptual enhancement for acoustic events like bursts in stops with respect to vocalic portions. While aspects of tonotopic organisation are also observable in the auditory cortex, responses at that level are not as easy to correlate to inputs as they are in the auditory nerve, cochlear nucleus and inferior colliculus, despite the transformations undergone in these earlier stages. Young (2008) lists three reasons for this. First, in animals (e.g. marmosets), cortical neurons seem to be selective for sounds that are important for the species, like the vocalisations of other conspecifics, as opposed to the same and similar sounds when perceived by other species. That is, cortical neurons seem to respond to sounds as meaningful objects, rather than to their bare spectral and temporal features. Second, despite the tonotopic organisation, neurons in the auditory cortex are highly adaptable, in the sense that their characteristic frequency can shift if a certain task demands it, sometimes only temporarily. Moreover, in their degree of response, neurons are also sensitive to stimulus frequency. Third, simple models based on the response of cortical neurons to particular sets of stimuli characterised by similar spectro-temporal features do not seem to have a high predictive power, thus suggesting that the way neurons respond to sounds is more
  34. 34. 2.3. Challenges 20 complex. Thus, also neuro-physiological evidence strongly suggests that the mapping between acoustic patterns and linguistic units is a very complex one. While it is still impossible to model all these transformations, they should at least be acknowledged in models of speech recognition that try to give an account of representations, processes and they way these are implemented in the brain. 2.3.4 Automatic speech recognition Moving beyond beads-on-a-string is not only a theoretical necessity imposed by the explanation of data like those presented in the previous sections. With constant de- velopments in automatic speech recognition technology, researchers in that area have increasingly become aware of the intrinsic limitations of a classical HMM framework, where context-free or context-dependent acoustic models of phones (or syllables) are the only interface between the acoustic signal and linguistic categories (Jurafsky and Martin, 2009). The main trigger of this awareness has been the issue of pronunciation variability, mostly intended as variability due to geographical or sociolinguistic factors. Traditionally, this issue has been tackled by explicitly listing in a dictionary several pro- nunciations for the same word. This solution, however, has proven to be unsatisfactory, particularly when dealing with spontaneous speech (Ostendorf, 1999). For this reason, many researchers have been considering alternative approaches (Baker et al., 2009). The HMM framework has proven to be a flexible and powerful formalism for the mod- elling of many aspects of speech recognition. Yet, its intrinsic limitations are well known and constitute a major bottleneck for bridging the gap between human and machine performance (Ostendorf, 1999). Among these, the most relevant ones include the lack of embedded mechanisms for the modelling of event durations and the assumption of conditional independence for successive acoustic observations. A further limitation of standard HMM architectures is given by the blending of acoustic detail due to the rep-
  35. 35. 2.3. Challenges 21 resentation of the acoustic space via mixtures of Gaussians or other kinds of distributions based on summary statistics. Many alternatives have been proposed to overcome these issues. Some of these, rather than discarding the HMM framework, try to enhance its capabilities (Ostendorf et al., 1996; Deng et al., 2006, II.B). Other proposals aim at a different characterisation of the recognition process. Among the latter, a few proposals concentrate their attention on the modelling of the articulatory aspects of speech, e.g. by trying to model vocal tract dynamics (Deng et al., 2005, 2006). Such proposals are interesting in that they try and give a unified account of production and perception, along the lines of popular theories of speech recognition such as Liberman et al. (1967; 1985) and Fowler (1990). These theories, on the other hand, are controversial, and reliance upon production mechanisms is not necessary in order to account for many aspects of speech recognition (Jusczyk and Luce, 2002). From the point of view of the discussion here, it is thus more interesting to look at architectures that allow more freedom regarding the nature of perceptual units involved in recognition, by at the same time remaining agnostic regarding the relation- ship between production and perception. Among these, two interesting implemented proposals are template-based systems (De Wachter et al., 2007; Demange and Van Com- pernolle, 2009; Maier and Moore, 2007) and graphical models for ASR (Bilmes, 2003; Bilmes and Bartels, 2005).
  36. 36. 3 Theoretical framework: A rational prosodic analysis 3.1 Analytic foundations 3.1.1 Rational analysis A cognitive system can be described from several perspectives: for example, its purpose and the goals it strives to achieve; the mechanisms adopted to achieve those goals; and the physical properties that make those mechanisms work effectively. According to Marr (1982), the levels of explanation of any information processing system are usually only loosely coupled: thus, it should be possible to describe a cognitive system from a particular perspective, while only sketching the others. This assumption is indeed crucial for any endeavour that strives to model a relatively complex system. Anderson (1990; 1991) acknowledges this independence principle as one of the found- ations of what he termed a rational analysis of cognitive systems. Anderson’s rational analysis assumes that a cognitive system has a purpose, which can be described by form- ally defining the task it has to achieve and the environment in which it operates. As the name implies, a rational analysis assumes that cognitive systems behave rationally in taking decisions. According to Anderson’s terminology, “rational” means that the system, which is optimally adapted to its environment and to the task, makes use of all available information about the task and the environment to fulfil its goals. While the characterisation of cognitive systems in terms of their purpose has been widely accepted, the concepts of optimality and rationality are seen by many as not ac- counting for behavioural data about irrational and non-optimal decision making (Kahne- 22
  37. 37. 3.1. Analytic foundations 23 man and Tversky, 1973; Kahneman et al., 1982; and Lopes, 1991 for a critical review). However, Chase et al. (1998) have argued that giving more relevance to the constraints that the environment imposes on the cognitive system and to simple approximations to optimal solutions (a bounded rationality, as they call it) accommodates for these dis- crepancies, by helping to make it clearer what it means to be optimal for a particular system. Rational analyses have been developed for the characterisation of many aspects of perceptual and cognitive systems (Chater and Oaksford, 1999; Oaksford and Chater, 2008): from causal relations (Griffiths, 2005) to associative memory (Anderson and Schooler, 1991), from continuous speech recognition (Norris and McQueen, 2008) to category learning (Sanborn et al., 2010). 3.1.2 Bayes’ theorem An important advantage of a rational analysis over a mechanistic explanation of a cog- nitive system is that it can be readily expressed in formal terms by using Bayes’ theorem. This is particularly useful when information from the environment and the task that is available to the cognitive system is uncertain or incomplete, as is the case for perceptual systems (2.1). In the analysis of speech recognition, the great variability found in acous- tic patterns must be harmonised with the persistence of the linguistic and extra-linguistic categories identified (2.3). By adopting probabilistic reasoning, we can associate an am- biguous acoustic pattern to a set of competing linguistic structures (hypotheses) with different degrees of confidence, and also update confidence scores as soon as new, per- haps disambiguating acoustic evidence for or against a particular linguistic hypothesis becomes available. Bayesian principles (see e.g. Griffiths and Yuille, 2008) constitute a powerful tool for probabilistic reasoning and hypothesis testing. While in a frequentist approach hypo- theses are evaluated exclusively upon previous evidence, in the Bayesian framework a
  38. 38. 3.1. Analytic foundations 24 hypothesis can be given a prior probability, independently from the observed evidence. In the case of speech recognition, this means that hypotheses about linguistic categories can be constrained by many additional linguistic and non-linguistic factors, that we can loosely define as “context”. Bayesian probability theory has been at the core of HMM automatic speech recognition technology for more than twenty years (Jurafsky and Martin, 2009). More recently, and particularly after the work of Anderson, it has gained great popularity also for the modelling of cognitive systems (see e.g. Griffiths and Tenenbaum, 2006, and the early discussion in Watanabe, 1985). Scharenborg et al. (2005) give a unified account of human and automatic speech recognition, showing how describing the task of speech recognition as reasoning under uncertainty in a Bayesian setting helps to bridge the gap between modelling endeavours in ASR and HSR, despite differences at the implementation level (human brains vs. computers). Finally, Norris and McQueen (2008) show convincingly how a formulation of continuous spoken word recognition as a Bayesian problem of optimal decision making accounts elegantly for many effects that in other modelling frameworks would require special treatment. For the particular purpose of this thesis, probabilistic Bayesian modelling offers sev- eral advantages over other kinds of statistical modelling. First of all, at the core of the Bayesian framework is a treatment of data and hypotheses in probabilistic terms. As already described (2.1), such a treatment is required by the very nature of the problem at hand (high degrees of random variability in the acoustic patterns; inherent ambiguity of certain linguistic structures). Second, it is desirable to constrain the scoring of hy- potheses about linguistic categories based on the broad “context” in which they operate because, as we will see (3.2), the linguistic categories that are recruited during recogni- tion are assumed to be task- and environment-specific. Probabilities offer a convenient mechanism for doing so: frequency effects can be easily modelled with prior probabilities, and contextual effects by incorporating other sources of evidence. Finally, because of
  39. 39. 3.1. Analytic foundations 25 the underlying probabilistic reasoning, Bayesian modelling can be applied equally well to various kinds of representations for data and hypotheses: atomic symbols, scalar values, discrete and continuous distributions, complex structures like graphs. Particular kinds of Bayesian models offer additional advantages. Those offered by sparse Bayesian models are discussed in section 5.2. 3.1.3 Firthian prosodic analysis A beads-on-a-string view, in which speech is treated as a concatenation of homogeneous units, is not sufficient to account for human performance in the recognition of examples like those in 2.3.2. Those examples show that listeners’ judgements are driven, to various degrees, by diverse cues that cannot be located on a single segment, or that cannot be related to short term spectral properties. The discussion in 2.3.1 has also pointed out that what is usually considered as indexical variation has in fact a direct influence on recognition performance, and hence cannot be excluded from a comprehensive model of human speech recognition. While the urge to overcome beads-on-a-string represents an element of relative novelty in psycholinguistic modelling (Luce and McLennan, 2005), in descriptive linguistics the question has been extensively investigated since at least the 1940s. Some of the accounts elaborated in that context, however, have remained fairly marginal and less widespread than works that, regarding the issue of variability and invariance, were based on more “orthodox” views (e.g. Chomsky and Halle, 1968). A framework for linguistic analysis known as Firthian Prosodic Analysis provides particularly helpful insights in this respect. The linguistic framework known as Firthian Prosodic Analysis, or simply Prosodic Analysis (Palmer, 1970, henceforth FPA), was developed by J.R. Firth and his co-workers at the School of Oriental and African Studies in London (Firth, 1948). Its development was motivated by the unsuitability, according to Firthians, of classical methods of phon- emic analysis (e.g. Pike, 1947) to the description of many regularities in languages. Firth
  40. 40. 3.1. Analytic foundations 26 attributed the classical analyses that considered the phonology of a language as a unitary system of phonemic contrasts (the beads-on-a-string of section 2.2.3) to the influence of Roman script, noting how other writing systems, based on different principles, were more suited to a more economical description of the languages they had been developed for. Firth mainly disputed the mostly paradigmatic and mono-systemic nature of phonemic approaches. Firth started by considering how certain acoustic patterns (’phonetic exponents’ in FPA terminology) are more economically and profitably described by referring primarily to their collocation within a certain linguistic structure (their syntagmatic properties), rather than to their spectral similarity to other segments occurring in a different context (their paradigmatic properties). For example, in British English, in words like pat and tap, from an acoustic point of view there are potentially many more commonalities (e.g. in terms of degree of aspiration, duration, intensity) between syllable-initial [p] and [t] vs. syllable-final [p] and [t] than between both [p]’s vs. both [t]’s. Such commonalities are determined, in this specific case, by syllabic structure, and can thus be predicted quite independently from the actual segmental content. In this example, the syllable is a suitable context for the prediction of many phonological properties of the word, which in turn determine many of its observable acoustic patterns. Other properties, conversely, would require the consideration of a wider context in which the word is embedded. This clear distinction between ’sounds’ (segments in a traditional sense) and ’prosodies’ (the properties of a given phonological context) allows one to dispense with the transform- ational rules that became one of the main points of interest in Chomsky and Halle’s The Sound Pattern of English (Chomsky and Halle, 1968) and, under different formula- tions, of many successive generative approaches to phonological theory (e.g. Prince and Smolensky, 1993). Particularly relevant for our investigation of the sound-to-grammar mapping is the nature, within a Firthian analysis, of context. Prosodies can be associated with linguistic
  41. 41. 3.1. Analytic foundations 27 units of all kinds. We might have prosodies which serve the purpose of delimiting a syllable, or a word, but also prosodies that mark grammatical categories, like verb vs. noun (e.g. a ’stress’ prosody in many bisyllabic English words, like re’bel vs. ’rebel), active vs. passive (e.g. a ’nasality’ prosody in the Eritrean language Bilin, see Robins, 1970), or prefix vs. non-prefix (Ogden et al., 2000). In addition, some prosodies might be associated with aspects of speech that usually fall under the label of indexical variation (2.3.1): mood, gender, register etc. Thus, in FPA terms, a language is a collection of interacting subsystems, rather than a monolithic, hierarchical system. In FPA, there is a clear distinction between phonological structure and the phonetic manifestations thereof. A prosody, which is an aspect, or element, of phonological struc- ture, will be manifested at the acoustic level by phonetic exponents (acoustic patterns). That is, a prosody can be thought of as an invariant (and thus abstract) element associ- ated with a particular linguistic context, that is realised acoustically by a co-occurrence, or relation, of acoustic features forming a consistent acoustic pattern. The linguistic context is of great importance for the definition of a prosody. A similar acoustic pat- tern appearing in two different linguistic contexts is not automatically considered to be associated to the same prosody: in such a case, acoustic similarity might have no relev- ance whatsoever from a phonological point of view. This view is at odds with the one presented in section 2.2.3. While in most cases both a generative approach and FPA, albeit very differently, are able to adequately explain the same linguistic data, in some circumstances the FPA ap- proach succeeds where a standard phonemic analysis is not straightforward. An example of this is the data of Hawkins and Nguyen (2004) on coda voicing. By examining pairs such as led and let, they found that in addition to the well known effects on vowel dur- ation, in non-rhotic varieties of British English coda voicing also affects the duration of /l/, which is longer, and F2 and centre of gravity as measured at /l/’s onset, which are mostly lower, as compared to the voiceless condition. The influence of coda voicing on
  42. 42. 3.1. Analytic foundations 28 /l/ onset cannot be easily linked to anticipatory co-articulation, and thus is not easily motivated without giving the right weight to the broader linguistic structure. An FPA analysis accounts very naturally for this phonetic behaviour by interpreting coda voicing as a property of the whole syllable, which is thus manifested by several, and possibly non- adjacent acoustic cues. In addition to the weight given to linguistic context, also FPA’s polysystematicity gives a consistent explanation of linguistic contrasts which would oth- erwise seem rather opaque. An example is given by pairs of prefixed words that differ in historical origin (Hawkins and Smith, 2001). Unknown, unnatural and innate all contain prefixes which are monosyllabic and nasal. Furthermore, all words bear primary stress on the second syllable. Despite these similarities, the first two words are rhythmically very different from the third word, by displaying a longer /n/. The difference is however easily explained if one considers the origin of the prefixes (Germanic in the former case, Latinate in the latter) and thus postulating these two co-existing linguistic systems. Cases like these, together with the other examples (provided in section 2.3) relat- ive to phonetic detail signalling prosodic boundaries through rich phonetic exponents, differences between morphologically simple and complex words, and prefixed vs. pseudo- prefixed words, suggest that in terms of explanatory power, an FPA-style approach to the modelling of variability seems to offer substantial advantages over a beads-on-a-string one. However, there are several issues connected to its adoption. In the first place, one must not forget that FPA is a framework for linguistic analysis, which does not make any claim regarding the exact nature of the psychological processes and representations driv- ing human speech recognition (Firth, 1948). This said, there seems to be at least some evidence suggesting that an FPA-style analysis might actually be adopted by listeners: for a start, the data presented in 2.3 requires an analysis of this sort; additionally, some neuro-physiological evidence seems to support it too (Hawkins and Smith, 2001). A second difficulty is given by the non-formalised, non-exhaustive nature of FPA descriptions. As already noted, an information-processing system can be characterised
  43. 43. 3.1. Analytic foundations 29 at several levels (Marr, 1982). A computational model of an information-processing system requires at least 1) an explicit statement about the task to be carried out by the system, 2) the development of representational devices and procedures to carry out the simulation. In an FPA analysis, neither aspect is usually dealt with in great detail. A computational model which adopts an FPA-style approach, however, should tackle both aspects explicitly. The discussion of this issue will build the core of chapter 7. There I will show that a Bayesian perspective of the kind adopted in Norris & McQueen (2008) represents an elegant solution to the explicit formulation of the computational task, and that the same probabilistic framework underlying it also allows us to make use of representational devices of various kinds towards the same task, thus preserving the spirit of a Firthian analysis. Albeit not being mainstream, the Firthian approach to the analysis of spoken language has found its way into more recent linguistic accounts. Among these, we might recall Declarative Phonology (Coleman, 1998) and the work of Local and colleagues (Kelly and Local, 1989; Ogden, 1999; Local, 2003). Much of this theoretical work has been applied to models of speech production and implemented in speech synthesis systems, first with YorkTalk (Coleman, 1990) and later with ProSynth (Ogden et al., 2000). Polysp, a descriptive model of human speech recognition by Hawkins and Smith, is largely based on Firthian principles (Hawkins and Smith, 2001; Hawkins, 2003). One account that, despite not having explicit connections to FPA, nonetheless share some of its features is Jusczyk’s WRAPSA model of speech recognition development (Jusczyk 1993; 2000). Automatic speech recognition systems that, despite quite different from each other, might represent suitable tools for the implementation of an FPA-style approach include graphical models for ASR (Bartels and Bilmes, 2010) and Leuven’s template- based speech recognizer (De Wachter et al., 2007).
  44. 44. 3.2. Central assumptions 30 3.2 Central assumptions The theoretical framework that shapes the computational model of the sound-to-grammar mapping introduced in the next chapters is based on three central assumptions: 1. the goals to be achieved and the computations to be performed in speech recogni- tion, as well as the representation and processing mechanisms recruited, crucially depend on the task a listener is facing, and on the environment in which the task occurs (3.2.1); 2. whatever the task and the environment, the human speech recognition system behaves optimally with respect to them (3.2.2); 3. internal representations of acoustic patterns are distinct from the linguistic features associated to them (3.2.3). The following sections will better qualify, and provide evidence for, these claims. 3.2.1 Specificity of task and environment In a rational analysis perspective, the definition of optimality, and the analysis itself, depend crucially on the task that the cognitive system is facing (3.1.1). This means that the structure of the information processing system being recruited to accomplish the task might be substantially different depending on the task faced by the listener, and by the specific characteristics of the environment in which the task occurs (Hawkins and Nguyen, 2004; Norris and Kinoshita, 2008). Continuous spoken word recognition has been, and still is, at the core of most modelling efforts both in psycholinguistics and engineering (Pisoni and Levi, 2007; Baker et al., 2009). In a natural environment, however, spoken word recognition as usually intended is the main goal of just one kind of task: dictation, i.e. the derivation of a lawful sequence of written words from an acoustic input.
  45. 45. 3.2. Central assumptions 31 Continuous spoken word recognition per se is thus the primary object of enquiry, both from a psychological and an engineering perspective, only if the task to be explained and simulated corresponds to dictation. It cannot, however, be automatically assumed as a primary goal for all tasks which involve the recognition of continuous speech. According to the specific task and environment, the role played by spoken word recognition is greater or smaller: in some cases it constitutes a necessary intermediate goal, along with other goals; in other cases it plays an auxiliary, perhaps marginal role. An example supporting this argument is presented by Hawkins and Smith (2001). Most speakers of English possess a wide variety of expressions for conveying the meaning of “I don’t know”. Each of these varieties, however, is perceived as appropriate only in a specific environment and conveys different kinds of semantic and pragmatic information. These varieties might range from a usually rude I. . . do. . . not. . . know to a rather stylized intonation and rhythm configuration with very weak segmental articulation that signals the fact that the speaker is not very engaged in the conversation: [˜@fl˜@˜@fi]. This very context-specific acoustic pattern, because of its uniqueness, does not necessarily require a familiar listener to recognize the sequence of words “I”, “don’t” and “know” as their main goal. For any task that differs from dictation, the necessity and importance of spoken word recognition as a goal should thus constitute a hypothesis to be tested in its own right, and assessed by experiment. Just as spoken word recognition appears to be central to some of the tasks which involve listening, and less important to others, so the recognition of grammatical structures of other kinds should not be expected to differ in this respect. Even in the case of an easily characterisable task like dictation, the nature of the environment may vary: the kinds of acoustic patterns that one might expect to encounter, and the type and number of linguistic structures that one might need to recruit, will be very different if dictation involves writing down some telephone number or address, as opposed e.g. to writing down a dictation passage at school, or having a business decision dictated with the intention of writing a letter.
  46. 46. 3.2. Central assumptions 32 Firthian Prosodic Analysis is a convenient tool to envisage the recruitment of different language structures based on the task and environment in which speech recognition op- erates. As already noted, in FPA, linguistic structures are organised into self-contained, albeit interacting subsystems, and particular linguistic contrasts are triggered only by context-specific phonological contrasts, or prosodies, and their sometimes complex acous- tic manifestations (3.1.3). By carefully considering task and environment, we can de- velop computational models of speech recognition which are limited enough in scope to make the most out of detailed phonetic descriptions, linguistic analyses, behavioural experiments, and simulations. We can then integrate various models, always considering carefully the respective tasks and environments and, if necessary, revising the models in order to accommodate for any emerging interaction. By postulating an optimal beha- viour of listeners with respect to the task and environment, as envisaged by a rational analysis (3.1.1), and by exploiting the Bayesian framework through the probabilistic in- terpretation of hypotheses and combinations thereof (3.1.2), we have a principled way to express the models formally, to implement them, and to perform this kind of integration. 3.2.2 Optimal behaviour While defining the task a listener is facing, and characterising the environment in which it is performed, we need to specify what it means for a listener to behave optimally with respect to task and environment (3.1.1). Norris and McQueen (2008) give two examples of task-specific optimal behaviour: in tasks requiring speeded decisions, it would amount to “making the fastest decision possible while achieving a given level of accuracy”; whereas in tasks which require a response based on a fixed amount of perceptual evidence, it could be defined as “selecting the response (word) that is most probable, given the available input” (p.358). Whereas a definition of optimality in most experimental settings (where most aspects of task and environment are strictly controlled) is fairly trivial, a formal definition of
  47. 47. 3.2. Central assumptions 33 what is optimal behaviour in more natural contexts becomes quite challenging. Let us recall the already introduced example of dictation, and particularly of having an address dictated on the phone. If part of the environment in which the task is performed is a very expensive international call, and the listener is the caller, she might accept a lower level of confidence in the correctness of the address, with the aim of keeping the conversation as short as possible, and hence paying less. This in turn should be weighted by other contextual factors, such as how wealthy (and stingy) the person is, how important it is for her to get the address right, and whether she knows that she can double-check the address at a later point with an online mapping service. Each of these factors could be important in determining what constitutes an optimal strategy for achieving the task of recognising and writing down the address and, consequently, how speech recognition will be performed (degree of attention to the subtleties of the acoustic signal, reliance upon previous knowledge, degree of adaptation to the voice of the speaker, success measures for the task). A complex optimality criterion like this is largely dependent on factors that are not easily observable, and which are hard to capture in a simple, general-purpose model of speech recognition. This seems a compelling reason to develop computational models that are at first small in scope, and are then gradually expanded and connected to include new combinations of tasks and environments, thus enabling what Luce and McLennan call “cumulative progress” in the understanding of human speech recognition (Luce and McLennan, 2005). 3.2.3 Auditory patterns and linguistic features Finding a way to account gracefully for both generalisation properties and preservation of detail in human speech recognition is arguably one of the most prominent topics among researchers in the field (Luce and McLennan, 2005; Pisoni and Levi, 2007). The discussion about variability and invariance in chapter 2 showed how this issue has been tackled in the past.
  48. 48. 3.2. Central assumptions 34 Early theoretical accounts based on a beads-on-a-string approach (2.2.3) placed the burden of this conversion almost exclusively onto the phonemic level. This simplification was also adopted by many subsequent psycholinguistic models. A few theories and models conceded the role of privileged unit of analysis of this conversion to the word, as was the case for the original Cohort theory (Marslen-Wilson and Welsh, 1978). All those accounts, however, tended to identify the unit of analysis, rather than the units. As already discussed, such a rigid interpretation of human speech recognition fails to account for numerous phenomena, such as: the role played in recognition by indexical factors (2.3.1); physiological data about the nature of auditory representations along the auditory nerve and in the brain (2.3.3); and behavioural data about the role of grammar (2.3.2). At the other end of the spectrum, some psycholinguistic accounts tried to do away completely with abstract representations by holding all exemplars and the phonetic de- tail they carry in memory, and envisaging recognition as an analogical process (e.g. Goldinger, 1998). While accounts of this latter type can do justice to some data, es- pecially those regarding the influence of speaker identity on word recognition (Nygaard et al., 1994), they fail to account for the combinatorial and generalisation properties of human language. The task- and environment-specific, optimal-behaviour approach adopted in this thesis, as outlined in the previous sections, does not encourage any kind of general-purpose account about the nature of mental representations, let alone their hardware implement- ation in the brain. Despite not having a strong position about the specific nature of mental representations, the theoretical framework adopted here assumes some form of distinction between the internal representations of auditory patterns and the linguistic and indexical features associated with them. For example, the internal representation of a single auditory pattern could be recalled to provide at once information about a non-canonical acoustic realisation of a specific lexical item, the sex of the speaker as-
  49. 49. 3.2. Central assumptions 35 sociated with that particular realisation, and her identity. Once again, we believe that the whole picture can only emerge after the careful analysis and modelling of several specific computational tasks, their environments and constraints. Since there are many possible combination of tasks, environments and constraints, the picture will certainly be a complex one. Since, on the other hand, the concepts of task- and environment specificity and of optimal behaviour are transversal and build upon the same principles, we should expect a certain amount of convergence at the representation level.
  50. 50. 4 Assessment: Perceptual-magnet effect 4.1 Motivation In this chapter I introduce a perceptual phenomenon known as the perceptual-magnet effect (PME), and describe an account of it from the literature, which is based on a rational analysis (3.1.1). While, as already mentioned, rational accounts like the one presented here are gaining popularity, and this model in particular offers an elegant explanation of PME for somewhat artificial datasets, I make the case for the use of more “natural” data in the development of such accounts, to prevent a twofold risk: on one hand, that of equating acoustic features with phonological categories (3.2.3); on the other hand, that of concentrating too much on the computational aspects without any specification of the representational and processing devices involved. Recent efforts to include also these aspects in rational models (see e.g.Sanborn et al., 2010) should indeed be welcomed. The increasing convergence of HSR and ASR methods (Scharenborg et al., 2005) is also highly beneficial in this respect. The model of prefix prediction developed in chapter 7, which builds upon the theoretical framework of chapter 3 and on the tools presented in chapter 5, tries to be as explicit as possible in dealing with both the computational and representational aspects. 4.1.1 The perceptual-magnet effect Perceptual-magnet effect (henceforth PME) is a term that has become common in psy- chological literature since it was introduced for the first time by Kuhl (1991). It is used in order to describe the shrinkage of perceptual space, manifested as reduced discrimina- 36
  51. 51. 4.1. Motivation 37 tion, around vowels and liquids whose quality listeners consider prototypical (i.e., good). According to PME, two sounds which are separated by a certain acoustic distance are less easily discriminable when they are in the proximity of good exemplars (prototypes) of the category they belong to. 0 1 2 3 4 5 6 7 8 9 Stimulus Feature values in acoustic space Feature values in perceptual spacePrototype Figure 4.1.1: Illustration of PME for equally spaced stimuli in one-dimensional acoustic space (top) and corresponding representation in perceptual space (bottom). Stimuli closer to prototype (stimulus 0) are attracted more, and thus are less discriminable from neighbouring stimuli. Consider for example figure 4.1.1. It represents a series of 9 stimuli which vary along one acoustic dimension, say F2 frequency at the steady state of /u:/ in British English. The top vector (circles) shows the stimuli in acoustic space, where there is a constant increase in F2 mean frequency: the stimuli are thus equally spaced. The bottom vector
  52. 52. 4.1. Motivation 38 (squares), by contrast, shows the stimuli as PME would predict they are perceived by a native listener of BE: under the influence of the category’s prototype (stimulus 0 in the figure), which might correspond, in this case, to the mean F2 frequency of all the /u:/ vowels that the listener has heard before when listening to other BE speakers, stimuli which are closer to the prototype in acoustic space tend to be squeezed in perceptual space, whereas stimuli farther away from the prototype in acoustic space are much more clearly distinguishable from each other in perceptual space. Questions about the existence and nature of this kind of perceptual warping have given rise to a lively debate among scholars, particularly during the second half of the 1990s. The major points of criticism raised by those who are not convinced about the existence of a PME in real speech address the experimental methods used in order to assess the PME and its generalisability. With respect to experimental methodology, sceptics pointed out that in almost all cases investigators elicited judgements about isolated, synthetic sounds, which varied along one or two parameters at most (usually F1 and/or F2). One example of this kind of critique can be found in Lotto et al. (1998) (with interesting follow- ups in Guenther, 2000 and Lotto, 2000). The second argument brought forward by PME opposers concerned the difficulty of generalising PME across sounds, sound classes and languages. Although several studies (mainly conducted by Kuhl, Iverson and co- workers) postulate a PME for some language-specific vowel phonemes (e.g. American English /i:/: Iverson and Kuhl, 1995; German /i/: Diesch et al., 1999; Swedish /y/: Kuhl, 1992) and liquids (American English /r/ and /l/: Iverson and Kuhl, 1996, Iverson et al., 2003), other studies did not find any such effect, and thus questioned at least its generalisability (several Australian English vowels: Thyer et al., 2000; American English /i:/: Lotto et al., 1998, Frieda et al., 1999 and Lively and Pisoni, 1997). Most authors who do not agree with a PME analysis explain experimental results that seem to support it in terms of the more classical categorical perception (Liberman et al., 1957). In other classes of sounds, most notably stops and fricatives, listeners’ perception seems to be
  53. 53. 4.1. Motivation 39 strictly categorical. Explaining vowel data with categorical perception has the advantage of providing a unified account for both consonants and vowels. Most of the experimental work done on the PME starts from the assumption that prototypes are the best instance, as judged by listeners of a specific language, of a certain phoneme. In the next section I describe a study by Barrett-Jones and Hawkins (Barrett, 1997; Hawkins and Barret Jones, 2004) that challenges this assumption by showing how contextual information affects the way listeners perceive prototypes, and ultimately brings to the foreground the question about the nature of linguistic units. 4.1.2 Context-dependent PME Barrett-Jones and Hawkins (Barrett, 1997; Hawkins and Barret Jones, 2004) investigated the nature of prototypes in human speech recognition. In her thesis, Barrett-Jones wanted to test whether context sensitivity, in terms of allophonic variation, would affect PME. If the PME were found to be context-dependent, this would have had some implications regarding the nature of the phonological units of representation of speech sounds. She tested this hypothesis by eliciting goodness ratings and similarity judgements from listeners for the Southern British English vowel /u:/ in three different allophonic contexts: isolation (/u:/ ), preceding lateral (/lu:/ ) and preceding palatal glide (/ju:/ ). These three syllables also happen to be (pseudo-) words in SBE (ooh!, Lou/loo, you), a fact which gives an added degree of naturalness to the laboratory experiments. After having recorded 50 tokens for each of the three monosyllables from a 29-year-old male speaker of SBE, Barrett-Jones analysed them in terms of F1, F2 and F3 frequency at the beginning and at the end of the vocalic portion. The measurements were taken as outlined in Barrett-Jones (1997, p. 59), and they served in order to synthesise stimuli for the perceptual experiments. As, unsurprisingly, F2 onset seemed the cue that mostly distinguished the three allophones from each other, stimuli were synthesised by varying
  54. 54. 4.1. Motivation 40 this parameter systematically (from 800 Hz to 1600 Hz), with minimal variation over the remaining parameters, which was necessary due to naturalness concerns (see Barrett- Jones’ thesis for further details). In a first experiment Barrett-Jones asked subjects (8, different for each monosyllable) to give comparative judgements about which one was the better exemplar for all possible pairs of stimuli of the same monosyllable (excluding X-X pairs, thus giving 9Ö9 − 9 = 72 pairs). She then assigned one point to each of the “winning” stimuli. In giving their judgements, subjects were asked to focus on the vowel. This experiment showed that the most-preferred (or prototypes, P) and rarely-preferred (or non-prototypes, NP) tokens differed for each of the three contexts. A further outcome of this experiment was that individual differences were indeed quite big. This observation led Barrett-Jones to have ten randomly-selected subjects repeat the experiment for all three contexts. A plot of these listeners’ judgements based on her data is shown in figure 4.1.2. These data hint at the fact that while inter-speaker, absolute values for Ps and NPs can still contain some useful information, attention to individual values and relative distances is very important in order to have a proper understanding of the phenomenon. The data presented here will be described at the end of section 4.3.1 and in the simulations following it. In a subsequent experiment Barrett-Jones used the individually-measured Ps and NPs in order to test for differences in discriminability, which could have offered evidence for PME. The measure of discriminability adopted was d’ (d-prime: Green and Swets, 1966). The results showed that discrimination was consistently worse around Ps. As a last, crucial step in order to ascertain the plausibility of context-dependent perceptual magnets, Barrett-Jones performed another discrimination test in which the same subject was asked to give same/different judgements for two distinct blocks of stimuli. The first block tested for discriminability around the synthesised, subject-specific prototype (as previously found through comparative judgements) for one of the three contexts, whereas the second block tested for discriminability around a “candidate” prototype which took

×