Neural Phonetic Typewriter


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Neural Phonetic Typewriter

  1. 1. The “Neural” Phonetic Typewriter Teuvo Kohonen Helsinki University of Technology I n 1930 a Hungarian scientist, Recently, researchers have placed great Tihamer Nemes, filed a patent appli- cation in Germany for the principle I hopes on artificial neural networks to per- form such “natural” tasks as speech of making an optoelectrical system auto- Based on a neural recognition. This was indeed one motiva- matically transcribe speech. His idea was tion for us to start research in this area to use the optical sound track on a movie network processor for many years ago at Helsinki University of film as a grating to produce diffraction the recognition of Technology. This article describes the patterns (corresponding to speech spec- result of that research-a complete “neu- tra), which then could be identified and phonetic units of ral” speech recognition system, which typed out. The application was turned recognizes phonetic units, called pho- down as “unrealistic.” Since then the speech, this speaker- nemes, from a continuous speech signal. problem of automatic speech recognition adaptive system Although motivated by neural network has occupied the minds of scientists and principles, the choices in its design must be engineers, both amateur and professional. transcribes dictation regarded as a compromise of many tech- Research on speech recognition princi- nical aspects of those principles. As our ples has been pursued in many laborato- using an unlimited system is a genuine “phonetic typewriter” ries around the world, academic as well as vocabulary. intended to transcribe orthographically industrial, with various objectives in edited text from an unlimited vocabulary, mind.’ One ambitious goal is to imple- it cannot be directly compared with any ment automated query systems that could more conventional, word-based system be accessed through public telephone lines, that applies classical concepts such as because some telephone companies have dynamic time warping’ and hidden Mar- observed that telephone operators spend kov models.* most of their time answering queries. An even more ambitious plan, adopted in words from limited vocabularies with 1986 by the Japanese national ATR varying accuracy are now on the market. Why is speech (Advanced Telecommunication Research) These devices have important applica- recognition difficult? project, is to receive speech in one lan- tions, such as the operation of machines by guage and to synthesize it in another, on voice, various dispatching services that Automatic recognition of speech line. The dream of a phonetic typewriter employ voice-activated devices, and aids belongs to the broader category of pattern that can produce text from arbitrary dic- for seriously handicapped people. But in recognition tasks,3 for which, during the tation is an old one; it was envisioned by spite of big investments and the work of past 30 years or so, many heuristic and Nemes and is still being pursued today. experts, the original goals have not been even sophisticated methods have been Several dozen devices, even special reached. High-level speech recognition has tried. It may seem strange that while prog- microcircuits, that can recognize isolated existed so far only in science fiction. ress in many other fields of technology has March 1988 0018-9162/88/0300-0011$01.00 0 1988 IEEE 11
  2. 2. been astoundingly rapid, research invest- therefore often restricted to particular ments in these "natural" tasks have not tasks. For instance, we might wish to yet yielded adequate dividends. After ini- recognize isolated commands from a tial optimism, the researchers in this area Machine limited vocabulary, or to type text from have gradually become aware of the many dictation automatically. Many satisfactory difficulties to be surmounted. interpretation of techniques for speaker-specific, isolated- Human beings' recognition of speech COmph?tesentences word recognition have already been devel- consists of many tasks, ranging from the has been oped. Systems that type English text from detection of phonemes from speech wave- clear dictation with short pauses between forms to the high-level understanding of accomplished only the words have been demonstrated.6 messages. We do not actually hear all with artificially Typing unlimited dictation in English is speech elements; we realize this easily another intriguing objective. Systems when we try to decipher foreign or uncom- limited syntax. designed for English recognize words as mon utterances. Instead, we continuously complete units, and various grammatical relate fragmentary sensory stimuli to con- forms such as plural, possessive, and so texts familiar from various experiences, forth can be stored in the vocabulary as and we unconsciouslytest and reiterate our separate word tokens. This is not possible perceptions at different levels of abstrac- in many other languages-Finnish and tion. In other words, what we believe we Japanese, for example-in which the hear, we in fact reconstruct in our minds imitationoftheoperationofthepreatten- grammar is implemented by inflections from pieces of received information. tive sensory system*The first large exper- and there may be dozens of different Even in clear speech from the same imental speech-understanding Systems forms of the Same root word. F~~inflec- speaker, distributions of the spectral S a m -followedthis line of thought (seethe report tional languagesthe system must construct ples of differentphonemes overlap. Their of the ARPA Project,4 which was corn- the text from recognized phonetic units, statistical density functions are not Gaus- Pleted around l976), but for commercia1 taking into account the transformations of sian, so they cannot be approximated ana- aPPlicationsuchsolutionsweretooexPen- these units due to coarticulation effects lytically. The same phonemes spoken by sive. Machine interpretation ofthe mean- (i.e., a phoneme's acoustic spectrum varies different persons can be confused too; for ing ofcompletesentencesisavery difficult in the context of different phonemes). example, the / E / of one speaker might task; it has been accomplishedonly when Especially in image analysis, but in sound like the /n/ of another. For this rea- the syntax has been artificially limited. speech recognition too, many newer son, absolutely speaker-independent Such ''Party tricks" may have led the Pub- methods concentrate on structural and detection of phonemes is possible Only lic to believe that Practical speech recog- syntactic relationships between the pattern with relatively low accuracy. nition has reached a more advanced level elements, and special grammars for their Some phonemes are Spectrally Clearer than it has- Despite decades of intensive analysis have been developed. It seems, and stabler than others. For speech recog- research, no machine has Yet been able to however, that the first step, preanalysis nition purposes, we distinguish three recognize general, continuous speech and detection of primary features such as acoustically different categories: produced by an arbitrary speaker, when acoustic spectra, is still often based on have been rather coarse principles, without careful (1) Vocal (voiced, nonturbulent) pho- no speech Recognition of the speech of arbitrary consideration of the very special statistical nemes, including the vowels, semivowels speakers is much more than properties of the natural signals and their (/j/* /'/I, nasals (Im/? In/> and generally believed. Existing commercial liquids (/l/, /r/) clustering. Therefore, when new, highly speaker-independentsystems are restricted and adaptive such as arti- (2) Fricatives (/s/, /y/, /z/, etc.) to isolated words from vocabularies not ficial neural networks are introduced, we (3) " ( '9 '~'3 3' " 9'" exceeding 40 words. Reddy and Zue esti- =Surne that their capacities best be uti- /d/,/g/, etc.) mated in 1983 that f o r speaker- lized if the networks are made to adapt to The phonemes of the first two categories independent recognition of connected the real data, finding relevant features in have rather well-defined, stationary spec- speech, based on a 20,000-word vocabu- the This was in fact one of the ten- tra, whereas the Plosives are identifiable lary, acomputing power of 100,000MIPS, tral assumptions in our research. only on the basis oftheir transient proper- corresponding to 100 supercomputers, To recapitulate, speech is a very difficult ties. For instance, for /k,p,t/ there is a would be necessary.' Moreover, the stochastic process, and its elements are not silence followed by a short, faint burst of detailed programs to perform these oper- unique at all. The distributions of the voice characteristic of each plosive, ations have not been devised. The difficul- different phonemic classes overlap depending on its Point of articulation (lips. ties would be even greater if the ously, and to minimize misclassification tongue, palate). The transition of the vocabularies were unlimited, if the utter- errors, careful statistical as well as strut- speech signal to the next phoneme also ances were loaded with emotions, or if tural analyses are needed. varies among the plosives. speech were produced under noisy or A high-level automatic speech recogni- stressful conditions. tion system also should interpret the We must, of course, be aware of these The promise of neural semantic content of utterances so that it difficulties. On the other hand, we would can maintain selectiveattention to partic- never complete any practical speech recog- computers ular portions of speech. This ability would nizer if we had to attack all the problems Because the brain has already imple- call for higher thinking processes, not only simultaneously. Engineering solutions are mented the speech recognition function 12 COMPUTER
  3. 3. (and many others), some researchers have network described here would accept reached the straightforward conclusion many alternative kinds of preprocessing that artificial neural networks should be and compensate for modest imperfections, able to do the same, regarding these net- In practical as long as they occur consistently. Our works as a panacea for such “natural” final results confirmed this belief; at least problems. Many of these people believe neural-network there were no large differences in recogni- that the only bottleneck is computing applications, the tion accuracies between stationary and power, and some even expect that all the transient phonemes. remaining problems will be solved when, number of Briefly, the complete acoustic say, optical neural computers, with a vast input samples preprocessor of our system consists of the computing capacity, become feasible. used for training following stages: What these people fail to realize is that we (1) Noise-canceling microphone may not yet have discovered what biolog- cannot be large. (2) Preamplifier with a switched- ical neurons and neural systems are like. capacitor, 5.3-kHz low-pass filter Maybe the machines we call neural net- works and neural computers are too sim- (3) 12-bit analog-to-digital converter ple. Before we can utilize such computing with 13.02-kHz sampling rate capacities, we must know what and how to (4) 256-point fast Fourier transform, compute. computed every 9.83 ms using a 256-point Hamming window It is true that intriguing simulations of Acoustic preprocessing ( 5 ) Logarithmization and filtering of new information-processing functions, spectral powers by fourth-order elliptic based on artificial neural networks, have Physiological research on hearing has low-pass filters been made, but most of these demonstra- revealed many details that may or may not (6) Grouping of spectral channels into tions have been performed with artificial be significant to artificial speech recogni- a 15-component real-pattern vector data that are separable into disjoint tion. The main operation carried out in (7) Subtraction of the average from all classes. Difficulties multiply when natural, human hearing is a frequency analysis components stochastic data are applied. In my own based on the resonances of the basilar (8) Normalization of the resulting vec- experience the quality of a neural network membrane of the inner ear. The spectral tor into constant length must be tested in an on-line connection decomposition of the speech signal is with a natural environment. One of the transmitted to the brain through the audi- Operations 3 through 8 are computed by most difficult problems is dealing with tory nerves. Especially at lower frequen- the signal processor chip TMS 32010 (our input data whose statistical density func- cies, however, each peak of the pressure design is four years old; much faster tions overlap, have awkward forms in wave gives rise to separate bursts of neu- processors are now available). high-dimensional signal spaces, and are ral impulses; thus, some kind of time- In many speech recognition systems not even stationary. Furthermore, in prac- domain information also is transmitted by acoustic preprocessing encodes the speech tical applications the number of samples the ear. On the other hand, a certain degree signal into so-called LPC (linear predictive of input data used for training cannot be of synchronization of neural impulses to coding) coefficients,’ which contain large; for instance, we cannot expect that the acoustic signals seems to occur at all approximately the same information as the every user has the patience to dictate a frequencies, thus conveying phase infor- spectral decomposition. We preferred the sufficient number of speech samples to mation. One therefore might stipulate that FFT because, as will be shown, one of the guarantee ultimate accuracy. the artificial ear contain detectors that main operations of the neural network that mimic the operation of the sensory recep- recognizes the phonemes is to perform On the other hand, since digital comput- metric clustering of the phonemic samples. ing principles are already in existence, they tors as fully as possible. The FFT, a transform of the signal, should be used wherever they are superior Biological neural networks are able to reflects its clustering properties better than to biological circuits, as in the syntactic enhance signal transients in a nonlinear a parametric code. analysis of symbolic expressions and even fashion. This property has been simulated We had the option of applying the over- in the spectral analysis of speech wave- in physical models that describe the all root-mean-square value of the speech forms. The discrete Fourier transform has mechanical properties of the inner ear and signal as the extra sixteenth component in very effective digital implementations. chemical transmission in its neural the pattern vector; in this way we expected Our choice was to try neural networks Nonetheless, we decided to apply conven- to obtain more information on the tran- in a task in which the most demanding tional frequency analysis techniques, as sient signals. The recognition accuracy statistical analyses are performed- such, to the preprocessing of speech. The remained the same, however, within one namely, in the optimal detection of the main motivations for this approach were percent. We believe that the acoustic phonemes. In this task we could test some that the digital Fourier analysis is both processor can analyze many other speech new learning methods that had been accurate and fast and the fundamentals of features in addition to the spectral ones. shown to yield a recognition accuracy digital filtering are well understood. Stan- Another trick that improved accuracy on comparable to the decision-theoretic max- dard digital signal processing has been the order of two percent was t o make the imum, while at the same time performing considered sufficient in acoustic engineer- true pattern vector out of two spectra 30 the computations by simple elements, ing and telecommunication. Our decision ms apart in the time scale. Since the two using a minimal amount of sample data was thus a typical engineering choice. We samples represent two different states of for training. also believed the self-organizing neural the signal, dynamic information is added March 1988 13
  4. 4. their vectorial difference (actually the norm of this difference) in an n- dimensional Euclidean space. Figure 1 exemplifies a two-dimensional space in I I which a finite number of reference vectors are shown as points, corresponding to their coordinates. This space is partitioned into regions, bordered by lines (in general, t hyperplanes) such that each partition con- tains a reference vector that is the nearest neighbor to any vector within the same partition. These lines, or the midplanes of 52 the neighboring reference vectors, consti- tute the Voronoi tessellation, which -TI defines a set of discrimination or decision surfaces. This tessellation represents one kind of vector quantization, which gener- I l o ally means quantization of the vector space into discrete regions. - 51 One or more neighboring reference vec- tors can be made to define a category in the vector space as the union of their respec- tive partitions. Determination of such referencevectors was the main problem on Figure 1. Voronoi tessellation partitions a two-dimensional (51, tZ)“pattern space” which we concentrated in our neural net- into regions around reference vectors, shown as points in this coordinate system. work research. There are, of course, many All vectors (tl, 52) in the same partition have the same reference vector as their classical mathematical approaches to this nearest neighbor and are classified according to it. The solid and open circles, p r ~ b l e mIn very simple and straightfor- .~ respectively, represent reference vectors of two classes, and the discrimination ward pattern recognition, samples, or pro- “surface” between them is drawn in bold. totypes, of earlier observed vectors are used as such for the reference vectors. For the new or unknown vector, a small num- ber of its nearest prototypes are sought; to the preanalysis. can be regarded as a 15-dimensional real then majority voting is applied to them to Because the plosives must be distin- vector in a Euclidean space. We might determine classification. A drawback of guished on the basis of the fast, transient think that the spectra of the different pho- this method is that for good statistical parts of the speech waveform, we selected nemes of speech occupy different regions accuracy an appreciable number of refer- the spectral samples of the plosives from of this space, so that they could be detected ence vectors are needed. Consequently, the the transient regions of the signal, on the by some kind of multidimensional dis- comparison computations during classifi- basis of the constancy of the waveform. crimination method. In reality, several cation, expecially if they are made serially, On the other hand, there is evidence that problems arise. One of them, as already become time-consuming; the unknown the biological auditory system is sensitive stated, is that the distributions of the spec- vector must be compared with all the refer- not only to the spectral representations of tra of different phonemic classes overlap, ence vectors. Therefore, our aim was to speech but to their particular transient fea- so that it is not possible to distinguish the describe the samples by a much smaller tures too, and apparently it uses the non- phonemes by any discrimination method representative set of reference vectors linear adaptive properties of the inner ear, with 100 percent certainty. The best we can without loss of accuracy. especially its hair cells, the different trans- do is to divide the space with optimal dis- Imagine now that a f i e d number of dis- mission delays in the neural fibers, and crimination borders, relative to which, on crete neurons is in parallel, looking at the many kinds of neural gating in the audi- the average, the rate of misclassifications speech spectrum, or the set of input sig- tory nuclei (processing stations between is minimized. It turns out that analytical nals. Imagine that each neuron has a tem- the ear and the brain). For the time being, definition of such (nonlinear) borders is plate, a reference spectrum with respect to these nonlinear, dynamic neural functions far from trivial, whereas neural networks which the degree of matching with the are not understood well enough to warrant can define them very effectively. Another input spectrum can be defined. Imagine the design of standard electronicanalogies problem is presented by the coarticulation further that the different neurons com- for them. effects discussed later. pete, the neuron with the highest match- A concept useful for the illustration of ing score being regarded as the “winner.” these so-called vector space methods for The input spectrum would then be Vector quantization pattern recognition and neural networks is assigned to the winner in the same way that called Voronoi tessellation.For simplicity, an arbitrary vector is assigned to the closest The instantaneous spectralpower values consider that the dissimilarity of two or reference vector and classified according on the 15 channels formed from the FFT more spectra can be expressed in terms of to it in the above Voronoi tessellation. 14 COMPUTER
  5. 5. There are neural networks in which such capacitance that integrates input currents templates are formed adaptively, and and triggers a volley of impulses when a which perform this comparison in paral- critical level of depolarization is achieved.) lel, so that the neuron whose template matches best with the input automatically gives an active response to it. Indeed, the self-organizing process described below defines reference vectors for the neurons such that their Voronoi tessellation sets The first term on the right corresponds near-optimal decision borders between the to the coupling of input signals to the neu- classes-i.e., the fraction of input vectors ron through the different transmittances; falling on the wrong side of the borders is a linear, superpositive effect was assumed minimized. In classical decision theory, for simplicity. The last term, - y ( q l ) , theoretical minimization of the probabil- stands for a nonlinear leakage effect that ity for misclassification is a standard describes all nonideal properties, such as procedure, and the mathematical setting saturation, leakage, and shunting effects for it is the Bayes theory of probability. In of the neuron, in a simple way. It is what follows, we shall thus point out that assumed to be a stronger than linear func- the vector quantization and nearest neigh- tion of qr. It is further assumed that the bor classification resulting in the neural inverse function y - I exists. Then if the (,J network defines the reference vectors in are held stationary, or they are changing such a way that their Voronoi tessellation slowly, we can consider the case dq,/dt very closely approximates the theoretical 0, whereby the output will follow the inte- Bayesian decision surfaces. grated input as in a nonlinear, saturating amplifier according to Figure 2. Symbol of a theoretical neu- The neural network n ron and the signal and system variables 1, = o [ t p , J ~ l J l 1 (2) relating to it. The small circles cor- Detailed biophysical analysis of the J=1 respond to the input connections, the phenomena taking place at the cell mem- synapses. brane of biological neurons leads to sys- Here 0 . is the inverse function of y, and 11 tems of nonlinear differential equations it usually has a typical sigmoidal form, with dozens of state variables for each neu- with low and high saturation limits and a ron; this would be untenable in a computa- proportionality range between. tional application. Obviously it is The settling of activity according to necessary to simplify the mathematics, Equation 1 proceeds very quickly; in bio- (3) while retaining some essentials of the real logical circuits it occurs in tens of milli- dynamic behavior. The approximations seconds. Next we consider an adaptive where a is a positive constant, the first made here, while reasonably simple, are process in which the transmittances p,, are term is the “Hebbian” term, and the last still rather “neural” and have been assumed to change too. This is the effect term representsthe nonlinear “forgetting” influential in many intriguing appli- regarded as “learning” in neural circuits, effect, which depends on the activity q,; cations. and its time constants are much longer. In forgetting is thus “active.” As will be Figure 2 depicts one model neuron and biological circuits this process corresponds pointed out later, the first term defines defines its signal and state variables. The to changes in proteins and neural struc- changes in the pfJin such a direction that input signals are connected to the neuron tures that typically take weeks. A simple, the neuron tends to become more and with different, variable “transmittances” natural adaptation law that already has more sensitive and selective to the partic- corresponding to the coupling strengths of suggested many applications is the follow- ular combination of signals (, presented , the neural junctions called synapses. The ing: First, we must stipulate that paramet- at the input. This is the basic adaptive latter are denoted by pfJ(here i is the index ric changes occur very selectively; thus effect. of the neuron and j that of its input). Cor- dependence on the signals must be non- On the other hand, to stabilize the out- respondingly, ( is the signal value (signal , linear. The classical choice made by most put activity to a proper range, it seems very activity, actually the frequency of the neu- modelers is to assume that changes are profitable for p(ql) to be a scalar function ral impulses) at the j t h input of the ith proportional to theproduct of input and with a Taylor expansion in which the con- neuron. output activities (the so-called law of stant term is zero. Careful analyses have Each neuron is thought to act as a pulse- Hebb). However, this choice, as such, shown that this kind of neuron becomes frequency modulator, producing an out- would be unnatural because the selective to the so-called largest principal put activity q, (actually a train of neural parameters would change in one direction component of input.’ For many choices impulses with this repetition frequency), only (notice that the signals are positive). of the functional form, it can further be which is obtained by integrating the input Therefore it is necessary to modify this shown that the ,ufJ will automatically signals according to the following law-for example, by includingsome kind become normalized such that the vector differential equation. (The biological neu- of nonlinear “forgetting” term. Thus we formed from the p,J during the process rons have an active membrane with a can write tends to a constant length (norm) indepen- March 1988 15
  6. 6. activities qi, denoting the feedback coup- ling from neuron k to neuron i by wk;, can be written where k runs over the subset SI of those neurons that have connections with neu- ron i. A characteristic phenomenon, due to the lateral feedback interconnections, will be observed first: The initial activity distribution in the network may be more or less random, but over time the activity develops into clusters or “bubbles” of a certain dimension, as shown in Figures 4 and 5 . If the interaction range is not much less than the diameter of the network, the 1 wki network activity seems to develop into a single bubble, located around the maxi- mum of the (smoothed) initial activity. Consider now that there is no external source of activation other than that provided by the input signal connections, which extend in parallel over the whole network. According to Equations 1 and 2, the strength of the initial activation of a neuron is proportional to the dot product mITx where m, is the vector of the p,,, x i s Figure 3. (a) Neural network underlying the formation of the phonotopic maps the vector of the t,,, and Tis the transpose used in speech recognition. (b) The strengths of lateral interaction as a function of of a vector. (We use here concepts of distance (the “Mexican hat” function). matrix algebra whereby m, and x are column vectors.) Therefore, the bubble is formed around those units at which mITx is maximum. The saturation limits of o [ .J defined by Equation 2 stabilize the activities q, to dent of the signal values that occur in the The feedback connections are coupled either a low or a high value. Similarly, process.’ We shall employ this effect a bit to the neurons in the same way as the exter- /3(q,) takes on either of two values. With- later in a further simplification of the nal inputs. However, for simplicity, only out loss of generality, it is possible to re- model. the latter are assumed to have adaptive scale the variables trJ and pIJ to make One cannot understand the essentialsof synapses. If the feedbacks wereadaptive, v , ~ { O , l } ,B ( q l ) ~ { O , a ) whereby Equa- , neural circuits unless one considers their too, this network would exhibit other more tion 3 will be further simplified and split behavior as a collective system. An exam- complex effects.’ It should also be in two equations: ple occurs in the “self-organizing feature emphasized that the biological synaptic maps” in our speech recognition applica- circuits of the feedbacks are different from dPij/dt = a k j - PIJ) (6a) tion. Consider Figure 3a, where a set of those of the external inputs. The time- if q, = 1 and /3 = a (inside the neurons forms a layer, and each neuron is invariant coupling coefficient of the feed- bubble) connected to its neighbors in the lateral back connections, as a function of dis- dplJ/dt = 0 (6b) direction. We have drawn the network tance, has roughly the “Mexican hat” for q, = /3 = 0 (outside the bubble) one-dii.iensionally for clarity, although in form depicted in Figure 3b, as in real neu- all practical applications it has been two- ral networks. For negative coupling, dimensional. The external inputs, in the signal-invertingelements are necessary; in It is evident from Equation 6 that the simplest model used for pattern recogni- biological circuits inversion is made by a transmittances pij then adaptively tend to tion, are connected in parallel to all the special kind of inhibitory interneuron. If follow up the input signals tu.In other neurons of this network so that each neu- the external input is denoted words, these neurons start to become selec- ron can simultaneously “look” at the tively sensitized to the prevailing input pat- same input. (Certain interesting but much tern. But this occurs only when the bubble (4) more complex effects result if the input lies over the particular neuron. For connectionsare made to different portions another input, the bubble lies over other of the network and the activation is neurons, which then becomesensitized to propagated through it in a sequence.) then the system equation for the network that input. In this way different parts of 16 COMPUTER
  7. 7. the network are automatically “tuned” to different inputs. The network will indeed be tuned to different inputs in an ordered fashion, as if a continuous map of the signal space were formed over the network. The con- I 10 I tinuity of this mapping follows from the simple fact that the vectors mi of contig- uous units (within the bubbles) are modi- fied in the same direction, so that during the course of the p r o w s the neighboring values become smoothed. The ordering of these values, however, is a very subtle phenomenon, the proof or complete explanation of which is mathematically very sophisticated’ and cannot be given here. The effect is difficult to visualize 0.01 ’ - . I without, say, an animation film. A con- 0 60 crete example of this kind of ordering is the i phonotopic map described later in this article. Figure 4. Development of the distribution of activity over time (f) into a stable “bubble” in a laterally interconnected neural network (cf. Figure 3). The activities Shortcut learning of the individual neurons (q) shown in the logarithmic scale. are algorithm In the time-continuous process just described, the weight vectors attain asymptotic values, which then define a vector quantization of the input signal . . . . . . . . . . . . . . . . . . . space, and thus a classification of all its . . . . . . . . . . . . . . . . . . . . vectors. In practice, the same vector quan- . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . tization can be computed much more . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . quickly from a numerically simpler algo- . . . . . . . . . . . . . rithm. The bubble is equivalent to a neigh- . . -. .. .. ... borhood set N, of all those network units that lie within a certain radius from a cer- . . . . . . . . . . . . . . -.. . . ..... tain unit c. It can be shown that the size of . . . . . . . . .. *. e ... .e. .. ...****e.* the bubble depends on the interaction -e .* *e** e.&.. * parameters, and so we can reason that the .! e. * ,* . radius of the bubble is controllable, even- tually being definable as some function of .,*... . +.* e*. ..... time. For good self-organizing results, it has been found empirically that the radius indeed should decrease in time monoton- ically. Similarly a = a(t) ought to be a Figure 5. “Bubbles” formed in a two-dimensional network viewed from the top. monotonically decreasing function of The dots correspond to neurons, and their sizes correspond to their activity. In the time. Simple but effective choices for these picture on the right, the input was changing slowly, and the motion of the bubble is functions have been determined in a series indicated by its “tail.” of practical experiments.’ As stated earlier, the process defined by Equation 1 normalizes the weight vectors mi to the same length. Since the bubble is (1) Center of the bubble (c): As stated above, a = a(t) and N, = formed around those units at which mTx is NJt)are empirical functions of time. The maximum, its center also coincides with that unit for which the norm of the vec- vector quantization. Notice, too, that torial difference x - mi is minimum. (2) Updated weight vectors: Equation 7a defines the classification of Combining all the above results, we m, ( t + 1) = m, (4 + a(t) (x(4 - m, ( t ) ) input according to the closest weight vec- obtain the following shortcut algorithm. for i E N, tor to x. Let us start with random initial values mi m,(t+1) = m l ( t ) We must point out that if N , contained = mi(0).F o r t = 0, 1, 2, . . . . compute: for all other indices i the index i only, Equations 7a and 7b (7b) March 1988 17
  8. 8. stochastic input vector x, a function of time, to the network. The self-organizing process has been used to create a "topo- graphic," two-dimensionalmap of speech elements onto the network. Superficiallythis network seems to have only one layer of neurons; due to the N lateral interactions in the network, how- ever, its topology is in effect even more complicated than that of the famous mul- tilayered Boltzmann machines or back- propagation networks." Any neuron in our network is also able to create an inter- nal representation of input information in the same way as the "hidden units" in the backpropagation networks eventually do. Several projects have recently been le/ launched to apply Boltzmann machines to speech recognition. We should learn in the near future how they compete with the design described here. The input vectors x, representing short- time spectra of the speech waveform, are computed in our system every 9.83 milli- seconds. These samples are applied in Equations 7a and 7b as input data in their natural order, and the self-organizingpro- A/ or cess then defines the m,, the weight vec- tors of the neurons. One striking result is that the various neurons of the network -- become sensitized to spectra of different hi phonemes and their variations in a two- dimensional order, although teaching was not done by the phonemes; only spectral samples of input were applied. The reason is that the input spectra are clustered around phonemes, and the process finds these clusters. The maps can be calibrated using spectra of known phonemes. If then Figure 6. The signal of natural speech is preanalyzed and represented on 15 spectral a new or unknown spectrum is presented channels ranging from 200 Hz to 5 kHz. The spectral powers of the different chan- at the inputs, the neuron with the closest nel outputs are presented as input to an artificial neural network. The neurons are transmittance vector m i gives the tuned automatically, without any supervision or extra information, to the acoustic response, and so the classification occurs units of speech identifiable as phonemes. In this set of pictures the neurons cor- in accordance with the Voronoi tessella- respond to the small rectangular subareas. Calibration of the map was made with tion in which the m i as reference vec- act 50 samples of each test phoneme. The shaded areas correspond to histograms of tors. The values of these vectors very responses from the map to certain phonemes (white: maximum). closely reflect the actual speech signal statistics." Figure 6 shows the calibration result for different phonemic samples as a gray-level histogram of such responses, and Figure 7 shows the map when its neu- rons are labeled according to the majority would superficially resemble the classical Phonotopic maps voting for a number of different vector quantization method called k- responses. means clustering.'oThe present method, For this discussion we assume that a lat- The speech signal is a continuous wave- however, is more general because the cor- tice of hexagonally arranged neurons form that makes transitions between var- rections are made over a wider, dynami- forms a two-dimensional neural network ious states, corresponding t o the cally defined neighborhood set, or bubble of the type depicted in Figure 3. As already phonemes. On the other hand, as stated N,, so that an ordered mapping is described, the microphone signal is first earlier, the plosives are detectable only as obtained. Together with some fine adjust- converted into a spectral representation, transient states of the speech waveform. ments of the mi vectors,' spectral recogni- grouped into 15 channels. These channels For that reason their labeling in Figure 7 tion accuracy is improved significantly. together constitute the 15-component is not reliable. Recently we solved the 18 COMPUTER
  9. 9. problem of more accurate detection of plosives and certain other phonemic cate- gories by using special, auxiliary maps in which only a certain category of phonemes was represented, and which were trained by a subset of samples. For this purpose we first detect the presence of such phonemes (as a group) from the waveform, and then we use this information to activate the cor- responding map. For instance, the occur- rence of /k,p,t/ is indicated by low signal energy, and the corresponding spectral samples are picked from the transient regions following silence. The nasals as a group are detectableby responses obtained from the middle area of the main map. Another problem is segmentation of the responses from the map into a standard Figure 7. The neurons, shown as circles, are labeled with the symbols of the pho- phonemic transcription. Consider that the nemes to which they “learned” to give best responses. Most neurons give a unique spectral samples are taken at regular inter- answer; the double labels here show neurons that respond to two phonemes. Dis- vals every 9.83 milliseconds, and they are tinction of /k,p,t/ from this map is not reliable and needs the analysis of the tran- first labeled in accordance with the cor- sient spectra of these phonemes by an auxiliary map. In the Japanese version there responding phonemic spectra. These are auxiliary maps for /k,p,t/, /b,d,g/, and /m,n,q/ for more accurate analysis. labeled samples are called quasiphonemes; in contrast, the duration of a true phoneme is variable, say, from 40 to 400 millise- conds. We have used several alternative rules for the segmentation of quasipho- neme sequences into true phonemes. One of them is based on the degree of stability of the waveform; most phonemes, let alone plosives, have a unique stationary a state. Another, more heuristic method is to decide that if m out of n successive quasiphonemes are the same, they cor- respond to a single phoneme; e.g., m = 4 and n = 7 are typical values. i The sequences of quasiphonemes can also be visualized as trajectories over the main map, as shown in Figure 8. Each arrowhead represents one spectral sample. For clarity, the sequence of coordinates shown by arrows has been slightly smoothed to make the curves more con- tinuous. It is clearly discernible that con- vergence points of the speech waveform seem to correspond to certain (stationary) Pb phonemes. This kind of graph provides a new Figure 8. Sequence of the responses obtained from the phonotopic map when the means, in addition to some earlier ones, Finnish word humppila was uttered. The arrows correspond to intervals of 9.83 for the visualization of the phonemes of milliseconds, at which the speech waveform was analyzed spectrally. speech, which may be useful for speech training and therapy. Profoundly deaf people may find it advantageous to have an immediate visual feedback from their speech. It may be necessary to point out that the phonotopic map is not the same thing as of the vocal tract. Neither is this map any approximately correspond to the vectorial the so-called formant maps used in pho- kind of principal component graph for differencesbetween the original spectra; so netics. The latter display the speech signal phonemes. The phonotopic map displays this map should rather be regarded as a in coordinates that correspond to the two the images of the complete spectra as similaritygraph, thecoordinates of which lowest formants, or resonant frequencies points on a plane, the distances of which have no explicit interpretation. March 1988 19
  10. 10. Actually, the phoneme recognition accuracy can still be improved by three or four percent if the templates mi are fine- tuned: small corrections to the responding neurons can be made automatically by turning their template vectors toward x if a tentative classification was correct, and away from x if the result was wrong. Postprocessing in symbolic form Even if the classificationof speech spec- tra were error-free, the phonemes would not be identifiable from them with 100-percent reliability. This is because there are coarticulation effects in speech: the phonemes are influenced by neighbor- ing phonemes. One might imagine it pos- sible to list and take into account all such variations. But there may be many hundreds of different frames or contexts of neighboring phonemes in which a par- ticular phoneme may occur. Even this, however, is an optimistic figure since the neighbors too may be transformed by other coarticulation effects and errors. Figure 9. The coprocessor board for the neural network and the postprocessing Thus, the correction of such transformed functions. phonemes should be made by reference to some kind of context-sensitive stochastic grammar, the rules of which are derived from real examples. I have developed a program code that automatically con. structs the grammatical transformation rules on the basis of speech samples and their correct reference transcriptions. A typical error grammar may contain 15,OOO to 20,000 rules (productions), and these rules can be encoded as a data structure or stored in an associative memory. The optimal amount of context is determined automatically for each rule separately. No special hardware is needed; the search of the matching rules and their application 10- can be made in real time by efficient and fast software methods, based on so-called hash coding, without slowing down the recognition operation. The two-stage speech recognition sys- <MULTIBUS I1 tem described in this article is a genuine phonetic typewriter, since it outputs ortho- graphic transcriptions for unrestricted utterances, the forms of which only approximately obey certain morphologi- cal rules or regularities of a particular lan- Figure 10. Block diagram of the coprocessor board. A/D: analog-to-digitalcon- guage. We have implemented this system verter. TMS320: Texas Instruments 32010 signal processor chip. RAM/ROM: 4K- for both Finnish and (romanized) Japa- word random-access memory, 256-word programmable read-only memory. nese. Both of these languages, like Latin, EPROM: 64K-byte electrically erasable read-only memory. DRAM: 512K-byte are characterized by the fact that their dual-port random-access memory. SRAM: 96K-byte paged dual-port random- orthography is almost identical to their access memory. 80186: Intel microprocessor CPU. 8256: parallel interface. phonemic transcription. 20 COMPUTER
  11. 11. As a complete speech recognition device, our system can be made to operate in either of two modes: (1) transcribing dictation of unlimited text, whereby the words (at least in some common idioms) can be connected, since similar rules are applicable for the editing of spaces NEURAL NETWORKSAND between the words (at present short pauses NATURAL INTELLIGENCE are needed to insert spaces); and (2) Stephen Grossberg isolated-word recognition from a large This anthology of the latest research in neural networks is packed with real-time vocabulary. computer simulationsand rigorous dem- In isolated-word recognition we first use onstrations, covering results in vision, the phonotopic map and its segmentation speech, cognitive information process- algorithm to produce a raw phonemic ing, adaptive pattern recognition, adaptive robotics, conditioning and attention, cogni- transcription of the uttered word. Then tive-emotional interactions,and decision this transcription is compared with refer- making under risk. ence transcriptions earlier collected from $29.95 a great many words. Comparison of partly NEUROCOMPUTING erroneous symbolic expressions (strings) Foundations of Research can be related to many standard similar- edited by James A. Anderson and ity criteria. Rapid prescreening and spot- Edward Rosenfeld ting of the closest candidates can again be A fundamental reference work that col- lects seminal work by McCulloch and Pitts, performed by associative or hash-coding Hebb, Lashley, von Neumann, Minsky and methods; we have introduced a very effec- Papert, Cooper, Grossberg, Kohonen, and tive error-tolerant searching scheme called McClelland and Rumelhart. redundant hash addressing, by which a $55.00 small number of the best candidates, NATURAL COMPUTATION selected from vocabulariesof thousands of Selected Readings items, can be located in a few hundred mil- edited by Whitman A. Richards liseconds (on a personal computer). After This extensive book of readings combines mathematics, artificial intelligence, that, the more accurate final comparison computer science, experimental psychology, and neurophysiology in studying between the much smaller number of can- perception. $25.00 paper ($50.00 cloth) didates can be made by the best statistical methods. PARALLEL DISTRIBUTED Two MIT classics now available PROCESSING in paperback Explorations in the Microstructure PERCEPTRONS Hardware of Cognition Expanded Edition Volume 1: Foundations implementations and David E. Rumelhart, Marvin L. Minsky and Seymour Papert performance James L. McClelland, and the "The best place to begin reviewing PDP Research Group neural networks is the late 1960s. In The system's neural network could, in Volume 2: Psychological and their land mark book, Minsky and Biological Models Papert examined the notion of build- principle, be built up of parallel hardware James L. McClelland, ing 'thinking machines' by joining components that behave according to David E. Rumelhart, and the together computational units that Equations 5 and 6 . For the time being, no PDP Research Group mimic human neurons."- /€E€ Expert such components have been developed. $16.95 each volume, paper $12.50 paper On the other hand, for many applications $29.95 the set the equivalent functions from Equations EMBODIMENTS 7a and 7b are readily computable by fast EXPLORATIONS IN O MIND F digital signal processor chips; in that case PARALLEL DISTRIBUTED Warren S. McCulloch the various neurons only exist virtually, as PROCESSING Preface by Jerome Y. Lettvin the signal processors are able to solve their A Handbook of Models, Another major work of the 1960s that Programs, and Exercises teems with concepts that are highly equations by a timesharing principle. Even relevant to current developments in James 1. McClelland and this operation, however, can be performed David E. Rumelhart neuroscience and neural networks. in real time, especially in speech $27.50 paper, software included for $12.50 paper processing. IBM PC The most central neural hardware of our system is contained on the coprocessor To order call toll free: 800-356-0343 (617) or 253-2884. board shown in Figure 9. Its block dia- Mastercard and Visa acceoted. gram is shown in Figure 10. Only two sig- nal processors have been necessary: one for the acoustic preprocessor that March 1988 Reader Service Number 2
  12. 12. produces the input pattern vectors x , and scriptions, the vocabulary or its active 9. T. Kohonen, Self-organization and another for timeshared computation of subsets can be defined in written form and Associative Memory, Seriesin Information changed dynamically during use, without Sciences, Vol. 8, Springer-Verlag, Berlin- the responses from the neural network. Heidelberg-New York-Tokyo, 1984; 2nd For the time being, the self-organizedcom- the need of speaking any samples of these ed. 1988. or putation of the templates mi, “learn- words. 10. J. Makhoul, S. Roucos, andH. Gish, “Vec- ing,” is made in an IBM PC All output, for unlimited text as well as tor Quantization in SpeechCoding,” Proc. AT-compatible host processor, and the for isolated words, is produced in near real IEEE, NOV.1985, pp. 1551-1588. transmittance parameters (synaptic trans- time: the mean delay is on the order of 250 11. D.E. Rumelhart. G.E. Hinton. and R.J. mittances) are loaded onto the coproces- milliseconds per word. It should be noticed Williams, “Learning Internal Representa- tions by Error Propagation,” in Parallel sor board. Newer designs are intended to that contemporary microprocessors Distributed Processing, Explorationsin the operate as stand-alone systems. A stan- already have much higher speeds (typically Microstructure of Cognition, Volume I : dard microprocessor CPU chip on our five times higher) than the chips used in Foundations, ed. by David E. Rumelhart, board takes care of overall control and our design, James L. McClelland, and the PDP To the best of our knowledge, this sys- Research Group, MIT Press, Cambridge, data routing and performs some Mass., 1986,pp. 318-362. preprocessing operations after FFT (such tem is the only existing complete speech 12. T. Kohonen, “Dynamically Expanding as logarithmizationand normalization), as recognizerthat employs neural computing Context, with Applicationto the Correction well as segmenting the quasiphoneme principles and has been brought to a com- of Symbol Strings in the Recognition of strings and deciding whether the auxiliary mercial stage, verified by extensive tests. Continuous Speech,” Proc. Eighth Int’l Of course, it still falls somewhat short of Conf. Pattern Recognition, IEEE Com- transient maps are to be used. Although puter Society, Washington, D.C., 1986,pp. the 80186is anot-so-effective CPU, it still expectations; obviously some kind of lin- 1148-1151. has extra capacity for postprocessingoper- guistic postprocessing model would ations: it can be programmed to apply the improve its performance. On the other context-sensitive grammar for unlimited hand, our principal aim was to demon- text or to perform the isolated-wordrecog- strate the highly adaptive properties of nition operations. neural networks, which allow a very The personal computer has been used accurate, nonlinear statistical analysis of during experimentation for all post- real signals. These properties ought to be processing operations. Nonetheless, the a goal of all practical “neurocom- overall recognition operations take place puters.”O in near real time. In the intended mode of operation the speech recognizer will only assist the keyboard operations and com- municate with the CPU through the same channel. One of the most serious problems with References this system, as well as with any existing speech recognizer, is recognition accuracy, 1. W.A. Lea, ed., Trendsin Speech Recogni- especially for an arbitrary speaker. After tion, Prentice-Hall, Englewood Cliffs, postprocessing, the present transcription N.J., 1980. accuracy varies between 92 and 97 percent, 2. S.E. Levinson, L.R. Rabiner, and M.M. depending on speaker and difficulty of Sondhi, “An Introduction to the Applica- text. We performed most of the experi- tion of the Theory of Probabilistic Func- Teuvo Kohonen is a professor on the Faculty of tions of a Markov Process to Automatic Information Sciencesof Helsinki University of ments reported here with half adozen male Speech Recognition,” Bell Syst. Tech. J . , speakers, using office text, names, and the Apr. 1983, pp. 1035-1073. Technology, Finland. He is also a research professor of the Academy of Finland, a mem- most frequent words of the language. The 3. P.A. Devijver and J. Kittler, PatternRecog- ber of the Finnish Academy of Sciences, and a number of tests performed over the years nition: A Statistical Approach, Prentice- member of the Finnish Academy of Engineer- is inestimable. Typically, thousands of Hall, London, 1982. ing Sciences. He received his D.Eng. degree in words have been involved in a particular 4. D.H. Klatt, “Review of the ARPA Speech physics from HelsinkiUniversity of Technology series of tests. Enrollment of a new speaker Understanding Project,” J. Acoust. Soc. in 1962.His scientific interests are neural com- Amer., Dec. 1977, pp. 1345-1366. puters and pattern recognition. requires dictation of 100 words, and the 5. R. Reddy and V. Zue, “Recognizing Con- Kohonen has written four textbooks, of learning processes can proceed concur- tinuous Speech Remains an ElusiveGoal,” which Content-Addressable Memories rently with dictation. The total learning IEEE Spectrum, Nov. 1983, pp. 84-87. (Springer, 1987) and Selforganization and time on the PC is less than 10 minutes. 6. P. Petre, “Speak, Master:TypewritersThat Associative Memory (Springer, 1988) are best Take Dictation,” Fortune, Jan. 7,1985, pp. known. He was the first vice chairman of the During learning, the template vectors of International Neural Network Society. Koho- the phonotopic map are tuned to the new 56-60. nen is a senior member of the IEEE. He was samples. 7. M.R. Schroeder and J.L.Hall, “Model for awarded the Eemil Aaltonen Honorary Prize in Mechanical to Neural Transduction in the 1983and the Cultural Prize of Finnish Commer- Isolated-word recognition from a Auditory Receptor,” J. Acoust. Soc. Am., 1000-wordvocabulary is possible with an cial Television in 1984. May 1974, pp. 1055-1060. accuracy of 96 to 98 percent. Since the 8. R. Meddis, “Simulation of Mechanical to Kohonen’s address is Helsinki University of recognition system forms an intermediate Neural Transduction in the Auditory Technology, Laboratory of Computer and symbolic transcription that can be com- Receptor,” J. Acoust. Soc. Am., Mar. Information Science, Rakentajanaukio 2 C, pared with any standard reference tran- 1986, pp. 703-711. SF-02150 Espoo, Finland. 22 COMPUTER