Recurrent Time-Delay Neural Network - Fabio Greco - ICASSP 91

s2.11
A RECURRENTTIME-DELAYNEURALNETWORKFOR
IMPROVEDPHONEMERECOGNITION
Fabio GRECO, Andrea PAOLONI, Giacomo RAVAIOLJ
Fondazione Ugo Bordoni, Via Baldassarre Castiglione, 59 - 00142 Roma
ABSTRACT
In this work we propose a modification to the well-known
structure of Time-Delay Neural Network, obtained through a
feedback at the first-hidden layer level.
The experiment carried out with the new model called
RTDNN (Recurrent Time-Delay Neural Network) consists in the
classification of the unvoiced plosive phonemes. These ones
were extracted from an initial and intermediate position in a list
of most common Italian words, uttered by a male speaker, thus
obtaining 250 tokens per phoneme. The training was carried out
through a modified variant of Back Propagation, known as BPS
(Back Propagation for Sequences), using half of the tokens for
learning and the remaining for the test. The error rate trend thus
obtained shows a 21% decrease. in a uarticular range of “0”
(magnitude of feedback), with values ranging from 5% for the
original TDNN model with no feedback, to 3.6% for our
RTDNN model.
INTRODUCTION
A neural network implemented for phoneme recognition has
to take into account the sequential nature of speech. Unlike static
pattern classification the net must be able to exploit the
information coming from the temporal dimension and decisions
must be more linked to the way in which patterns evolve in time
rather than to their shape at particular instants.
A solution adopted by many scientists is the extension of the
input window considering a network activated simultaneously
by a large portion of speech. In this way they get round the
difficulty of explicitely treating the time and they return back to
static patterns classification.
Nevertheless, the question of temporal window extension is
quite troublesome and experiments up to now are based on
euristic assumptions [l, 21.
Initially, it could seem that the window extension should be
wide enough to contain contextual information as much as
possible. There are, however, at least three fundamental
problems connected to the use of a wide signal window.
The &Q problem is that contextual information is not
equally distributed, since the frames closer to the classifying
ones are more important than the ones further away. The net
should then exhibit a “forgetting effect” in order to exclude too
old information and give more importance to the closer frames.
The difficulty coming from the use of a wide signal window lies
in the fact that it is not known how to weight the window in
order to assure this effect.
The & proble.tR is connected to local speech rate
variations. Though with a small window such variations have a
negligible effectrwith wider windows considerable difficulties
come up. In fact, in this case, learning is a very difficult problem
in order to take into account such variations and this leads to a
worsening in performance.
The m urobla is that a large window require a net with
a large number of weights in the input layer and consequently a
large amount of training time and data are necessary.
Time-Delay Neural Networks (TDNNs) [3] avoid the last
two problems with a network activated (in the input layer) by
signal windows of only 30 ms. The use of contextual
information based on a wider window (150 ms) occurs in the
upper network layer, where the extraction of low-level acoustical
features has already been achieved.
So there is the advantage of separating the acoustical
features extration (sensitive to local signal distorsion) from the
integration of those features along time dimension. In the upper
network layer the possible effects of temporal variations can be
better handled as most information is already codified, by means
of this separation.
RTDNN
TDNNs, as we said, take into account a context of 150 ms
but low-level signal analysis is limited to a window of 30 ms.
Instead, it would be useful that internal representation performed
by first-hidden layer neurons is influenced by a wider portion of
speech, in order to take into account to a greater extent the
context in the extraction of acoustical features. The solution to
this problem obtained merely by extending the 30 ms signal
window is not effective for the above reasons.
Our solution allows to take into account a wider context,
avoiding the problems related to the window increase. The
method consists in using a high-level memory, generated
through a feedback between the codified information resulting
from the first-hidden layer neuron outputs at time “t” and the
input of the same at time “t+l” (Fig. 1). This modification, in
the spatially-extended representation of the new model which we
called RTDNN (Recurrent Time-Delay Neural Network), can be
seen as a sequence of connections acting horizontally within the
first-hidden layer between a frame of eight neurons and the
following one (Fig.2). The feedback weights are the same in the
copies of time-shifted connections, thereby assuring the
important property of time invariance of the original TDNN
model.
The network thus developed has the advantage of coping
better with the sequential nature of speech; in this way the latter
is considered as a sequence of events rather than a succession of
static patterns without temporal ordering.
Unlike TDNN our model, in the acoustic features extraction,
can explicitely exploit information hidden in the temporal
structure of speech. This is particulary important for phonemes
like plosives or nasals which exhibit a clear sequential structure,
The feedback action, in considering the contribution of
closer frames is qualitatively different from that coming from the
analysis window extension.
-81-
(382977-719110000-0081$1.00 e 1991 IEEE

7 Wi.0
Fig. 1: The RTDNN first-hidden layer neuron
P TK
15 frames (150 ms)
Fig.2: The RTDNN architecture
As can be seen in Fig.2, the information is the one coming
from the coding performed by the first hidden neurons and not
directely from the signal. This information is not limited to a
finite and prefixed time interval (detetmined by the topology of
the connections versus input) but extends backward in time,
with an extension determined in an adaptive way by feedback
weight magnitude.
This consideration raises the problem of stability; the
contribution of past frames must in fact decrease on going
backward in time, but this condition is assured only by some
values of feedback weights. Let’s see why:
refering to Fig. 2, let’s call x(t) the vector constructed as a
connection of three input vectors of temporal coordinate t, t+l,
t+2. Calling ai the activation of neuron “i” of the first hidden
layer (value preceding the sigmoid function) and initially
assuming an activation feedback, we can write:
ai = Wiia;(t-1) + x,wil x,(t) + b, with ai(0) = 0 (1)
where wii is the feedback weight. By induction we get:
T
a,(T) = c
w;-, P,(t) with Pi(t) = Ciwtj x,(t) + b, (2)
kl
In order to estimate the influence of the signal window at
time “t - 7” to neuron ‘7” at time “t”, let’s define the following
parameter:
6(t,z) = aaP = w’5
f3pi(t-T) l1
(3)
From this we can see that, in order to ensure the desired
stability or “forgetting effect”, feedback weights must be
costrained in the range (-l,+l).
On observing the formula (2), we can notice that the neuron
transfer function with activation feedback is similar to that of an
IIR (Infinite Impulse Response) filter and is as follows :
(4)
which is an equality for “T” infinite. The “a” coefficient is thus
related to feedback weight wii of RTDNN neuron.
The transfer function of the TDNN neuron instead is similar
to that of a FIR (Finite Impulse Response) filter, because it is
affected the contribution of a finite number “D” of frames. This
transfer function is of the following kind:
D
c -i
tli 2 (5)
id
where now the various ai are different and each of them is
related to the connection matrix between the TDNN neuron and
the input frame at time “t - i”.
In the case of output feedback, formula (1) became:
a;(t) = wii f(ai(t-1)) + cjWij x,(t) + bi (6)
where f(x) is the sigmoid function and it is easily proven that in
this case:
- 82 -

Iw.1 7
I 6(t,z) I I +
( >
If -4 2 wii < +4, the effect of past frames will be modulated
by decreasing terms, thereby ensuring the desired effect.
As you can see in (6) output feedback is a non-linear one,
because-of the sigmoid‘functibn, and the correspondence with
IIR filters is not so immediate: the RTDNN neuron with outuut
feedback will be then described as a kind of non-linear IIR filier.
Because of the greater general nature of this model, we
adopted it in our experiment with RTDNN.
THELEARNINGALGORITHM
The introduction of feedback in the TDNN structure creates
some trouble in the learning algorithm.
The algorithm used for the learning of RTDNN follows an
approach similar to the one proposed by Gori [4] and Kuhn [5].
The method, known as BPS (Back-Propagation for Sequences),
allows to avoid the backward path in time until the initial point,
during the back-propagation stage, as proposed by Rumelhart
with the “time-unfolding” tecnique. This is implemented at the
price of some additional variables computed in the feed-forward
stage and propagated forward in time.
The bond on the feedback weights -4 < wii < +4 (that is
necessary to ensure the model stability) has been implemented
by introducing some control variables kii and defining:
Wii = p tanh (kii ) with 0 I p < +4 (8)
The kii variables which are unlimited are therefore the ones
that are modified by the learning algorithm. The p parameter
introduced in (8) allows to continuously vary the feedback
amplitude inside the range of values: 0 s p 5 +4.
By putting p = 0, the described algorithm became the usual
Back-Propagation algotithm.and the net RTDNN is converted
into the Waibel’s TDNN. .
EXPERIMENTSSET UP
The aim for carrying out experiments was to compare
performances of the new model with the ones of TDNNs. It was
also of interest to measure, for the Italian language, the
classification cauabilities of both models. To make this
comparison we decided to treat the classification of plosive
phonemes in our experiment with RTDNNs. Really experiments
with these phonemes are very accurately described in other
papers about TDNNs. In particular, we choose the unvoiced
plosives phonemes (/p/, It/, /k/), which are more easy to
segment.
The network we used in our experiment is the one depicted
in Fig. 2, with three output nodes and feedback connections.
“Database”
In order to prepare the training and test data-base, we have
chosen 105 Italian words among the 10.000 most common ones
which contain the plosives both in initial and intermediate
position trying to represent as much as possible the different
phonetic contexts.
In order to increase the data-base representativeness, we
prepared another list containing the same words preceded by an
article, a verb or preposition. This second list aims at
introducing inter-words coarticulation events, besides intra-word
ones in the phonemes.
The 210 words have been repeated three times by a non-
professional adult male speaker at intervals of about one month
between each repetition. So the tokens final number resulted as
follows: 254 Jpf, 288 ItI e 25OJk/, from which we used an
evenly distributed number of tokens: 250 per class.
The words were recorded in a quiet laboratory environement
with an average S/N ratio of 38 dB. Numerical signal conversion
has been achieved in agreement with the standard defined in the
European project SAM (Speech Assessment Methodologies), by
using the OROS card on a PC and the acquisition program
EUROPEC. Signal has been sampled at 16 bits with a sampling
frequency of 20 kHz.
The 792 plosive tokens have been hand-segmented to center
the 150 ms input window over the voiced onset which follows
the burst. A 512-point FFT has been computed every 5 ms
inside the window. from which we obtained 16 Barkscale
coefficients over a 46 Hz - 6 kHz range. Coefficients adjacent in
time were averaged, yielding an overall frame rate of 10 msec;
the coefficients were then normalized between -1 and +1 with the
average at 0.0.
Networksimulation
In order to reduce computation time, we decided to use an
array-processor FPS M64/35, with 12 MFLOPS peak
performance [6]. The measured speed of the implemented
simulation algorithm (related to the learning stage), gave a value
of 0.79 MCPS (Million of Connections modified Per Second).
Learning has been stopped when the total error on the
complete set of patterns comes below a threshold of 0.1. To
reach this situation an average number of 100 epochs was
required.
EXPERIMENTALRESULTS
ComparisonbetweenRTDNNand TDNN
In the experiment of classification the whole phoneme set
available of 256 tokens per phoneme has been divided by two,
using one half for learning and the remaining one for the test.
Each experiment was implemented with initial weights
chosen randomly in the [-l,+l] range and, by fixing initial
weights, the range of allowed Beta values (the amplitude of
feedback) has been explored between 0.0 and 4.0 with step 0.2.
Fig. 3 shows the results of 10 experiments of this type, each
one obtained with a different random set up of initial weigths.
In coincidence with Beta=0 (TDNN case), we have obtained an
average error rate of 5%. a value comparable to the one obtained
by Waibel(2.3%) with a different data-base [3].
The RTDNN net (corresponding to upper Beta values)
instead shows, in the error rate trend, a considerable reduction in
relation to the TDNN case. Improvement is particulary marked
for Beta values above 2.
Differences in the error rate trend in various experiments,
depending on the choice of initial weigths, lead us to carry out a
statistical test in order to stress real systematic variations.
- 83 -

53
50
475
490
3J
390
04 LO 290 390 490
BETA
Fig. 3: Averaged error rate vs. BETA (feedback amplitude).
The vertical segments are confidence intervals (&10).
From “t - test” of Student the result for Beta values above 2
is as follows: the hypothesis that the mean error was
considerablv smaller than the TDNN case is accented with a
percentage df false rejection of 5%.
1
The statistical test thus confirms the significativeness of
improvements particulary for Beta = 2.8. Corresponding to this
value, the final experiment result shows a relative error rate
decrease of 27% in relation to the original TDNN model.
Shift influence
In order to test the TDNNs time-invariance features and to
uy and see whether this is true also for RTDNNs, we carried out
a second experiment.
In this case the 15 spectral vectors that form the learning and
test patterns are randomly shifted to a maximum amount of * 40
ms. In order to achieve a significant comparison with the case of
“no shift” we used for the learning and test the same patterns in
the same order of the previous experiment.
50
4,s
490
3s
390
295
2,o
090 l,O 2,o 330 490
BETA
Fig. 4: Error rate vs. BETA obtained with 40 ms of
maximum temporal shift
Fig. 4 shows the error rate trend versus Beta, obtained in a
single experiment. In that case the values have been
approximated with a parabola obtained with the least square
method.
The experiment proves the time-invariance proprerty of
TDNNs (deducible in Fir-r.4 from the value cot-resuondinz to.~~~
Beta =0) and shows that”also with shifted patterns RTDNNs
continue to be at an advantage in relation to TDNNs suggesting
an opimal Beta value near 2 in view of error reduction,
CONCLUSIONS
This paper proposed a neural network model based on the
use of a feedback in the TDNN structure. Experiments on
classification of segments of speech related to plosive phonemes
show that RTDNN model improves recognition rate, confirming
the superior ability of recurrent networks to treat the sequential
nature of speech.
Future work will be devoted to extend the test to the whole
corpus of Italian phonemes and to try the effect of spontaneous
speech.
ACKNOWLEDGMENT
The autors would like to thank Dr. Alex Waibel for the
encouragement provided to submit this paper to ICASSP and
Mr. Berardo Savetione for labeling the data base.
References
[II
PI
[31
141
[51
[61
r71
PI
PI
Bourlard, H., Wellekens, C. “Speech pattern
discrimination and multilayer perceptrons”, Computer
Speech andLanguage, J,Q&J, pp. 1-19, 1989
Elman, J.L., Zipser, D. “Learning the Hidden Structure of
Speech” KS Report 8701, Institute for Cognitive Science,
University of California, San Diego, CA, 1987
Waibel, A., Hanazawa, T., Hinton, G., Shikano, K.,
Lang, K. “Phoneme Recognition Using Time-Delay
Neural Networks”, IEEE Transactions on Acoustic,
Speech and Signal Processing, Yol. 31, No. 3, March
1989
Gori, M., Bengio, Y., De Mori, R. “BPS: A Learning
Algorithm for Capturing the Dynamic Nature of Speech”,
Proceedings of the IEEE-IJCNN 89, Washington, 1989
Kuhn, G., Watrous, R.L., Ladendotf, B. “Connected
Recognition with a Recurrent Network”, Speech
Communication , w, pp.41-48, 1990
Corana, A., Rolando, C., Ridella, S. “Neural Network
Simulation with High Performances on FPS M64 Series
Minisupercomputers”, FPS Users European Conference,
Stratfotd-upon- Avon, 25-26 April 1989
Bourlard, H., & Wellekens, C.J. “Speech Dynamics and
+&rent Neural Networks”, ICASSP 1989, yell, pp.
Greco, F., Ravaioli, G., “An Experiment on Phoneme
Classification Through a Time-Delay Neural Network”,
Proc. of the “3’ Workshop Italian0 su Architetture Parallele
e Reti Neuronali “, Vietri sul Mare, Salerno, Italy, 15-19
May 1990, World Scientific Publishing
Greco, F. “Realizzazione di un modello neuronale per la
decodifica acustico-fonetica de1 parlato continua”
Graduation Thesis in Phisics, University of Rome,
November 1990
- 04 -

Recurrent Time-Delay Neural Network - Fabio Greco - ICASSP 91

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to Recurrent Time-Delay Neural Network - Fabio Greco - ICASSP 91

Similar to Recurrent Time-Delay Neural Network - Fabio Greco - ICASSP 91 (20)

Recurrent Time-Delay Neural Network - Fabio Greco - ICASSP 91