[Paper Review] 2014 combining time and frequency-domain convolution in convolutional neural network-based phone recognition

Jaesung Bae(bjsd3@kaist.ac.kr)
Combining Time-and Frequency-
domain Convolutional Neural Network-
based phone recognition
Laszlo Toth
ppt: Jaesung Bae

Abstract
• Most author
→Focused on convolution on frequency domain
• To make invariance to speaker and speaking style.
→Others time-domain convolution
• For longer time-span of input in hierarchical manner.
• Here
→Combined two network.
• 16.7 error rate on the TIMIT phone recognition task.
• New record.

1. Introduction
• CNN
→Process in small localized parts, looking for the presence of relevant local features.
→By pooling
• Made more translation tolerant.
• Effect
→Convolution on frequency domain
• Good, speaker and speaking style invariant
→Convolution on time domain
• Not that effective. Too many convolution on time is harmful.
• negligible benefit.(Abdel-Hamid et al. and Sainath et al.)
• Some teams are succefully done it.
• During pooling the position information is not discarded.
• So convolution here is not for shift invariance, but allowing to process longer time.

2. Convolutional Neural Networks
• Difference with ANN
→1.locality
• Trained on a time-frequency representation instead of MFCC features
→2. Do weight sharing.
• 2.1. Convolution along the frequency axis
→Input: 40 mel filterbank channels plus the frame-level energy, along with the corresponding
delta and delta-delta parameters.
• 다른곳에서 쓰인 reference
→Use max pooling
→Vary the number of filters used to cover the whole frequency range of 40 mel filterbank.
→Weight sharing, limited weight sharing possible
• In here limited weight sharing is used.
→2 convolutional layer and 4 fully connected layer.

• 2.2. Convolution along the time axis.
→Motivated by hierarchical ANN models.
→Frequency에 대해서 먼저 network를 한 번 하고, 거기다가 time 에 대해서
convolution한 network.
→This paper’s difference with only applying frequency-domain convolution
• Input blocks are processed by just one layer of neurons in one case, and by a sub-
network of several layers in the other.
→This paper’s difference with only applying time-domain convolution
• Instead of pooling size r, they put several filters at different places along time.
• Allow shift invariance, but rather to enable the model to hierarchically process a fairly wide
range of input without increasing the number of weights.
• Q. Pooling size 1?????

• 2.3. Convolution along both time and frequency
→Network shown in Fig.1a should be substituted for the subnetwork for Fig.1b.
→First, sub-network is trained, then the output layer is discarded and full
network is consturceted with randomly initialized weights in the upper layers.
→Only the upper part is trained for 1 iteration
→Then the whole network is trained until convergence is reached.

3. Experimental Setting
• 10% of train dataset as validation dataset.
• Evaluation phase
→Label outputs were mapped to the usual set of 39 labels.
• Bigram language model was used.
→LM weight: 1.0, phone insertion penalty parameters: 0.0
• Trained by semi-batch back-propagation with batch size 100.
• Frame-level cross-entropy cost. (Not ctc-loss.)
• Learn rate: 0.001. If the validation loss does not decrease halved after each
iteration.
• Training was halted when the improvement in the error was smaller than 0.1% in
two subsequent iterations.

4. Result and Discussion
• Baseline model: FC 4 hidden layer with 2000 ReLU.(reference)
• 4.1. Convolution along time.
→Architecture of 1b was investigated in our earlier study.
→5 input blocks, covering 9 frames of input context with an overlap of 4 frames.
→Subnetwork: 3 layer of 2000 neurons. Bottleneck layer of 400 neurons.
→Upper part of network: hidden layer of 2000 neurons.

• 4.2. Convolution along frequency.
→First, attempt to find the optimal number of convolutional filters.
• Number of filter 4~8
• Neighboring filters overlapped by 2-3 mel channels.
→Filter width was set to 15 frames.
• To make it same with baseline model.
→Pooling size was 3.
→By experiment use 7 filters with width 7.

• 4.2. Convolution along frequency.
→Second, to find optimal pooling size.
• 5 gave the best result.
• Maybe possible, using various pooling size in the same model.
• 4.3. Convolution along time and frequency.
→Same dropout parameters for each layer.
→Dropout rate 0.25.
• Concusion
→16.7% on TIMIT dataset.
→Need more modification experiment.

[Paper Review] 2014 combining time and frequency-domain convolution in convolutional neural network-based phone recognition

Recommended

Recommended

More Related Content

Similar to [Paper Review] 2014 combining time and frequency-domain convolution in convolutional neural network-based phone recognition

Similar to [Paper Review] 2014 combining time and frequency-domain convolution in convolutional neural network-based phone recognition (20)

Recently uploaded

Recently uploaded (20)

[Paper Review] 2014 combining time and frequency-domain convolution in convolutional neural network-based phone recognition