Jaesung Bae(bjsd3@kaist.ac.kr)
Combining Time-and Frequency-
domain Convolutional Neural Network-
based phone recognition
Laszlo Toth
ppt: Jaesung Bae
Jaesung Bae(bjsd3@kaist.ac.kr)
Abstract
• Most author
→Focused on convolution on frequency domain
• To make invariance to speaker and speaking style.
→Others time-domain convolution
• For longer time-span of input in hierarchical manner.
• Here
→Combined two network.
• 16.7 error rate on the TIMIT phone recognition task.
• New record.
Jaesung Bae(bjsd3@kaist.ac.kr)
1. Introduction
• CNN
→Process in small localized parts, looking for the presence of relevant local features.
→By pooling
• Made more translation tolerant.
• Effect
→Convolution on frequency domain
• Good, speaker and speaking style invariant
→Convolution on time domain
• Not that effective. Too many convolution on time is harmful.
• negligible benefit.(Abdel-Hamid et al. and Sainath et al.)
• Some teams are succefully done it.
• During pooling the position information is not discarded.
• So convolution here is not for shift invariance, but allowing to process longer time.
Jaesung Bae(bjsd3@kaist.ac.kr)
2. Convolutional Neural Networks
• Difference with ANN
→1.locality
• Trained on a time-frequency representation instead of MFCC features
→2. Do weight sharing.
• 2.1. Convolution along the frequency axis
→Input: 40 mel filterbank channels plus the frame-level energy, along with the corresponding
delta and delta-delta parameters.
• 다른곳에서 쓰인 reference
→Use max pooling
→Vary the number of filters used to cover the whole frequency range of 40 mel filterbank.
→Weight sharing, limited weight sharing possible
• In here limited weight sharing is used.
→2 convolutional layer and 4 fully connected layer.
Jaesung Bae(bjsd3@kaist.ac.kr)
2. Convolutional Neural Networks
Jaesung Bae(bjsd3@kaist.ac.kr)
2. Convolutional Neural Networks
• 2.2. Convolution along the time axis.
→Motivated by hierarchical ANN models.
→Frequency에 대해서 먼저 network를 한 번 하고, 거기다가 time 에 대해서
convolution한 network.
→This paper’s difference with only applying frequency-domain convolution
• Input blocks are processed by just one layer of neurons in one case, and by a sub-
network of several layers in the other.
→This paper’s difference with only applying time-domain convolution
• Instead of pooling size r, they put several filters at different places along time.
• Allow shift invariance, but rather to enable the model to hierarchically process a fairly wide
range of input without increasing the number of weights.
• Q. Pooling size 1?????
Jaesung Bae(bjsd3@kaist.ac.kr)
2. Convolutional Neural Networks
• 2.3. Convolution along both time and frequency
→Network shown in Fig.1a should be substituted for the subnetwork for Fig.1b.
→First, sub-network is trained, then the output layer is discarded and full
network is consturceted with randomly initialized weights in the upper layers.
→Only the upper part is trained for 1 iteration
→Then the whole network is trained until convergence is reached.
Jaesung Bae(bjsd3@kaist.ac.kr)
3. Experimental Setting
• 10% of train dataset as validation dataset.
• Evaluation phase
→Label outputs were mapped to the usual set of 39 labels.
• Bigram language model was used.
→LM weight: 1.0, phone insertion penalty parameters: 0.0
• Trained by semi-batch back-propagation with batch size 100.
• Frame-level cross-entropy cost. (Not ctc-loss.)
• Learn rate: 0.001. If the validation loss does not decrease halved after each
iteration.
• Training was halted when the improvement in the error was smaller than 0.1% in
two subsequent iterations.
Jaesung Bae(bjsd3@kaist.ac.kr)
4. Result and Discussion
• Baseline model: FC 4 hidden layer with 2000 ReLU.(reference)
• 4.1. Convolution along time.
→Architecture of 1b was investigated in our earlier study.
→5 input blocks, covering 9 frames of input context with an overlap of 4 frames.
→Subnetwork: 3 layer of 2000 neurons. Bottleneck layer of 400 neurons.
→Upper part of network: hidden layer of 2000 neurons.
Jaesung Bae(bjsd3@kaist.ac.kr)
4. Result and Discussion
• 4.2. Convolution along frequency.
→First, attempt to find the optimal number of convolutional filters.
• Number of filter 4~8
• Neighboring filters overlapped by 2-3 mel channels.
→Filter width was set to 15 frames.
• To make it same with baseline model.
→Pooling size was 3.
→By experiment use 7 filters with width 7.
Jaesung Bae(bjsd3@kaist.ac.kr)
4. Result and Discussion
• 4.2. Convolution along frequency.
→Second, to find optimal pooling size.
• 5 gave the best result.
• Maybe possible, using various pooling size in the same model.
• 4.3. Convolution along time and frequency.
→Same dropout parameters for each layer.
→Dropout rate 0.25.
• Concusion
→16.7% on TIMIT dataset.
→Need more modification experiment.

[Paper Review] 2014 combining time and frequency-domain convolution in convolutional neural network-based phone recognition

  • 1.
    Jaesung Bae(bjsd3@kaist.ac.kr) Combining Time-andFrequency- domain Convolutional Neural Network- based phone recognition Laszlo Toth ppt: Jaesung Bae
  • 2.
    Jaesung Bae(bjsd3@kaist.ac.kr) Abstract • Mostauthor →Focused on convolution on frequency domain • To make invariance to speaker and speaking style. →Others time-domain convolution • For longer time-span of input in hierarchical manner. • Here →Combined two network. • 16.7 error rate on the TIMIT phone recognition task. • New record.
  • 3.
    Jaesung Bae(bjsd3@kaist.ac.kr) 1. Introduction •CNN →Process in small localized parts, looking for the presence of relevant local features. →By pooling • Made more translation tolerant. • Effect →Convolution on frequency domain • Good, speaker and speaking style invariant →Convolution on time domain • Not that effective. Too many convolution on time is harmful. • negligible benefit.(Abdel-Hamid et al. and Sainath et al.) • Some teams are succefully done it. • During pooling the position information is not discarded. • So convolution here is not for shift invariance, but allowing to process longer time.
  • 4.
    Jaesung Bae(bjsd3@kaist.ac.kr) 2. ConvolutionalNeural Networks • Difference with ANN →1.locality • Trained on a time-frequency representation instead of MFCC features →2. Do weight sharing. • 2.1. Convolution along the frequency axis →Input: 40 mel filterbank channels plus the frame-level energy, along with the corresponding delta and delta-delta parameters. • 다른곳에서 쓰인 reference →Use max pooling →Vary the number of filters used to cover the whole frequency range of 40 mel filterbank. →Weight sharing, limited weight sharing possible • In here limited weight sharing is used. →2 convolutional layer and 4 fully connected layer.
  • 5.
  • 6.
    Jaesung Bae(bjsd3@kaist.ac.kr) 2. ConvolutionalNeural Networks • 2.2. Convolution along the time axis. →Motivated by hierarchical ANN models. →Frequency에 대해서 먼저 network를 한 번 하고, 거기다가 time 에 대해서 convolution한 network. →This paper’s difference with only applying frequency-domain convolution • Input blocks are processed by just one layer of neurons in one case, and by a sub- network of several layers in the other. →This paper’s difference with only applying time-domain convolution • Instead of pooling size r, they put several filters at different places along time. • Allow shift invariance, but rather to enable the model to hierarchically process a fairly wide range of input without increasing the number of weights. • Q. Pooling size 1?????
  • 7.
    Jaesung Bae(bjsd3@kaist.ac.kr) 2. ConvolutionalNeural Networks • 2.3. Convolution along both time and frequency →Network shown in Fig.1a should be substituted for the subnetwork for Fig.1b. →First, sub-network is trained, then the output layer is discarded and full network is consturceted with randomly initialized weights in the upper layers. →Only the upper part is trained for 1 iteration →Then the whole network is trained until convergence is reached.
  • 8.
    Jaesung Bae(bjsd3@kaist.ac.kr) 3. ExperimentalSetting • 10% of train dataset as validation dataset. • Evaluation phase →Label outputs were mapped to the usual set of 39 labels. • Bigram language model was used. →LM weight: 1.0, phone insertion penalty parameters: 0.0 • Trained by semi-batch back-propagation with batch size 100. • Frame-level cross-entropy cost. (Not ctc-loss.) • Learn rate: 0.001. If the validation loss does not decrease halved after each iteration. • Training was halted when the improvement in the error was smaller than 0.1% in two subsequent iterations.
  • 9.
    Jaesung Bae(bjsd3@kaist.ac.kr) 4. Resultand Discussion • Baseline model: FC 4 hidden layer with 2000 ReLU.(reference) • 4.1. Convolution along time. →Architecture of 1b was investigated in our earlier study. →5 input blocks, covering 9 frames of input context with an overlap of 4 frames. →Subnetwork: 3 layer of 2000 neurons. Bottleneck layer of 400 neurons. →Upper part of network: hidden layer of 2000 neurons.
  • 10.
    Jaesung Bae(bjsd3@kaist.ac.kr) 4. Resultand Discussion • 4.2. Convolution along frequency. →First, attempt to find the optimal number of convolutional filters. • Number of filter 4~8 • Neighboring filters overlapped by 2-3 mel channels. →Filter width was set to 15 frames. • To make it same with baseline model. →Pooling size was 3. →By experiment use 7 filters with width 7.
  • 11.
    Jaesung Bae(bjsd3@kaist.ac.kr) 4. Resultand Discussion • 4.2. Convolution along frequency. →Second, to find optimal pooling size. • 5 gave the best result. • Maybe possible, using various pooling size in the same model. • 4.3. Convolution along time and frequency. →Same dropout parameters for each layer. →Dropout rate 0.25. • Concusion →16.7% on TIMIT dataset. →Need more modification experiment.