1. Deep Learning – Speaker
Recognition for Security & IoT
Sai Kiran Kadam(SK)
Description: Investigation of DNNs/DBNs for Noise
Robust Speech – Emotion Recognition
2. Investigation of Full-Sequence Training of Deep Belief Networks
for Speech Recognition
Abdel-rahman Mohamed, Dong Yu , Li Deng
• Investigates approaches to optimize DBN weights, state-state parameters and language Model using Sequential Discriminative
Training
• DBN- densely connected, highly complex nonlinear feature extractor, each hidden layer learns
to represent features that capture higher order correlations in the original input data.
• Technique: DBN-3 layered & 6-layered with SEQUENCE based training & FRAME based training.
• RBM | DBN| Conditional Probability concepts | Experiments performed on TIMIT Corpus Data set
• Training set: 462 speakers, set of 50 speakers for Model Tuning
• Test Set: 192 sentences with 7,333 tokens. Speech was analyzed using a 25-ms Hamming window
with a 10-ms fixed rate. (12th order MFCC)
• Tools: HTK – 183 (61 phones x 3 states) target class labels. After Decoding, 61 classes were mapped to
a standard set of 39 classes for scoring using HResults tool.
• Results:
• Use Sequence based training using RNNs/CNNs or GCNs in my Thesis
3. A NETWORK OF DEEP NEURAL NETWORKS FOR DISTANT SPEECH
RECOGNITION
Mirco Ravanelli, Philemon Brakel, Maurizio Omologo , Yoshua Bengio
• Proposes a novel architecture for DSR based on network of DNNs
• Limitation of state of art tech- lack of robustness, no cooperation between speech
enhancement and speech recognition
• Here, all components are jointly trained between SE & SR to mitigate the lack of match
• Technique: DNN architecture - Joint Trained with back-propagation algorithm
• Tools: Theano, Kaldi (s5 recipe) toolkit (C++)
• Experiment: Conducted using TIMIT-(Phoneme Recog) and WSJ – DIRHA English Dataset
• Training set – contaminated with impulse responses with reverb time – 0.7 sec
• Test set – DIRHA set – 409 WSJ sentences uttered by 6 American speakers
• System: Features – 39 MFCC computed every 10 ms with frame length of 25 ms
• Results:
Try to implement back-propagation & joint training
RNNs may perform better than back-propagation
4. AN INVESTIGATION OF DEEP NEURAL NETWORKS FOR NOISE
ROBUST SPEECH RECOGNITION
Michael L. Seltzer, Dong Yu, Yongqiang Wang
• Evaluates the performance of noise robustness of DNN based acoustic models
• Technique: DNN noise-aware training & DNN dropout training
• Dataset: Aurora 4 – Speech recognition frontend – based on WSJ0 corpus
• Training set: 7137 utterances from 83 speakers – conducted at 16 kHz multi-condition training
• Evaluation/test set: WSJ0 corpus 5K-word – 330 utterances from 8 speakers
• Test set – recorded using primary & secondary microphone
• These 2 sets are corrupted by same 6 noises used in training set at 5-15 dB SNR14 sets total
• Results: Performance as a function of number of senones & hidden layers
• Dropout technique can be used for speaker recognition, Data Augmentation is better than dropout.
5. Related Research Papers
• Deng, L., Seltzer, M., Yu, D., Acero, A., Mohamed, A., and Hinton, G. “Binary coding of speech spectrograms using deep
auto-encoder,” Proc. Interspeech, 2010.
• Schwarz, P., Matjka, P., and Cernock, J. “Hierarchical structures of neural networks for phoneme recognition,” Proc. ICASSP,
2006, pp. 325–328.
• Bilmes, J. and Bartels, C. “Graphical model architectures for speech recognition,” IEEE Sig. Proc. Mag., vol. 22, Sept. 2005,
pp. 89–100
• T. Gao, J. Du, L. R. Dai, and C. H. Lee, “Joint training of front-end and back-end deep neural networks for robust speech
recognition,” in Proc. of ICASSP, 2015, pp. 4375–4379
• A. Ragni and M. J. F. Gales, “Derivative kernels for noise robust ASR,” in IEEE Workshop on Automatic Speech Recognition
and Understanding, 2011
• F. Seide, G. Li, and D. Yu, “Conversational speech transcription using context-dependent deep neural networks,” in Proc.
Interspeech, 2011.
• Y. Xu, J. Du, L. R. Dai, and C. H. Lee, “A regression approach to speech enhancement based on deep neural networks,”
IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 23, no. 1, pp. 7–19, Jan 2015.
• D. Yu, L. Deng, J. Droppo, J. Wu, Y. Gong, and A. Acero, “A Minimum-mean-square-error noise reduction algorithm on mel-
frequency cepstra for robust speech recognition,” in Proc. of ICASSP, Las Vegas, NV, 2008