Deep Learning – Speaker
Recognition for Security & IoT
Sai Kiran Kadam(SK)
Description: Investigation of DNNs/DBNs for Noise
Robust Speech – Emotion Recognition
Investigation of Full-Sequence Training of Deep Belief Networks
for Speech Recognition
Abdel-rahman Mohamed, Dong Yu , Li Deng
• Investigates approaches to optimize DBN weights, state-state parameters and language Model using Sequential Discriminative
Training
• DBN- densely connected, highly complex nonlinear feature extractor, each hidden layer learns
to represent features that capture higher order correlations in the original input data.
• Technique: DBN-3 layered & 6-layered with SEQUENCE based training & FRAME based training.
• RBM | DBN| Conditional Probability concepts | Experiments performed on TIMIT Corpus Data set
• Training set: 462 speakers, set of 50 speakers for Model Tuning
• Test Set: 192 sentences with 7,333 tokens. Speech was analyzed using a 25-ms Hamming window
with a 10-ms fixed rate. (12th order MFCC)
• Tools: HTK – 183 (61 phones x 3 states) target class labels. After Decoding, 61 classes were mapped to
a standard set of 39 classes for scoring using HResults tool.
• Results:
• Use Sequence based training using RNNs/CNNs or GCNs in my Thesis
A NETWORK OF DEEP NEURAL NETWORKS FOR DISTANT SPEECH
RECOGNITION
Mirco Ravanelli, Philemon Brakel, Maurizio Omologo , Yoshua Bengio
• Proposes a novel architecture for DSR based on network of DNNs
• Limitation of state of art tech- lack of robustness, no cooperation between speech
enhancement and speech recognition
• Here, all components are jointly trained between SE & SR to mitigate the lack of match
• Technique: DNN architecture - Joint Trained with back-propagation algorithm
• Tools: Theano, Kaldi (s5 recipe) toolkit (C++)
• Experiment: Conducted using TIMIT-(Phoneme Recog) and WSJ – DIRHA English Dataset
• Training set – contaminated with impulse responses with reverb time – 0.7 sec
• Test set – DIRHA set – 409 WSJ sentences uttered by 6 American speakers
• System: Features – 39 MFCC computed every 10 ms with frame length of 25 ms
• Results:
Try to implement back-propagation & joint training
RNNs may perform better than back-propagation
AN INVESTIGATION OF DEEP NEURAL NETWORKS FOR NOISE
ROBUST SPEECH RECOGNITION
Michael L. Seltzer, Dong Yu, Yongqiang Wang
• Evaluates the performance of noise robustness of DNN based acoustic models
• Technique: DNN noise-aware training & DNN dropout training
• Dataset: Aurora 4 – Speech recognition frontend – based on WSJ0 corpus
• Training set: 7137 utterances from 83 speakers – conducted at 16 kHz multi-condition training
• Evaluation/test set: WSJ0 corpus 5K-word – 330 utterances from 8 speakers
• Test set – recorded using primary & secondary microphone
• These 2 sets are corrupted by same 6 noises used in training set at 5-15 dB SNR14 sets total
• Results: Performance as a function of number of senones & hidden layers
• Dropout technique can be used for speaker recognition, Data Augmentation is better than dropout.
Related Research Papers
• Deng, L., Seltzer, M., Yu, D., Acero, A., Mohamed, A., and Hinton, G. “Binary coding of speech spectrograms using deep
auto-encoder,” Proc. Interspeech, 2010.
• Schwarz, P., Matjka, P., and Cernock, J. “Hierarchical structures of neural networks for phoneme recognition,” Proc. ICASSP,
2006, pp. 325–328.
• Bilmes, J. and Bartels, C. “Graphical model architectures for speech recognition,” IEEE Sig. Proc. Mag., vol. 22, Sept. 2005,
pp. 89–100
• T. Gao, J. Du, L. R. Dai, and C. H. Lee, “Joint training of front-end and back-end deep neural networks for robust speech
recognition,” in Proc. of ICASSP, 2015, pp. 4375–4379
• A. Ragni and M. J. F. Gales, “Derivative kernels for noise robust ASR,” in IEEE Workshop on Automatic Speech Recognition
and Understanding, 2011
• F. Seide, G. Li, and D. Yu, “Conversational speech transcription using context-dependent deep neural networks,” in Proc.
Interspeech, 2011.
• Y. Xu, J. Du, L. R. Dai, and C. H. Lee, “A regression approach to speech enhancement based on deep neural networks,”
IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 23, no. 1, pp. 7–19, Jan 2015.
• D. Yu, L. Deng, J. Droppo, J. Wu, Y. Gong, and A. Acero, “A Minimum-mean-square-error noise reduction algorithm on mel-
frequency cepstra for robust speech recognition,” in Proc. of ICASSP, Las Vegas, NV, 2008

Deep Learning - Speaker Recognition

  • 1.
    Deep Learning –Speaker Recognition for Security & IoT Sai Kiran Kadam(SK) Description: Investigation of DNNs/DBNs for Noise Robust Speech – Emotion Recognition
  • 2.
    Investigation of Full-SequenceTraining of Deep Belief Networks for Speech Recognition Abdel-rahman Mohamed, Dong Yu , Li Deng • Investigates approaches to optimize DBN weights, state-state parameters and language Model using Sequential Discriminative Training • DBN- densely connected, highly complex nonlinear feature extractor, each hidden layer learns to represent features that capture higher order correlations in the original input data. • Technique: DBN-3 layered & 6-layered with SEQUENCE based training & FRAME based training. • RBM | DBN| Conditional Probability concepts | Experiments performed on TIMIT Corpus Data set • Training set: 462 speakers, set of 50 speakers for Model Tuning • Test Set: 192 sentences with 7,333 tokens. Speech was analyzed using a 25-ms Hamming window with a 10-ms fixed rate. (12th order MFCC) • Tools: HTK – 183 (61 phones x 3 states) target class labels. After Decoding, 61 classes were mapped to a standard set of 39 classes for scoring using HResults tool. • Results: • Use Sequence based training using RNNs/CNNs or GCNs in my Thesis
  • 3.
    A NETWORK OFDEEP NEURAL NETWORKS FOR DISTANT SPEECH RECOGNITION Mirco Ravanelli, Philemon Brakel, Maurizio Omologo , Yoshua Bengio • Proposes a novel architecture for DSR based on network of DNNs • Limitation of state of art tech- lack of robustness, no cooperation between speech enhancement and speech recognition • Here, all components are jointly trained between SE & SR to mitigate the lack of match • Technique: DNN architecture - Joint Trained with back-propagation algorithm • Tools: Theano, Kaldi (s5 recipe) toolkit (C++) • Experiment: Conducted using TIMIT-(Phoneme Recog) and WSJ – DIRHA English Dataset • Training set – contaminated with impulse responses with reverb time – 0.7 sec • Test set – DIRHA set – 409 WSJ sentences uttered by 6 American speakers • System: Features – 39 MFCC computed every 10 ms with frame length of 25 ms • Results: Try to implement back-propagation & joint training RNNs may perform better than back-propagation
  • 4.
    AN INVESTIGATION OFDEEP NEURAL NETWORKS FOR NOISE ROBUST SPEECH RECOGNITION Michael L. Seltzer, Dong Yu, Yongqiang Wang • Evaluates the performance of noise robustness of DNN based acoustic models • Technique: DNN noise-aware training & DNN dropout training • Dataset: Aurora 4 – Speech recognition frontend – based on WSJ0 corpus • Training set: 7137 utterances from 83 speakers – conducted at 16 kHz multi-condition training • Evaluation/test set: WSJ0 corpus 5K-word – 330 utterances from 8 speakers • Test set – recorded using primary & secondary microphone • These 2 sets are corrupted by same 6 noises used in training set at 5-15 dB SNR14 sets total • Results: Performance as a function of number of senones & hidden layers • Dropout technique can be used for speaker recognition, Data Augmentation is better than dropout.
  • 5.
    Related Research Papers •Deng, L., Seltzer, M., Yu, D., Acero, A., Mohamed, A., and Hinton, G. “Binary coding of speech spectrograms using deep auto-encoder,” Proc. Interspeech, 2010. • Schwarz, P., Matjka, P., and Cernock, J. “Hierarchical structures of neural networks for phoneme recognition,” Proc. ICASSP, 2006, pp. 325–328. • Bilmes, J. and Bartels, C. “Graphical model architectures for speech recognition,” IEEE Sig. Proc. Mag., vol. 22, Sept. 2005, pp. 89–100 • T. Gao, J. Du, L. R. Dai, and C. H. Lee, “Joint training of front-end and back-end deep neural networks for robust speech recognition,” in Proc. of ICASSP, 2015, pp. 4375–4379 • A. Ragni and M. J. F. Gales, “Derivative kernels for noise robust ASR,” in IEEE Workshop on Automatic Speech Recognition and Understanding, 2011 • F. Seide, G. Li, and D. Yu, “Conversational speech transcription using context-dependent deep neural networks,” in Proc. Interspeech, 2011. • Y. Xu, J. Du, L. R. Dai, and C. H. Lee, “A regression approach to speech enhancement based on deep neural networks,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 23, no. 1, pp. 7–19, Jan 2015. • D. Yu, L. Deng, J. Droppo, J. Wu, Y. Gong, and A. Acero, “A Minimum-mean-square-error noise reduction algorithm on mel- frequency cepstra for robust speech recognition,” in Proc. of ICASSP, Las Vegas, NV, 2008