Topic: Deep Learning – Speaker
Recognition for Security & IoT
01/03/2018
Sai Kiran Kadam(SK)
Description: Automatic Text-Independent Speaker
Recognition using DNNs/DBNs for Distant Noise
Robust Speech – Emotion Recognition
SPEAKER IDENTIFICATION & CLUSTERING USING
CONVOLUTIONAL NEURAL NETWORKS
Yanick Lukic, Carlo Vogt, Oliver Durr, Thilo Stadelmann
• Speaker ID using CNN; Input to CNN – Spectograms (Cepstral Analysis)
• Speaker Clustering-Telling who spoke w/o prior knowledge of identity
• Technique/Method: Apply CNN’s on Spectrograms to learn speaker specific
features
• Libraries Used: Python - LIBROSA (to compute I/p) & LASAGNE (Build, Train CNN)
• Training: Dataset - Studio Quality Recordings - 630 people (192 F, 438 M)
• Experiments & Results:
• Optimal Convolutional Filter Dimension
• Evaluate Speaker Perf – 97.0% corresponding to 19 misidentified speakers
• Evaluate Clustering Perf – Mis-classification Rate
• Use: Clustering and Convolution Architecture to
my work
RECURRENT NEURAL NETWORKS FOR POLYPHONIC SOUND EVENT
DETECTION IN REAL LIFE RECORDINGS
Giambattista Parascandolo, Heikki Huttunen, Tuomas Virtanen
• Presents approach to Polyphonic SED in real life recordings.
• Technique: BLSTM – RNN
• Training Data: 103 recordings (each 10-30min long)- total 1133 minutes
from 10 real life contexts- 8 to14 recordings/context
• Testing Data: DB of 61 classes from 10 different real life contexts
• Results/How good it is: Avg F1 score of 65.5% on 1-sec blocks & 64.7% on single
frames with relative improvement over state-of-art methods of 6.8% & 15.1 %
respectively.
• Limitations: Overfitting - dataset smaller than network (use Data Augmentation)
• Use BLSTM – RNN with Data Augmentation for my thesis.
Using Deep Learning for Detecting Spoofing Attacks on Speech
Signals
Alan Godoy , Flavio Sim ´ oes ˜ , Jose Augusto Stuchi , Marcus de Assis Angeloni , Mario Uliani ´ , Ricardo Violato
• About Automatic Speaker Verification Spoofing & Countermeasure
Challenge – ASVSpoof2015 based on Deep Neural Networks.
• Biometric Spoofing: Direct attack perpetrated against a biometric authentication system
by presenting fake/forged biometric sample.
• Technique: DNN used as a classifier and feature extraction module.
• Feature Extraction: DNN based MLP(I/p – 2668 features of a vector.)
• Back Propagation Algorithm + Stochastic Gradient Descent Optimization – to train
the network
• How good it is: MLP showed EER<0.5% beating SVM-RBF & GMM
• Limitation: MLP is not as efficient as CNN/RNN with BLSTM
• Tradeoffs: BLSTM-RNN over MLP -> No loss of long-term info, EER Increase
• Use: BLSTM-RNN with Spoofing for Security
END-TO-END ATTENTION BASED TEXT-DEPENDENT SPEAKER
VERIFICATION
Shi-Xiong Zhang, Zhuo Chen , Yong Zhao, Jinyu Li and Yifan Gong
• Presents End-to-End system that uses CNNs to extract noise-robust frame-level speaker
discriminative features
• These features - combined to form Attention Mechanism
• CNN + Attention Model-joint optimized using end-to-end criterion
• Technique: CNN + End-to-end Architecture
• Tools: Theano FrameWork, KERAS package – Python
• Testing: System is evaluated on Windows 10 “Hey Cortana” SV task
• End-End Arch has 3 phases
• Training: 200k utterances from 10k speakers, each with 10-30 utterances
• Enrollment: 6 utterances of “Hey Cortana”
• Evaluation: 60k utterances from 3k target speakers & 3k imposters
• Attention mechanism with DNN outperforms CNN & LSTM.
• Use Attention model + BLSTM RNN to my work ? Not sure yet.
Deep Neural Network Embedding’s for Text-Independent
Speaker Verification
David Snyder, Daniel Garcia-Romero, Daniel Povey, Sanjeev Khudanpur
• Investigates i-vectors replacement with embedding’s from Ff-DNN (for Txt Ind SV)
• i-vectors: low-Dim representation of speech, captures speaker & channel chars
• Temporal Pooling Layer-captures long term speaker chars, to train network to
discriminate between speakers from variable length speech segments
• Tools: Kaldi Speech Recognition Toolkit - USEFUL, nnet3 neural network.
• Training Data: Telephone speech of 65,000 recordings from 6500 speakers
• Evaluation: on NIST - SRE2010 & SRE2016
• Results:
Investigation of Full-Sequence Training of Deep Belief Networks
for Speech Recognition
Abdel-rahman Mohamed, Dong Yu , Li Deng
• Investigates approaches to optimize DBN weights, state-state parameters and language Model using Sequential
Discriminative Training
• DBN- densely connected, highly complex nonlinear feature extractor, each hidden layer learns
to represent features that capture higher order correlations in the original input data.
• Technique: DBN-3 layered & 6-layered with SEQUENCE based training & FRAME based training.
• RBM | DBN| Conditional Probability concepts | Experiments performed on TIMIT Corpus Data set
• Training set: 462 speakers, set of 50 speakers for Model Tuning
• Test Set: 192 sentences with 7,333 tokens. Speech was analyzed using a 25-ms Hamming window-10-ms fixed rate Order-
12 MFCC
• Tools: HTK – 183 (61 phones x 3 states) target class labels. After Decoding, 61 classes were mapped to
a standard set of 39 classes for scoring using HResults tool.
• Results: Sequence based 6 layer DBN outperfroms the frame based and sequence based 3 layer DBN
• Use Sequence based training using RNNs/CNNs or GCNs in my Thesis
A NETWORK OF DEEP NEURAL NETWORKS FOR DISTANT SPEECH
RECOGNITION
Mirco Ravanelli, Philemon Brakel, Maurizio Omologo , Yoshua Bengio
• Proposes a novel architecture for DSR based on network of DNNs
• Limitation of state of art tech- lack of robustness, no cooperation between speech
enhancement and speech recognition
• Here, all components are jointly trained between SE & SR to mitigate the lack of match
• Technique: DNN architecture - Joint Trained with back-propagation algorithm
• Tools: Theano, Kaldi (s5 recipe) toolkit (C++)
• Experiment: Conducted using TIMIT-(Phoneme Recog) and WSJ – DIRHA English Dataset
• Training set – contaminated with impulse responses with reverb time – 0.7 sec
• Test set – DIRHA set – 409 WSJ sentences uttered by 6 American speakers
• System: Features – 39 MFCC computed every 10 ms with frame length of 25 ms
• Results:
Try to implement back-propagation & joint training
RNNs may perform better than back-propagation
AN INVESTIGATION OF DEEP NEURAL NETWORKS FOR NOISE
ROBUST SPEECH RECOGNITION
Michael L. Seltzer, Dong Yu, Yongqiang Wang
• Evaluates the performance of noise robustness of DNN based acoustic models
• Technique: DNN noise-aware training & DNN dropout training
• Dataset: Aurora 4 – Speech recognition frontend – based on WSJ0 corpus
• Training set: 7137 utterances from 83 speakers – conducted at 16 kHz multi-condition training
• Evaluation/test set: WSJ0 corpus 5K-word – 330 utterances from 8 speakers
• Test set – recorded using primary & secondary microphone
• These 2 sets are corrupted by same 6 noises used in training set at 5-15 dB SNR14 sets total
• Results: Performance as a function of number of senones & hidden layers
• Dropout technique can be used for speaker recognition, Data Augmentation is better than dropout.
Related Research Papers
• Deng, L., Seltzer, M., Yu, D., Acero, A., Mohamed, A., and Hinton, G. “Binary coding of speech spectrograms using deep
auto-encoder,” Proc. Interspeech, 2010.
• Schwarz, P., Matjka, P., and Cernock, J. “Hierarchical structures of neural networks for phoneme recognition,” Proc. ICASSP,
2006, pp. 325–328.
• Bilmes, J. and Bartels, C. “Graphical model architectures for speech recognition,” IEEE Sig. Proc. Mag., vol. 22, Sept. 2005,
pp. 89–100
• T. Gao, J. Du, L. R. Dai, and C. H. Lee, “Joint training of front-end and back-end deep neural networks for robust speech
recognition,” in Proc. of ICASSP, 2015, pp. 4375–4379
• A. Ragni and M. J. F. Gales, “Derivative kernels for noise robust ASR,” in IEEE Workshop on Automatic Speech Recognition
and Understanding, 2011
• F. Seide, G. Li, and D. Yu, “Conversational speech transcription using context-dependent deep neural networks,” in Proc.
Interspeech, 2011.
• Y. Xu, J. Du, L. R. Dai, and C. H. Lee, “A regression approach to speech enhancement based on deep neural networks,”
IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 23, no. 1, pp. 7–19, Jan 2015.
• D. Yu, L. Deng, J. Droppo, J. Wu, Y. Gong, and A. Acero, “A Minimum-mean-square-error noise reduction algorithm on mel-
frequency cepstra for robust speech recognition,” in Proc. of ICASSP, Las Vegas, NV, 2008
Related Research Papers
• D. Garcia-Romero, X. Zhang, A. McCree, and D. Povey, “Improving speaker recognition performance in the domain
adaptation challenge using deep neural networks,” in Spoken Language Technology Workshop (SLT), 2014 IEEE. IEEE, 2014,
pp. 378– 383.
• O. Novotny, P. Mat ´ ejka, O. Glembeck, O. Plchot, F. Gr ˇ ezl, L. Bur- ´ get, and J. Cernock ˇ y, “Analysis of the dnn-based sre
systems ´ in multi-language conditions,” in Spoken Language Technology Workshop (SLT), 2016 IEEE. IEEE, 2016.
• E. Variani, X. Lei, E. McDermott, I. L. Moreno, and J. GonzalezDominguez, “Deep neural networks for small footprint
textdependent speaker verification,” in Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International
Conference on. IEEE, 2014, pp. 4052–4056.
• Najim Dehak, Patrick J Kenny, Reda Dehak, Pierre Du- ´ mouchel, and Pierre Ouellet, “Front-end factor analysis for speaker
verification,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 4, pp. 788–798, 2011.
• Felix Weninger, “Introducing currennt: The munich opensource cuda recurrent neural network toolkit,” Journal of Machine
Learning Research, vol. 16, pp. 547–551, 2015.
• Yann LeCun and Yoshua Bengio, “Convolutional networks for images, speech, and time series,” The handbook of brain
theory and neural networks, vol. 3361, no. 10, pp. 1995, 1995
• Ossama Abdel-Hamid, Li Deng, and Dong Yu, “Exploring convolutional neural network structures and optimization
techniques for speech recognition.,” in INTERSPEECH, 2013, pp. 3366–3370.
• Honglak Lee, Peter Pham, Yan Largman, and Andrew Y Ng, “Unsupervised feature learning for audio classification using
convolutional deep belief networks,” in Advances in neural information processing systems, 2009, pp. 1096–1104

Deep Learning for Automatic Speaker Recognition

  • 1.
    Topic: Deep Learning– Speaker Recognition for Security & IoT 01/03/2018 Sai Kiran Kadam(SK) Description: Automatic Text-Independent Speaker Recognition using DNNs/DBNs for Distant Noise Robust Speech – Emotion Recognition
  • 2.
    SPEAKER IDENTIFICATION &CLUSTERING USING CONVOLUTIONAL NEURAL NETWORKS Yanick Lukic, Carlo Vogt, Oliver Durr, Thilo Stadelmann • Speaker ID using CNN; Input to CNN – Spectograms (Cepstral Analysis) • Speaker Clustering-Telling who spoke w/o prior knowledge of identity • Technique/Method: Apply CNN’s on Spectrograms to learn speaker specific features • Libraries Used: Python - LIBROSA (to compute I/p) & LASAGNE (Build, Train CNN) • Training: Dataset - Studio Quality Recordings - 630 people (192 F, 438 M) • Experiments & Results: • Optimal Convolutional Filter Dimension • Evaluate Speaker Perf – 97.0% corresponding to 19 misidentified speakers • Evaluate Clustering Perf – Mis-classification Rate • Use: Clustering and Convolution Architecture to my work
  • 3.
    RECURRENT NEURAL NETWORKSFOR POLYPHONIC SOUND EVENT DETECTION IN REAL LIFE RECORDINGS Giambattista Parascandolo, Heikki Huttunen, Tuomas Virtanen • Presents approach to Polyphonic SED in real life recordings. • Technique: BLSTM – RNN • Training Data: 103 recordings (each 10-30min long)- total 1133 minutes from 10 real life contexts- 8 to14 recordings/context • Testing Data: DB of 61 classes from 10 different real life contexts • Results/How good it is: Avg F1 score of 65.5% on 1-sec blocks & 64.7% on single frames with relative improvement over state-of-art methods of 6.8% & 15.1 % respectively. • Limitations: Overfitting - dataset smaller than network (use Data Augmentation) • Use BLSTM – RNN with Data Augmentation for my thesis.
  • 4.
    Using Deep Learningfor Detecting Spoofing Attacks on Speech Signals Alan Godoy , Flavio Sim ´ oes ˜ , Jose Augusto Stuchi , Marcus de Assis Angeloni , Mario Uliani ´ , Ricardo Violato • About Automatic Speaker Verification Spoofing & Countermeasure Challenge – ASVSpoof2015 based on Deep Neural Networks. • Biometric Spoofing: Direct attack perpetrated against a biometric authentication system by presenting fake/forged biometric sample. • Technique: DNN used as a classifier and feature extraction module. • Feature Extraction: DNN based MLP(I/p – 2668 features of a vector.) • Back Propagation Algorithm + Stochastic Gradient Descent Optimization – to train the network • How good it is: MLP showed EER<0.5% beating SVM-RBF & GMM • Limitation: MLP is not as efficient as CNN/RNN with BLSTM • Tradeoffs: BLSTM-RNN over MLP -> No loss of long-term info, EER Increase • Use: BLSTM-RNN with Spoofing for Security
  • 5.
    END-TO-END ATTENTION BASEDTEXT-DEPENDENT SPEAKER VERIFICATION Shi-Xiong Zhang, Zhuo Chen , Yong Zhao, Jinyu Li and Yifan Gong • Presents End-to-End system that uses CNNs to extract noise-robust frame-level speaker discriminative features • These features - combined to form Attention Mechanism • CNN + Attention Model-joint optimized using end-to-end criterion • Technique: CNN + End-to-end Architecture • Tools: Theano FrameWork, KERAS package – Python • Testing: System is evaluated on Windows 10 “Hey Cortana” SV task • End-End Arch has 3 phases • Training: 200k utterances from 10k speakers, each with 10-30 utterances • Enrollment: 6 utterances of “Hey Cortana” • Evaluation: 60k utterances from 3k target speakers & 3k imposters • Attention mechanism with DNN outperforms CNN & LSTM. • Use Attention model + BLSTM RNN to my work ? Not sure yet.
  • 6.
    Deep Neural NetworkEmbedding’s for Text-Independent Speaker Verification David Snyder, Daniel Garcia-Romero, Daniel Povey, Sanjeev Khudanpur • Investigates i-vectors replacement with embedding’s from Ff-DNN (for Txt Ind SV) • i-vectors: low-Dim representation of speech, captures speaker & channel chars • Temporal Pooling Layer-captures long term speaker chars, to train network to discriminate between speakers from variable length speech segments • Tools: Kaldi Speech Recognition Toolkit - USEFUL, nnet3 neural network. • Training Data: Telephone speech of 65,000 recordings from 6500 speakers • Evaluation: on NIST - SRE2010 & SRE2016 • Results:
  • 7.
    Investigation of Full-SequenceTraining of Deep Belief Networks for Speech Recognition Abdel-rahman Mohamed, Dong Yu , Li Deng • Investigates approaches to optimize DBN weights, state-state parameters and language Model using Sequential Discriminative Training • DBN- densely connected, highly complex nonlinear feature extractor, each hidden layer learns to represent features that capture higher order correlations in the original input data. • Technique: DBN-3 layered & 6-layered with SEQUENCE based training & FRAME based training. • RBM | DBN| Conditional Probability concepts | Experiments performed on TIMIT Corpus Data set • Training set: 462 speakers, set of 50 speakers for Model Tuning • Test Set: 192 sentences with 7,333 tokens. Speech was analyzed using a 25-ms Hamming window-10-ms fixed rate Order- 12 MFCC • Tools: HTK – 183 (61 phones x 3 states) target class labels. After Decoding, 61 classes were mapped to a standard set of 39 classes for scoring using HResults tool. • Results: Sequence based 6 layer DBN outperfroms the frame based and sequence based 3 layer DBN • Use Sequence based training using RNNs/CNNs or GCNs in my Thesis
  • 8.
    A NETWORK OFDEEP NEURAL NETWORKS FOR DISTANT SPEECH RECOGNITION Mirco Ravanelli, Philemon Brakel, Maurizio Omologo , Yoshua Bengio • Proposes a novel architecture for DSR based on network of DNNs • Limitation of state of art tech- lack of robustness, no cooperation between speech enhancement and speech recognition • Here, all components are jointly trained between SE & SR to mitigate the lack of match • Technique: DNN architecture - Joint Trained with back-propagation algorithm • Tools: Theano, Kaldi (s5 recipe) toolkit (C++) • Experiment: Conducted using TIMIT-(Phoneme Recog) and WSJ – DIRHA English Dataset • Training set – contaminated with impulse responses with reverb time – 0.7 sec • Test set – DIRHA set – 409 WSJ sentences uttered by 6 American speakers • System: Features – 39 MFCC computed every 10 ms with frame length of 25 ms • Results: Try to implement back-propagation & joint training RNNs may perform better than back-propagation
  • 9.
    AN INVESTIGATION OFDEEP NEURAL NETWORKS FOR NOISE ROBUST SPEECH RECOGNITION Michael L. Seltzer, Dong Yu, Yongqiang Wang • Evaluates the performance of noise robustness of DNN based acoustic models • Technique: DNN noise-aware training & DNN dropout training • Dataset: Aurora 4 – Speech recognition frontend – based on WSJ0 corpus • Training set: 7137 utterances from 83 speakers – conducted at 16 kHz multi-condition training • Evaluation/test set: WSJ0 corpus 5K-word – 330 utterances from 8 speakers • Test set – recorded using primary & secondary microphone • These 2 sets are corrupted by same 6 noises used in training set at 5-15 dB SNR14 sets total • Results: Performance as a function of number of senones & hidden layers • Dropout technique can be used for speaker recognition, Data Augmentation is better than dropout.
  • 10.
    Related Research Papers •Deng, L., Seltzer, M., Yu, D., Acero, A., Mohamed, A., and Hinton, G. “Binary coding of speech spectrograms using deep auto-encoder,” Proc. Interspeech, 2010. • Schwarz, P., Matjka, P., and Cernock, J. “Hierarchical structures of neural networks for phoneme recognition,” Proc. ICASSP, 2006, pp. 325–328. • Bilmes, J. and Bartels, C. “Graphical model architectures for speech recognition,” IEEE Sig. Proc. Mag., vol. 22, Sept. 2005, pp. 89–100 • T. Gao, J. Du, L. R. Dai, and C. H. Lee, “Joint training of front-end and back-end deep neural networks for robust speech recognition,” in Proc. of ICASSP, 2015, pp. 4375–4379 • A. Ragni and M. J. F. Gales, “Derivative kernels for noise robust ASR,” in IEEE Workshop on Automatic Speech Recognition and Understanding, 2011 • F. Seide, G. Li, and D. Yu, “Conversational speech transcription using context-dependent deep neural networks,” in Proc. Interspeech, 2011. • Y. Xu, J. Du, L. R. Dai, and C. H. Lee, “A regression approach to speech enhancement based on deep neural networks,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 23, no. 1, pp. 7–19, Jan 2015. • D. Yu, L. Deng, J. Droppo, J. Wu, Y. Gong, and A. Acero, “A Minimum-mean-square-error noise reduction algorithm on mel- frequency cepstra for robust speech recognition,” in Proc. of ICASSP, Las Vegas, NV, 2008
  • 11.
    Related Research Papers •D. Garcia-Romero, X. Zhang, A. McCree, and D. Povey, “Improving speaker recognition performance in the domain adaptation challenge using deep neural networks,” in Spoken Language Technology Workshop (SLT), 2014 IEEE. IEEE, 2014, pp. 378– 383. • O. Novotny, P. Mat ´ ejka, O. Glembeck, O. Plchot, F. Gr ˇ ezl, L. Bur- ´ get, and J. Cernock ˇ y, “Analysis of the dnn-based sre systems ´ in multi-language conditions,” in Spoken Language Technology Workshop (SLT), 2016 IEEE. IEEE, 2016. • E. Variani, X. Lei, E. McDermott, I. L. Moreno, and J. GonzalezDominguez, “Deep neural networks for small footprint textdependent speaker verification,” in Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on. IEEE, 2014, pp. 4052–4056. • Najim Dehak, Patrick J Kenny, Reda Dehak, Pierre Du- ´ mouchel, and Pierre Ouellet, “Front-end factor analysis for speaker verification,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 4, pp. 788–798, 2011. • Felix Weninger, “Introducing currennt: The munich opensource cuda recurrent neural network toolkit,” Journal of Machine Learning Research, vol. 16, pp. 547–551, 2015. • Yann LeCun and Yoshua Bengio, “Convolutional networks for images, speech, and time series,” The handbook of brain theory and neural networks, vol. 3361, no. 10, pp. 1995, 1995 • Ossama Abdel-Hamid, Li Deng, and Dong Yu, “Exploring convolutional neural network structures and optimization techniques for speech recognition.,” in INTERSPEECH, 2013, pp. 3366–3370. • Honglak Lee, Peter Pham, Yan Largman, and Andrew Y Ng, “Unsupervised feature learning for audio classification using convolutional deep belief networks,” in Advances in neural information processing systems, 2009, pp. 1096–1104