Deep Learning for Automatic Speaker Recognition

Topic: Deep Learning – Speaker
Recognition for Security & IoT
01/03/2018
Sai Kiran Kadam(SK)
Description: Automatic Text-Independent Speaker
Recognition using DNNs/DBNs for Distant Noise
Robust Speech – Emotion Recognition

SPEAKER IDENTIFICATION & CLUSTERING USING
CONVOLUTIONAL NEURAL NETWORKS
Yanick Lukic, Carlo Vogt, Oliver Durr, Thilo Stadelmann
• Speaker ID using CNN; Input to CNN – Spectograms (Cepstral Analysis)
• Speaker Clustering-Telling who spoke w/o prior knowledge of identity
• Technique/Method: Apply CNN’s on Spectrograms to learn speaker specific
features
• Libraries Used: Python - LIBROSA (to compute I/p) & LASAGNE (Build, Train CNN)
• Training: Dataset - Studio Quality Recordings - 630 people (192 F, 438 M)
• Experiments & Results:
• Optimal Convolutional Filter Dimension
• Evaluate Speaker Perf – 97.0% corresponding to 19 misidentified speakers
• Evaluate Clustering Perf – Mis-classification Rate
• Use: Clustering and Convolution Architecture to
my work

RECURRENT NEURAL NETWORKS FOR POLYPHONIC SOUND EVENT
DETECTION IN REAL LIFE RECORDINGS
Giambattista Parascandolo, Heikki Huttunen, Tuomas Virtanen
• Presents approach to Polyphonic SED in real life recordings.
• Technique: BLSTM – RNN
• Training Data: 103 recordings (each 10-30min long)- total 1133 minutes
from 10 real life contexts- 8 to14 recordings/context
• Testing Data: DB of 61 classes from 10 different real life contexts
• Results/How good it is: Avg F1 score of 65.5% on 1-sec blocks & 64.7% on single
frames with relative improvement over state-of-art methods of 6.8% & 15.1 %
respectively.
• Limitations: Overfitting - dataset smaller than network (use Data Augmentation)
• Use BLSTM – RNN with Data Augmentation for my thesis.

Using Deep Learning for Detecting Spoofing Attacks on Speech
Signals
Alan Godoy , Flavio Sim ´ oes ˜ , Jose Augusto Stuchi , Marcus de Assis Angeloni , Mario Uliani ´ , Ricardo Violato
• About Automatic Speaker Verification Spoofing & Countermeasure
Challenge – ASVSpoof2015 based on Deep Neural Networks.
• Biometric Spoofing: Direct attack perpetrated against a biometric authentication system
by presenting fake/forged biometric sample.
• Technique: DNN used as a classifier and feature extraction module.
• Feature Extraction: DNN based MLP(I/p – 2668 features of a vector.)
• Back Propagation Algorithm + Stochastic Gradient Descent Optimization – to train
the network
• How good it is: MLP showed EER<0.5% beating SVM-RBF & GMM
• Limitation: MLP is not as efficient as CNN/RNN with BLSTM
• Tradeoffs: BLSTM-RNN over MLP -> No loss of long-term info, EER Increase
• Use: BLSTM-RNN with Spoofing for Security

END-TO-END ATTENTION BASED TEXT-DEPENDENT SPEAKER
VERIFICATION
Shi-Xiong Zhang, Zhuo Chen , Yong Zhao, Jinyu Li and Yifan Gong
• Presents End-to-End system that uses CNNs to extract noise-robust frame-level speaker
discriminative features
• These features - combined to form Attention Mechanism
• CNN + Attention Model-joint optimized using end-to-end criterion
• Technique: CNN + End-to-end Architecture
• Tools: Theano FrameWork, KERAS package – Python
• Testing: System is evaluated on Windows 10 “Hey Cortana” SV task
• End-End Arch has 3 phases
• Training: 200k utterances from 10k speakers, each with 10-30 utterances
• Enrollment: 6 utterances of “Hey Cortana”
• Evaluation: 60k utterances from 3k target speakers & 3k imposters
• Attention mechanism with DNN outperforms CNN & LSTM.
• Use Attention model + BLSTM RNN to my work ? Not sure yet.

Deep Neural Network Embedding’s for Text-Independent
Speaker Verification
David Snyder, Daniel Garcia-Romero, Daniel Povey, Sanjeev Khudanpur
• Investigates i-vectors replacement with embedding’s from Ff-DNN (for Txt Ind SV)
• i-vectors: low-Dim representation of speech, captures speaker & channel chars
• Temporal Pooling Layer-captures long term speaker chars, to train network to
discriminate between speakers from variable length speech segments
• Tools: Kaldi Speech Recognition Toolkit - USEFUL, nnet3 neural network.
• Training Data: Telephone speech of 65,000 recordings from 6500 speakers
• Evaluation: on NIST - SRE2010 & SRE2016
• Results:

Investigation of Full-Sequence Training of Deep Belief Networks
for Speech Recognition
Abdel-rahman Mohamed, Dong Yu , Li Deng
• Investigates approaches to optimize DBN weights, state-state parameters and language Model using Sequential
Discriminative Training
• DBN- densely connected, highly complex nonlinear feature extractor, each hidden layer learns
to represent features that capture higher order correlations in the original input data.
• Technique: DBN-3 layered & 6-layered with SEQUENCE based training & FRAME based training.
• RBM | DBN| Conditional Probability concepts | Experiments performed on TIMIT Corpus Data set
• Training set: 462 speakers, set of 50 speakers for Model Tuning
• Test Set: 192 sentences with 7,333 tokens. Speech was analyzed using a 25-ms Hamming window-10-ms fixed rate Order-
12 MFCC
• Tools: HTK – 183 (61 phones x 3 states) target class labels. After Decoding, 61 classes were mapped to
a standard set of 39 classes for scoring using HResults tool.
• Results: Sequence based 6 layer DBN outperfroms the frame based and sequence based 3 layer DBN
• Use Sequence based training using RNNs/CNNs or GCNs in my Thesis

A NETWORK OF DEEP NEURAL NETWORKS FOR DISTANT SPEECH
RECOGNITION
Mirco Ravanelli, Philemon Brakel, Maurizio Omologo , Yoshua Bengio
• Proposes a novel architecture for DSR based on network of DNNs
• Limitation of state of art tech- lack of robustness, no cooperation between speech
enhancement and speech recognition
• Here, all components are jointly trained between SE & SR to mitigate the lack of match
• Technique: DNN architecture - Joint Trained with back-propagation algorithm
• Tools: Theano, Kaldi (s5 recipe) toolkit (C++)
• Experiment: Conducted using TIMIT-(Phoneme Recog) and WSJ – DIRHA English Dataset
• Training set – contaminated with impulse responses with reverb time – 0.7 sec
• Test set – DIRHA set – 409 WSJ sentences uttered by 6 American speakers
• System: Features – 39 MFCC computed every 10 ms with frame length of 25 ms
• Results:
Try to implement back-propagation & joint training
RNNs may perform better than back-propagation

AN INVESTIGATION OF DEEP NEURAL NETWORKS FOR NOISE
ROBUST SPEECH RECOGNITION
Michael L. Seltzer, Dong Yu, Yongqiang Wang
• Evaluates the performance of noise robustness of DNN based acoustic models
• Technique: DNN noise-aware training & DNN dropout training
• Dataset: Aurora 4 – Speech recognition frontend – based on WSJ0 corpus
• Training set: 7137 utterances from 83 speakers – conducted at 16 kHz multi-condition training
• Evaluation/test set: WSJ0 corpus 5K-word – 330 utterances from 8 speakers
• Test set – recorded using primary & secondary microphone
• These 2 sets are corrupted by same 6 noises used in training set at 5-15 dB SNR14 sets total
• Results: Performance as a function of number of senones & hidden layers
• Dropout technique can be used for speaker recognition, Data Augmentation is better than dropout.

Related Research Papers
• Deng, L., Seltzer, M., Yu, D., Acero, A., Mohamed, A., and Hinton, G. “Binary coding of speech spectrograms using deep
auto-encoder,” Proc. Interspeech, 2010.
• Schwarz, P., Matjka, P., and Cernock, J. “Hierarchical structures of neural networks for phoneme recognition,” Proc. ICASSP,
2006, pp. 325–328.
• Bilmes, J. and Bartels, C. “Graphical model architectures for speech recognition,” IEEE Sig. Proc. Mag., vol. 22, Sept. 2005,
pp. 89–100
• T. Gao, J. Du, L. R. Dai, and C. H. Lee, “Joint training of front-end and back-end deep neural networks for robust speech
recognition,” in Proc. of ICASSP, 2015, pp. 4375–4379
• A. Ragni and M. J. F. Gales, “Derivative kernels for noise robust ASR,” in IEEE Workshop on Automatic Speech Recognition
and Understanding, 2011
• F. Seide, G. Li, and D. Yu, “Conversational speech transcription using context-dependent deep neural networks,” in Proc.
Interspeech, 2011.
• Y. Xu, J. Du, L. R. Dai, and C. H. Lee, “A regression approach to speech enhancement based on deep neural networks,”
IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 23, no. 1, pp. 7–19, Jan 2015.
• D. Yu, L. Deng, J. Droppo, J. Wu, Y. Gong, and A. Acero, “A Minimum-mean-square-error noise reduction algorithm on mel-
frequency cepstra for robust speech recognition,” in Proc. of ICASSP, Las Vegas, NV, 2008

Related Research Papers
• D. Garcia-Romero, X. Zhang, A. McCree, and D. Povey, “Improving speaker recognition performance in the domain
adaptation challenge using deep neural networks,” in Spoken Language Technology Workshop (SLT), 2014 IEEE. IEEE, 2014,
pp. 378– 383.
• O. Novotny, P. Mat ´ ejka, O. Glembeck, O. Plchot, F. Gr ˇ ezl, L. Bur- ´ get, and J. Cernock ˇ y, “Analysis of the dnn-based sre
systems ´ in multi-language conditions,” in Spoken Language Technology Workshop (SLT), 2016 IEEE. IEEE, 2016.
• E. Variani, X. Lei, E. McDermott, I. L. Moreno, and J. GonzalezDominguez, “Deep neural networks for small footprint
textdependent speaker verification,” in Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International
Conference on. IEEE, 2014, pp. 4052–4056.
• Najim Dehak, Patrick J Kenny, Reda Dehak, Pierre Du- ´ mouchel, and Pierre Ouellet, “Front-end factor analysis for speaker
verification,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 4, pp. 788–798, 2011.
• Felix Weninger, “Introducing currennt: The munich opensource cuda recurrent neural network toolkit,” Journal of Machine
Learning Research, vol. 16, pp. 547–551, 2015.
• Yann LeCun and Yoshua Bengio, “Convolutional networks for images, speech, and time series,” The handbook of brain
theory and neural networks, vol. 3361, no. 10, pp. 1995, 1995
• Ossama Abdel-Hamid, Li Deng, and Dong Yu, “Exploring convolutional neural network structures and optimization
techniques for speech recognition.,” in INTERSPEECH, 2013, pp. 3366–3370.
• Honglak Lee, Peter Pham, Yan Largman, and Andrew Y Ng, “Unsupervised feature learning for audio classification using
convolutional deep belief networks,” in Advances in neural information processing systems, 2009, pp. 1096–1104

Deep Learning for Automatic Speaker Recognition

More Related Content

What's hot

Similar to Deep Learning for Automatic Speaker Recognition

Recently uploaded

Deep Learning for Automatic Speaker Recognition