Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Speaker ID II (D4L1 Deep Learning for Speech and Language UPC 2017)


Published on

Winter School on Deep Learning for Speech and Language. UPC BarcelonaTech ETSETB TelecomBCN.

The aim of this course is to train students in methods of deep learning for speech and language. Recurrent Neural Networks (RNN) will be presented and analyzed in detail to understand the potential of these state of the art tools for time series processing. Engineering tips and scalability issues will be addressed to solve tasks such as machine translation, speech recognition, speech synthesis or question answering. Hands-on sessions will provide development skills so that attendees can become competent in contemporary data analytics tools.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Speaker ID II (D4L1 Deep Learning for Speech and Language UPC 2017)

  1. 1. [course site] Day 4 Lecture 1 Speaker ID II Javier Hernando
  2. 2. 2 DL Modeling i-Vectors
  3. 3. 3 DL Modeling i-Vectors O. Ghahabi, J. Hernando, Deep Learning Backend for Single and Multi-Session i-Vector Speaker Recognition, to be appear in IEEE Trans. Audio, Speech and Language Processing
  4. 4. 4 Decoder Problem: A large number of impostor data (negative samples) Very few number of target data (positive samples) Solutions: Global Impostor Selection Clustering using K-means Equally distributing positive and negative samples among minibatches
  5. 5. DL Modeling i-Vectors 5
  6. 6. 6 DL Modeling i-Vectors
  7. 7. 7 DL Modeling i-Vectors
  8. 8. 8 DL Modeling i-Vectors
  9. 9. 9 DL Modeling i-Vectors
  10. 10. 10 DL Feature Classification
  11. 11. DL Feature Classification Speaker Verification Speaker Identification Credit S. H. Yella,
  12. 12. 12 DL Feature Classification P. Safari, O. Ghahabi, J. Hernando, “Restricted Boltzmann Machines for speaker vector extraction and feature classification”, Proc. URSI 2016
  13. 13. 13 DL i-vector Extraction
  14. 14. 14 DL i-vector Extraction
  15. 15. 15 DL ‘speaker-vectors’
  16. 16. 16 RBM vectors P. Safari, O. Ghahabi, and J. Hernando. From Features to Speaker Vectors by means of Restricted Boltzmann Machine Adaptation. Odyssey Speaker and Language Recognition Workshop,, June, 2016
  17. 17. 17 RBM vectors
  18. 18. 18 RBM vectors
  19. 19. 19 CDBN vectors
  20. 20. 20 CDBN vectors Unsupervised feature learning for audio classification using convolutional deep belifef networks, H. Lee et al., Advances in Neural Information Processing Systems, 22:1096–1104, 2009
  21. 21. 21 DL ‘supervector like’ estimation
  22. 22. 22 Tasks
  23. 23. SoA Speaker Diarization
  24. 24. 24 DL in Speaker Diarization
  25. 25. 25 DL Feature Classification
  26. 26. Speaker Clustering: Speaker Comparison Shallow Speaker Comparison Harsha et al. “Artificial Neural Network Features for Speaker Diarization”. IEEE Spoken Language Technology Workshop. (2014) 402-406 Speaker errors obtained on AMI and ICSI datasets for matched and mismatched training conditions. MFCC corresponds to baseline clustering using BIC. ANN+MFCC is referred to the ANN shown in right figure. Pa
  27. 27. 27 DL ‘speaker-vectors’
  28. 28. Speaker Embeddings Mickael Rouvier et al. “Speaker Diarization trough Speaker Embeddings”. 23rd European Signal Processing Conference. (2015)
  29. 29. Speaker Embeddings 500 size Speaker Embeddings rearranged in 10x50. Representation of two utterances from each speaker. 2D projection of four Speaker Embeddings using PCA.
  30. 30. 30 DL ‘speaker-vectors’
  31. 31. CNN BN Feature Yanik Lukic et al. “Speaker Identification and Clustering using Convolutional Neural Networks”. In 2016 IEEE International workshop on machine learning for signal processing. (2016) Five Speaker representations in 2 dimensions. Left figure show the output vector of the softmax layer L8. Right figure correspond to the same output vector of L5 dense layer. Differents colors are assigned to different speakers.
  32. 32. CNN BN Features ● L5 and L7 size depend proportionally to the number of speakers. ● L5 and L7 outperforms the softmax layer L8, where L7 is better than L5. ● trainning data (speaker ammount ) must be above 10 * (# speakers) for a good performance.