Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

IV_WORKSHOP_NVIDIA-Audio_Processing

313 views

Published on

  • Be the first to comment

  • Be the first to like this

IV_WORKSHOP_NVIDIA-Audio_Processing

  1. 1. IV WORKSHOP NVIDIA DE GPU E CUDA Audio Processing using Convolutional Neural Network Diego Augusto September 6, 2016
  2. 2. Speech Activity Detection (SAD) ❖ Distinguish speech and noise segments. ❖ Estimate start and end times of speech events. WAVEFORM
  3. 3. Speech Activity Detection (SAD) ❖ Distinguish speech and noise segments. ❖ Estimate start and end times of speech events. #1, START: 1.2 sec, END: 2.5 sec #2, START: 3.3 sec END: 4.9 sec WAVEFORM speech speech
  4. 4. Applications ❖ Segmentation of spontaneous speech: ➢ Live language translation. ➢ Speech transmission over audio codec’s. ➢ Retrieval of speech in video and social networks.
  5. 5. Applications ❖ Segmentation of spontaneous speech: ➢ Live language translation. ➢ Speech transmission over audio codec’s. ➢ Retrieval of speech in video and social networks. ❖ Pre processing of speech engines: ➢ Speech Recognition - “what is being said?” ➢ Speaker Authentication - “who is speaking?” ➢ Speaker Diarization - “who spoke when?”
  6. 6. Challenges ❖ Large variety of different types of noises: ➢ Clicking, Motor sound, Background voice. ❖ Voice distortion, overlapping sounds.
  7. 7. Convolutional Neural Network (CNN) ❖ CNN approach: ➢ Features are extracted automatically by the network. ❖ Inspired by human vision system (visual cortex). ❖ Extract distinctive features.
  8. 8. CPqD Dataset ❖ > 300 hours of speech and noise. ➢ with ground truth. ❖ Environments: ➢ Phone conversation. ➢ PCs and IoT devices (mobile apps). ❖ Split into two parts: ➢ Development = 75%. ➢ Evaluation = 25%.
  9. 9. Speech/Noise Features SPECTROGRAMWAVEFORM
  10. 10. Speech/Noise Features SPECTROGRAMWAVEFORM 1 1 1 1 1 1 0 = NOISE 1 = SPEECH
  11. 11. Speech/Noise Features SPECTROGRAMWAVEFORM 0 = NOISE 1 = SPEECH 0 0 1 0 0 0 0 01 1 1 1 1
  12. 12. Deep Learning Platform MANAGE DEVELOPMENT EVALUATION NVIDIA DIGITS 4 GPU GRID K520 Linux 64-bit FEAT. EXTRACT TRAIN TEST REINFORCEMENT LEARNING
  13. 13. NVIDIA DIGITS Monitor Train Test Model 99,93 0,07 1 0
  14. 14. Evaluation FA MSMS FAFA Ground Truth: Spectrogram: ❖ Half-Total Error Rate: HTER = (MR + FAR) / 2 ➢ Miss Speech Rate (%): ■ (# Speech samples not detected as speech / Total number of speech samples) x 100 ➢ False Alarm Rate (%): ■ (# Nonspeech samples detected as speech / Total number of nonspeech samples) x 100 speech speech
  15. 15. Evaluation FA MS FA MS FA Ground Truth: Spectrogram: Hypothesis: ❖ Half-Total Error Rate: HTER = (MR + FAR) / 2 ➢ Miss Speech Rate (%): ■ (# Speech samples not detected as speech / Total number of speech samples) x 100 ➢ False Alarm Rate (%): ■ (# Nonspeech samples detected as speech / Total number of nonspeech samples) x 100 speech speech
  16. 16. Evaluation ❖ QUT-NOISE-TIMIT: ➢ Large-scale dataset to evaluation SAD algorithms. ❖ Technical challenges and Future: ➢ Automatic adaptation to environment. ➢ Overlapping sound events. ➢ CNN approach to perform others problems. Features Classifier HTER Energy Threshold 26,3% MFCC GMM-HMM 4,7 % Spectrogram CNN 3,2%
  17. 17. References ● J. Sohn, N. S. Kim, and W. Sung, “A statistical model based voice activity detection,” Signal Processing Letters, IEEE, vol. 6, no. 1, pp. 1–3, 1999. ● W. H. Abdulla, Z. Guan, and H. C. Sou, “Noise robust speech activity detection,” in Signal Processing and Information Technology (ISSPIT), 2009 IEEE International Symposium on. IEEE, 2009, pp. 473–477. ● D. B. Dean, S. Sridharan, R. J. Vogt, and M. W. Mason, “The qut-noise-timit corpus for the evaluation of voice activity detection algorithms,” Proceedings of Interspeech 2010, 2010. ● D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, and K. Vesely. The Kaldi Speech Recognition Toolkit. In IEEE 2011 Workshop on Automatic Speech Recognition and Understanding. IEEE Signal Processing Society, 2011. ● S. Thomas, S. Ganapathy, G. Saon, and H. Soltau, “Analyzing convolutional neural networks for speech activity detection in mismatched acoustic conditions,” in Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on. IEEE, 2014, pp. 2519– 2523. ● Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for fast feature embedding,” arXiv preprint arXiv:1408.5093, 2014. ● H. Ghaemmaghami, D. Dean, S. Kalantari, S. Sridharan, and C. Fookes, “Complete-linkage clustering for voice activity detection in audio and visual speech,” 2015. ● NVIDIA Deep Learning GPU Training System (DIGITS) 4. Retrieved July 18, 2016, from https://developer.nvidia.com/digits.
  18. 18. www.cpqd.com.br TURNING INTO REALITY Diego Augusto diegoa@cpqd.com.br

×