Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Multi-talker Speech Separation and Tracing at AI NEXT Conference

637 views

Published on

AI NEXT Conference 2017 Seattle by Dong Yu
Video: https://www.youtube.com/channel/UCj09XsAWj-RF9kY4UvBJh_A

Published in: Technology
  • DOWNLOAD FULL BOOKS, INTO AVAILABLE FORMAT ......................................................................................................................... ......................................................................................................................... 1.DOWNLOAD FULL. PDF EBOOK here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... 1.DOWNLOAD FULL. EPUB Ebook here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... 1.DOWNLOAD FULL. doc Ebook here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... 1.DOWNLOAD FULL. PDF EBOOK here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... 1.DOWNLOAD FULL. EPUB Ebook here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... 1.DOWNLOAD FULL. doc Ebook here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... ......................................................................................................................... ......................................................................................................................... .............. Browse by Genre Available eBooks ......................................................................................................................... Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, Cookbooks, Crime, Ebooks, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult,
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • DOWNLOAD FULL BOOKS, INTO AVAILABLE FORMAT ......................................................................................................................... ......................................................................................................................... 1.DOWNLOAD FULL. PDF EBOOK here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... 1.DOWNLOAD FULL. EPUB Ebook here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... 1.DOWNLOAD FULL. doc Ebook here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... 1.DOWNLOAD FULL. PDF EBOOK here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... 1.DOWNLOAD FULL. EPUB Ebook here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... 1.DOWNLOAD FULL. doc Ebook here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... ......................................................................................................................... ......................................................................................................................... .............. Browse by Genre Available eBooks ......................................................................................................................... Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, Cookbooks, Crime, Ebooks, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult,
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Multi-talker Speech Separation and Tracing at AI NEXT Conference

  1. 1. Dong Yu Distinguished Scientist and Vice General Manager Tencent AI Lab work was done while @ Microsoft Research Joint work with Morten Kolbæk, Zheng-Hua Tan, and Jesper Jensen Multi-talker Speech Separation and Tracing with Permutation Invariant Training
  2. 2. Outline • Motivation • Problem Setup and Prior Arts • Multi-talker Speech Separation • Experiments • Conclusion 3/27/17 Dong Yu : Multi-talker Speech Separation and Tracing with Permutation Invariant Training 2
  3. 3. Outline • Motivation • Problem Setup and Prior Arts • Multi-talker Speech Separation • Experiments • Conclusion 3/27/17 Dong Yu : Multi-talker Speech Separation and Tracing with Permutation Invariant Training 3
  4. 4. Frontier Shift • Driven by demand from users to interact with devices without wearing or carrying a close-talk microphone. • Many difficulties hidden by close-talk microphones now surface: • The energy of speech signal is very low when it reaches the microphones. • The interfering signals, such as background noise, reverberation, and speech from other talkers, become so distinct that they can no longer be ignored. 3/27/17 Dong Yu : Multi-talker Speech Separation and Tracing with Permutation Invariant Training 4 close-talk microphone far-field microphone
  5. 5. reverberation from surface reflections additive noise from other sound sources source Channel distortion ASR in Real World Scenarios 3/27/17 Dong Yu : Multi-talker Speech Separation and Tracing with Permutation Invariant Training 5
  6. 6. Cocktail Party Problem • Term coined by Cherry • “One of our most important faculties is our ability to listen to, and follow, one speaker in the presence of others. This is such a common experience that we may take it for granted; we may call it ‘the cocktail party problem’…” (Cherry’57) • Human’s performance is superior to machine • “For ‘cocktail party’-like situations… when all voices are equally loud, speech remains intelligible for normal-hearing listeners even when there are as many as six interfering talkers” (Bronkhorst & Plomp’92) • Speech separation problem • Separate and trace audio streams • Sometimes called speech enhancement when dealing with non-speech interference 3/27/17 Dong Yu : Multi-talker Speech Separation and Tracing with Permutation Invariant Training 6
  7. 7. Is Speech Separation Work Needed? • End-to-end ASR system sufficient? • Current ASR techniques require huge amount of training data that covers various conditions to train well • Speech separation can be used as advanced front-end • Speech separation criterion can be used as regularization to aid and speed up training of ASR systems • More applications than ASR • Hearing aids • Cochlear implants • Noise reduction for mobile communication • Audio information retrieval • Using microphone array sufficient? • Mic-array alone is not sufficient, e.g., when at same direction • Many recordings are still collected with single microphone 3/27/17 Dong Yu : Multi-talker Speech Separation and Tracing with Permutation Invariant Training 7
  8. 8. Outline • Motivation • Problem Setup and Prior Arts • Multi-talker Speech Separation • Experiments • Conclusion 3/27/17 Dong Yu : Multi-talker Speech Separation and Tracing with Permutation Invariant Training 8
  9. 9. Problem Definition • Source speech streams • Mixed speech • STFT domain • Estimate Mask • Reconstruct with Mask 3/27/17 Dong Yu : Multi-talker Speech Separation and Tracing with Permutation Invariant Training 9 •Ill-posed problem (# constraints < # free params: • There are an infinite number of possible 𝑋" 𝑡, 𝑓 combinations that lead to the same 𝑌 𝑡, 𝑓 •Solution: • Learn from training set to look for hidden regularities (complicated soft constraints)
  10. 10. Prior Arts Before Deep Learning Era • Computational auditory scene analysis (CASA) • Use perceptual grouping cues to estimate time-frequency masks • Non-negative matrix factorization (NMF) • Learn a set of non-negative bases during training • Estimate mixing factors during evaluation • Model based approach such as factorial GMM-HMM • Models the interaction between the target and competing speech signals and their temporal dynamics • Spatial filtering with a microphone array • Beamforming: Extract target sound from a specific spatial direction • Independent component analysis: Find a demixing matrix from multiple mixtures of sound sources 3/27/17 Dong Yu : Multi-talker Speech Separation and Tracing with Permutation Invariant Training 10
  11. 11. Training Criteria for Deep Learning • Ideal amplitude mask (IAM) 𝑀" 𝑡, 𝑓 = )* +,, - +,, • Minimize mask estimation error (two problems) • In silence segments 𝑋" 𝑡, 𝑓 = 0 and 𝑌 𝑡, 𝑓 = 0 → 𝑀" 𝑡, 𝑓 is not well defined • Smaller error on masks may not lead to a smaller error on magnitude (which is what we care about) • Minimize magnitude estimation error (used in this study) • Magnitude still estimated through masks: often lead to better performance esp. when training set is small 3/27/17 Dong Yu : Multi-talker Speech Separation and Tracing with Permutation Invariant Training 11
  12. 12. Prior Arts with DL: Speech + Others (many works, OSU, MERL, CUST, etc.) • Basic Architecture: mix of different types of signals 3/27/17 Dong Yu : Multi-talker Speech Separation and Tracing with Permutation Invariant Training 12 Noise/Music/ Other Speakers Est. Noise/Music/ Other Speakers
  13. 13. Prior Arts with DL: Focus on Speech (many works, OSU, MERL, CUST, etc.) • Basic Architecture: mix of different types of signals 3/27/17 Dong Yu : Multi-talker Speech Separation and Tracing with Permutation Invariant Training 13 Noise/Music/ Other Speakers Est. Noise/Music/ Other Speakers Speech + noise Speech + music Specific speaker + other speakers
  14. 14. Outline • Motivation • Problem Setup and Prior Arts • Multi-talker Speech Separation • Experiments • Conclusion 3/27/17 Dong Yu : Multi-talker Speech Separation and Tracing with Permutation Invariant Training 14
  15. 15. Multi-Talker Speech Separation • Label Ambiguity / Label Permutation Problem 3/27/17 Dong Yu : Multi-talker Speech Separation and Tracing with Permutation Invariant Training 15 Speaker 1 à output 1 ? Speaker 1 à output 2 ?
  16. 16. Solution 1: Deep Clustering (Hershey, Chen, Roux, Watanabe, 2016) • Learn a unit-size embedding for each time-frequency bin • If two bins belong to the same speaker they are close in the embedding space, and father away otherwise. • Trained on a large window of frames • Separation is done by clustering embedding space representations (i.e., segment the bins) • Shortcomings • Pipeline is complicated • Each bin is assumed to belong to one and only one speaker à limited its ability to combine with other techniques 3/27/17 Dong Yu : Multi-talker Speech Separation and Tracing with Permutation Invariant Training 16
  17. 17. Solution 2: Use Manually Defined Rules (Weng, Yu, Seltzer, Droppo, 14,15) • Use instantaneous energy instead of speaker ID to assign labels: manually designed limited cues 3/27/17 Dong Yu : Multi-talker Speech Separation and Tracing with Permutation Invariant Training 17 Low-energy speech High-energy speech
  18. 18. Our Solution: Permutation Invariant Training (Yu, Kolbæk, Tan, Jensen, 16, 17) 3/27/17 Dong Yu : Multi-talker Speech Separation and Tracing with Permutation Invariant Training 18 Simple to implement Can be easily extended to 3-speakers 𝑋0 − 𝑋20 3 + 𝑋3 − 𝑋23 3 𝑋3 − 𝑋20 3 + 𝑋0 − 𝑋23 3
  19. 19. Testing 3/27/17 Dong Yu : Multi-talker Speech Separation and Tracing with Permutation Invariant Training 19 • Default assignment: concatenate output s’s frames to form stream s • Optimal assignment: output of each frame is correctly assigned to speakers. Concatenate frames belong to speaker s to form stream s • Gap between them indicates the gain from additional speaker tracing
  20. 20. Outline • Motivation • Problem Setup and Prior Arts • Multi-talker Speech Separation • Experiments • Conclusion 3/27/17 Dong Yu : Multi-talker Speech Separation and Tracing with Permutation Invariant Training 20
  21. 21. Experiment Setup: Datasets • WSJ0-2mix and 3-mix • Derived from WSJ0 corpus • 2- and 3-speaker mixtures (artificially generated) • 30h training set, 10h validation set, 5h test set • Mixed at SIRs between 0 dB and 5 dB. • Danish-2mix and 3-mix • Derived from a Danish corpus • 2- or 3-speaker mixtures (artificially generated) • 10k, 1k, 1k+1k utterances in training, validation, and test sets • Mixed at 0dB • WSJ0-2mix-other • Same as WSJ0-2mix but mixed at 0dB 3/27/17 Dong Yu : Multi-talker Speech Separation and Tracing with Permutation Invariant Training 21
  22. 22. Models • Implemented using the Microsoft cognitive toolkit (CNTK) • Input: 257 dim STFT; Output: 257 x S streams • Segment-based (PIT-S): Each segment is independent, no tracing • DNN: 3 hidden layers each with 1024 ReLU units • PIT with tracing (PIT-T): force all frames from the same output layer to belong to the same speaker • LSTM: 3 LSTM layers each with 1792 units • BLSTM: 3 BLSTM layers each with 896 units • Test Conditions • Closed condition (CC): seen speakers • Open condition (OC): unseen speakers 3/27/17 Dong Yu : Multi-talker Speech Separation and Tracing with Permutation Invariant Training 22
  23. 23. PIT-S Training Behavior: WSJ0-2mix 3/27/17 Dong Yu : Multi-talker Speech Separation and Tracing with Permutation Invariant Training 23
  24. 24. PIT-S: SDR Gain (dB) on WSJ0-2MIX 3/27/17 Dong Yu : Multi-talker Speech Separation and Tracing with Permutation Invariant Training 24
  25. 25. PIT-T Training Behavior: WSJ0-2mix 3/27/17 Dong Yu : Multi-talker Speech Separation and Tracing with Permutation Invariant Training 25
  26. 26. PIT-T: SDR Gain (dB) on WSJ0-2MIX 3/27/17 Dong Yu : Multi-talker Speech Separation and Tracing with Permutation Invariant Training 26
  27. 27. SDR (dB) and PESQ Gain Comparison 3/27/17 Dong Yu : Multi-talker Speech Separation and Tracing with Permutation Invariant Training 27
  28. 28. Cross Language Behavior on 2-talker Mix 3/27/17 Dong Yu : Multi-talker Speech Separation and Tracing with Permutation Invariant Training 28
  29. 29. PIT-T on WSJ0-3mix 3/27/17 Dong Yu : Multi-talker Speech Separation and Tracing with Permutation Invariant Training 29
  30. 30. PIT-T Trained with Both 2- and 3-mix 3/27/17 Dong Yu : Multi-talker Speech Separation and Tracing with Permutation Invariant Training 30
  31. 31. Examples: 2-talker Mix •Male+Female: •Mix: •S1: •S2: 3/27/17 Dong Yu : Multi-talker Speech Separation and Tracing with Permutation Invariant Training 31 •Female+Male: •Mix: •S1: •S2: •Female+Female: •Mix: •S1: •S2: •Male+Male: •Mix: •S1: •S2:
  32. 32. Examples: 3-talker Mix •Male+2Female: •Mix: •S1: •S2: •S3: 3/27/17 Dong Yu : Multi-talker Speech Separation and Tracing with Permutation Invariant Training 32 •Female+2Male: •Mix: •S1: •S2: •S3:
  33. 33. Example: Trained on 3-Mix Test on 2-Mix •Diff Gender: •Mix: •S1: •S2: •S3: 3/27/17 Dong Yu : Multi-talker Speech Separation and Tracing with Permutation Invariant Training 33 •Same Gender: •Mix: •S1: •S2: •S3:
  34. 34. Example: Trained on 2 and 3-Mix, test on 2-Mix 3/27/17 Dong Yu : Multi-talker Speech Separation and Tracing with Permutation Invariant Training 34 •Diff Gender: •Mix: •S1: •S2: •S3: •Same Gender: •Mix: •S1: •S2: •S3:
  35. 35. Outline • Motivation • Problem Setup and Prior Arts • Multi-talker Speech Separation • Experiments • Conclusion 3/27/17 Dong Yu : Multi-talker Speech Separation and Tracing with Permutation Invariant Training 35
  36. 36. Conclusion • PIT can solve the label permutation problem • PIT is effective in speech separation without knowing number of speakers • PIT trained models generalize well to unseen speakers and languages • PIT is simple to implement • PIT has great potential since it can be easily integrated and combined with other techniques 3/27/17 Dong Yu : Multi-talker Speech Separation and Tracing with Permutation Invariant Training 36 Classification View (supervised approach) Segmentation view (deep clustering) Separation View (PIT) PIT is an important ingredient in the final solution to the cocktail party problem

×