Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Towards Machine Comprehension of Spoken Content

499 views

Published on

李宏毅 Hung-yi Lee

Published in: Technology
  • Be the first to comment

Towards Machine Comprehension of Spoken Content

  1. 1. TAIPEI | SEP. 21-22, 2016 李宏毅 Hung-yi Lee TOWARDS MACHINE COMPREHENSION OF SPOKEN CONTENT
  2. 2. 2 MULTIMEDIA INTERNET CONTENT 300 hrs multimedia is uploaded per minute. (2015.01) 1874 courses on coursera (2016.04) Ø We need machine to listen to the audio data, understand it, and extract useful information for humans. Ø In these multimedia, the spoken part carries very important information about the content. Ø Nobody is able to go through the data. Ø Overview the technology developed at NTU Speech Lab
  3. 3. 3 OVERVIEW 2016/9/26 Spoken Content Text Speech Recognition Deep Learning
  4. 4. Deep Learning for Speech Recognition • Acoustic Model (聲學模型) • DNN + HMM • Widely used • CTC • Sequence to sequence learning • DNN + structured SVM [Meng & Lee, ICASSP 10] • DNN + structured DNN [Liao & Lee, ASRU 15] hidden layer h1 hidden layer h2 W1 W2 F2 (x, y; θ2 ) WL speech signal F1 (x, y; θ1 ) y (phoneme label sequence) (a) use DNN phone posterior as acoustic vector (b) structured SVM (c) structured DNN Ψ(x,y) hidden layer hL-1 hidden layer h1 hidden layer hL W0,0 output layer input layer W0,L feature extraction a c b a x (acoustic vector sequence)
  5. 5. Deep Learning for Speech Recognition • Language Model (語言模型) http://colah.github.io/posts/2015-08- Understanding-LSTMs/ RNN LSTM Neural Turing Machine Attention- based Model [Ko & Lee, submitted to ICASSP 17] [Liu & Lee, submitted to ICASSP 17]
  6. 6. 6 OVERVIEW Spoken Content Text Speech Recognition Summarization
  7. 7. 7 SPEECH SUMMARIZATION Retrieved Audio File Summary Select the most informative segments to form a compact version 1 hour long 10 minutes Extractive Summaries Ref: http://speech.ee.ntu.edu.tw/~tlkagk/courses/ MLDS_2015/Structured%20Lecture/Summariz ation%20Hidden_2.ecm.mp4/index.html
  8. 8. 8 SPEECH SUMMARIZATION Abstractive Summaries x1 x2 x3 xN …… …… Input document (long word sequence) Summary (short word sequence) y1 y2 y3 y4 機器先看懂文章 機器用自己的話來寫摘要 [Yu & Lee, SLT 16]
  9. 9. 9 OVERVIEW 2016/9/26 Spoken Content Text Speech Recognition Key Term Extraction Summarization
  10. 10. Key Term Extraction [Shen & Lee, Interspeech 16] α1 α2 α3 α4 … αT ΣαiVi x4x3x2x1 xT… …V3V2V1 V4 VT Embedding Layer …V3V2V1 V4 VT OT … document Output Layer Hidden Layer Embedding Layer Key Terms: DNN, LSTN 機器先大略讀 過整篇文章 機器擷取 文章中的重點 回頭把重點 畫起來
  11. 11. 11 OVERVIEW 2016/9/26 Spoken Content Text Speech Recognition Spoken Content Retrieval Key Term Extraction Summarization
  12. 12. Spoken Content Retrieval l Transcribe spoken content into text by speech recognition Speech Recognition Models Text Retrieval Result Text Retrieval Query learner l Use text retrieval approach to search the transcriptions Spoken Content Black Box
  13. 13. Overview Paper • Lin-shan Lee, James Glass, Hung-yi Lee, Chun-an Chan, "Spoken Content Retrieval —Beyond Cascading Speech Recognition with Text Retrieval," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol.23, no.9, pp.1389-1420, Sept. 2015 • http://speech.ee.ntu.edu.tw/~tlkagk/paper/Overview.pdf • 3 hours tutorial at INTERSPEECH 2016 • Slide: http://speech.ee.ntu.edu.tw/~tlkagk/slide/spoken_content_retrieval_ IS16.pdf
  14. 14. Audio is difficult to browse • Retrieval results of spoken content is usually noisy • When the system returns the retrieval results, user doesn’t know what he/she get at the first glance Retrieval Result
  15. 15. 15 OVERVIEW 2016/9/26 Spoken Content Text Speech Recognition Spoken Content Retrieval Key Term Extraction Summarization Interaction user “Deep Learning” Related to Machine Learning or Education?
  16. 16. Challenges • Given the information entered by the users, which action should be taken? “Give me an example.”“Is it relevant to XXX?” “More precisely, please.” “Show the results.” The retrieval system learns to take the most effective actions from historical interaction experiences.
  17. 17. Deep Reinforcement Learning
  18. 18. Deep Reinforcement Learning • The actions are determined by a neural network • Input: information to help to make the decision • Output: which action should be taken • Taking the action with the highest score … … DNN … Information Action Z Action B Action A Max The network parameters can be optimized by historical interaction.
  19. 19. Deep Reinforcement Learning • Different network depth The task cannot be addressed by linear model. Some depth is needed. More Interaction Better retrieval performance, Less user labor
  20. 20. 20 OVERVIEW 2016/9/26 Spoken Content Text Speech Recognition Spoken Content Retrieval Key Term Extraction Summarization Interaction Organization
  21. 21. Today’s Retrieval Techniques 752 matches
  22. 22. More is less …... • Given all the related lectures from different courses Which lecture should I go first? Learning Map Ø Nodes: lectures in the same topics Ø Edges: suggested learning order learner [Shen & Lee, Interspeech 15]
  23. 23. Demo
  24. 24. 24 OVERVIEW Spoken Content Text Speech Recognition Spoken Content Retrieval Key Term Extraction Summarization Question Answering Interaction Organization
  25. 25. Spoken Question Answering What is a possible origin of Venus’ clouds? Spoken Question Answering: Machine answers questions based on the information in spoken content Gases released as a result of volcanic activity
  26. 26. Spoken Question Answering • TOEFL Listening Comprehension Test by Machine Question: “ What is a possible origin of Venus’ clouds? ” Audio Story: Choices: (A) gases released as a result of volcanic activity (B) chemical reactions caused by high surface temperatures (C) bursts of radio energy from the plane's surface (D) strong winds that blow dust into the atmosphere (The original story is 5 min long.) [Tseng & Lee, Interspeech 16]
  27. 27. Results Accuracy(%) (1) (2) (3) (4) (5) (6) (7) Memory Network: 39.2% (proposed by FB AI group) Naive Approaches
  28. 28. Model Architecture (A) (A) (A) (A) (A) (B) (B) (B)
  29. 29. Model Architecture “what is a possible origin of Venus‘ clouds?" Question: Question Semantics …… It be quite possible that this be due to volcanic eruption because volcanic eruption often emit gas. If that be the case volcanism could very well be the root cause of Venus 's thick cloud cover. And also we have observe burst of radio energy from the planet 's surface. These burst be similar to what we see when volcano …… Audio Story: Speech Recognition Semantic Analysis Semantic Analysis Attention Answer Select the choice most similar to the answer Attention The model is learned end-to-end.
  30. 30. Results Accuracy(%) (1) (2) (3) (4) (5) (6) (7) Memory Network: 39.2% Naive Approaches Proposed Approach: 48.8% (proposed by FB AI group) [Fang & Hsu & Lee, SLT 16] [Tseng & Lee, Interspeech 16]
  31. 31. 31 OVERVIEW 2016/9/26 Spoken Content Text Speech Recognition Spoken Content Retrieval Key Term Extraction Summarization Question Answering Interaction Organization Speech recognition is essential?
  32. 32. 32 CHALLENGES IN SPEECH RECOGNITION? Lots of audio files in different languages on the Internet Most languages have little annotated data for training speech recognition systems. Some audio files are produced in several different of languages Some languages even do not have written form Out-of-vocabulary (OOV) problem
  33. 33. 33 OVERVIEW 2016/9/26 Spoken Content Text Speech Recognition Spoken Content Retrieval Key Term Extraction Summarization Question Answering Interaction Organization Speech recognition is essential? Is it possible to directly understand spoken content?
  34. 34. Preliminary Study: Learning from Audio Book Machine listens to lots of audio book [Chung, Interspeech 16) Machine does not have any prior knowledge Like an infant
  35. 35. Preliminary Study: Audio Word to Vector • Audio segment corresponding to an unknown word Fixed-length vector
  36. 36. Preliminary Study: Audio Word to Vector • The audio segments corresponding to words with similar pronunciations are close to each other. ever ever never never never dog dog dogs Unsupervised
  37. 37. Sequence-to-sequence Auto-encoder audio segment acoustic features The values in the memory represent the whole audio segment x1 x2 x3 x4 RNN Encoder audio segment vector The vector we want How to train RNN Encoder?
  38. 38. Sequence-to-sequence Auto-encoder RNN Decoder x1 x2 x3 x4 y1 y2 y3 y4 x1 x2 x3 x4 RNN Encoder audio segment acoustic features The RNN encoder and decoder are jointly trained. Input acoustic features
  39. 39. Experimental Results neverever Cosine Similarity Edit Distance between Phoneme sequences RNN Encoder RNN Encoder
  40. 40. Experimental Results More similar pronunciation Higher cosine similarity.
  41. 41. Observation • Visualizing embedding vectors of the words fear nearname fame audio segment vector Project on 2-D
  42. 42. Next Step …… • Including semantics? flower tree dog cat cats walk walked run
  43. 43. 43 CONCLUDING REMARKS
  44. 44. 44 CONCLUDING REMARKS 2016/9/26 Spoken Content Text Speech Recognition Spoken Content Retrieval Key Term Extraction Summarization Question-answering Interaction Organization With Deep Learning, machine will understand spoken content, and extract useful information for humans.
  45. 45. 45 如果你想 “深度學習深度學習” My Course: Machine learning and having it deep and structured http://speech.ee.ntu.edu.tw/~tlkagk/courses_MLSD15_2.html 6 hour version: http://www.slideshare.net/tw_dsconf/ss-62245351 “Neural Networks and Deep Learning” written by Michael Nielsen http://neuralnetworksanddeeplearning.com/ “Deep Learning” Written by Yoshua Bengio, Ian J. Goodfellow and Aaron Courville http://www.deeplearningbook.org
  46. 46. TAIPEI | SEP. 21-22, 2016 THANK YOU

×