李宏毅/當語音處理遇上深度學習

2,475 views

Published on

現為臺大電機系助理教授,他的研究方向與興趣是以機器學習技術讓機器辨識並理解語音訊號的內容。以深度學習技術為基石,他正致力於語音數位內容搜尋、語音數位內容之自動化組織以及從語音數位內容擷取關鍵資訊等前瞻性研究,這些技術有很多的應用,例如:人機互動、問答系統、智慧型線上教學平台等等。他曾在臺大開設和深度學習相關的課程「機器學習及其深層與結構化」。

Published in: Data & Analytics

李宏毅/當語音處理遇上深度學習

  1. 1. Deep Learning and its Applica1on on Speech Processing Hung-yi Lee
  2. 2. Spoken Content Speech Recogni4on Recogni4on Output Speech Recogni,on How to do speech recogni4on with deep learning? Deep Learning
  3. 3. People imagine …… This is not true! DNN can only take fixed-length vectors as input and output. “大家好 我今天 ….” DNN Input and output are sequences with different lengths.
  4. 4. Recurrent Neural Network x1 x2 x3 y1 y2 y3 Wi Wo …… Wh Wh Wi Wo Wi Wo How about Recurrent Neural Network (RNN)?
  5. 5. Recurrent Neural Network 好 好 好 Trimming 棒 棒 棒 棒 棒 “好棒” Why can’t it be “好棒棒” Input: Output: (character sequence) (vector sequence ) Problem? How about Recurrent Neural Network (RNN)? 0.01s
  6. 6. Recurrent Neural Network •  Connec4onist Temporal Classifica4on (CTC) [Alex Graves, ICML’06][Alex Graves, ICML’14][Haşim Sak, Interspeech’15][Jie Li, Interspeech’15][Andrew Senior, ASRU’15] 好 φ φ 棒 φ φ φ φ 好 φ φ 棒 φ 棒 φ φ “好棒” “好棒棒” Add an extra symbol “φ” represen4ng “null”
  7. 7. Sequence-to-sequence Learning •  Sequence to sequence learning: Both input and output are both sequences with different lengths. Containing all informa4on about input uferance …… …… “機器學習” acous4c feature sequence → character sequence
  8. 8. Sequence-to-sequence Learning •  Sequence to sequence learning: Both input and output are both sequences with different lengths. …… …… “機器學習” 機 習 器 學 …… …… 慣 性 Don’t know when to stop
  9. 9. Sequence-to-sequence Learning •  Sequence to sequence learning: Both input and output are both sequences with different lengths. …… …… “機器學習” 機 習 器 學 Add a symbol “。 “ (句點) [Ilya Sutskever, NIPS’14][Dzmitry Bahdanau, arXiv’15] 。
  10. 10. Spoken Content Speech Recogni4on Recogni4on Output Retrieval Retrieval Result Spoken Content Retrieval
  11. 11. People think …… l Transcribe spoken content into text by speech recognition Speech Recognition Models Text Retrieval Result Text Retrieval Query learner l Use text retrieval approach to search the transcriptions Spoken Content Black Box
  12. 12. People think …… Spoken Content Retrieval Speech Recognition + Text Retrieval =
  13. 13. •  Good spoken content retrieval needs good speech recognition system. •  In real application, such high quality recognition models are not available •  Ex, YouTube •  Different languages/accents •  Different recording environments •  Hope for spoken content retrieval •  Don’t completely rely on accurate speech recognition •  Accurate spoken content retrieval, even under poor speech recognition Problem?
  14. 14. Spoken Content Speech Recogni4on Beyond Cascading ? Recogni4on Output Retrieval Retrieval Result Spoken Content Retrieval ¨  Is the cascading of speech recognition and text retrieval the only solution of spoken content retrieval?
  15. 15. Beyond Cascading Speech Recogni1on and Text Retrieval •  5 direc4ons •  Modified Speech Recogni4on for Retrieval Purposes •  Exploi4ng Informa4on not present in ASR outputs •  Directly Matching on Acous4c Level without ASR •  Seman4c Retrieval of Spoken Content •  Interac4ve Retrieval and Efficient Presenta4on of Retrieved Objects Overview paper "Spoken Content Retrieval —Beyond Cascading Speech Recogni4on with Text Retrieval" http://speech.ee.ntu.edu.tw/~tlkagk/paper/Overview.pdf
  16. 16. Our Point Spoken Content Retrieval Speech Recognition + Text Retrieval =
  17. 17. Spoken Content Speech Recogni4on Beyond Cascading ? Recogni4on Output Retrieval Retrieval Result Interac4on user Interact with Humans
  18. 18. Spoken Content Speech Recogni4on Beyond Cascading ? Recogni4on Output Retrieval Seman4c Analysis Retrieval Result Interac4on user Seman,c Analysis
  19. 19. Unsupervised Learning •  Machine reads lots of text on the Internet …… 蔡英文 520宣誓就職 馬英九 520宣誓就職 蔡英文、馬英九 are something very similar You shall know a word by the company it keeps
  20. 20. Seman1c Analysis •  Let machine read lots of documents. •  Each word is represented as a vector dog cat rabbit jump run flower tree
  21. 21. Seman1c Analysis •  Even the distances between the vectors have some meaning. Source: hfp:// www.slideshare.net/hustwj/cikm- keynotenov2014
  22. 22. Spoken Content Speech Recogni4on Beyond Cascading ? Recogni4on Output Retrieval Seman4c Analysis Key Term Extrac4on Retrieval Result Interac4on user Key Term Extrac,on [Interspeech 2015] (with 沈昇勳)
  23. 23. Spoken Content Speech Recogni4on Beyond Cascading ? Recogni4on Output Retrieval Seman4c Analysis Key Term Extrac4on Retrieval Result Interac4on user Summariza,on Summari- za4on
  24. 24. Speech Summariza1on Retrieved Audio File Summary Select the most informative segments to form a compact version 1 hour long 10 minutes Extrac've Summaries Ref: http://speech.ee.ntu.edu.tw/ ~tlkagk/courses/MLDS_2015/ Structured%20Lecture/Summarization %20Hidden_2.ecm.mp4/index.html
  25. 25. Speech Summariza1on •  用自己的話寫 summary (Abstrac4ve Summaries) •  Machine learns to do abstrac4ve summariza4on from 2,000,000 training examples , , , , , ; …… Human Machine 台大電機系 盧柏儒、徐翊祥 台大資工系 葉正杰、周儒杰 (助教:余朗祺)
  26. 26. Spoken Content Speech Recogni4on Beyond Cascading ? Recogni4on Output Retrieval Seman4c Analysis Key Term Extrac4on Summari- za4on Ques4on- answering Retrieval Result Interac4on user question answer Ques,on Answering
  27. 27. Spoken Content Speech Recogni4on Beyond Cascading ? Recogni4on Output Retrieval Seman4c Analysis Key Term Extrac4on Summari- za4on Ques4on- answering Retrieval Result Interac4on user question answer Without Speech Recogni,on?
  28. 28. Outline Very Brief Introduc4on of Deep Learning Towards Machine Comprehension of Spoken Content •  Overview •  Example I: Speech Ques4on Answering •  Example II: Interac4ve Spoken Content Retrieval •  Example III: What can machine learn from audio without any supervision
  29. 29. Speech Ques1on Answering •  Machine answers ques4ons based on the informa4on in spoken content What is a possible origin of Venus’ clouds? ……… answer
  30. 30. Speech Ques1on Answering •  TOEFL Listening Comprehension Test by Machine •  Example: Ques4on: “ What is a possible origin of Venus’ clouds? ” Audio Story: Choices: (A) gases released as a result of volcanic activity (B) chemical reactions caused by high surface temperatures (C) bursts of radio energy from the plane's surface (D) strong winds that blow dust into the atmosphere (The original story is 5 min long.)
  31. 31. Simple Baselines Accuracy (%) (1) (2) (3) (4) (5) (6) (7) Naive Approaches random (4) 選 seman4c 和其他 選項最像的選項 (2) select the shortest choice as answer Experimental setup: 717 for training, 124 for validation, 122 for testing
  32. 32. Supervised Learning Accuracy (%) (1) (2) (3) (4) (5) (6) (7) Memory Network: 39.2% Naive Approaches Interspeech 2016 (with 曾柏翔) (proposed by FB AI group)
  33. 33. Model Architecture “what is a possible origin of Venus Ques4on: Ques4on Seman4cs …… It be quite possible that this be due to volcanic erup4on because volcanic erup4on o{en emit gas. If that be the case volcanism could very well be the root cause of Venus 's thick cloud cover. And also we have observe burst of radio energy from the planet 's surface. These burst be similar to what we see when volcano erupt on earth …… Audio Story: Speech Recogni4on Seman4c Analysis Seman4c Analysis Afen4on (畫重點) Answer Select the choice most similar to the answer Afen4on Similar to Memory Network
  34. 34. Model Architecture Word-based Afen4on
  35. 35. Model Architecture Sentence-based Afen4on
  36. 36. (A) (A) (A) (A) (A) (B) (B) (B)
  37. 37. Supervised Learning Accuracy (%) (1) (2) (3) (4) (5) (6) (7) Memory Network: 39.2% Naive Approaches Word-based Afen4on: 48.3% Interspeech 2016 (with 曾柏翔) (proposed by FB AI group)
  38. 38. Outline Very Brief Introduc4on of Deep Learning Towards Machine Comprehension of Spoken Content •  Overview •  Example I: Speech Ques4on Answering •  Example II: Interac4ve Spoken Content Retrieval •  Example III: What can machine learn from audio without any supervision
  39. 39. Interact with Users •  Interac4ve retrieval is helpful. user “深度學習” 和機器學習有關的 ”深度學習” 嗎? 還是和教育有關的 ”深度學習” 呢?
  40. 40. Audio is hard to browse •  When the system returns the retrieval results, user doesn’t know what he/she get at the first glance Retrieval Result
  41. 41. user Spoken Content Retrieval Results Spoken Content Interac,ve retrieval of spoken content query Directly showing the retrieval results is probably not a good idea.
  42. 42. user Spoken Content Retrieval Results Spoken Content Interac,ve retrieval of spoken content query “Give me an example.” “Is it relevant to XXX?” “Can you give me another query?” “Show the results.” Given the current situation, which action should be taken? ……
  43. 43. user Spoken Content Retrieval Results Spoken Content Interac,ve retrieval of spoken content query State Es4ma4on Ac4on Decision state The degree of clarity from the retrieval results ac4on features ¤  The policy π(s) is a function ¤  Input: state s, output: action a Decide the actions by intrinsic policy π(S) [Interspeech 2012][ICASSP 2013]
  44. 44. user Spoken Content Retrieval Results Spoken Content Interac,ve retrieval of spoken content query features … …… DNN State EstimationAction Decision Is it relevant to XXX? Give me an example. Show the results. Max
  45. 45. user Spoken Content Retrieval Results Spoken Content Interac,ve retrieval of spoken content query features … …… DNN Is it relevant to XXX? Give me an example. Show the results. Max Learned from historical interac4on Goal: maximizing return (Retrieval Quality - User labor)
  46. 46. Deep Reinforcement Learning
  47. 47. Experimental Results •  Broadcast news, seman4c retrieval Retrieval Quality (MAP) Op4miza4on Target: Retrieval Quality - User labor Hand-cra{ed Deep Learning Previous Method (state + decision) submifed to Interspeech 2016 (with 吳彥諶、林子翔)
  48. 48. Experimental Results
  49. 49. Outline Very Brief Introduc4on of Deep Learning Towards Machine Comprehension of Spoken Content •  Overview •  Example I: Speech Ques4on Answering •  Example II: Interac4ve Spoken Content Retrieval •  Example III: What can machine learn from audio without any supervision
  50. 50. Unsupervised Learning Machine listens to lots of audio book (TA: ) Audio Word2Vec: Unsupervised Learning of Audio Segment Representa'ons using Sequence-to-sequence Autoencoder (accepted by Interspeech 2016)
  51. 51. Audio Word to Vector •  Consider audio segment corresponding to an unknown word Deep Learning with (助教:沈家豪)
  52. 52. Audio Word to Vector •  The audio segments corresponding to words with similar pronuncia4ons are close to each other. Deep Learning
  53. 53. Audio Word to Vector •  The audio segments corresponding to words with similar pronuncia4ons are close to each other. ever ever never never never dog dog dogs Deep Learning
  54. 54. Sequence Auto-encoder
  55. 55. How to evaluate never ever Cosine Similarity Phoneme sequence edit distance Deep Learning Deep Learning
  56. 56. Experimental Results More similar pronuncia4on Larger cosine similarity.
  57. 57. Interes1ng Observa1on •  Projec4ng the embedding vectors to 2-D day days says say
  58. 58. Spoken Content Retrieval without Speech Recognition user “US President” spoken query [Hazen, ASRU 09] [Zhang Glass, ASRU 09] [Chan Lee, Interspeech 10] [Zhang Glass, ICASSP 11] [Gupta, Interspeech 11] [Zhang Glass, Interspeech 11] [Zhang Glass, ASRU 09] [Huijbregts, ICASSP 11] [Chan Lee, Interspeech 11] Computing similarity between spoken queries and audio files on signal level Spoken Content Handheld device
  59. 59. Spoken Content Retrieval without Speech Recognition • Why spoken content retrieval without speech recognition? •  Lots of audio files in different languages on the Internet •  Most languages have little annotated data for training speech recognition systems. •  Some audio files are produced in several different of languages •  Some languages even do not have text
  60. 60. Spoken Content Retrieval without Speech Recognition
  61. 61. Retrieval Performance
  62. 62. Concluding Remarks Very Brief Introduc4on of Deep Learning Towards Machine Comprehension of Spoken Content •  Overview •  Example I: Speech Ques4on Answering •  Example II: Interac4ve Spoken Content Retrieval •  Example III: What can machine learn from audio without any supervision
  63. 63. Thank You for Your Attention

×