Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Ry pyconjp2015 karaoke

11,746 views

Published on

talk at Pycon Jp2015

Published in: Education
  • Be the first to comment

Ry pyconjp2015 karaoke

  1. 1. 1 PyCon JP 2015 Renyuan Lyu 呂仁園 Chun-Han Lai 賴俊翰 Karaoke-style Read-aloud System Chang Gung Univ. Taiwan Oct/10/ Saturday 2 p.m.–2:30 p.m. in 会議室1/Conference Room 1
  2. 2. CguTextKaraoke a Karaoke-style Read-aloud System Using Speech Alignment and Text-to-Speech Technology Chun-Han Lai (賴俊翰) Renyuan Lyu (呂仁園) Chang Gung University (長庚大學) Taiwan (台灣) 2
  3. 3. Abstract • A procedure to create a Speech-to-Text Synchronization file from an original text-only file – can be used to show high-light text just like a Karaoke machine – very useful for language learning purpose. • TTS (Text-to-speech) technology on clouds, like Google TTS • Speech-recognition technology, like HTK, for temporal alignment 3
  4. 4. Introduction • Starting from a text-only file, using a cloud-based text-to-speech (TTS) technology, like Google Translate/TTS, and also a speech- recognition technology, like Hidden Markov Model Toolkits (HTK), we could generate its associated timed-text file which aligns up text with speech waveform file in the temporal axis. • Python is used not only as a glue to link all different styles of software resources, like Google Translate and HTK, but also as a powerful tool to deal with all text processing tasks in this project. • From such a kind of timed text file, we have also provided a JavaScript based web-app and also a Python GUI software to demonstrate the time-aligned high-lighted text like a karaoke machine in word level, which are considered very useful for the language learning purpose. 4
  5. 5. a Karaoke-style Text Read-aloud System https://www.youtube-nocookie.com/embed/9a5KoXNCagM?start=180 • Karaoke (カラオケ) is a form of interactive entertainment in which an amateur singer sings along with recorded music. • Lyrics are usually displayed on a video screen, along with a moving symbol, changing color, or music video images, to guide the singer. • Here is an example of my favorites https://en.wikipedia.org/wiki/Karaoke 5
  6. 6. Speech Shadowing Technique for Language Learning • The motivation of this project » https://en.wikipedia.org/wiki/Speech_shadowing –Speech shadowing • is an Language Learning technique in which subjects repeat speech immediately after hearing it. – The technique is used in language learning. – A demonstration can be viewed at the following Youtube link. • “English Speaking Practice: How to improve your English Speaking and Fluency: SHADOWING” • https://www.youtube.com/watch?v=GVWFGIyNswI6
  7. 7. Text-to-Speech Synthesis 7 Wikipedia is a multilingual, web-based, free-content encyclopedia project supported by the Wikimedia Foundation and based on a model of openly editable content. The name "Wikipedia" is a portmanteau of the words wiki (a technology for creating collaborative websites, from the Hawaiian word wiki, meaning "quick") and encyclopedia. Wikipedia's articles provide links designed to guide the user to related pages with additional information. Given: a piece of Text and its speech, e.g., The goal is to obtain its speech
  8. 8. Google TTS API in a Python module 8 • pip install gTTS from gtts import gTTS aText= 'Wikipedia is a multilingual, ...' aLang= 'en' tts= gTTS(text= aText, lang= aLang) tts.save("aSpeech.mp3") aSpeech.mp3aText https://github.com/pndurette/gTTS
  9. 9. FFmpeg • About Ffmpeg – [https://en.wikipedia.org/wiki/FFmpeg] – FFmpeg is a free software project that produces libraries and programs for handling multimedia data. – It is one of the leading multimedia frameworks, able to do many DSP tasks, including ... • decode, encode, • transcode, mux, demux, stream, filter and play 9
  10. 10. 10 FFmpeg -i aSpeech.mp3 -y - vn -acodec pcm_s16le -ac 1 -ar 16000 -f wav aSpeech.wav aSpeech.mp3 aSpeech.wav Pcm, 16 bits/sample Little endian 1 (mono) channel 16000 samples/sec FFplay aSpeech.wav Verifying by seeing and hearing Or using an interactive audio tool, like Audacity.
  11. 11. Audacity (audio editor) • Audacity is a powerful, free open source digital audio editor – Its features include: • Recording and playing back sounds • Importing and exporting of WAV, MP3, .... • Viewing and editing via cut, copy, and paste, ... 11 aSpeech.mp3 aSpeech.wav
  12. 12. Text-to-Speech Alignment 12 Wikipedia is a multilingual, web-based, free-content encyclopedia project supported by the Wikimedia Foundation and based on a model of openly editable content. The name "Wikipedia" is a portmanteau of the words wiki (a technology for creating collaborative websites, from the Hawaiian word wiki, meaning "quick") and encyclopedia. Wikipedia's articles provide links designed to guide the user to related pages with additional information. Given: a piece of Text and its speech, e.g., The goal is to obtain a ‘Timed-Text’ 0.0000.080sil 0.0800.870wikipedia 0.8700.990is 0.9901.080a 1.0802.010multilingual 2.0102.140sil 2.1602.240sil 2.2403.020webbased 3.0203.180sil 3.2043.354sil 3.3544.284freecontent 4.2845.374encyclopedia 5.3745.774project 5.7746.454supported 6.4546.754by 6.7546.904the 6.9047.574wikimedia 7.5748.414foundation 8.4148.514sil 8.5328.622sil 8.6228.852and 8.8529.242based 9.2429.382on 9.3829.432a 9.4329.982model 9.98210.032of 10.03210.592openly 10.59211.212editable 11.21211.802content 11.80211.932sil : : :
  13. 13. Wav splitting 13 In Sentence-level, this can be straightforward done by extracting the time information from the TTS mp3 files, which are received sentence by sentence. Sentence boundaries
  14. 14. Phonetic Transcription • Speech recognition technology needs to transcribe text into phonetic symbols, in order to build up phone models. 14 “Wikipedia is a multilingual, web-based, free-content encyclopedia project.” “wikipedia ɪz ə məltilɪŋwəl, wɛb- best, fri- kɑntɛnt ənsɑjkləpidiə prɑdʒɛkt.” ”wikipedia Iz @ m@ltilINw@l, wEb- best, fri- kAntEnt @nsAykl@pidi@ prAdZEkt.” Original English Text: (ASCII only, perhaps!) Transcription in IPA: (needs Unicode) Transcription in SAMPA: (ASCII only, including non-alphabet symbols) http://upodn.com/phon.asp
  15. 15. • Post processing of phonetic transcription • To map or simply clean all undesired symbols from multiple styles of outputs – (usually in unicode, or some non-alphabet symbols) • For plain English (en), – Approximately using the original Text as the phone sequence. – Although it seems too simple, it is so far so good. • For Traditional Chinese (zh-tw), – Google Translate was used to get phonetic symbols in Pinyin (拼音, pīnyīn), and then plain romaji (eliminating the tone mark) • For Japanese (ja), – Mecab has been used recently to get the Katakana (片仮名, カタカナ). – Romkan has been used to transform katakana to romaji (kunrei) • Thanks to Python, it helps me do the most jobs during this stage of processing!! 15
  16. 16. • Phonetic transcription for English – Using regular expression module 16 phn= text2phn_en(enText) enText= ‘’’Wikipedia is a multilingual, web-based, free-content encyclopedia project.‘’’ phn= ‘’’wikipedia_is_a_multilingual_webbased _freecontent_encyclopedia_project’’' import re pats= ''|"|-|^_|_$|,|.|(|)' phn= re.sub(pats, '', phn)
  17. 17. • Phonetic transcription for Traditional Chinese – Using Google Translate/TTS api 17 phn= text2phn_tc(tcText) tcText= ‘維基百科是一個自由內容’ phn= ‘weiji_baike_shi_yige_ziyou_neirong’ GOOGLE_TTS_URL= 'https://translate.google. com.tw/translate_a/singl e?dt=bd&dt=ex&dt=at&' req= urllib.request.Request(GOOGLE_TTS_URL + data)
  18. 18. • Phonetic transcription for Japanese – Using MeCab and Romkan 18 phn= text2phn_jp(jpText) jpText= ‘‘’ウィキペディアは、 信頼されるフリーなオンライン百科事典、‘’’ phn= ‘‘’wikipedyia_wa_sil_sinrai_sa_reru_furi-_ na_onrain_hyakka_ziten‘’’ import MeCab import romkan y= MeCab.Tagger().parse(text) ... kun= romkan.to_kunrei(phn)
  19. 19. At the Halfway • a bundle of files wav/lab 19
  20. 20. • HMM Toolkits (HTK), – http://htk.eng.cam.ac.uk/ – Given a speech utterance, with its phone sequence, the speech can be well aligned with phones by ‘forced alignment’ techniques in the HMM approach. – A set of HMM Toolkits, called HTK, provided a convenient way to utilize the HMM approach. 20 Speech recognition technology
  21. 21. • The HTK overview 21
  22. 22. HTK processing (abstract) .... 22 • #[00] setting the working dir • #[01] creating the (hmm) model prototype • #[02] label processing • #[03] feature extraction • #[04] model initialization • #[05] model training • #[06] forced alignment • #[07] post file moving operation
  23. 23. HTK processing (detail).... 23 #[00] setting the working dir dirName= ./_wav/ #[01] creating the (hmm) model prototype CreateHProto.... myHmmPro N = 3 M = 6 #[02] label processing 000, 0,----> ._htkhled -A -i spLab00.mlf -n spLab00.lst -S spLab.scp hL 001, 0,----> ._htkhled -A -i spLab.mlf -n spLab.lst -S spLab.scp hLed.l 002, 0,----> ._htkhled -A -i spLab_p.mlf -n spLab_p.lst -S spLab.scp -I #[03] feature extraction 003, 0,----> ._htkHCopy -A -C hCopy.conf -S spWav2Mfc.scp 1>> 1.htk.out 2>> #[04] model initialization 004, 1,----> mkdir hmms_p 005, 0,----> ._htkHCompV -A -m -C hInit.conf -S spMfc.scp -I spLab_p.mlf -M #[05] model training 006, 0,----> ._htkHERest -A -C hErest.conf -S spMfc.scp -p 1 -t 2000.0 -w 3 007, 0,----> ._htkHERest -A -C hErest.conf -p 0 -t 2000.0 -w 3 -v 0.05 -I sp : (repeating several times...) : #[06] forced alignment 016, 0,----> ._htkHVite -A -a -C hVite.conf -S spMfc.scp -d hmms_p/ -i s #[07] post file moving operation 017, 1,----> mkdir outDir 018, 1,----> copy spLab_aligned.mlf outDir./_wav_aligned.mlf
  24. 24. 24 HLedspLab.scp spLab.mlf spLab.lst hLed.led HLed spLab00.mlf spLab00.lst hLed00.led HLed spLab_p.mlf spLab_p.lst hLed.led spLab_p.dic HLed
  25. 25. 25 HCopy hCopy.conf spWav2Mfc.scp *.wav *.mfc HCopy
  26. 26. HCompV 26 HCompV HCompV.conf *.mfc hmms_p/* spMfc.scp spLab_p.mlf myHmmPro
  27. 27. HERest 27 HERest hErest.conf *.mfc hmms_p/* spMfc.scp spLab_p.mlf spLab_p.lst hmms_p/HER1.acc N iterations N=5 HERest
  28. 28. HVite 28 HVite hVite.conf*.mfc spMfc.scp spLab_p.lst spLab_aligned.mlf spLab.mlf spLab_p.dic hmms_p/
  29. 29. HTK summary 29 HLed HCopy HCompV HERest HVite HTK Tools #!MLF!# "./_wav/SN0.rec" 0 800000 sil -578.044434 800000 8700000 wikipedia -5636.368652 8700000 9900000 is -855.988770 9900000 10800000 a -693.554871 10800000 20100000 multilingual -7268.197266 20100000 21400000 sil -791.746216 . "./_wav/SN1.rec" 0 800000 sil -541.083069 800000 8600000 webbased -5977.622070 8600000 10200000 sil -1048.225220 . "./_wav/SN2.rec" 0 1500000 sil -1100.892822 1500000 10800000 freecontent -7094.197266 10800000 21700000 encyclopedia -8148.633789 21700000 25700000 project -3247.493896 25700000 32500000 supported -5594.979492 32500000 35500000 by -2412.487305 35500000 37000000 the -1176.310547 37000000 43700000 wikimedia -5128.852051 43700000 52100000 foundation -5995.618164 52100000 53100000 sil -695.872864 . . . spLab_aligned.mlf wavDir/
  30. 30. The major algorithm in HTK 30 ‘Holiday Shopping’ = ‘h’+’o’+’l’+’i’+’d’+’ay’+’sil’+’sh’+’o’+’p’+’I’+’ng’ ‘h’ ’o’ ’ng’ • Forced Alignment in HTK – 1. Given a Speech signal – 2. Doing the Pronunciation transcription • Pronunciation symbols must be all-ASCII only!! – 3. Training to get the HMM models
  31. 31. 31 ‘h’ ’o’ ’ng’ – 4. Doing the Viterbi Search for the optimal path (alignment):
  32. 32. 32 #!MLF!# "wavDir/SN0001.rec" 0 800000 sil -567.865356 800000 8700000 wikipedia -5670.471680 8700000 10000000 is -951.059692 10000000 10600000 a -489.843994 10600000 20000000 multilingual -7398.754395 20000000 20700000 sil -416.119415 . "wavDir/SN0002.rec" 0 900000 sil -632.964050 900000 8600000 webbased -6000.767578 8600000 9900000 sil -914.236206 . "wavDir/SN0003.rec" 0 2100000 sil -1373.137817 2100000 9000000 freecontent -5306.260742 9000000 18500000 encyclopedia -6654.958984 18500000 25600000 project -5698.730469 25600000 32700000 supported -5713.494141 32700000 33200000 by -429.306763 33200000 34800000 the -1205.477539 34800000 41500000 wikimedia -5115.318359 41500000 50000000 foundation -6074.208496 50000000 52000000 and -1746.236938 52000000 56200000 based -3267.695801 56200000 57000000 on -585.264404 57000000 57700000 a -577.346130 57700000 63200000 model -3769.413574 63200000 63800000 of -524.015503 63800000 65300000 sil -1129.348633 . wavDir.align
  33. 33. 33 Now it’s time to KaraOke !
  34. 34. A Browser in Javascript and HTML for Text-KaraOke • https://youtu.be/11-ltx0yv_o 34
  35. 35. A Browser in Python using TKinter for Text-KaraOke 35
  36. 36. Conclusion & Future Work • Make the process more automatically. • Make the user interface more friendly. • Make the program more robust. • Call for your help to improve. • Thank you for Listening! 36
  37. 37. 37 PyCon JP 2015 Renyuan Lyu 呂仁園 Chun-Han Lai 賴俊翰 Karaoke-style Read-aloud System Oct/10/ Saturday 2 p.m.–2:30 p.m. in 会議室1/Conference Room 1 Thank you for Listening. ご聴取 有り難う 御座いました。 感謝您的收聽。

×