江振宇/It's Not What You Say: It's How You Say It!

868 views

Published on

Chen-Yu Chiang was born in Taipei, Taiwan, in 1980. He received the B.S., M.S., Ph.D. degrees in communication engineering from National Chiao Tung University (NCTU), Hsinchu, Taiwan, in 2002, 2004, and 2009, respectively. In 2009, he was a Postdoctoral Fellow at the Department of Electrical Engineering, NCTU, where he primarily worked on prosody modeling for automatic speech recognition and text-to-speech system, under the guidance of Prof. Sin-Horng Chen. In 2012, he was a Visiting Scholar at the Center for Signal and Image Processing (CSIP), Georgia Institute of Technology, Atlanta. Currently he is the director of the Speech and Multimedia Signal Processing Lab and an assistant professor at the Department of Communication Engineering, National Taipei University. His main research interests are in speech processing, in particular prosody modeling, automatic speech recognition and text-to-speech systems.

Published in: Data & Analytics
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
868
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
168
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

江振宇/It's Not What You Say: It's How You Say It!

  1. 1. 1 江振宇Chen-Yu CHIANG 國立臺北大學通訊工程學系 語音暨多媒體訊號處理實驗室 It isn’t what you said; it’s how you said it! 2016/7/17
  2. 2. 2 An Example - Talking Twin Babies - PART 2 - OFFICIAL VIDEO 2016/7/17 https://www.youtube.com/watch?v=_JmA2ClUvUY
  3. 3. 3 It isn’t what you said; it’s how you said it! • 這句話其實就是在講 “韻律” (Prosody) • 簡而言之:語音的“抑,揚,頓,挫,輕,重,緩,急” 2016/7/17 快速流利與慢而斷續的極端例子 朗讀式語音 不同說話方式 自發性語音 (通常是對話) 人對機器 以朗讀方式讀以上句子 (怪異) 以自然方式唸出以上句子 (自然多了!具溝通功能!) 嘿!芭樂!請你過來看一看。 (“芭樂”是一個人?)
  4. 4. 4 It isn’t what you said; it’s how you said it! • Thomas Sheridan (Irish stage actor) pointed out the importance of prosody more than 200 years ago: – Children are taught to read sentences, which they do not understand; and as it is impossible to lay the emphasis right, without perfectly comprehending the meaning of what one reads, they get a habit either of reading in a monotone, or if they attempt to distinguish one word from the rest, as the emphasis falls at random, the sense is usually perverted, or changed into nonsense. 2016/7/17
  5. 5. 5 韻律的物理量化量測 • Prosody can be measured by the following prosodic-acoustic features (韻律聲學參數) – 基頻 (Fundamental Frequency),或稱 F0 – 時長 (Duration) – 能量 (Intensity or Energy) – 靜音 (Pause or silence) • 韻律聲學參數可以使用以下的單元來量測 – 語句 (utterance)、語段 (discourse)、句子 (sentence)、子句 (clause)、片 語 (phrase)、詞 (word)、音節 (syllable)、聲母/韻母 (initial/final)、音素 (phoneme) 2016/7/17
  6. 6. 6 韻律聲學參數量測範例 2016/7/17 科學家 愈來愈 相信 在 生化學 上 而言 眼睛 與 胃 之間 必定 有 密切 的 關聯 波形 頻譜 基頻 能量 時間 靜音 音節切割
  7. 7. 7 Important Characteristics of Mandarin Chinese (1/2) • A tonal language (Four lexical tones (聲調), one neutral tone) • The tonality of a monosyllable is mainly characterized by the shape of its fundamental frequency (F0) contour. 趙元任提出的聲調標記 • To disambiguate word meanings: 媽 麻 馬 罵 嘛,買、賣,主投、豬頭 2016/7/17 Original speech signal Synthesis without tone
  8. 8. 8 Important Characteristics of Mandarin Chinese (2/2) • A syllable-based language, where each syllable carries a lexical tone (聲調). – 411 base syllables and tones  1,300 distinct tonal syllables. • A syllable-timed language – syllables take approximately equal amounts of time to pronounce. – Syllable structure of Chinese – Initial (聲母)+ Final (韻母) – Initial = consonants Final = [medial] + nucleus + [coda] • English - a stress-timed language, where there is approximately the same amount of time between stressed syllables. 2016/7/17 Native English Speaker Non-Native English Speaker
  9. 9. 9 Tone and Intonation • Ripples on the waves (趙元任) or superposition – Synthesis with tone+intonation – Synthesis without intonation 2016/7/17
  10. 10. 10 2016/7/17 Prosody Hierarchy for Mandarin (Tseng, 2005) Chiu-yu Tseng, et. al.“Fluent speech prosody: framework and modeling,” Speech Communication, vol.46, Issues 3-4, Special Issue on Quantitative Prosody modeling for Natural Speech Description and Generation, pp.284-309, July 2005. Prosodic Phrase Group Breath Group Prosodic Phrase Prosodic Word Syllable
  11. 11. 11 A Modified Prosodic Structure for Mandarin 2016/7/17 B4: Boundary of a Breath Group (BP)/Prosodic Phrase Group (PG) B3: Boundary of a Prosodic Phrase (PPh) B2: Prosodic Word (PW) boundary B2-1: pitch reset B2-2: short pause B2-3: duration lengthening B1: Normal syllabic boundary B0: Tightly coupling syllabic boundary
  12. 12. 12 2016/7/17 Prosody and Syntax Na 科學家 Dfa 愈來愈 VK 相信 PP P 在 Na 生化學 Ng 而言 Ng 上 GP GP Na 眼睛 Caa 與 Na 胃 Ng 之間 NP GP D 必定 V_2 有 VH 密切 DE 的 Na 關聯 V-的 NP S S PW PW PW PW PW PW PW PW PW PW PW PPh PPh PPh PPh PG/BG Syntactic Structure Prosodic Structure 。 Original speech signal
  13. 13. 13 2016/7/17 Arbitrary Prosody Na 科學家 Dfa 愈來愈 VK 相信 PP P 在 Na 生化學 Ng 而言 Ng 上 GP GP Na 眼睛 Caa 與 Na 胃 Ng 之間 NP GP D 必定 V_2 有 VH 密切 DE 的 Na 關聯 V-的 NP S S PW PW PW PW PW PW PW PW PW PW PPh PPh PG/BG Syntactic Structure Prosodic Structure 。 Poor prosody speech signal
  14. 14. 14 Examples of English Prosodic Structure (ToBI) • BU Radio f1ajrlp4.sph • Hennessy * is the S.J.C.'s | thirty-second * chief justice. / Holding the court system * on the course * he has set / and plotting | it's future agenda / won't be an easy job / for his successor. / 2016/7/17
  15. 15. 15 Prosody Labeling Example 2016/7/17
  16. 16. 16 Labeling Example • 謝謝 B2-1 主持人 B2-2 今天的 B2-3 監察人 B3 林委員 B2-2 我 們 B2-3 主持人 B2-3 陳委員 B3 還有 B2-3 朱 B2-1 主席 B2-3 宋主席 B2-2 各位 B2-2 在場的 B2-2 朋友 B2-2 還有我們 B2-2 電視機 B2- 1 前面 B3 的國人 B2-3 同胞 B4 父老兄弟 B2-3 姐妹 B2-3 朋友 B2- 1 們 B2-2 大家 B2-1 晚安 B2-2 大家好 Be 2016/7/17
  17. 17. 17 Functions of Prosody • Grammar – It is believed that prosody assists listeners in parsing continuous speech and in the recognition of words, providing cues to syntactic structure, grammatical boundaries and sentence type. • Why did you hit Joe? Why did you hit PAUSE Joe? • Focus – Intonation and stress work together to highlight important words or syllables for contrast and focus. • Discourse – Prosody plays a role in the regulation of conversational interaction and in signaling discourse structure. • Emotion – Prosody is also important in signaling emotions and attitudes. • 簡而言之:韻律是人與人溝通的通訊協定! 2016/7/17
  18. 18. 18 Issues and Applications • Issues concerned in prosody modeling – Labeling of important prosodic cues – Construction of prosody hierarchy – Modeling of syntax-prosody relationship – Prediction of prosodic phrase boundary (break) from text, etc. • Applications – Automatic Speech Recognition (ASR) • Important prosodic cues can be explored from the input utterance to assist in both acoustic and linguistic decoding – Text-to-Speech (TTS) • A good prosody model can be used to generate appropriate prosodic features from the input text 2016/7/17
  19. 19. 19 韻律於 Spoken Language Processing 的角色 Human Computer Input Output Generation Understanding Speech Text Recognition Speech Text Synthesis Meaning
  20. 20. 20 Prosody Modeling • y = f(x)  prosody generation for TTS – x: input information – y: prosodic-acoustic features (pitch, duration, energy, pause) 2016/7/17
  21. 21. 21 Prosody Modeling • y = f(x)  recognition of information carried by prosody – x: prosodic-acoustic features (pitch, duration, energy, pause) – y: information carried by prosody, including 2016/7/17
  22. 22. 22 Direct or Indirect Prosody Modeling Linguistic/para-linguistic/non- linguistic features Prosodic-acoustic features 工 程 師 藉 由 pattern recognition tools 建立兩 者之關係 (不須大量語言 學知識)
  23. 23. 23 Direct or Indirect Prosody Modeling Linguistic/para-linguistic/non-linguistic features Prosodic-acoustic features Prosody tags (Abstract representation of prosody) 語言學家可解釋其物理及語 言學意義,可較廣義一般化 (generalization)至所有語言 (nature?)
  24. 24. 24 Influential Factors on Prosody 2016/7/17 Fujisaki, H., “Information, prosody, and modeling – with emphasis on tonal features of speech,” Proc. Speech Prosody 2004, Nara, Japan, pp. 1-10, 2004.
  25. 25. 25 Conventional Schemes 2016/7/17 Training of pattern classifier Speech corpus Feature extraction Prosodic- acoustic features Target class: lexical tone, word boundary, etc. Parameters of pattern classifiers (GMM, DT, NN, ME, etc.) Fig.1. prosody modeling via intermediate abstract phonological categories Fig. 2. Direct modeling of target classes
  26. 26. 26 Proposed Scheme – Unsupervised Prosody Labeling and Modeling (PLM) 2016/7/17  Basic Idea – Prosody modeling and labeling are jointly conducted using an unlabeled speech database. – To properly model the observed features and then let the modeled-features objectively determine prosodic tags by themselves rather than by human perception.  Design of the Hierarchical Prosodic Model 1. Representation of prosody hierarchy by Break Types and Prosodic State 2. Realizing patterns of prosodic constituents – Prosodic state model – Syllable prosodic-acoustic model 3. Exploring the relationship between prosodic tags or boundary types and the acoustic features surrounding junctures. – Syllable juncture prosodic-acoustic model 4. Relationship between prosodic structure and syntactic structure. – Break-syntax model Prosody-labeled database
  27. 27. 27 2016/7/17 Prosody Hierarchical Structure and Prosody Tags • Break types of syllable junctures – demarcate prosodic constituents, i.e. syllable (SYL), prosodic word (PW), prosodic phrase (PPh), breath group (BG) and prosodic phrase group (PG). • Prosodic states of syllables – represent syllable pitch contour, duration and energy level variations resulting from high-level prosodic constituents (>=PW). – a substitution for the effects from high-level linguistic features, such as a word, a phrase or a syntactic tree.
  28. 28. 28 A Modified Prosodic Structure for Mandarin 2016/7/17 B4: Boundary of a Breath Group (BP)/Prosodic Phrase Group (PG) B3: Boundary of a Prosodic Phrase (PPh) B2: Prosodic Word (PW) boundary B2-1: pitch reset B2-2: short pause B2-3: duration lengthening B1: Normal syllabic boundary B0: Tightly coupling syllabic boundary C.-Y. Tseng, S.-H. Pin, Y.-L. Lee, H.-M. Wang, and Y.-C. Chen, “Fluent speech prosody: Framework and modeling,” Speech Commun. special issue on quantitative prosody modeling for natural speech description and generation, 46, 284–309 (2005).
  29. 29. 29 2016/7/17 Unsupervised Joint Prosody Labeling and Modeling by Hierarchical Prosodic Model Chen-Yu Chiang, Sin-Horng Chen, Hsiu-Min and Yu, Yih-Ru Wang, “Unsupervised Joint Prosody Labeling and Modeling for Mandarin Speech,” J. Acoust. Soc. Am., vol. 125, No. 2, pp. 1164-1183, Feb, 2009. Chen-Yu Chiang, Sin-Horng Chen and Yih-Ru Wang, “Advanced Unsupervised Joint Prosody Labeling and Modeling for Mandarin Speech and Its Application to Prosody Generation for TTS,” in Proc. Interspeech 2009, Brighton, UK, Sept. 2009, pp. 504-507.
  30. 30. 30 2016/7/17 Features and Parameters Used in the Hierarchical Prosodic Model T: prosodic tag B: break type ={B0, B1, B2-1, B2-2, B2-3, B3, B4} PS: prosodic state p: pitch prosodic state q: duration prosodic state r: energy prosodic state A: prosodic feature X: syllable prosodic feature sp: syllable pitch contour sd: syllable duration se: syllable energy level Y: inter-syllabic prosodic feature pd: pause duration ed: energy-dip level Z: differential prosodic features pj: normalized pitch jump dl: normalized duration lengthening factor 1 df: normalized duration lengthening factor 2 L: linguistic feature l: reduced linguistic feature set t: syllable tone sequence s: base-syllable type sequence f: final type sequence u: utterance sequence
  31. 31. 31 Parameterization of Syllable Pitch Contour (in logHz) • Discrete orthogonal polynomial – Basis Functions (Discrete Legendre Polynomials) : 2016/7/17 1)(0 M i  ][][)( 2 12/1 2 12 1    M i M M M i  ])[(][)( 6 122/1 )3)(2)(1( 180 2 3 M M M i M i MMM M M i      ])()()[(][)( 22 25 20 )2)(1( 10 2362 2 332/1 )4)(3)(2)(2)(1( 2800 3 M MM M i M MM M i M i MMMMM M M i        Mi 0 3M
  32. 32. 32 The Design of the Four Models 2016/7/17 Syllable prosodic features X Inter-syllable prosodic feature Y Differential prosodic features Z Reduced linguistic feature set l Prosodic state PS Break type B Tone t, syllable type s, final f General prosodic feature model General prosody- syntax model Syllable prosodic-acoustic model Syllable juncture prosodic- acoustic model Prosodic state model Break-syntax model ( , | ) ( | , ) ( | ) ( , , | , , ) ( , | )P P P P P T AL AT L TL X Y ZB PS L B PSL ( , , | , , ) ( | , , ) ( , | , )P P PX Y ZB PS L XB PS L Y ZB L ( , | ) ( | ) ( | )P PB PSL PSB BL
  33. 33. 33 2016/7/17 Syllable Pitch Contour Model (1/3) 4 565.4 23.9 -25.6 -0.5 23.9 90.5 9.7 -8.2 10 -25.6 9.7 17.8 -0.9 -0.5 -8.2 -0.9 5.0                spR 4 3.5 0.2 -0.2 0.0 0.2 31.9 2.6 -1.5 10 -0.2 2.6 11.1 0.6 0.0 -1.5 0.6 3.7               r sp R Covariance of observed log-F0 Figure 2.4: The APs of five tones Covariance of residual log-F0
  34. 34. 34 2016/7/17 Syllable Pitch Contour Model (2/3) Figure 2.5: The (a) forward and (b) onset coarticulation patterns Here tp = (i, j) and t = i or j. , f B tpβ ,b f B tβ Tone 1 Tone 3 + High-low mismatch compensation B0 B1 B4
  35. 35. 35 2016/7/17 Syllable Pitch Contour Model (3/3) Figure 2.5: The (c) forward and (d) offset coarticulation patterns Here tp = (i, j) and t = i or j. , b B tpβ ,e b B tβ Tone 3 Tone 3 + A tone sandhi example
  36. 36. 36 Patterns of Prosodic Constituents 2016/7/17 -0.15 -0.1 -0.05 0 0.05 0.1 0.15 LogF0 PG/BG -0.15 -0.1 -0.05 0 0.05 0.1 0.15 LogF0 PPh 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 -0.15 -0.1 -0.05 0 0.05 0.1 0.15 Length in syllable LogF0 PW Figure 3.13: The log-F0 patterns of BG/PG, PPh and PW. /n n n r n n PW PPh BG PG   pm pm β β β
  37. 37. 37 Patterns of Prosodic Constituents 2016/7/17 -0.02 0 0.02 0.04 0.06 PG/BG sec -0.02 0 0.02 0.04 0.06 PPh sec 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 -0.02 0 0.02 0.04 0.06 PW Length in syllable sec Figure 3.14: The syllable duration patterns of BG/PG, PPh and PW. /n n n r n n PW PPh BG PGdm dm      
  38. 38. 38 Patterns of Prosodic Constituents 2016/7/17 -5 -3 -1 1 3 5 dB PG/BG -5 -3 -1 1 3 5 dB PPh 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 -5 -3 -1 1 3 5 Length in syllable dB PW Figure 3.15: The energy level patterns of BG/PG, PPh and PW. /n n n r n n PW PPh BG PGem em      
  39. 39. 39 Comparison between Human Labeling and Machine Labeling (1/2) 2016/7/17 Human Labeling Tags: b1: non-break b2: prosodic word boundary b3: minor break b4: major break Machine Labeling Tags: B4: Boundary of a Breath Group (BP)/Prosodic Phrase Group (PG) B3: Boundary of a Prosodic Phrase (PPh) B2: Prosodic Word (PW) boundary B2-1: pitch reset B2-2: short pause B2-3: duration lengthening B1: Normal syllabic boundary B0: Tightly coupling syllabic boundary
  40. 40. 40 Comparison between Human Labeling and Machine Labeling (2/2) 2016/7/17
  41. 41. 41 2016/7/17 Application to ASR (read speech) Sin-Horng Chen, Jyh-Her Yang, Chen-Yu Chiang, Ming-Chieh Liu and Yih-Ru Wang, "A New Prosody-Assisted Mandarin ASR System", IEEE Trans. on Audio, Speech and Language Processing, vol.20, no.6, pp.1669-1684, Aug. 2012.
  42. 42. 42 Proposed Two-Stage Prosody-Assisted ASR 2016/7/17
  43. 43. 43 Experimental Settings • Database for the ASR experiments – TCC300: a large Mandarin read speech database – Training: 274 speakers, 23 hours for acoustic model and prosodic model – Test: 19 speakers, 2 hours • Acoustic model – 411 Syllable HMM (8 states) + silence model + short pause model – MMI training – Trained from TCC300 training set (274 speakers, 23 hours) • Factored LM – NTCIR + Sinica + Panorama, about 1.2 billion words – 60000-word lexicon • Prosodic model – Trained from the subset of TCC300 training set (164 speakers, 8.3 hours) 2016/7/17
  44. 44. 44 Experimental Results 2016/7/17 Recognition Performances of The Baseline Scheme, Scheme 1, and Scheme 2 (%) WER CER SER Baseline scheme 24.4 18.1 12.0 Break 21.3 15.0 10.2 Break + Prosodic state 20.7 14.4 9.6 EXPERIMENTAL RESULTS OF POS DECODING (%) Precision Recall F-measure Baseline scheme 93.4 76.4 84.0 Break + Prosodic state 93.4 80.0 86.2 EXPERIMENTAL RESULTS OF PM DECODING (%) Precision Recall F-measure Baseline scheme 55.2 37.8 44.8 Break + Prosodic state 61.2 53.0 56.8
  45. 45. 45 2016/7/17 TABLE VIII. EXPERIMENTAL RESULTS OF TONE DECODING (%) Precision Recall F-Measure Baseline scheme 87.9 87.5 87.7 Break + Prosodic state 91.9 91.6 91.7 An example of recognition results for a partial paragraph. Eight panels represent, respectively, waveform, prosodic state AP+global mean of syllable log-F0 level, syllable duration, and syllable energy level, break type (B), reference transcription (R), result of baseline scheme (F) and proposed system (P).
  46. 46. 46 2016/7/17 Application to ASR (spontaneous speech) Cheng-Hsien Lin, Meng-Chian Wu, Chung-Long You, Chen-Yu Chiang, Yih-Ru Wang, Sin-Horng Chen, “Prosody Modeling of Spontaneous Mandarin Speech and Its Application to Automatic Speech Recognition,” Speech Prosody 2016, accepted.
  47. 47. 47 Experimental Settings • Database for the ASR experiments – MCDC 8-hour dialogues from 16 speakers, texts with PU tags are transcribed and annotated by linguist experts • Acoustic Model – Seed tri-phone HMM models are trained from TCC300[3] and adapted using 80% of MCDC. – CI models for PU: 6 particles (HO, EI, HAN, HEN, HEIN, and MHM)+ 2 fillers (unrecognized/foreign speech) – CI models for paralinguistic: BREATHE, CLEAR_THROAT, LAUGH, NOISE, SMACK, and SWALLOW • Factored LM – About 440 million words corpus merged from 5 corpora, words/POS are tagged by in-house CRF tagger – Adapted by 90% MCDC corpus – 60000-word lexicon, including all particles and markers, selected by their word frequencies 2016/7/17
  48. 48. 48 Experimental Results 2016/7/17
  49. 49. 49 2016/7/17 Speaking Rate Dependent Hierarchical Prosodic Model Sin-Horng Chen, Chiao-Hua Hsieh, Chen-Yu Chiang, Hsi-Chun Hsiao, Yih-Ru Wang, Yuan-Fu Liao and Hsiu-Min Yu, “Modeling of Speaking Rate Influences on Mandarin Speech Prosody and Its Application to Speaking Rate-controlled TTS,” , IEEE Trans. on Audio, Speech and Language Processing, vol.22, no. 7, pp.1158-1171, July. 2014.
  50. 50. 50 Introduction • Speaking rate is a prosodic feature that influences many phenomena such as – Syllable duration – Pitch contour – Pause duration – Occurrence frequency of pause • Modeling the effects of speaking rate is an important research issue in – Automatic speech recognition (ASR) – Text-to-speech system (TTS) 2016/7/17
  51. 51. 51 • Objective – Modeling the influence of speaking rate on speech prosody based on the PLM method • The proposed approach – We take speaking rate as a continuous variable and construct a single HPM using the same four corpora – In this study, the speaking rate(SR) in each utterance is defined as its average duration per syllable uttered disregarding all inter-syllable pauses 2016/7/17
  52. 52. 52 Experimental Database • SR-Treebank database: – Read speech – The corpus contains four parallel speech datasets uttered by a female professional announcer with fast, normal, median and slow speaking rate. – All utterances are short paragraphs. There are in total 1478 utterances consisting of 203,746 syllables. 2016/7/17
  53. 53. 53 Break Labeling Examples for Four Parallel Utterances with Various SR 2016/7/17 Note: only pause-related break type, i.e. B4(@), B3 (/) and B2-2(*) are displayed Fast SR: 依據行政院主計處的統計 @,十月份 * 一到二十日 / ,我國出口及進口金額 / 比起去年同期 * 均有增加 @, Normal SR: 依據行政院主計處的統計 @,十月份 * 一到二十日 /,我國出口 * 及進口金額 / 比起去年同期 * 均有增加@, Median SR: 依據 * 行政院主計處的統計 @,十月份 / 一到 * 二十日 /,我國出口 * 及進口 金額 / 比起去年同期 * 均有增加 @, Slow SR: 依據 / 行政院 * 主計處的統計 @,十月份 / 一 * 到 * 二十日 @,我國出口 * 及 進口金額 / 比起去年同期 * 均有增加 @,
  54. 54. 54 Examples of Synthesized Speech 2016/7/17 Original proposed baseline SlowerFaster
  55. 55. 55 Cross-Dialect and -Speaker Adaptation of SR-HPM 2016/7/17 Chen-Yu Chiang, “A Study on Adaptation of Speaking Rate-Dependent Hierarchical Prosodic Model for Chinese Dialect TTS,” in Proc. OCOCOSDA 2015, Shanghai, China, Oct. 2015. (Best Paper Award) Chen-Yu Chiang, Hsiu-Min Yu, Sin-Horng Chen, “On Cross-Dialect and -Speaker Adaptation of Speaking Rate-Dependent Hierarchical Prosodic Model for a Hakka Text-to-Speech System,” Speech Prosody 2016, accepted. I-Bin Liao, Chen-Yu Chiang, Sin-Horng Chen, “Structural Maximum a Posteriori Speaker Adaptation of Speaking Rate-Dependent Hierarchical Prosodic Model for,” accepted by ICASSP 2016
  56. 56. 56 Experimental Databases • Mandarin (for background model) – 1,478 utterances with 183,795 syllables – A wider SR range of 3.4-6.8 syl/sec • Min (for adaptation): – 21,143 syllables for adaptation and the test set of 2,488 syllables – Speaking rate: 4.5-6.8 syl/sec • Hakka (for adaptation): – 15,009 syllables for adaptation and test set of 3,711 syllables – Speaking rate: 3.8-5.1 syl/sec 2016/7/17
  57. 57. 57 Results for Hakka 2016/7/17 客家人在歷史項,輒常 分人看做「人客」。從 歷史個文獻資料來看, 客家人經過幾下擺個大 遷徙;見擺遷徙,就去 到別人既經早就先到個 所在;高不將先向山區 安身,先定疊下來,正 定定仔對外發展。台灣 個客家人,大部分對廣 東梅州、惠州,少部分 對福建永定、詔安遷徙 過來。
  58. 58. 58 Results for Min 2016/7/17 原來春枝迷著歌仔戲。呣敢siau 想看規齣,有通看一te te仔就teh 癮囉。自按呢日昇配合伊,調 整送貨時間、路線,撥工載伊 去看戲尾仔。彼khui仔是日昇一 世 人 上 樂 暢 的 時 陣 , 送 貨 的 khang khoe,做著嘛加偌thiau iat 咧。一下chiap去,連顧戲口的 查某gin仔看in來,一句赫晏ouh? 就知愛放戲尾仔啦。春枝知影 日昇愛食甜,不時偷me烏糖互 伊,日昇上愛那看那chng,chng 甲規嘴hoe sa sa。一擺戲齣做到 娘子落難,沿途奔波討食,尾 仔煞真正跪ti台仔頂teh即時台仔 腳看戲的銀角仔四界tan去lih。
  59. 59. 59 2016/7/17 Application to Prosody Coding Chen-Yu Chiang, Jyh-Her Yang, Ming-Chieh Liu, Yih-Ru Wang, Yuan-Fu Liao and Sin- Horng Chen, “A New Model-based Mandarin-speech Coding System,” in Proc. Interspeech 2011, Florence, Italy, Aug. 2011, pp 2561-2564. Chen-Yu Chiang, Yu-Ping Hung, Sin-Horng Chen, and Yih-Ru Wang, “A New Model- Based Prosody Coder for Mandarin Speech,” in Proc. of IIHMSP 2013, Beijing, China, Oct. 2013, pp. 60-63.
  60. 60. 60 System Overview 2016/7/17
  61. 61. 61 2016/7/17 Experimental Database • Treebank Corpus – Read speech – 425 utterances with 56,237 syllables uttered by a female professional announcer. – Average syllable duration = 0.19 sec – Associated texts - short paragraphs composed of several sentences selected from the Sinica Treebank Version 3.0. – Training set - 379 utterances with 52,192 syllables. – Test set - 46 utterances with 4,801 syllables.
  62. 62. 62 2016/7/17 Experimental Results
  63. 63. 63 Experimental Results 2016/7/17
  64. 64. 64 Experimental Results 2016/7/17
  65. 65. 65 Ongoing Tasks and Future Works • Transform Leaning: – Take the SR-HPM for Mandarin as a base model (prior) to construct prosodic models for English • Voice Bank – Prosody bank: modeling prosodies of various speakers, emotions, styles… – Voice font bank: modeling spectra of various speakers, emotions, styles… 2016/7/17
  66. 66. 66 2016/7/17 Acknowledgements • We would like to thank – Academia Sinica, Taiwan for providing the Tree-Bank text corpus – Dr. Chiu-yu TSENG (鄭秋豫博士) of Academia Sinica, Taiwan for providing the Sinica COSPRO Corpus and and the on-line word segmentation system – Dr. Shu-Chuan TSENG (曾淑娟博士) of Academia Sinica, Taiwan for providing the Mandarin Conversational Dialogue Corpus (MCDC) – Prof. Ho-Hsien PAN (潘荷仙教授) of Phonetics Laboratory, Department of Foreign Languages and Literatures of National Chiao Tung University, Taiwan for her generous and helpful assistance in manually labeling our experimental
  67. 67. 67 2016/7/17 Thank You for Your Attention! Contact: cychiang@mail.ntpu.edu.tw http://cychiang.tw

×