second_seminar

Work Done till First Seminar Work Done after First Seminar Conclusion Future Work Publications References
Prosody Modeling for Synthesis of
Storytelling Style Speech
Second Seminar
by
Parakrant Sarkar
Roll No: 12IT72P08
Under the Supervision of
Dr. K.Sreenivasa Rao
Department of Computer Science and Engineering
Indian Institute of Technology Kharagpur
March 4, 2016

OUTLINE
1. Work Done till First Seminar
2. Work Done after First Seminar
2.1 Modeling of pauses using FFNN, SVM and ELM
2.2 Unsupervised Pause Position Prediction
2.3 Modeling of pauses based on Discourse modes
2.4 Duration Modeling for Storytelling Style Speech
3. Conclusion
4. Future Work
5. Publications
6. References

WORK DONE TILL FIRST
SEMINAR

LIST OF PROBLEMS ADDRESSED
1. Hindi Story Synthesis framework was proposed.
SSED: Story-specific Emotion Detection Module
SSPG: Story-specific Prosody Generation Module
SSPI: Story-specific Prosody Incorporation Module
2. Three stage pause prediction model was proposed
considering.
Position of the pause.
Duration of the pause.

PROPOSED PAUSE PREDICTION MODEL
Pause/ Non-pause
Short / Medium / Long Pause
Short Pause
Duration Predictor
Medium Pause
Duration Predictor
Long Pause
Duration Predictor
Pause Position Prediction Model
Pause Duration Prediction Model
First Stage
Second Stage
Story TextStory speech
corpus

WORK DONE AFTER
FIRST SEMINAR

LIST OF PROBLEMS:
1. Modeling of pauses using FFNN, SVM and ELM.
2. Unsupervised Pause Position Prediction.
3. Modeling of pauses based on Discourse modes.
4. Duration Modeling for Storytelling Style Speech.

STORY SPEECH CORPUS
100 stories collected: Panchatantra and Akbar-Birbal.
# sentences per story: 20-25
# total words: 24400
Duration of the speech corpus: 3 hours (approx.)

PREDICTION OF PAUSE
POSITION

LIST OF FEATURES
1. Positional features:
Position of the current word from the beginning and
ending of the utterance.
Total number of words in the utterance.
2. Structural features:
Total number of phones in the current word, previous two
and following two words.
Total number of syllables in the current word, previous two
words and following two words.
Total number of phones in the utterance.
3. Morphological features:
Part-of-Speech (POS) of current word, previous two words
and following two words.
4. Story-semantic features
Emotion associated with the current word.
Phonetic strength of current word.
Genre of the Story

RESULTS FOR PREDICTING THE PAUSE POSITION
Table: Performance of data-driven models (CART, FFNN, SVM and
ELM) for pause position prediction.
CART
Recall Precision F1
Non-pause 0.89 0.94 0.91
Pause 0.68 0.81 0.74
FFNN
Non-pause 0.90 0.94 0.92
Pause 0.71 0.83 0.77
SVM
Non-pause 0.91 0.93 0.92
Pause 0.78 0.81 0.79
ELM
Non-pause 0.89 0.92 0.90
Pause 0.71 0.82 0.76

ACCURACY OF THE PAUSE POSITION PREDICTION
MODEL

Prediction of Pause
Duration

FEATURES USED FOR DETERMINING THE PAUSE
DURATION
1. Morphological features:
Terminal syllable of the current word, previous two and
following two words.
2. Structural features:
Position of the vowel in the terminal syllable.
Number of segments (i.e., consonants) before and after the
nucleus (i.e., vowel) in the terminal syllable.
3. Positional features:
Total number of phones in the terminal syllable of the
current word, previous two and following two words.

PERFORMANCE OF CART, FFNN, SVM AND ELM
MODELS FOR PREDICTING THE PAUSE DURATION
Model ¯x (in ms) ¯y (in ms) µ (in ms) σ (in ms) γx,y
CART
241.60
247.21 107.12 155.87 0.67
FFNN 251.25 75.13 70.68 0.87
SVM 251.70 117.69 138.28 0.71
ELM 245.98 107.25 110.11 0.67
¯x: Average of actual pause duration values.
¯y: Average of predicted pause duration values.
µ: Average prediction error.
σ: Standard deviation of average prediction error.
γx,y: Correlation coefﬁcient.

SCATTER PLOTS OF CART, FFNN, SVM AND ELM
MODELS
0 100 200 300 400 500
0
100
200
300
400
500
actual pause duration (in ms)
predictedpauseduration(inms)
(a) CART
0 100 200 300 400 500
0
100
200
300
400
500
(b) FFNN
0 100 200 300 400 500
0
100
200
300
400
500
(c) SVM
0 100 200 300 400 500
0
100
200
300
400
500
(d) ELM

MULTI-STAGE PAUSE
DURATION PREDICTION

RESULTS OF THE CLASSIFICATION OF PAUSE BASED
ON LIMITED INTERVAL
CART
Recall Precision F-1
long 0.73 0.62 0.67
medium 0.50 0.52 0.51
short 0.53 0.63 0.58
FFNN
long 0.72 0.65 0.68
medium 0.54 0.57 0.55
short 0.62 0.59 0.60
SVM
long 0.68 0.65 0.66
medium 0.51 0.58 0.54
short 0.61 0.58 0.59
ELM
long 0.72 0.58 0.64
medium 0.55 0.52 0.53
short 0.65 0.62 0.63

ACCURACY OF CLASSIFYING A PAUSE IN ONE OF THE
PAUSE TYPE:SHORT, MEDIUM AND LONG

RESULTS FOR PREDICTING PAUSE DURATION
Table: Performance of mulit model framework using CART, FFNN,
SVM and ELM models based on objective measures
CART
¯x (in ms) ¯y (in ms) µ (in ms) σ (in ms) γx,y
L CART 347.96 372.92 78.46 57.31 0.73
M CART 208.43 199.30 26.19 17.02 0.67
S CART 87.33 84.99 24.51 17.81 0.72
Overall 181.16 184.18 39.23 40.17 0.70
FFNN
L FFNN 350.96 342.70 62.59 46.87 0.67
M FFNN 203.15 195.05 25.87 18.92 0.75
S FFNN 87.32 82.55 28.29 17.69 0.68
Overall 148.98 143.13 33.70 28.39 0.73
SVM
L FFNN 353.16 299.10 72.40 63.75 0.73
M FFNN 203.15 190.98 29.96 16.79 0.52
S FFNN 90.70 77.41 22.96 17.88 0.63
Overall 189.49 164.88 38.50 42.85 0.65
ELM
L FFNN 353.16 340.83 69.32 44.51 0.66
M FFNN 203.15 195.09 27.35 14.64 0.65
S FFNN 90.38 80.91 22.34 16.24 0.81
Overall 190 180.02 36.85 34.19 0.71

SCATTER PLOTS OF MULTI MODEL FRAMEWORK
0 100 200 300 400 500
0
100
200
300
400
500
(e) CART
0 100 200 300 400 500
0
100
200
300
400
500
(f) FFNN
0 100 200 300 400 500 600
0
100
200
300
400
500
(g) SVM
0 100 200 300 400 500 600
0
100
200
300
400
500
(h) ELM

ANALYSIS OF PAUSE MODEL BASED ON SINGLE AND
MULTI MODEL FRAMEWORK
Figure: Average prediction error (in ms)

UNSUPERVISED PAUSE
POSITION PREDICTION
MODEL

UNSUPERVISED FEATURES EXTRACTION
METHODOLOGY
Story Speech
Corpus
Most frequent occuring
words
Unique
Words
Feature
Extraction
SVD
Dictionary
m*n m*r
Figure: Feature extraction method
m = 3579 unique words.
n = 300 most frequent occurring word.
r = 50 reduced dimension of the co-occurrence matrix.

UNSUPERVISED FEATURES EXTRACTION
METHODOLOGY CONTD..

RESULTS
Table: Performance measures of unsupervised data-driven models:
CART, FFNN, SVM and ELM for pause position prediction
CART
Recall Precision F1
Non-pause 0.86 0.94 0.89
Pause 0.79 0.58 0.66
FFNN
Non-pause 0.85 0.91 0.88
Pause 0.82 0.62 0.70
SVM
Non-pause 0.81 0.88 0.84
Pause 0.77 0.68 0.72
ELM
Non-pause 0.84 0.90 0.86
Pause 0.82 0.63 0.71

MODELING OF PAUSES
BASED ON DISCOURSE
MODES

DISCOURSE MODE
Three discourse modes of a story are considered:
Descriptive : 547 #sentences
Dialogue: 279 #sentences
Narrative: 1134 #sentences

STATISTICS OF THE PAUSES FOR VARIOUS MODES OF
DISCOURSE
Table: Satistics of the Pauses for various modes of discourse based on
limited intervals.
Descriptive Mode
Pause Type Mean (ms) StdDev (ms) % in original
Long Pause 447.59 240.18 6.15
Medium Pause 116.99 52.59 6.94
Short Pause 128.68 59.73 6.96
Narrative Mode
Long Pause 413.00 179.89 13.14
Medium Pause 197.61 27.89 11.48
Short Pause 93.97 30.19 23.84
Dialogue Mode
Long Pause 492.28 211.00 4.85
Medium pause 196.44 27.22 2.44
Short Pause 92.24 28.97 4.20

HISTOGRAM PLOTS
0 100 200 300 400 500 600 700
Duration in ms
0
20
40
60
80
100
120
Frequency
(a) Descriptive Mode

CONTD..
0 100 200 300 400 500 600 700
Duration in ms
0
50
100
150
200
250
300
Frequency
(b) Narrative Mode

CONTD..
0 100 200 300 400 500 600 700
Duration in ms
0
10
20
30
40
50
Frequency
(c) Dialogue Mode

PAUSE PREDICTION MODEL
Story-speech
Corpus
Dialogue
Story text
NarrativeDescriptive
Figure: Classifying Story text into three modes of Discourse

Pause/ Non-pause
Short / Medium / Long Pause
Short Pause
Duration Predictor
Medium Pause
Duration Predictor
Long Pause
Duration Predictor
Pause Position Prediction Model
Pause Duration Prediction Model
First Stage
Second Stage
Figure: Proposed pause prediction model

ACCURACY OF FIRST STAGE OF PAUSE POSITION
PREDICTION MODEL
Table: Performance of CART Model for predicting pause postion
Descriptive
Recall Precision F-1 Score
Non-pause 0.978 0.837 0.902
Pause 0.454 0.88 0.60
Dialogue
Non-pause 0.952 0.872 0.91
Pause 0.569 0.793 0.663
Narrative
Non-pause 0.953 0.856 0.902
Pause 0.552 0.806 0.655

ACCURACY OF THE PAUSE POSITION PREDICTION
MODEL
Descriptive Mode: 72%
Dialogue Mode:76.05%
Narrative Mode: 75.25%

ACCURACY OF SECOND STAGE OF PAUSE PREDICTION
MODEL
Table: Performance of CART for long, medium and short pause
classiﬁcation
Descriptive
Pause Type Recall Precision F-1 Score
long 0.56 0.46 0.51
medium 0.48 0.39 0.43
short 0.56 0.47 0.50
Dialogue
long 0.50 0.72 0.59
medium 0.53 0.65 0.58
short 0.40 0.55 0.46
Narrative
long 0.37 0.48 0.42
medium 0.53 0.46 0.49
short 0.73 0.46 0.56

ACCURACY OF CLASSIFYING A PAUSE IN ONE OF THE
PAUSE TYPE:SHORT, MEDIUM AND LONG
Descriptive Mode: 53%
Dialogue Mode:48%
Narrative Mode: 54%

ACCURACY OF THIRD STAGE OF PAUSE PREDICTION
MODEL
Table: Performance of CART for pause duration prediction
Descriptive
CART long 468.24 486.93 90.79 90.92 0.58
CART medium 201.03 189.97 24.59 18.95 0.56
CART short 88.26 93.48 34.86 12.37 0.53
Overall 228.84 234.72 40.42 41.42 0.55
Narrative
CART long 402.87 397.99 74.90 77.26 0.69
CART medium 198.09 198.87 10.60 10.79 0.66
CART short 93.73 92.69 9.65 9.40 0.69
Overall 244.98 242.71 44.30 37.04 0.68
Dialogue
CART long 472.47 463.31 60.96 61.93 0.60
CART medium 193.31 200.33 12.02 10.01 0.75
CART short 87.56 89.03 13.74 12.07 0.66
Overall 214.52 214.60 23.70 22.92 0.77

IDEAL CASE: ACCURACY OF THIRD STAGE OF PAUSE
PREDICTION MODEL
Table: Performance of CART for pause duration prediction
Descriptive
CART long 468.24 486.93 80.89 100.92 0.78
CART medium 201.03 189.97 14.39 12.95 0.76
CART short 88.26 93.48 13.76 7.37 0.73
Overall 228.84 234.72 34.48 37.24 0.75
Narrative
CART long 402.87 397.99 74.90 77.26 0.69
CART medium 198.09 198.87 10.60 9.79 0.66
CART short 93.73 92.69 9.65 7.14 0.69
Overall 244.98 242.71 37.16 37.04 0.68
Dialogue
CART long 472.47 463.31 52.96 61.93 0.71
CART medium 193.31 200.33 7.02 7.01 0.85
CART short 87.56 89.03 9.74 10.07 0.76
Overall 214.52 214.60 20.41 22.92 0.77

ANALYSIS OF PAUSE PREDICTION MODEL IN
DISCOURSE MODE
Figure: Average Prediction Error (in ms)

DURATION MODELING
FOR STORYTELLING STYLE
SPEECH

FEATURES USED FOR TRAINING
1 Positional Features (Baseline):
Position of the current syllable from the beginning and
Position of a syllable in the word.
Position of the vowel in the syllable.
Syllable Identity: Segments of the syllable (consonants and
vowels) for current syllable
Syllable Identity of previous two syllables and following
two syllables.
2 Structural Features (Baseline):
Total number of words in the utterance.
Total number of phones in the utterance.
Total number of syllables in the utterance.
Total number of syllables in the current word, previous two
words and following two words.
Total number of phones in the current word, previous two

FEATURES USED FOR TRAINING CONTD..
2 Structural Features (Baseline Contd..):
Number of segments (i.e. consonants) before the nucleus
(i.e. vowel) in the syllable.
Number of segments after the nucleus (i.e. vowel) in the
syllable.
3 Story-speciﬁc Features
Emotion (sad, anger, happy, fear) of the current word in the
utterance.
Genre of the story (fable, legendary, folk-tales)
Whether the word is a content or functional word.
Whether the word is stressed or not.

RESULTS
Table: Performance of CART model for predicting the syllable
duration
Model ¯x (in ms) ¯y (in ms) µ (in ms) σ (in
ms)
γx,y
Baseline 208.97 212 56.02 52.79 0.58
Story-speciﬁc 208.97 211.91 46.72 39.72 0.70
¯x: Average of actual pause duration values.
¯y: Average of predicted pause duration values.
µ: Average prediction error.
σ: Standard deviation of average prediction error.
γx,y: Correlation coefﬁcient.

PROPOSED METHOD
Fable Folk-tale Legendary
Story

RESULTS CONTD..
Table: Accuracy of prediction by CART model based on Story Genre
Model ¯x (in ms) ¯y (in ms) µ (in ms) σ (in ms) γx,y
Fable 209.97 208.89 38.20 31.70 0.80
Folk-tale 204.86 209.71 36.88 31.06 0.77
Legendary 212.58 209.57 37.81 39.51 0.83
Overall 209.13 209.39 37.63 34.09 0.80

SUMMARY AND CONCLUSION
1. Modeling of pauses using FFNN, SVM and ELM are
carried out.
2. Unsupervised Pause Position Prediction is proposed.
3. Modeling of pauses based on Discourse modes is studied.
4. Duration Modeling for Storytelling Style Speech.

FUTURE WORK
In future, we will be extending the current work to include the
followings:
1. Subjective listening test need to be carried out for the
proposed pause prediction model.
2. Modeling of word prominence for story text.
3. Analysis and modeling of pitch for storytelling style
speech based on three modes of discourse.
4. Analysis and modeling of intensity for storytelling style
speech.

Acknowledgments
The authors would like to thank the Department of
Information Technology, Government of India, for funding
the project, Development of Text-to-Speech synthesis for Indian
Languages Phase II, Ref. no. 11(7)/2011HCC(TDIL). The author
also like to thank all the DAC committee memebers,
supervisor, and all seminar attendees.

DISSEMINATION OF RESEARCH
Conference:
1. Parakrant Sarkar, K. Sreenivasa Rao, “Analysis and Modeling Pauses for
Synthesis of Storytelling Speech based on Discourse modes”, in Proceedings of the
IEEE International Conference on Contemporary Computing (IC3 2015), JIIT Noida,
11-13 August India.
2. Parakrant Sarkar, K. Sreenivasa Rao,“Data-Driven Pause Prediction for
Synthesis of Storytelling Style Speech based on Discourse Modes”, in Proceedings
of the IEEE International Conference on Electronics, Computing and Communication
Technologies (CONECCT 2015), IIIT Bangalore, 10-11 July India.
3. Parakrant Sarkar, K. Sreenivasa Rao, “Modeling Pauses for Synthesis of
Storytelling Style Speech Using Unsupervised Word Features”, Procedia Computer
Science, Volume 58, 2015, pages 42-49, 10-13 Aug 2015.
4. P Sarkar, K. S Rao, “Data-driven pause prediction for speech synthesis in
storytelling style speech,” in 2015 Twenty First National Conference on
Communications (NCC) , pages 1-5, 27 Feb. - 1 Mar, IIT Bombay, 2015.
Journal:
1. Parakrant Sarkar, K. Sreenivasa Rao, “Modeling of pauses for Storytelling Style
Speech Synthesis”, Computer Speech and Language [Under Revision]

REFERENCES I
[1] P. Taylor and A. W. Black, “Assigning phrase breaks from part-of-speech
sequences,” Computer Speech & Language, vol. 12, no. 2, pp. 99–117, 1998.
[2] P. Zervas, M. Maragoudakis, N. Fakotakis, and G. Kokkinakis, “Bayesian
Induction of Intonational Phrase Breaks,” Eurospeech, 2003.
[3] K. Yoon, “A Prosodic Phrasing Model for a Korean Text-to-speech Synthesis
System ,” Computer Speech & Language, vol. 20, no. 1, pp. 69 – 79, 2006.
[4] S. Kim, J. Lee, B. Kim, and G. G. Lee, “Incorporating second-order information into
two-step major phrase break prediction for korean,” in INTERSPEECH 2006 -
ICSLP, Ninth International Conference on Spoken Language Processing, Pittsburgh, PA,
USA, September 17-21, 2006, 2006.
[5] A. Parlikar and A. W. Black, “A grammar based approach to style speciﬁc phrase
prediction,” in Interspeech, 2011, pp. 2149–2152.
[6] A. Vadapalli, P. Bhaskararao, and K. Prahallad, “Signiﬁcance of word-terminal
syllables for prediction of phrase breaks in Text-to-Speech systems for Indian
Languages,” in 8th ISCA Speech Synthesis Workshop. Barcelona, Spain: ISCA,
August 31– September 2 2013, pp. 189 – 194.
[7] N. S. Krishna and H. A. Murthy, “A New Prosodic Phrasing Model for Indian
Language Telugu,” in INTERSPEECH. ISCA, 2004.

REFERENCES II
[8] K. Ghosh and K. Sreenivasa Rao, “Data-Driven Phrase Break Prediction for Bengali
Text-to-Speech System,” in Contemporary Computing - 5th International Conference,
IC3 2012, Noida, India, August 6-8, 2012. Proceedings, ser. Communications in
Computer and Information Science. Springer Berlin Heidelberg, 2012, vol. 306,
pp. 118 – 129.
[9] A. W. Black and P. Taylor, “The Festival Speech Synthesis System: System
Documentation,” Human Communciation Research Centre, University of
Edinburgh, Scotland, UK, Tech. Rep. HCRC/TR-83, 1997.

Thank You

second_seminar

Recommended

Recommended

More Related Content

Similar to second_seminar

Similar to second_seminar (20)

second_seminar