SlideShare a Scribd company logo
1 of 54
Work Done till First Seminar Work Done after First Seminar Conclusion Future Work Publications References
Prosody Modeling for Synthesis of
Storytelling Style Speech
Second Seminar
by
Parakrant Sarkar
Roll No: 12IT72P08
Under the Supervision of
Dr. K.Sreenivasa Rao
Department of Computer Science and Engineering
Indian Institute of Technology Kharagpur
March 4, 2016
Work Done till First Seminar Work Done after First Seminar Conclusion Future Work Publications References
OUTLINE
1. Work Done till First Seminar
2. Work Done after First Seminar
2.1 Modeling of pauses using FFNN, SVM and ELM
2.2 Unsupervised Pause Position Prediction
2.3 Modeling of pauses based on Discourse modes
2.4 Duration Modeling for Storytelling Style Speech
3. Conclusion
4. Future Work
5. Publications
6. References
Work Done till First Seminar Work Done after First Seminar Conclusion Future Work Publications References
WORK DONE TILL FIRST
SEMINAR
Work Done till First Seminar Work Done after First Seminar Conclusion Future Work Publications References
LIST OF PROBLEMS ADDRESSED
1. Hindi Story Synthesis framework was proposed.
SSED: Story-specific Emotion Detection Module
SSPG: Story-specific Prosody Generation Module
SSPI: Story-specific Prosody Incorporation Module
2. Three stage pause prediction model was proposed
considering.
Position of the pause.
Duration of the pause.
Work Done till First Seminar Work Done after First Seminar Conclusion Future Work Publications References
PROPOSED PAUSE PREDICTION MODEL
Pause/ Non-pause
Short / Medium / Long Pause
Short Pause
Duration Predictor
Medium Pause
Duration Predictor
Long Pause
Duration Predictor
Pause Position Prediction Model
Pause Duration Prediction Model
First Stage
Second Stage
Story TextStory speech
corpus
Work Done till First Seminar Work Done after First Seminar Conclusion Future Work Publications References
WORK DONE AFTER
FIRST SEMINAR
Work Done till First Seminar Work Done after First Seminar Conclusion Future Work Publications References
LIST OF PROBLEMS:
1. Modeling of pauses using FFNN, SVM and ELM.
2. Unsupervised Pause Position Prediction.
3. Modeling of pauses based on Discourse modes.
4. Duration Modeling for Storytelling Style Speech.
Work Done till First Seminar Work Done after First Seminar Conclusion Future Work Publications References
STORY SPEECH CORPUS
100 stories collected: Panchatantra and Akbar-Birbal.
# sentences per story: 20-25
# total words: 24400
Duration of the speech corpus: 3 hours (approx.)
Work Done till First Seminar Work Done after First Seminar Conclusion Future Work Publications References
PREDICTION OF PAUSE
POSITION
Work Done till First Seminar Work Done after First Seminar Conclusion Future Work Publications References
LIST OF FEATURES
1. Positional features:
Position of the current word from the beginning and
ending of the utterance.
Total number of words in the utterance.
2. Structural features:
Total number of phones in the current word, previous two
and following two words.
Total number of syllables in the current word, previous two
words and following two words.
Total number of phones in the utterance.
3. Morphological features:
Part-of-Speech (POS) of current word, previous two words
and following two words.
4. Story-semantic features
Emotion associated with the current word.
Phonetic strength of current word.
Genre of the Story
Work Done till First Seminar Work Done after First Seminar Conclusion Future Work Publications References
RESULTS FOR PREDICTING THE PAUSE POSITION
Table: Performance of data-driven models (CART, FFNN, SVM and
ELM) for pause position prediction.
CART
Recall Precision F1
Non-pause 0.89 0.94 0.91
Pause 0.68 0.81 0.74
FFNN
Non-pause 0.90 0.94 0.92
Pause 0.71 0.83 0.77
SVM
Non-pause 0.91 0.93 0.92
Pause 0.78 0.81 0.79
ELM
Non-pause 0.89 0.92 0.90
Pause 0.71 0.82 0.76
Work Done till First Seminar Work Done after First Seminar Conclusion Future Work Publications References
ACCURACY OF THE PAUSE POSITION PREDICTION
MODEL
Work Done till First Seminar Work Done after First Seminar Conclusion Future Work Publications References
Prediction of Pause
Duration
Work Done till First Seminar Work Done after First Seminar Conclusion Future Work Publications References
FEATURES USED FOR DETERMINING THE PAUSE
DURATION
1. Morphological features:
Terminal syllable of the current word, previous two and
following two words.
2. Structural features:
Position of the vowel in the terminal syllable.
Number of segments (i.e., consonants) before and after the
nucleus (i.e., vowel) in the terminal syllable.
3. Positional features:
Total number of phones in the terminal syllable of the
current word, previous two and following two words.
Position of the current word from the beginning and
ending of the utterance.
Work Done till First Seminar Work Done after First Seminar Conclusion Future Work Publications References
PERFORMANCE OF CART, FFNN, SVM AND ELM
MODELS FOR PREDICTING THE PAUSE DURATION
Model ¯x (in ms) ¯y (in ms) µ (in ms) σ (in ms) γx,y
CART
241.60
247.21 107.12 155.87 0.67
FFNN 251.25 75.13 70.68 0.87
SVM 251.70 117.69 138.28 0.71
ELM 245.98 107.25 110.11 0.67
¯x: Average of actual pause duration values.
¯y: Average of predicted pause duration values.
µ: Average prediction error.
σ: Standard deviation of average prediction error.
γx,y: Correlation coefficient.
Work Done till First Seminar Work Done after First Seminar Conclusion Future Work Publications References
SCATTER PLOTS OF CART, FFNN, SVM AND ELM
MODELS
0 100 200 300 400 500
0
100
200
300
400
500
actual pause duration (in ms)
predictedpauseduration(inms)
(a) CART
0 100 200 300 400 500
0
100
200
300
400
500
actual pause duration (in ms)
predictedpauseduration(inms)
(b) FFNN
0 100 200 300 400 500
0
100
200
300
400
500
actual pause duration (in ms)
predictedpauseduration(inms)
(c) SVM
0 100 200 300 400 500
0
100
200
300
400
500
actual pause duration (in ms)
predictedpauseduration(inms)
(d) ELM
Work Done till First Seminar Work Done after First Seminar Conclusion Future Work Publications References
MULTI-STAGE PAUSE
DURATION PREDICTION
Work Done till First Seminar Work Done after First Seminar Conclusion Future Work Publications References
RESULTS OF THE CLASSIFICATION OF PAUSE BASED
ON LIMITED INTERVAL
CART
Recall Precision F-1
long 0.73 0.62 0.67
medium 0.50 0.52 0.51
short 0.53 0.63 0.58
FFNN
long 0.72 0.65 0.68
medium 0.54 0.57 0.55
short 0.62 0.59 0.60
SVM
long 0.68 0.65 0.66
medium 0.51 0.58 0.54
short 0.61 0.58 0.59
ELM
long 0.72 0.58 0.64
medium 0.55 0.52 0.53
short 0.65 0.62 0.63
Work Done till First Seminar Work Done after First Seminar Conclusion Future Work Publications References
ACCURACY OF CLASSIFYING A PAUSE IN ONE OF THE
PAUSE TYPE:SHORT, MEDIUM AND LONG
Work Done till First Seminar Work Done after First Seminar Conclusion Future Work Publications References
RESULTS FOR PREDICTING PAUSE DURATION
Table: Performance of mulit model framework using CART, FFNN,
SVM and ELM models based on objective measures
CART
¯x (in ms) ¯y (in ms) µ (in ms) σ (in ms) γx,y
L CART 347.96 372.92 78.46 57.31 0.73
M CART 208.43 199.30 26.19 17.02 0.67
S CART 87.33 84.99 24.51 17.81 0.72
Overall 181.16 184.18 39.23 40.17 0.70
FFNN
L FFNN 350.96 342.70 62.59 46.87 0.67
M FFNN 203.15 195.05 25.87 18.92 0.75
S FFNN 87.32 82.55 28.29 17.69 0.68
Overall 148.98 143.13 33.70 28.39 0.73
SVM
L FFNN 353.16 299.10 72.40 63.75 0.73
M FFNN 203.15 190.98 29.96 16.79 0.52
S FFNN 90.70 77.41 22.96 17.88 0.63
Overall 189.49 164.88 38.50 42.85 0.65
ELM
L FFNN 353.16 340.83 69.32 44.51 0.66
M FFNN 203.15 195.09 27.35 14.64 0.65
S FFNN 90.38 80.91 22.34 16.24 0.81
Overall 190 180.02 36.85 34.19 0.71
Work Done till First Seminar Work Done after First Seminar Conclusion Future Work Publications References
SCATTER PLOTS OF MULTI MODEL FRAMEWORK
0 100 200 300 400 500
0
100
200
300
400
500
actual pause duration (in ms)
predictedpauseduration(inms)
(e) CART
0 100 200 300 400 500
0
100
200
300
400
500
actual pause duration (in ms)
predictedpauseduration(inms)
(f) FFNN
0 100 200 300 400 500 600
0
100
200
300
400
500
actual pause duration (in ms)
predictedpauseduration(inms)
(g) SVM
0 100 200 300 400 500 600
0
100
200
300
400
500
actual pause duration (in ms)
predictedpauseduration(inms)
(h) ELM
Work Done till First Seminar Work Done after First Seminar Conclusion Future Work Publications References
ANALYSIS OF PAUSE MODEL BASED ON SINGLE AND
MULTI MODEL FRAMEWORK
Figure: Average prediction error (in ms)
Work Done till First Seminar Work Done after First Seminar Conclusion Future Work Publications References
UNSUPERVISED PAUSE
POSITION PREDICTION
MODEL
Work Done till First Seminar Work Done after First Seminar Conclusion Future Work Publications References
UNSUPERVISED FEATURES EXTRACTION
METHODOLOGY
Story Speech
Corpus
Most frequent occuring
words
Unique
Words
Feature
Extraction
SVD
Dictionary
m*n m*r
Figure: Feature extraction method
m = 3579 unique words.
n = 300 most frequent occurring word.
r = 50 reduced dimension of the co-occurrence matrix.
Work Done till First Seminar Work Done after First Seminar Conclusion Future Work Publications References
UNSUPERVISED FEATURES EXTRACTION
METHODOLOGY CONTD..
Work Done till First Seminar Work Done after First Seminar Conclusion Future Work Publications References
RESULTS
Table: Performance measures of unsupervised data-driven models:
CART, FFNN, SVM and ELM for pause position prediction
CART
Recall Precision F1
Non-pause 0.86 0.94 0.89
Pause 0.79 0.58 0.66
FFNN
Non-pause 0.85 0.91 0.88
Pause 0.82 0.62 0.70
SVM
Non-pause 0.81 0.88 0.84
Pause 0.77 0.68 0.72
ELM
Non-pause 0.84 0.90 0.86
Pause 0.82 0.63 0.71
Work Done till First Seminar Work Done after First Seminar Conclusion Future Work Publications References
MODELING OF PAUSES
BASED ON DISCOURSE
MODES
Work Done till First Seminar Work Done after First Seminar Conclusion Future Work Publications References
DISCOURSE MODE
Three discourse modes of a story are considered:
Descriptive : 547 #sentences
Dialogue: 279 #sentences
Narrative: 1134 #sentences
Work Done till First Seminar Work Done after First Seminar Conclusion Future Work Publications References
STATISTICS OF THE PAUSES FOR VARIOUS MODES OF
DISCOURSE
Table: Satistics of the Pauses for various modes of discourse based on
limited intervals.
Descriptive Mode
Pause Type Mean (ms) StdDev (ms) % in original
Long Pause 447.59 240.18 6.15
Medium Pause 116.99 52.59 6.94
Short Pause 128.68 59.73 6.96
Narrative Mode
Pause Type Mean (ms) StdDev (ms) % in original
Long Pause 413.00 179.89 13.14
Medium Pause 197.61 27.89 11.48
Short Pause 93.97 30.19 23.84
Dialogue Mode
Pause Type Mean (ms) StdDev (ms) % in original
Long Pause 492.28 211.00 4.85
Medium pause 196.44 27.22 2.44
Short Pause 92.24 28.97 4.20
Work Done till First Seminar Work Done after First Seminar Conclusion Future Work Publications References
HISTOGRAM PLOTS
0 100 200 300 400 500 600 700
Duration in ms
0
20
40
60
80
100
120
Frequency
(a) Descriptive Mode
Work Done till First Seminar Work Done after First Seminar Conclusion Future Work Publications References
CONTD..
0 100 200 300 400 500 600 700
Duration in ms
0
50
100
150
200
250
300
Frequency
(b) Narrative Mode
Work Done till First Seminar Work Done after First Seminar Conclusion Future Work Publications References
CONTD..
0 100 200 300 400 500 600 700
Duration in ms
0
10
20
30
40
50
Frequency
(c) Dialogue Mode
Work Done till First Seminar Work Done after First Seminar Conclusion Future Work Publications References
PAUSE PREDICTION MODEL
Story-speech
Corpus
Dialogue
Story text
NarrativeDescriptive
Figure: Classifying Story text into three modes of Discourse
Work Done till First Seminar Work Done after First Seminar Conclusion Future Work Publications References
Pause/ Non-pause
Short / Medium / Long Pause
Short Pause
Duration Predictor
Medium Pause
Duration Predictor
Long Pause
Duration Predictor
Pause Position Prediction Model
Pause Duration Prediction Model
First Stage
Second Stage
Figure: Proposed pause prediction model
Work Done till First Seminar Work Done after First Seminar Conclusion Future Work Publications References
ACCURACY OF FIRST STAGE OF PAUSE POSITION
PREDICTION MODEL
Table: Performance of CART Model for predicting pause postion
Descriptive
Recall Precision F-1 Score
Non-pause 0.978 0.837 0.902
Pause 0.454 0.88 0.60
Dialogue
Recall Precision F-1 Score
Non-pause 0.952 0.872 0.91
Pause 0.569 0.793 0.663
Narrative
Recall Precision F-1 Score
Non-pause 0.953 0.856 0.902
Pause 0.552 0.806 0.655
Work Done till First Seminar Work Done after First Seminar Conclusion Future Work Publications References
ACCURACY OF THE PAUSE POSITION PREDICTION
MODEL
Descriptive Mode: 72%
Dialogue Mode:76.05%
Narrative Mode: 75.25%
Work Done till First Seminar Work Done after First Seminar Conclusion Future Work Publications References
ACCURACY OF SECOND STAGE OF PAUSE PREDICTION
MODEL
Table: Performance of CART for long, medium and short pause
classification
Descriptive
Pause Type Recall Precision F-1 Score
long 0.56 0.46 0.51
medium 0.48 0.39 0.43
short 0.56 0.47 0.50
Dialogue
Recall Precision F-1 Score
long 0.50 0.72 0.59
medium 0.53 0.65 0.58
short 0.40 0.55 0.46
Narrative
Recall Precision F-1 Score
long 0.37 0.48 0.42
medium 0.53 0.46 0.49
short 0.73 0.46 0.56
Work Done till First Seminar Work Done after First Seminar Conclusion Future Work Publications References
ACCURACY OF CLASSIFYING A PAUSE IN ONE OF THE
PAUSE TYPE:SHORT, MEDIUM AND LONG
Descriptive Mode: 53%
Dialogue Mode:48%
Narrative Mode: 54%
Work Done till First Seminar Work Done after First Seminar Conclusion Future Work Publications References
ACCURACY OF THIRD STAGE OF PAUSE PREDICTION
MODEL
Table: Performance of CART for pause duration prediction
Descriptive
¯x (in ms) ¯y (in ms) µ (in ms) σ (in ms) γx,y
CART long 468.24 486.93 90.79 90.92 0.58
CART medium 201.03 189.97 24.59 18.95 0.56
CART short 88.26 93.48 34.86 12.37 0.53
Overall 228.84 234.72 40.42 41.42 0.55
Narrative
¯x (in ms) ¯y (in ms) µ (in ms) σ (in ms) γx,y
CART long 402.87 397.99 74.90 77.26 0.69
CART medium 198.09 198.87 10.60 10.79 0.66
CART short 93.73 92.69 9.65 9.40 0.69
Overall 244.98 242.71 44.30 37.04 0.68
Dialogue
¯x (in ms) ¯y (in ms) µ (in ms) σ (in ms) γx,y
CART long 472.47 463.31 60.96 61.93 0.60
CART medium 193.31 200.33 12.02 10.01 0.75
CART short 87.56 89.03 13.74 12.07 0.66
Overall 214.52 214.60 23.70 22.92 0.77
Work Done till First Seminar Work Done after First Seminar Conclusion Future Work Publications References
IDEAL CASE: ACCURACY OF THIRD STAGE OF PAUSE
PREDICTION MODEL
Table: Performance of CART for pause duration prediction
Descriptive
¯x (in ms) ¯y (in ms) µ (in ms) σ (in ms) γx,y
CART long 468.24 486.93 80.89 100.92 0.78
CART medium 201.03 189.97 14.39 12.95 0.76
CART short 88.26 93.48 13.76 7.37 0.73
Overall 228.84 234.72 34.48 37.24 0.75
Narrative
¯x (in ms) ¯y (in ms) µ (in ms) σ (in ms) γx,y
CART long 402.87 397.99 74.90 77.26 0.69
CART medium 198.09 198.87 10.60 9.79 0.66
CART short 93.73 92.69 9.65 7.14 0.69
Overall 244.98 242.71 37.16 37.04 0.68
Dialogue
¯x (in ms) ¯y (in ms) µ (in ms) σ (in ms) γx,y
CART long 472.47 463.31 52.96 61.93 0.71
CART medium 193.31 200.33 7.02 7.01 0.85
CART short 87.56 89.03 9.74 10.07 0.76
Overall 214.52 214.60 20.41 22.92 0.77
Work Done till First Seminar Work Done after First Seminar Conclusion Future Work Publications References
ANALYSIS OF PAUSE PREDICTION MODEL IN
DISCOURSE MODE
Figure: Average Prediction Error (in ms)
Work Done till First Seminar Work Done after First Seminar Conclusion Future Work Publications References
DURATION MODELING
FOR STORYTELLING STYLE
SPEECH
Work Done till First Seminar Work Done after First Seminar Conclusion Future Work Publications References
FEATURES USED FOR TRAINING
1 Positional Features (Baseline):
Position of the current word from the beginning and
ending of the utterance.
Position of the current syllable from the beginning and
ending of the utterance.
Position of a syllable in the word.
Position of the vowel in the syllable.
Syllable Identity: Segments of the syllable (consonants and
vowels) for current syllable
Syllable Identity of previous two syllables and following
two syllables.
2 Structural Features (Baseline):
Total number of words in the utterance.
Total number of phones in the utterance.
Total number of syllables in the utterance.
Total number of syllables in the current word, previous two
words and following two words.
Total number of phones in the current word, previous two
Work Done till First Seminar Work Done after First Seminar Conclusion Future Work Publications References
FEATURES USED FOR TRAINING CONTD..
2 Structural Features (Baseline Contd..):
Number of segments (i.e. consonants) before the nucleus
(i.e. vowel) in the syllable.
Number of segments after the nucleus (i.e. vowel) in the
syllable.
3 Story-specific Features
Emotion (sad, anger, happy, fear) of the current word in the
utterance.
Genre of the story (fable, legendary, folk-tales)
Whether the word is a content or functional word.
Whether the word is stressed or not.
Work Done till First Seminar Work Done after First Seminar Conclusion Future Work Publications References
RESULTS
Table: Performance of CART model for predicting the syllable
duration
Model ¯x (in ms) ¯y (in ms) µ (in ms) σ (in
ms)
γx,y
Baseline 208.97 212 56.02 52.79 0.58
Story-specific 208.97 211.91 46.72 39.72 0.70
¯x: Average of actual pause duration values.
¯y: Average of predicted pause duration values.
µ: Average prediction error.
σ: Standard deviation of average prediction error.
γx,y: Correlation coefficient.
Work Done till First Seminar Work Done after First Seminar Conclusion Future Work Publications References
PROPOSED METHOD
Fable Folk-tale Legendary
Story
Work Done till First Seminar Work Done after First Seminar Conclusion Future Work Publications References
RESULTS CONTD..
Table: Accuracy of prediction by CART model based on Story Genre
Model ¯x (in ms) ¯y (in ms) µ (in ms) σ (in ms) γx,y
Fable 209.97 208.89 38.20 31.70 0.80
Folk-tale 204.86 209.71 36.88 31.06 0.77
Legendary 212.58 209.57 37.81 39.51 0.83
Overall 209.13 209.39 37.63 34.09 0.80
Work Done till First Seminar Work Done after First Seminar Conclusion Future Work Publications References
SUMMARY AND CONCLUSION
1. Modeling of pauses using FFNN, SVM and ELM are
carried out.
2. Unsupervised Pause Position Prediction is proposed.
3. Modeling of pauses based on Discourse modes is studied.
4. Duration Modeling for Storytelling Style Speech.
Work Done till First Seminar Work Done after First Seminar Conclusion Future Work Publications References
FUTURE WORK
In future, we will be extending the current work to include the
followings:
1. Subjective listening test need to be carried out for the
proposed pause prediction model.
2. Modeling of word prominence for story text.
3. Analysis and modeling of pitch for storytelling style
speech based on three modes of discourse.
4. Analysis and modeling of intensity for storytelling style
speech.
Work Done till First Seminar Work Done after First Seminar Conclusion Future Work Publications References
Acknowledgments
The authors would like to thank the Department of
Information Technology, Government of India, for funding
the project, Development of Text-to-Speech synthesis for Indian
Languages Phase II, Ref. no. 11(7)/2011HCC(TDIL). The author
also like to thank all the DAC committee memebers,
supervisor, and all seminar attendees.
Work Done till First Seminar Work Done after First Seminar Conclusion Future Work Publications References
DISSEMINATION OF RESEARCH
Conference:
1. Parakrant Sarkar, K. Sreenivasa Rao, “Analysis and Modeling Pauses for
Synthesis of Storytelling Speech based on Discourse modes”, in Proceedings of the
IEEE International Conference on Contemporary Computing (IC3 2015), JIIT Noida,
11-13 August India.
2. Parakrant Sarkar, K. Sreenivasa Rao,“Data-Driven Pause Prediction for
Synthesis of Storytelling Style Speech based on Discourse Modes”, in Proceedings
of the IEEE International Conference on Electronics, Computing and Communication
Technologies (CONECCT 2015), IIIT Bangalore, 10-11 July India.
3. Parakrant Sarkar, K. Sreenivasa Rao, “Modeling Pauses for Synthesis of
Storytelling Style Speech Using Unsupervised Word Features”, Procedia Computer
Science, Volume 58, 2015, pages 42-49, 10-13 Aug 2015.
4. P Sarkar, K. S Rao, “Data-driven pause prediction for speech synthesis in
storytelling style speech,” in 2015 Twenty First National Conference on
Communications (NCC) , pages 1-5, 27 Feb. - 1 Mar, IIT Bombay, 2015.
Journal:
1. Parakrant Sarkar, K. Sreenivasa Rao, “Modeling of pauses for Storytelling Style
Speech Synthesis”, Computer Speech and Language [Under Revision]
Work Done till First Seminar Work Done after First Seminar Conclusion Future Work Publications References
REFERENCES I
[1] P. Taylor and A. W. Black, “Assigning phrase breaks from part-of-speech
sequences,” Computer Speech & Language, vol. 12, no. 2, pp. 99–117, 1998.
[2] P. Zervas, M. Maragoudakis, N. Fakotakis, and G. Kokkinakis, “Bayesian
Induction of Intonational Phrase Breaks,” Eurospeech, 2003.
[3] K. Yoon, “A Prosodic Phrasing Model for a Korean Text-to-speech Synthesis
System ,” Computer Speech & Language, vol. 20, no. 1, pp. 69 – 79, 2006.
[4] S. Kim, J. Lee, B. Kim, and G. G. Lee, “Incorporating second-order information into
two-step major phrase break prediction for korean,” in INTERSPEECH 2006 -
ICSLP, Ninth International Conference on Spoken Language Processing, Pittsburgh, PA,
USA, September 17-21, 2006, 2006.
[5] A. Parlikar and A. W. Black, “A grammar based approach to style specific phrase
prediction,” in Interspeech, 2011, pp. 2149–2152.
[6] A. Vadapalli, P. Bhaskararao, and K. Prahallad, “Significance of word-terminal
syllables for prediction of phrase breaks in Text-to-Speech systems for Indian
Languages,” in 8th ISCA Speech Synthesis Workshop. Barcelona, Spain: ISCA,
August 31– September 2 2013, pp. 189 – 194.
[7] N. S. Krishna and H. A. Murthy, “A New Prosodic Phrasing Model for Indian
Language Telugu,” in INTERSPEECH. ISCA, 2004.
Work Done till First Seminar Work Done after First Seminar Conclusion Future Work Publications References
REFERENCES II
[8] K. Ghosh and K. Sreenivasa Rao, “Data-Driven Phrase Break Prediction for Bengali
Text-to-Speech System,” in Contemporary Computing - 5th International Conference,
IC3 2012, Noida, India, August 6-8, 2012. Proceedings, ser. Communications in
Computer and Information Science. Springer Berlin Heidelberg, 2012, vol. 306,
pp. 118 – 129.
[9] A. W. Black and P. Taylor, “The Festival Speech Synthesis System: System
Documentation,” Human Communciation Research Centre, University of
Edinburgh, Scotland, UK, Tech. Rep. HCRC/TR-83, 1997.
Work Done till First Seminar Work Done after First Seminar Conclusion Future Work Publications References
Thank You

More Related Content

Similar to second_seminar

PhD-Thesis-ErhardRank
PhD-Thesis-ErhardRankPhD-Thesis-ErhardRank
PhD-Thesis-ErhardRankErhard Rank
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)inventionjournals
 
Sequence to sequence model speech recognition
Sequence to sequence model speech recognitionSequence to sequence model speech recognition
Sequence to sequence model speech recognitionAditya Kumar Khare
 
Voice biometric recognition
Voice biometric recognitionVoice biometric recognition
Voice biometric recognitionphyuhsan
 
PUNJABI SPEECH SYNTHESIS SYSTEM USING HTK
PUNJABI SPEECH SYNTHESIS SYSTEM USING HTKPUNJABI SPEECH SYNTHESIS SYSTEM USING HTK
PUNJABI SPEECH SYNTHESIS SYSTEM USING HTKijistjournal
 
PUNJABI SPEECH SYNTHESIS SYSTEM USING HTK
PUNJABI SPEECH SYNTHESIS SYSTEM USING HTKPUNJABI SPEECH SYNTHESIS SYSTEM USING HTK
PUNJABI SPEECH SYNTHESIS SYSTEM USING HTKijistjournal
 
IRJET- Pitch Detection Algorithms in Time Domain
IRJET- Pitch Detection Algorithms in Time DomainIRJET- Pitch Detection Algorithms in Time Domain
IRJET- Pitch Detection Algorithms in Time DomainIRJET Journal
 
International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)ijceronline
 
International journal of signal and image processing issues vol 2015 - no 1...
International journal of signal and image processing issues   vol 2015 - no 1...International journal of signal and image processing issues   vol 2015 - no 1...
International journal of signal and image processing issues vol 2015 - no 1...sophiabelthome
 
3D Audio playback for single channel audio using visual cues
3D Audio playback for single channel audio using visual cues3D Audio playback for single channel audio using visual cues
3D Audio playback for single channel audio using visual cuesRamin Anushiravani
 
TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Beijing, Chengqing Zong, Casia...
TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Beijing, Chengqing Zong, Casia...TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Beijing, Chengqing Zong, Casia...
TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Beijing, Chengqing Zong, Casia...TAUS - The Language Data Network
 
AMHARIC TEXT TO SPEECH SYNTHESIS FOR SYSTEM DEVELOPMENT
AMHARIC TEXT TO SPEECH SYNTHESIS FOR SYSTEM DEVELOPMENTAMHARIC TEXT TO SPEECH SYNTHESIS FOR SYSTEM DEVELOPMENT
AMHARIC TEXT TO SPEECH SYNTHESIS FOR SYSTEM DEVELOPMENTNathan Mathis
 
Design and analysis of automotive muffler.pptx
Design and analysis of automotive muffler.pptxDesign and analysis of automotive muffler.pptx
Design and analysis of automotive muffler.pptxKRIPA SHNAKAR TIWARI
 

Similar to second_seminar (20)

PhD-Thesis-ErhardRank
PhD-Thesis-ErhardRankPhD-Thesis-ErhardRank
PhD-Thesis-ErhardRank
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)
 
Sequence to sequence model speech recognition
Sequence to sequence model speech recognitionSequence to sequence model speech recognition
Sequence to sequence model speech recognition
 
Transformers
TransformersTransformers
Transformers
 
Voice biometric recognition
Voice biometric recognitionVoice biometric recognition
Voice biometric recognition
 
PUNJABI SPEECH SYNTHESIS SYSTEM USING HTK
PUNJABI SPEECH SYNTHESIS SYSTEM USING HTKPUNJABI SPEECH SYNTHESIS SYSTEM USING HTK
PUNJABI SPEECH SYNTHESIS SYSTEM USING HTK
 
PUNJABI SPEECH SYNTHESIS SYSTEM USING HTK
PUNJABI SPEECH SYNTHESIS SYSTEM USING HTKPUNJABI SPEECH SYNTHESIS SYSTEM USING HTK
PUNJABI SPEECH SYNTHESIS SYSTEM USING HTK
 
IRJET- Pitch Detection Algorithms in Time Domain
IRJET- Pitch Detection Algorithms in Time DomainIRJET- Pitch Detection Algorithms in Time Domain
IRJET- Pitch Detection Algorithms in Time Domain
 
MaryamNajafianPhDthesis
MaryamNajafianPhDthesisMaryamNajafianPhDthesis
MaryamNajafianPhDthesis
 
Automata
AutomataAutomata
Automata
 
International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)
 
International journal of signal and image processing issues vol 2015 - no 1...
International journal of signal and image processing issues   vol 2015 - no 1...International journal of signal and image processing issues   vol 2015 - no 1...
International journal of signal and image processing issues vol 2015 - no 1...
 
3D Audio playback for single channel audio using visual cues
3D Audio playback for single channel audio using visual cues3D Audio playback for single channel audio using visual cues
3D Audio playback for single channel audio using visual cues
 
TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Beijing, Chengqing Zong, Casia...
TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Beijing, Chengqing Zong, Casia...TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Beijing, Chengqing Zong, Casia...
TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Beijing, Chengqing Zong, Casia...
 
Digital Signal Processing.pdf
Digital Signal Processing.pdfDigital Signal Processing.pdf
Digital Signal Processing.pdf
 
AMHARIC TEXT TO SPEECH SYNTHESIS FOR SYSTEM DEVELOPMENT
AMHARIC TEXT TO SPEECH SYNTHESIS FOR SYSTEM DEVELOPMENTAMHARIC TEXT TO SPEECH SYNTHESIS FOR SYSTEM DEVELOPMENT
AMHARIC TEXT TO SPEECH SYNTHESIS FOR SYSTEM DEVELOPMENT
 
Design and analysis of automotive muffler.pptx
Design and analysis of automotive muffler.pptxDesign and analysis of automotive muffler.pptx
Design and analysis of automotive muffler.pptx
 
Mjfg now
Mjfg nowMjfg now
Mjfg now
 
D04812125
D04812125D04812125
D04812125
 
Choi's PHD Thesis
Choi's PHD ThesisChoi's PHD Thesis
Choi's PHD Thesis
 

second_seminar

  • 1. Work Done till First Seminar Work Done after First Seminar Conclusion Future Work Publications References Prosody Modeling for Synthesis of Storytelling Style Speech Second Seminar by Parakrant Sarkar Roll No: 12IT72P08 Under the Supervision of Dr. K.Sreenivasa Rao Department of Computer Science and Engineering Indian Institute of Technology Kharagpur March 4, 2016
  • 2. Work Done till First Seminar Work Done after First Seminar Conclusion Future Work Publications References OUTLINE 1. Work Done till First Seminar 2. Work Done after First Seminar 2.1 Modeling of pauses using FFNN, SVM and ELM 2.2 Unsupervised Pause Position Prediction 2.3 Modeling of pauses based on Discourse modes 2.4 Duration Modeling for Storytelling Style Speech 3. Conclusion 4. Future Work 5. Publications 6. References
  • 3. Work Done till First Seminar Work Done after First Seminar Conclusion Future Work Publications References WORK DONE TILL FIRST SEMINAR
  • 4. Work Done till First Seminar Work Done after First Seminar Conclusion Future Work Publications References LIST OF PROBLEMS ADDRESSED 1. Hindi Story Synthesis framework was proposed. SSED: Story-specific Emotion Detection Module SSPG: Story-specific Prosody Generation Module SSPI: Story-specific Prosody Incorporation Module 2. Three stage pause prediction model was proposed considering. Position of the pause. Duration of the pause.
  • 5. Work Done till First Seminar Work Done after First Seminar Conclusion Future Work Publications References PROPOSED PAUSE PREDICTION MODEL Pause/ Non-pause Short / Medium / Long Pause Short Pause Duration Predictor Medium Pause Duration Predictor Long Pause Duration Predictor Pause Position Prediction Model Pause Duration Prediction Model First Stage Second Stage Story TextStory speech corpus
  • 6. Work Done till First Seminar Work Done after First Seminar Conclusion Future Work Publications References WORK DONE AFTER FIRST SEMINAR
  • 7. Work Done till First Seminar Work Done after First Seminar Conclusion Future Work Publications References LIST OF PROBLEMS: 1. Modeling of pauses using FFNN, SVM and ELM. 2. Unsupervised Pause Position Prediction. 3. Modeling of pauses based on Discourse modes. 4. Duration Modeling for Storytelling Style Speech.
  • 8. Work Done till First Seminar Work Done after First Seminar Conclusion Future Work Publications References STORY SPEECH CORPUS 100 stories collected: Panchatantra and Akbar-Birbal. # sentences per story: 20-25 # total words: 24400 Duration of the speech corpus: 3 hours (approx.)
  • 9. Work Done till First Seminar Work Done after First Seminar Conclusion Future Work Publications References PREDICTION OF PAUSE POSITION
  • 10. Work Done till First Seminar Work Done after First Seminar Conclusion Future Work Publications References LIST OF FEATURES 1. Positional features: Position of the current word from the beginning and ending of the utterance. Total number of words in the utterance. 2. Structural features: Total number of phones in the current word, previous two and following two words. Total number of syllables in the current word, previous two words and following two words. Total number of phones in the utterance. 3. Morphological features: Part-of-Speech (POS) of current word, previous two words and following two words. 4. Story-semantic features Emotion associated with the current word. Phonetic strength of current word. Genre of the Story
  • 11. Work Done till First Seminar Work Done after First Seminar Conclusion Future Work Publications References RESULTS FOR PREDICTING THE PAUSE POSITION Table: Performance of data-driven models (CART, FFNN, SVM and ELM) for pause position prediction. CART Recall Precision F1 Non-pause 0.89 0.94 0.91 Pause 0.68 0.81 0.74 FFNN Non-pause 0.90 0.94 0.92 Pause 0.71 0.83 0.77 SVM Non-pause 0.91 0.93 0.92 Pause 0.78 0.81 0.79 ELM Non-pause 0.89 0.92 0.90 Pause 0.71 0.82 0.76
  • 12. Work Done till First Seminar Work Done after First Seminar Conclusion Future Work Publications References ACCURACY OF THE PAUSE POSITION PREDICTION MODEL
  • 13. Work Done till First Seminar Work Done after First Seminar Conclusion Future Work Publications References Prediction of Pause Duration
  • 14. Work Done till First Seminar Work Done after First Seminar Conclusion Future Work Publications References FEATURES USED FOR DETERMINING THE PAUSE DURATION 1. Morphological features: Terminal syllable of the current word, previous two and following two words. 2. Structural features: Position of the vowel in the terminal syllable. Number of segments (i.e., consonants) before and after the nucleus (i.e., vowel) in the terminal syllable. 3. Positional features: Total number of phones in the terminal syllable of the current word, previous two and following two words. Position of the current word from the beginning and ending of the utterance.
  • 15. Work Done till First Seminar Work Done after First Seminar Conclusion Future Work Publications References PERFORMANCE OF CART, FFNN, SVM AND ELM MODELS FOR PREDICTING THE PAUSE DURATION Model ¯x (in ms) ¯y (in ms) µ (in ms) σ (in ms) γx,y CART 241.60 247.21 107.12 155.87 0.67 FFNN 251.25 75.13 70.68 0.87 SVM 251.70 117.69 138.28 0.71 ELM 245.98 107.25 110.11 0.67 ¯x: Average of actual pause duration values. ¯y: Average of predicted pause duration values. µ: Average prediction error. σ: Standard deviation of average prediction error. γx,y: Correlation coefficient.
  • 16. Work Done till First Seminar Work Done after First Seminar Conclusion Future Work Publications References SCATTER PLOTS OF CART, FFNN, SVM AND ELM MODELS 0 100 200 300 400 500 0 100 200 300 400 500 actual pause duration (in ms) predictedpauseduration(inms) (a) CART 0 100 200 300 400 500 0 100 200 300 400 500 actual pause duration (in ms) predictedpauseduration(inms) (b) FFNN 0 100 200 300 400 500 0 100 200 300 400 500 actual pause duration (in ms) predictedpauseduration(inms) (c) SVM 0 100 200 300 400 500 0 100 200 300 400 500 actual pause duration (in ms) predictedpauseduration(inms) (d) ELM
  • 17. Work Done till First Seminar Work Done after First Seminar Conclusion Future Work Publications References MULTI-STAGE PAUSE DURATION PREDICTION
  • 18. Work Done till First Seminar Work Done after First Seminar Conclusion Future Work Publications References RESULTS OF THE CLASSIFICATION OF PAUSE BASED ON LIMITED INTERVAL CART Recall Precision F-1 long 0.73 0.62 0.67 medium 0.50 0.52 0.51 short 0.53 0.63 0.58 FFNN long 0.72 0.65 0.68 medium 0.54 0.57 0.55 short 0.62 0.59 0.60 SVM long 0.68 0.65 0.66 medium 0.51 0.58 0.54 short 0.61 0.58 0.59 ELM long 0.72 0.58 0.64 medium 0.55 0.52 0.53 short 0.65 0.62 0.63
  • 19. Work Done till First Seminar Work Done after First Seminar Conclusion Future Work Publications References ACCURACY OF CLASSIFYING A PAUSE IN ONE OF THE PAUSE TYPE:SHORT, MEDIUM AND LONG
  • 20. Work Done till First Seminar Work Done after First Seminar Conclusion Future Work Publications References RESULTS FOR PREDICTING PAUSE DURATION Table: Performance of mulit model framework using CART, FFNN, SVM and ELM models based on objective measures CART ¯x (in ms) ¯y (in ms) µ (in ms) σ (in ms) γx,y L CART 347.96 372.92 78.46 57.31 0.73 M CART 208.43 199.30 26.19 17.02 0.67 S CART 87.33 84.99 24.51 17.81 0.72 Overall 181.16 184.18 39.23 40.17 0.70 FFNN L FFNN 350.96 342.70 62.59 46.87 0.67 M FFNN 203.15 195.05 25.87 18.92 0.75 S FFNN 87.32 82.55 28.29 17.69 0.68 Overall 148.98 143.13 33.70 28.39 0.73 SVM L FFNN 353.16 299.10 72.40 63.75 0.73 M FFNN 203.15 190.98 29.96 16.79 0.52 S FFNN 90.70 77.41 22.96 17.88 0.63 Overall 189.49 164.88 38.50 42.85 0.65 ELM L FFNN 353.16 340.83 69.32 44.51 0.66 M FFNN 203.15 195.09 27.35 14.64 0.65 S FFNN 90.38 80.91 22.34 16.24 0.81 Overall 190 180.02 36.85 34.19 0.71
  • 21. Work Done till First Seminar Work Done after First Seminar Conclusion Future Work Publications References SCATTER PLOTS OF MULTI MODEL FRAMEWORK 0 100 200 300 400 500 0 100 200 300 400 500 actual pause duration (in ms) predictedpauseduration(inms) (e) CART 0 100 200 300 400 500 0 100 200 300 400 500 actual pause duration (in ms) predictedpauseduration(inms) (f) FFNN 0 100 200 300 400 500 600 0 100 200 300 400 500 actual pause duration (in ms) predictedpauseduration(inms) (g) SVM 0 100 200 300 400 500 600 0 100 200 300 400 500 actual pause duration (in ms) predictedpauseduration(inms) (h) ELM
  • 22. Work Done till First Seminar Work Done after First Seminar Conclusion Future Work Publications References ANALYSIS OF PAUSE MODEL BASED ON SINGLE AND MULTI MODEL FRAMEWORK Figure: Average prediction error (in ms)
  • 23. Work Done till First Seminar Work Done after First Seminar Conclusion Future Work Publications References UNSUPERVISED PAUSE POSITION PREDICTION MODEL
  • 24. Work Done till First Seminar Work Done after First Seminar Conclusion Future Work Publications References UNSUPERVISED FEATURES EXTRACTION METHODOLOGY Story Speech Corpus Most frequent occuring words Unique Words Feature Extraction SVD Dictionary m*n m*r Figure: Feature extraction method m = 3579 unique words. n = 300 most frequent occurring word. r = 50 reduced dimension of the co-occurrence matrix.
  • 25. Work Done till First Seminar Work Done after First Seminar Conclusion Future Work Publications References UNSUPERVISED FEATURES EXTRACTION METHODOLOGY CONTD..
  • 26. Work Done till First Seminar Work Done after First Seminar Conclusion Future Work Publications References RESULTS Table: Performance measures of unsupervised data-driven models: CART, FFNN, SVM and ELM for pause position prediction CART Recall Precision F1 Non-pause 0.86 0.94 0.89 Pause 0.79 0.58 0.66 FFNN Non-pause 0.85 0.91 0.88 Pause 0.82 0.62 0.70 SVM Non-pause 0.81 0.88 0.84 Pause 0.77 0.68 0.72 ELM Non-pause 0.84 0.90 0.86 Pause 0.82 0.63 0.71
  • 27. Work Done till First Seminar Work Done after First Seminar Conclusion Future Work Publications References MODELING OF PAUSES BASED ON DISCOURSE MODES
  • 28. Work Done till First Seminar Work Done after First Seminar Conclusion Future Work Publications References DISCOURSE MODE Three discourse modes of a story are considered: Descriptive : 547 #sentences Dialogue: 279 #sentences Narrative: 1134 #sentences
  • 29. Work Done till First Seminar Work Done after First Seminar Conclusion Future Work Publications References STATISTICS OF THE PAUSES FOR VARIOUS MODES OF DISCOURSE Table: Satistics of the Pauses for various modes of discourse based on limited intervals. Descriptive Mode Pause Type Mean (ms) StdDev (ms) % in original Long Pause 447.59 240.18 6.15 Medium Pause 116.99 52.59 6.94 Short Pause 128.68 59.73 6.96 Narrative Mode Pause Type Mean (ms) StdDev (ms) % in original Long Pause 413.00 179.89 13.14 Medium Pause 197.61 27.89 11.48 Short Pause 93.97 30.19 23.84 Dialogue Mode Pause Type Mean (ms) StdDev (ms) % in original Long Pause 492.28 211.00 4.85 Medium pause 196.44 27.22 2.44 Short Pause 92.24 28.97 4.20
  • 30. Work Done till First Seminar Work Done after First Seminar Conclusion Future Work Publications References HISTOGRAM PLOTS 0 100 200 300 400 500 600 700 Duration in ms 0 20 40 60 80 100 120 Frequency (a) Descriptive Mode
  • 31. Work Done till First Seminar Work Done after First Seminar Conclusion Future Work Publications References CONTD.. 0 100 200 300 400 500 600 700 Duration in ms 0 50 100 150 200 250 300 Frequency (b) Narrative Mode
  • 32. Work Done till First Seminar Work Done after First Seminar Conclusion Future Work Publications References CONTD.. 0 100 200 300 400 500 600 700 Duration in ms 0 10 20 30 40 50 Frequency (c) Dialogue Mode
  • 33. Work Done till First Seminar Work Done after First Seminar Conclusion Future Work Publications References PAUSE PREDICTION MODEL Story-speech Corpus Dialogue Story text NarrativeDescriptive Figure: Classifying Story text into three modes of Discourse
  • 34. Work Done till First Seminar Work Done after First Seminar Conclusion Future Work Publications References Pause/ Non-pause Short / Medium / Long Pause Short Pause Duration Predictor Medium Pause Duration Predictor Long Pause Duration Predictor Pause Position Prediction Model Pause Duration Prediction Model First Stage Second Stage Figure: Proposed pause prediction model
  • 35. Work Done till First Seminar Work Done after First Seminar Conclusion Future Work Publications References ACCURACY OF FIRST STAGE OF PAUSE POSITION PREDICTION MODEL Table: Performance of CART Model for predicting pause postion Descriptive Recall Precision F-1 Score Non-pause 0.978 0.837 0.902 Pause 0.454 0.88 0.60 Dialogue Recall Precision F-1 Score Non-pause 0.952 0.872 0.91 Pause 0.569 0.793 0.663 Narrative Recall Precision F-1 Score Non-pause 0.953 0.856 0.902 Pause 0.552 0.806 0.655
  • 36. Work Done till First Seminar Work Done after First Seminar Conclusion Future Work Publications References ACCURACY OF THE PAUSE POSITION PREDICTION MODEL Descriptive Mode: 72% Dialogue Mode:76.05% Narrative Mode: 75.25%
  • 37. Work Done till First Seminar Work Done after First Seminar Conclusion Future Work Publications References ACCURACY OF SECOND STAGE OF PAUSE PREDICTION MODEL Table: Performance of CART for long, medium and short pause classification Descriptive Pause Type Recall Precision F-1 Score long 0.56 0.46 0.51 medium 0.48 0.39 0.43 short 0.56 0.47 0.50 Dialogue Recall Precision F-1 Score long 0.50 0.72 0.59 medium 0.53 0.65 0.58 short 0.40 0.55 0.46 Narrative Recall Precision F-1 Score long 0.37 0.48 0.42 medium 0.53 0.46 0.49 short 0.73 0.46 0.56
  • 38. Work Done till First Seminar Work Done after First Seminar Conclusion Future Work Publications References ACCURACY OF CLASSIFYING A PAUSE IN ONE OF THE PAUSE TYPE:SHORT, MEDIUM AND LONG Descriptive Mode: 53% Dialogue Mode:48% Narrative Mode: 54%
  • 39. Work Done till First Seminar Work Done after First Seminar Conclusion Future Work Publications References ACCURACY OF THIRD STAGE OF PAUSE PREDICTION MODEL Table: Performance of CART for pause duration prediction Descriptive ¯x (in ms) ¯y (in ms) µ (in ms) σ (in ms) γx,y CART long 468.24 486.93 90.79 90.92 0.58 CART medium 201.03 189.97 24.59 18.95 0.56 CART short 88.26 93.48 34.86 12.37 0.53 Overall 228.84 234.72 40.42 41.42 0.55 Narrative ¯x (in ms) ¯y (in ms) µ (in ms) σ (in ms) γx,y CART long 402.87 397.99 74.90 77.26 0.69 CART medium 198.09 198.87 10.60 10.79 0.66 CART short 93.73 92.69 9.65 9.40 0.69 Overall 244.98 242.71 44.30 37.04 0.68 Dialogue ¯x (in ms) ¯y (in ms) µ (in ms) σ (in ms) γx,y CART long 472.47 463.31 60.96 61.93 0.60 CART medium 193.31 200.33 12.02 10.01 0.75 CART short 87.56 89.03 13.74 12.07 0.66 Overall 214.52 214.60 23.70 22.92 0.77
  • 40. Work Done till First Seminar Work Done after First Seminar Conclusion Future Work Publications References IDEAL CASE: ACCURACY OF THIRD STAGE OF PAUSE PREDICTION MODEL Table: Performance of CART for pause duration prediction Descriptive ¯x (in ms) ¯y (in ms) µ (in ms) σ (in ms) γx,y CART long 468.24 486.93 80.89 100.92 0.78 CART medium 201.03 189.97 14.39 12.95 0.76 CART short 88.26 93.48 13.76 7.37 0.73 Overall 228.84 234.72 34.48 37.24 0.75 Narrative ¯x (in ms) ¯y (in ms) µ (in ms) σ (in ms) γx,y CART long 402.87 397.99 74.90 77.26 0.69 CART medium 198.09 198.87 10.60 9.79 0.66 CART short 93.73 92.69 9.65 7.14 0.69 Overall 244.98 242.71 37.16 37.04 0.68 Dialogue ¯x (in ms) ¯y (in ms) µ (in ms) σ (in ms) γx,y CART long 472.47 463.31 52.96 61.93 0.71 CART medium 193.31 200.33 7.02 7.01 0.85 CART short 87.56 89.03 9.74 10.07 0.76 Overall 214.52 214.60 20.41 22.92 0.77
  • 41. Work Done till First Seminar Work Done after First Seminar Conclusion Future Work Publications References ANALYSIS OF PAUSE PREDICTION MODEL IN DISCOURSE MODE Figure: Average Prediction Error (in ms)
  • 42. Work Done till First Seminar Work Done after First Seminar Conclusion Future Work Publications References DURATION MODELING FOR STORYTELLING STYLE SPEECH
  • 43. Work Done till First Seminar Work Done after First Seminar Conclusion Future Work Publications References FEATURES USED FOR TRAINING 1 Positional Features (Baseline): Position of the current word from the beginning and ending of the utterance. Position of the current syllable from the beginning and ending of the utterance. Position of a syllable in the word. Position of the vowel in the syllable. Syllable Identity: Segments of the syllable (consonants and vowels) for current syllable Syllable Identity of previous two syllables and following two syllables. 2 Structural Features (Baseline): Total number of words in the utterance. Total number of phones in the utterance. Total number of syllables in the utterance. Total number of syllables in the current word, previous two words and following two words. Total number of phones in the current word, previous two
  • 44. Work Done till First Seminar Work Done after First Seminar Conclusion Future Work Publications References FEATURES USED FOR TRAINING CONTD.. 2 Structural Features (Baseline Contd..): Number of segments (i.e. consonants) before the nucleus (i.e. vowel) in the syllable. Number of segments after the nucleus (i.e. vowel) in the syllable. 3 Story-specific Features Emotion (sad, anger, happy, fear) of the current word in the utterance. Genre of the story (fable, legendary, folk-tales) Whether the word is a content or functional word. Whether the word is stressed or not.
  • 45. Work Done till First Seminar Work Done after First Seminar Conclusion Future Work Publications References RESULTS Table: Performance of CART model for predicting the syllable duration Model ¯x (in ms) ¯y (in ms) µ (in ms) σ (in ms) γx,y Baseline 208.97 212 56.02 52.79 0.58 Story-specific 208.97 211.91 46.72 39.72 0.70 ¯x: Average of actual pause duration values. ¯y: Average of predicted pause duration values. µ: Average prediction error. σ: Standard deviation of average prediction error. γx,y: Correlation coefficient.
  • 46. Work Done till First Seminar Work Done after First Seminar Conclusion Future Work Publications References PROPOSED METHOD Fable Folk-tale Legendary Story
  • 47. Work Done till First Seminar Work Done after First Seminar Conclusion Future Work Publications References RESULTS CONTD.. Table: Accuracy of prediction by CART model based on Story Genre Model ¯x (in ms) ¯y (in ms) µ (in ms) σ (in ms) γx,y Fable 209.97 208.89 38.20 31.70 0.80 Folk-tale 204.86 209.71 36.88 31.06 0.77 Legendary 212.58 209.57 37.81 39.51 0.83 Overall 209.13 209.39 37.63 34.09 0.80
  • 48. Work Done till First Seminar Work Done after First Seminar Conclusion Future Work Publications References SUMMARY AND CONCLUSION 1. Modeling of pauses using FFNN, SVM and ELM are carried out. 2. Unsupervised Pause Position Prediction is proposed. 3. Modeling of pauses based on Discourse modes is studied. 4. Duration Modeling for Storytelling Style Speech.
  • 49. Work Done till First Seminar Work Done after First Seminar Conclusion Future Work Publications References FUTURE WORK In future, we will be extending the current work to include the followings: 1. Subjective listening test need to be carried out for the proposed pause prediction model. 2. Modeling of word prominence for story text. 3. Analysis and modeling of pitch for storytelling style speech based on three modes of discourse. 4. Analysis and modeling of intensity for storytelling style speech.
  • 50. Work Done till First Seminar Work Done after First Seminar Conclusion Future Work Publications References Acknowledgments The authors would like to thank the Department of Information Technology, Government of India, for funding the project, Development of Text-to-Speech synthesis for Indian Languages Phase II, Ref. no. 11(7)/2011HCC(TDIL). The author also like to thank all the DAC committee memebers, supervisor, and all seminar attendees.
  • 51. Work Done till First Seminar Work Done after First Seminar Conclusion Future Work Publications References DISSEMINATION OF RESEARCH Conference: 1. Parakrant Sarkar, K. Sreenivasa Rao, “Analysis and Modeling Pauses for Synthesis of Storytelling Speech based on Discourse modes”, in Proceedings of the IEEE International Conference on Contemporary Computing (IC3 2015), JIIT Noida, 11-13 August India. 2. Parakrant Sarkar, K. Sreenivasa Rao,“Data-Driven Pause Prediction for Synthesis of Storytelling Style Speech based on Discourse Modes”, in Proceedings of the IEEE International Conference on Electronics, Computing and Communication Technologies (CONECCT 2015), IIIT Bangalore, 10-11 July India. 3. Parakrant Sarkar, K. Sreenivasa Rao, “Modeling Pauses for Synthesis of Storytelling Style Speech Using Unsupervised Word Features”, Procedia Computer Science, Volume 58, 2015, pages 42-49, 10-13 Aug 2015. 4. P Sarkar, K. S Rao, “Data-driven pause prediction for speech synthesis in storytelling style speech,” in 2015 Twenty First National Conference on Communications (NCC) , pages 1-5, 27 Feb. - 1 Mar, IIT Bombay, 2015. Journal: 1. Parakrant Sarkar, K. Sreenivasa Rao, “Modeling of pauses for Storytelling Style Speech Synthesis”, Computer Speech and Language [Under Revision]
  • 52. Work Done till First Seminar Work Done after First Seminar Conclusion Future Work Publications References REFERENCES I [1] P. Taylor and A. W. Black, “Assigning phrase breaks from part-of-speech sequences,” Computer Speech & Language, vol. 12, no. 2, pp. 99–117, 1998. [2] P. Zervas, M. Maragoudakis, N. Fakotakis, and G. Kokkinakis, “Bayesian Induction of Intonational Phrase Breaks,” Eurospeech, 2003. [3] K. Yoon, “A Prosodic Phrasing Model for a Korean Text-to-speech Synthesis System ,” Computer Speech & Language, vol. 20, no. 1, pp. 69 – 79, 2006. [4] S. Kim, J. Lee, B. Kim, and G. G. Lee, “Incorporating second-order information into two-step major phrase break prediction for korean,” in INTERSPEECH 2006 - ICSLP, Ninth International Conference on Spoken Language Processing, Pittsburgh, PA, USA, September 17-21, 2006, 2006. [5] A. Parlikar and A. W. Black, “A grammar based approach to style specific phrase prediction,” in Interspeech, 2011, pp. 2149–2152. [6] A. Vadapalli, P. Bhaskararao, and K. Prahallad, “Significance of word-terminal syllables for prediction of phrase breaks in Text-to-Speech systems for Indian Languages,” in 8th ISCA Speech Synthesis Workshop. Barcelona, Spain: ISCA, August 31– September 2 2013, pp. 189 – 194. [7] N. S. Krishna and H. A. Murthy, “A New Prosodic Phrasing Model for Indian Language Telugu,” in INTERSPEECH. ISCA, 2004.
  • 53. Work Done till First Seminar Work Done after First Seminar Conclusion Future Work Publications References REFERENCES II [8] K. Ghosh and K. Sreenivasa Rao, “Data-Driven Phrase Break Prediction for Bengali Text-to-Speech System,” in Contemporary Computing - 5th International Conference, IC3 2012, Noida, India, August 6-8, 2012. Proceedings, ser. Communications in Computer and Information Science. Springer Berlin Heidelberg, 2012, vol. 306, pp. 118 – 129. [9] A. W. Black and P. Taylor, “The Festival Speech Synthesis System: System Documentation,” Human Communciation Research Centre, University of Edinburgh, Scotland, UK, Tech. Rep. HCRC/TR-83, 1997.
  • 54. Work Done till First Seminar Work Done after First Seminar Conclusion Future Work Publications References Thank You