3. TTS – Unit Selection (U.S.)
Input text NLP tasks
Unit
Selection
Speech
Database
Speech
4. TTS – Unit Selection (U.S.)
● Input text is processed and split into tokens.
● The best possible candidates for each token
are then selected from the speech database.
● The desired target utterance is then created by
determining the best chain of these candidate
units and concatenating them together →
Viterbi algorithm(using join and target costs).
5. MaryTTS implementation
● MaryTTS is a “modularised” system
● U.S. packages are currently embedded within
the marytts-runtime module.
6. MaryTTS implementation
● These packages are divided into the following groups:
- Base classes (marytts.unitselection)
- Data (marytts.unitselection.data)
- Selection & Cost Functions (marytts.unitselection.select)
- Viterbi (marytts.unitselection.select.viterbi)
- Weighting Functions
(marytts.unitselection.weightingfunctions)
- Target Features (marytts.features)
- Voice Properties (marytts.server)
7. MaryTTS Implementation
UML Schema of Unit Selection Packages in MaryTTS - woah!
***Only modelling the Unit selection part...concatenation excluded
8. MaryTTS U.S. step-by-step:
● A request is made to the mary server to output
audio from some input using a unit selection
voice.
● The Synthesis module is called to process the
input data → calls voice.synthesize()
● In this case, the unit selection voice (extension
of voice class) calls the unit selection
synthesizer.
9. MaryTTS U.S. step-by-step:
UnitSelectionSynthesizer.synthesize(tokens, UnitSelVoice):
→ Processes the tokens into audio by calling on the voice's
database, unit selector and concatenator.
– UnitSelectionVoice loads these objects by reading in
properties from the voice's .config file using MaryProperties
– The UnitDatabase class contains the target and join cost
functions, as well as a way to access the speech database
to retrieve target candidates etc.
– Unit selector selects the units
– Unit Concatenator concatenates these units into a single
audio stream
10. Unit Selector
● Contains a reference to the voice's database
● .selectUnits(tokens, voice):
- tokens converted to targets
- target feature vectors computed for each
target
- viterbi algorithm applied to find best path
11. How can we improve the system?
● Restructuring of codebase?