Announcing Amazon Polly - Turn Text into Lifelike Speech - December 2016 Monthly Webinar Series
Amazon Polly is a service that turns text into lifelike speech. Amazon Polly lets you create applications that can talk, enabling you to create entirely new categories of speech-enabled products. In this webinar, you’ll get an overview of how Polly uses advanced deep learning technologies to synthesize speech that sounds like a human voice. You’ll also learn how you can use Polly’s 47 lifelike voices and support for 24 languages to build speech-enabled applications that work in many different countries.
• Learn about the capabilities and features of Amazon Polly
• Learn about the benefits of Amazon Polly
• Learn about the different use cases
• Learn how to get started using Amazon Polly
• Learn how Polly speech audio can be distributed without restriction
• Get an overview of SSML
• Understand what is included in the AWS Free Tier and how to estimate usage costs
Why we built Polly
Apps using voice to communicate with end-users are
becoming more common every day
Naturalness of generated speech is a key element of
Integration of speech varies across use cases
What is Polly
A service that converts text into lifelike speech
Offers 47 lifelike voices and 24 languages
Low latency responses enable developers to build real-time systems
Developers can store, replay and distribute generated speech
Polly – Quality
Natural sounding speech
A subjective measure of how close TTS output is to human speech.
Accurate text processing
Ability of the system to interpret common text formats such as abbreviations, numerical
sequences, homographs etc.
Today in Las Vegas, NV it's 90°F.
"We live for the music", live from the Madison Square Garden.
A measure of how comprehensible speech is.
”Peter Piper picked a peck of pickled peppers.”
Polly – Language Portfolio
Polly features: SSML
Speech Synthesis Markup Language
is a W3C recommendation, an XML-based markup language for speech
My name is Kuklinski. It is spelled
Polly features: Lexicons
Enables developers to customize the pronunciation of
words or phrases
My daughter’s name is Kaja.
Goal: Convert text into intelligible, accurate, and natural speech
• Homographs: words written identically that have different
I live in Las Vegas vs This presentation broadcasts live from Las Vegas
• Text normalization: disambiguation of abbreviations, acronyms, units
‘St.’ expanded as ‘street’ or ‘saint’
• Conversion of text to phonemes (Grapheme-to-Phoneme) in
languages with complex mapping such as English e.g. tough,
• Foreign words (déjà vu), proper names (François Hollande), slang
(ASAP, LOL) etc.
Main Challenges of Text-to-Speech
Market grew by > 20%.
ˈmɑɹ.kət ˈgɹu baɪ ˈmoʊɹ
PROSODY CONTOURUNIT SELECTION AND ADAPTATION
Market grew by more
Conversion of phoneme sequence to waveform
Database of recorded audio
Unit – diphone
Coverage of diphones and various features
e.g. Allophonic variation
• Pin vs Spin vs limping
Recording Data for TTS
Tons of text
Few weeks of
• Covers all combinations of diphones
and significant features in a
an error occurred while searching for your route
because snaps weren't all so obedient anymore,
now we say apple again. and we say apple,
general electric soars today. information on general
quick breads, zucchini, holiday, crock pot, cake,
so are you still keeping tabs on your old team,
that weighs more than four tons, disrupts the
An apple a day, keeps …