Announcing Amazon Polly - Turn Text into Lifelike Speech - December 2016 Monthly Webinar Series

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Rafal Kuklinski, Piotr Lewalski
Amazon Text-to-Speech
12/09/2016
Amazon Polly
A service that turns text into lifelike speech

 Introduction to Amazon Polly
 Features and functionalities
 Text-to-Speech: Under the Hood
 Getting started
 Workshop & Demo
 Pricing
 Q&A
What to Expect from the Session

Why we built Polly
 Apps using voice to communicate with end-users are
becoming more common every day
 Naturalness of generated speech is a key element of
user experience
 Integration of speech varies across use cases

What is Polly
 A service that converts text into lifelike speech
 Offers 47 lifelike voices and 24 languages
 Low latency responses enable developers to build real-time systems
 Developers can store, replay and distribute generated speech

Polly – Quality
Natural sounding speech
A subjective measure of how close TTS output is to human speech.
Accurate text processing
Ability of the system to interpret common text formats such as abbreviations, numerical
sequences, homographs etc.
Today in Las Vegas, NV it's 90°F.
"We live for the music", live from the Madison Square Garden.
Highly intelligibile
A measure of how comprehensible speech is.
”Peter Piper picked a peck of pickled peppers.”

Polly – Language Portfolio
Americas:
 Brazilian Portuguese
 Canadian French
 English (US)
 Spanish (US)
A-PAC:
 Australian English
 Indian English
 Japanese
EMEA:
 Danish
 Dutch
 British English
 French
 German
 Icelandic
 Italian
 Norwegian
 Polish
 Portuguese
 Romanian
 Russian
 Spanish
 Swedish
 Turkish
 Welsh
 Welsh English

Polly features: SSML
Speech Synthesis Markup Language
is a W3C recommendation, an XML-based markup language for speech
synthesis applications
<speak>
My name is Kuklinski. It is spelled
<prosody rate='x-slow'>
<say-as interpret-as="characters">Kuklinski</say-as>
</prosody>
</speak>

Polly features: Lexicons
Enables developers to customize the pronunciation of
words or phrases
My daughter’s name is Kaja.
<lexeme>
<grapheme>Kaja</grapheme>
<grapheme>kaja</grapheme>
<grapheme>KAJA</grapheme>
<phoneme>"kaI.@</phoneme>
</lexeme>

Text-to-Speech: Under the Hood

Goal: Convert text into intelligible, accurate, and natural speech
Challenges
• Homographs: words written identically that have different
pronunciation
I live in Las Vegas vs This presentation broadcasts live from Las Vegas
• Text normalization: disambiguation of abbreviations, acronyms, units
‘St.’ expanded as ‘street’ or ‘saint’
• Conversion of text to phonemes (Grapheme-to-Phoneme) in
languages with complex mapping such as English e.g. tough,
through, though
• Foreign words (déjà vu), proper names (François Hollande), slang
(ASAP, LOL) etc.
Main Challenges of Text-to-Speech

TEXT
Market grew by > 20%.
WORDSPHONEMES
{
{
{
{
{
ˈtwɛn.ti
pɚ.ˈsɛnt
ˈmɑɹ.kət ˈgɹu baɪ ˈmoʊɹ
ˈðæn
PROSODY CONTOURUNIT SELECTION AND ADAPTATION
TEXT PROCESSING
PROSODY MODIFICATIONSTREAMING
Market grew by more
than
twenty
percent
Speech units
inventory

Unit Selection
Conversion of phoneme sequence to waveform
Database of recorded audio
Unit – diphone
Coverage of diphones and various features
e.g. Allophonic variation
• Pin vs Spin vs limping

Recording Data for TTS
Tons of text
Recording script:
Few weeks of
recordings
Automatic
selection of
texts
Recording script:
• Covers all combinations of diphones
and significant features in a
language

an error occurred while searching for your route
because snaps weren't all so obedient anymore,
now we say apple again. and we say apple,
general electric soars today. information on general
electric
quick breads, zucchini, holiday, crock pot, cake,
so are you still keeping tabs on your old team,
that weighs more than four tons, disrupts the
herring's swim
…
An apple a day, keeps …

Architecture
RSS Feed Amazon Polly
Amazon
CloudWatch
Amazon S3AWS Lambda
1. Trigger
2. Check
3. Content
4. Text 5. Audio
6. Audio

Workshop & Demo
https://github.com/awslabs/amazon-polly-sample

Source
from boto3 import Session, resource
from contextlib import closing
polly = Session().client(”polly")
response = polly.synthesize_speech(
Text="Sample content",
OutputFormat="mp3", VoiceId="Joanna")
with closing(response["AudioStream"]) as stream:
bucket = resource("s3").Bucket("podcasts")
bucket.put_object(Key="output.mp3", Body=stream.read())

Summary
RSS Feed Amazon Polly
Amazon
CloudWatch
Amazon S3AWS Lambda
1. Trigger
2. Check
3. Content
4. Text 5. Audio
6. Audio

Polly is cost-effective
 Pay-as-you-go
 $4 for 1M characters
 Free Tier of 5M characters/month - first year
 You can store and reuse generated speech

Announcing Amazon Polly - Turn Text into Lifelike Speech - December 2016 Monthly Webinar Series

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (13)

Similar to Announcing Amazon Polly - Turn Text into Lifelike Speech - December 2016 Monthly Webinar Series

Similar to Announcing Amazon Polly - Turn Text into Lifelike Speech - December 2016 Monthly Webinar Series (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Recently uploaded

Recently uploaded (20)

Announcing Amazon Polly - Turn Text into Lifelike Speech - December 2016 Monthly Webinar Series