Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Announcing Amazon Polly - Turn Text into Lifelike Speech - December 2016 Monthly Webinar Series

Amazon Polly is a service that turns text into lifelike speech. Amazon Polly lets you create applications that can talk, enabling you to create entirely new categories of speech-enabled products. In this webinar, you’ll get an overview of how Polly uses advanced deep learning technologies to synthesize speech that sounds like a human voice. You’ll also learn how you can use Polly’s 47 lifelike voices and support for 24 languages to build speech-enabled applications that work in many different countries.

Learning Objectives:
• Learn about the capabilities and features of Amazon Polly
• Learn about the benefits of Amazon Polly
• Learn about the different use cases
• Learn how to get started using Amazon Polly
• Learn how Polly speech audio can be distributed without restriction
• Get an overview of SSML
• Understand what is included in the AWS Free Tier and how to estimate usage costs

  • Login to see the comments

Announcing Amazon Polly - Turn Text into Lifelike Speech - December 2016 Monthly Webinar Series

  1. 1. © 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Rafal Kuklinski, Piotr Lewalski Amazon Text-to-Speech 12/09/2016 Amazon Polly A service that turns text into lifelike speech
  2. 2.  Introduction to Amazon Polly  Features and functionalities  Text-to-Speech: Under the Hood  Getting started  Workshop & Demo  Pricing  Q&A What to Expect from the Session
  3. 3. Introduction to Amazon Polly
  4. 4. Why we built Polly  Apps using voice to communicate with end-users are becoming more common every day  Naturalness of generated speech is a key element of user experience  Integration of speech varies across use cases
  5. 5. What is Polly  A service that converts text into lifelike speech  Offers 47 lifelike voices and 24 languages  Low latency responses enable developers to build real-time systems  Developers can store, replay and distribute generated speech
  6. 6. Polly – Quality Natural sounding speech A subjective measure of how close TTS output is to human speech. Accurate text processing Ability of the system to interpret common text formats such as abbreviations, numerical sequences, homographs etc. Today in Las Vegas, NV it's 90°F. "We live for the music", live from the Madison Square Garden. Highly intelligibile A measure of how comprehensible speech is. ”Peter Piper picked a peck of pickled peppers.”
  7. 7. Polly – Language Portfolio Americas:  Brazilian Portuguese  Canadian French  English (US)  Spanish (US) A-PAC:  Australian English  Indian English  Japanese EMEA:  Danish  Dutch  British English  French  German  Icelandic  Italian  Norwegian  Polish  Portuguese  Romanian  Russian  Spanish  Swedish  Turkish  Welsh  Welsh English
  8. 8. Features and Functionality
  9. 9. Polly features: SSML Speech Synthesis Markup Language is a W3C recommendation, an XML-based markup language for speech synthesis applications <speak> My name is Kuklinski. It is spelled <prosody rate='x-slow'> <say-as interpret-as="characters">Kuklinski</say-as> </prosody> </speak>
  10. 10. Polly features: Lexicons Enables developers to customize the pronunciation of words or phrases My daughter’s name is Kaja. <lexeme> <grapheme>Kaja</grapheme> <grapheme>kaja</grapheme> <grapheme>KAJA</grapheme> <phoneme>"kaI.@</phoneme> </lexeme>
  11. 11. Text-to-Speech: Under the Hood
  12. 12. Goal: Convert text into intelligible, accurate, and natural speech Challenges • Homographs: words written identically that have different pronunciation I live in Las Vegas vs This presentation broadcasts live from Las Vegas • Text normalization: disambiguation of abbreviations, acronyms, units ‘St.’ expanded as ‘street’ or ‘saint’ • Conversion of text to phonemes (Grapheme-to-Phoneme) in languages with complex mapping such as English e.g. tough, through, though • Foreign words (déjà vu), proper names (François Hollande), slang (ASAP, LOL) etc. Main Challenges of Text-to-Speech
  13. 13. TEXT Market grew by > 20%. WORDSPHONEMES { { { { { ˈtwɛn.ti pɚ.ˈsɛnt ˈmɑɹ.kət ˈgɹu baɪ ˈmoʊɹ ˈðæn PROSODY CONTOURUNIT SELECTION AND ADAPTATION TEXT PROCESSING PROSODY MODIFICATIONSTREAMING Market grew by more than twenty percent Speech units inventory
  14. 14. Unit Selection Conversion of phoneme sequence to waveform Database of recorded audio Unit – diphone Coverage of diphones and various features e.g. Allophonic variation • Pin vs Spin vs limping
  15. 15. Recording Data for TTS Tons of text Recording script: Few weeks of recordings Automatic selection of texts Recording script: • Covers all combinations of diphones and significant features in a language
  16. 16. an error occurred while searching for your route because snaps weren't all so obedient anymore, now we say apple again. and we say apple, general electric soars today. information on general electric quick breads, zucchini, holiday, crock pot, cake, so are you still keeping tabs on your old team, that weighs more than four tons, disrupts the herring's swim … An apple a day, keeps …
  17. 17. Getting started
  18. 18. Get started
  19. 19. Voicing your blog
  20. 20. AWS Blog
  21. 21. Architecture RSS Feed Amazon Polly Amazon CloudWatch Amazon S3AWS Lambda 1. Trigger 2. Check 3. Content 4. Text 5. Audio 6. Audio
  22. 22. Workshop & Demo
  23. 23. Source from boto3 import Session, resource from contextlib import closing polly = Session().client(”polly") response = polly.synthesize_speech( Text="Sample content", OutputFormat="mp3", VoiceId="Joanna") with closing(response["AudioStream"]) as stream: bucket = resource("s3").Bucket("podcasts") bucket.put_object(Key="output.mp3",
  24. 24. Summary RSS Feed Amazon Polly Amazon CloudWatch Amazon S3AWS Lambda 1. Trigger 2. Check 3. Content 4. Text 5. Audio 6. Audio
  25. 25. Other use cases
  26. 26. Polly is cost-effective  Pay-as-you-go  $4 for 1M characters  Free Tier of 5M characters/month - first year  You can store and reuse generated speech
  27. 27. Thank you!