Voice First Devices is a massiv growing market. Amazon Echo and Google Home are the first to create an open eco system and offer basic integration possibilities. The AI software to deliver this experience is available as API and can be used to offer custom sophisticated solutions. Key to success is the speech-to-text quality. Comparing different API's and sharing and demonstration of best practices for speech recognition API usage.
2. Voice First Footprint
In 2017 there will be 33 mio devices
● The Voice 2017 Report - VoiceLabs analysis combined with research from CIRP, KPCB and InfoScout
3. Voice adoption
The ‘Voice First’ era has already started
● Alexa in 4% of US households
(end 2016)
● Siri handles over 2bn commands
a week
● 20% of Google searches on
Android handsets input by voice
Alexa
Google
home
Ding Dong
4. Voice Devices
Creating an open ecosystem
Amazon Echo
Skills and Alexa Voices Service
Google Home
Google Assistant Actions
5. Speech Recognition API
Developing for the Amazon Alexa
● Limit understanding
Amazon Echo is build for predefined options (e.g. no custom notes).
Session is ended after 8 sec.
● Predefined wake word defines the customer experience.
Only 4 wake words available and must be in any conversation.
● No notifications and no presence
You can’t alert the user of an event. You cannot react on e.g. welcome
home.
● No audio / No identification
Anybody can use Alexa (guests, etc.) and access all informations
6. Technology Stack
Components enabling Voice User Interfaces
Implemented use cases leveraging
the Hardware and AI Software
Software that interprets speech,
enables conversations and provide
natural voice.
Devices the consumer is
interacting like Amazon Echo or
Google Home
Applications
AI Software
Hardware
8. Speech Recognition API
Real time speech-to-text API’s
Google4
IBM3
Microsoft2
Status Beta Beta/Production Preview
Language Support1
43 (89) 8 (14) 6 (7)
Cost/min 0,024 €
0,006 / 15sec
0,02 € 0,06 €
1000 calls a 15 sec for 4$
Speaker detection no English (8KHz) no
Audio Formats FLAC, Linear16, MULAW,
ARM, AMR_WB
FLAC, PCM, WAV, OGG,
NULAW
PCM single channel, Siren,
SirenSR
Noise Friendly Yes Unkown Unkown
Word hints Yes No No
1) Languages support (Languages supported including dialects)
2) Microsoft: https://www.microsoft.com/cognitive-services/en-us/speech-api
3) IBM: http://www.ibm.com/watson/developercloud/speech-to-text.html
4) Google: https://cloud.google.com/speech/
9. ● High audio capturing quality
Use lossless coding. Capture audio with 16,000 Hz or higher. Use native sample rate.
● No additional noise
API’s include noise reduction. Duplicate noise reduction can reduce the quality. Echo
and noise has huge impact on speech recognition quality
● User education
Educate user to be close to the microphone
● One speaker per stream.
For multi speaker setting try to separate the audio streams as the current API’s are
built for dictation
● Provide context
Context matters a lot. Provide word hints to help the system to correct detection.
Speech Recognition API
Best practices
10. Problem
Real life - Voice is in the early days
Speech-to-text-quality
Speaker
recognition
Language mixing
Punctuation
12. We are building a voice first company
and are looking for support
- Technical Research
- Deep Learning & NLP Scientist
- Software Engineers
Christian Rebernik
Contact: christian@6voices.com