SPEECH
RECOGNITION
TECHNOLOGY
Presented by:
Nicole Bralic | Sergio Rumantir | Louis Fong |
Niharika Kohli | Aamir Sheriff
Agenda
1. Origins / history of speech recognition
2. How it works – the technical aspects
3. Issues and concerns
4. Latest trends and future opportunities
5. Activity
ORIGINS / HISTORY
OF SPEECH
RECOGNITION
Introduction
Speech Recognition
What is the first thought that comes to mind?
Origins
When was the first Speech Recognition
Software developed?
a) 1950 b) 1960 c) 1970 d) 1980
Origins
Answer: 1950s
First appearance - could only understand digits
Origins
1960s
Understood 16 words spoken in English
Origins
1970s
Understood 1011 words
Origins
1980s
Understood thousands of words, but still slow
Origins
1990s
First comprehensive software
Cost = $9,000
Origins
2000s
Built into Mac OSX and Windows Vista
Origins
2010s
Apple introduces SIRI
HOW IT WORKS
The technical aspects
Small Vocabulary / Many Users
Types of Users
Large Vocabulary / Few Users
How it works
Speech Recognition Models
Today's speech recognition systems use powerful and complicated
statistical modeling systems. These systems use probability and
mathematical functions to determine the most likely outcome.
The two models that dominate the field today are the Hidden Markov
Model and neural networks
The “Hidden Markov” Model
1. Each phoneme is like a link in a
chain, and the completed chain is a
word.
1. The chain branches off in different
directions as the program attempts to
match the digital sound with the
phoneme that's most likely to come
next.
1. The program assigns a probability
score to each phoneme, based on its
built-in dictionary and user training.
There are four basic steps to performing
recognition:
1. Digitize the speech that we want to
recognize.
2. We compute features that represent the
spectral-domain content of the speech.
3. A neural network (also called an ANN, multi-
layer perceptron, or MLP) is used to classify
a set of these features into phonetic-based
categories at each frame.
4. Viterbi search is used to match the neural-
network output scores to the target words,in
order to determine the word that was most
likely uttered.
Neural Network
Overall Process
ISSUES
Poll Time!
Why / Why not?
Issue: Accuracy & Performance
How accurate was the
performance of Siri?
What caused this lack
of accuracy?
Issue: Accuracy & Performance
Why was the accuracy and performance of Siri
was low in the previous video?
● Background noise
● Overlapped speech
● Speaker’s accent
● Syntactic error
● iPhone 4S and Siri weren’t advanced enough
Issue: Accuracy & Performance
● Technological improvement
● More vocabulary, lower accuracy
● Perfectly recognize “one” to “nine”, but as library grows,
some words becomes confusing
● Speaker-dependent vs speaker-independent
● Isolated, discontinued, continuous (natural) speech
● Read vs. spontaneous speech
● Cannot understand a sentence that is very off syntactically
● Adverse condition: noise, distortion
Issue: Accuracy & Performance
Accuracy
enhanced!
Issue: Privacy
Google Chrome  Your PC becomes an open
microphone
Wearable Technology in the workplace  Should
not be used to monitor employees
Facebook Music and TV Recognition  Is it
really turned off?
Issue: Control
As technology advances quickly, is government
legislation good enough to control the proper
usage of speech recognition software?
Is it even possible to control?
LATEST TRENDS
AND FUTURE
OPPORTUNITIES
Computer software technology corporation
Market leader in speech and imaging applications
o server & embedded speech recognition
o telephone call steering systems
o automated telephone directory services
o medical transcription software & systems
o optical character recognition
o desktop imaging
Nuance Communications
• World’s best-selling speech recognition software
• For home, student, power and professional users
• Essential for people with visual impairments
Dragon Naturally Speaking
Using Python to Code by Voice
• The beginning: Tavis Rudd
developed Emacs Pinkie (RSI)
• Months of coding in Python and
Emacs
• Dragon Naturally Speaking voice
recognition software on Microsoft
Windows
• Over 2000 own personal commands
• The code is released for download
Dragon Drive
Nuance integrates its technology with cloud and vehicle on-
board capabilities to create distraction-free driving with
Dragon Drive voice command in action. Over 90 million
cars are currently equipped with Nuance Dragon Drive.
Dragon Medical
Dragon Medical provides clinical
documentation solutions for over 300,000
physicians. This portfolio captures the
physician narrative to document care in the
EHR – anywhere, any time and on any device.
Real-time Skype Translator
Microsoft will release the first beta of real-time
Skype Translator to Windows 8 before the end of
2014.
They are currently implementing near real-time
voice translation of multiple languages in a Skype
call.
Currently there is instant functional translation
from English to German and Chinese.
Future Trends
Voice recognition market to reach
US$2.5 billion in revenue by 2015
Typed Passwords aren’t going to
work in the future
Class Discussion
• What do you think the future of speech
recognition technology will look like?
• What are some other uses of this
technology?
• Do you think the benefits outweigh the
issues?
CLASS ACTIVITY
VOLUNTEERS?
THANK YOU!
QUESTIONS?

Speech Recognition Technology

  • 1.
    SPEECH RECOGNITION TECHNOLOGY Presented by: Nicole Bralic| Sergio Rumantir | Louis Fong | Niharika Kohli | Aamir Sheriff
  • 2.
    Agenda 1. Origins /history of speech recognition 2. How it works – the technical aspects 3. Issues and concerns 4. Latest trends and future opportunities 5. Activity
  • 3.
    ORIGINS / HISTORY OFSPEECH RECOGNITION
  • 4.
    Introduction Speech Recognition What isthe first thought that comes to mind?
  • 5.
    Origins When was thefirst Speech Recognition Software developed? a) 1950 b) 1960 c) 1970 d) 1980
  • 6.
    Origins Answer: 1950s First appearance- could only understand digits
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
    Origins 2000s Built into MacOSX and Windows Vista
  • 12.
  • 13.
    HOW IT WORKS Thetechnical aspects
  • 14.
    Small Vocabulary /Many Users Types of Users Large Vocabulary / Few Users
  • 15.
  • 16.
    Speech Recognition Models Today'sspeech recognition systems use powerful and complicated statistical modeling systems. These systems use probability and mathematical functions to determine the most likely outcome. The two models that dominate the field today are the Hidden Markov Model and neural networks
  • 17.
    The “Hidden Markov”Model 1. Each phoneme is like a link in a chain, and the completed chain is a word. 1. The chain branches off in different directions as the program attempts to match the digital sound with the phoneme that's most likely to come next. 1. The program assigns a probability score to each phoneme, based on its built-in dictionary and user training.
  • 18.
    There are fourbasic steps to performing recognition: 1. Digitize the speech that we want to recognize. 2. We compute features that represent the spectral-domain content of the speech. 3. A neural network (also called an ANN, multi- layer perceptron, or MLP) is used to classify a set of these features into phonetic-based categories at each frame. 4. Viterbi search is used to match the neural- network output scores to the target words,in order to determine the word that was most likely uttered. Neural Network
  • 19.
  • 20.
  • 22.
  • 23.
    Issue: Accuracy &Performance How accurate was the performance of Siri? What caused this lack of accuracy?
  • 24.
    Issue: Accuracy &Performance Why was the accuracy and performance of Siri was low in the previous video? ● Background noise ● Overlapped speech ● Speaker’s accent ● Syntactic error ● iPhone 4S and Siri weren’t advanced enough
  • 25.
    Issue: Accuracy &Performance ● Technological improvement ● More vocabulary, lower accuracy ● Perfectly recognize “one” to “nine”, but as library grows, some words becomes confusing ● Speaker-dependent vs speaker-independent ● Isolated, discontinued, continuous (natural) speech ● Read vs. spontaneous speech ● Cannot understand a sentence that is very off syntactically ● Adverse condition: noise, distortion
  • 26.
    Issue: Accuracy &Performance Accuracy enhanced!
  • 27.
    Issue: Privacy Google Chrome Your PC becomes an open microphone Wearable Technology in the workplace  Should not be used to monitor employees Facebook Music and TV Recognition  Is it really turned off?
  • 28.
    Issue: Control As technologyadvances quickly, is government legislation good enough to control the proper usage of speech recognition software? Is it even possible to control?
  • 29.
  • 30.
    Computer software technologycorporation Market leader in speech and imaging applications o server & embedded speech recognition o telephone call steering systems o automated telephone directory services o medical transcription software & systems o optical character recognition o desktop imaging Nuance Communications
  • 32.
    • World’s best-sellingspeech recognition software • For home, student, power and professional users • Essential for people with visual impairments Dragon Naturally Speaking
  • 34.
    Using Python toCode by Voice • The beginning: Tavis Rudd developed Emacs Pinkie (RSI) • Months of coding in Python and Emacs • Dragon Naturally Speaking voice recognition software on Microsoft Windows • Over 2000 own personal commands • The code is released for download
  • 37.
    Dragon Drive Nuance integratesits technology with cloud and vehicle on- board capabilities to create distraction-free driving with Dragon Drive voice command in action. Over 90 million cars are currently equipped with Nuance Dragon Drive.
  • 39.
    Dragon Medical Dragon Medicalprovides clinical documentation solutions for over 300,000 physicians. This portfolio captures the physician narrative to document care in the EHR – anywhere, any time and on any device.
  • 41.
    Real-time Skype Translator Microsoftwill release the first beta of real-time Skype Translator to Windows 8 before the end of 2014. They are currently implementing near real-time voice translation of multiple languages in a Skype call. Currently there is instant functional translation from English to German and Chinese.
  • 42.
    Future Trends Voice recognitionmarket to reach US$2.5 billion in revenue by 2015 Typed Passwords aren’t going to work in the future
  • 43.
    Class Discussion • Whatdo you think the future of speech recognition technology will look like? • What are some other uses of this technology? • Do you think the benefits outweigh the issues?
  • 44.
  • 45.

Editor's Notes

  • #9 1011 words - approximately the vocabulary of a three-year old
  • #10 Still requires the speaker to pause after every single word - inefficient
  • #11 Still requires the speaker to pause after every single word - inefficient
  • #12 Still requires the speaker to pause after every single word - inefficient DID YOU KNOW?
  • #13 Still requires the speaker to pause after every single word - inefficient
  • #15 Small vocabulary / many users Ideal for automated telephone answering. System capable of identifying variation in accent and speech patterns, and understanding them most of the time. Usage is limited to a small number of predetermined commands and inputs. Large vocabulary / limited users Systems work best in a business environment where a small number of users will work with the program. System must be trained to work best with a small number of primary users. The accuracy rate falls dramatically with any other user.
  • #23 (http://www.polleverywhere.com/)
  • #32 Dragon Naturally Speaking – Demo Video
  • #34 Using Python to code by voice - demo
  • #35 - allowed to code more efficiently by voice than with hand
  • #36 VoicePod – Video Demo
  • #38 Pros: provides convenience of smartphone connectivity and eyes on the road and hands on the wheel Mitigation of distracted driving -> decreases number of accidents ease of search and navigation ability to manage multiple suppliers access to content by just voice Cons: there is still a screen to navigate when driving costly
  • #40 Benefits: Physicians can spend more time with patients (makes the documentation process more efficient) Doctors can see more patients Narrative is stored by the clinic for future reference (environmental benefit: reduced use of paper)
  • #42 Pros: Advantage for businesses - when participating in a conference call with a business contact that speaks a foreign language, Skype will be able to help decrease the language barrier and bridge geographic and language boundaries. Gives businesses more opportunities. In the future - be implemented on Android devices Cons: There will be some errors concerning slang, language structure, and possibly voice recognition due to microphone voice quality and background noise Will most likely not be a free service
  • #43 http://www.biometricupdate.com/201405/voice-recognition-market-to-reach-us2-5-billion-in-revenue-by-2015 http://www.gizmodo.com.au/2014/05/the-guy-who-invented-computer-passwords-thinks-theyre-a-nightmare/ the guy who invented password 50 years ago suggested that password is not a secured measure of privacy anymore (opinion: Passvoice may be the future?) although using voice recognition as passwords is not a new approach, but not well developed and not widely adopted. Accuracy is one of the major issues