Acoustic Speech Recognition Techniques
Audio Signal Recognized Text
Sonu Kumar Mishra
BE Comp 2015-16
 Introduction
 Historical Survey
 Motivation
 General Model of ASR
 Feature Extraction
 Hidden Markov Model
 Existing Systems
 Developing ASR Systems
 Revolution in ASR
Human
Computer
Speech Speech
Text Text
Meaning
Input Output
UnderstandingGeneration
 The first speech recognition system (Audrey) was
developed at bell laboratories in 1952. It could
recognise numbers spoken by one person.
 In 1970s Carnegie Mellon came out with HARPY
system which could recognise 1011 words with
different pronunciation.
 In 1980s new systems based on Hidden Markov
Model was introduced. HMM was statistical
approach and more robust than the earlier
technology.
 Language is the fundamental mode of
communication. Communicating with
machines in natural language effectively is a
challenge.
 We can use ASR systems to control machines,
find contents online and contribute to
generate contents.
 Most of the speech recognition systems and
contents (about 80 %) are available for 10
major languages. Hence, there is a need to
expand the system for local languages.
 If we can interact with machines in our own
local language then it would be greater
achievement for modern era.
Speech Input
Analog to digital
Feature Extraction
(Generate speech fingerprint)
Compare and Select words
Prob
Matrices
updated
Compare and Select
Sentence level match
Pick most probable word
Output
Training algorithm
Word
fingerprint
template
Sentence
fingerprint
template
HMM
“Dividing sound waves, extract phonemes & represent using some parameters”
LPCC MFCC RASTA-PLP
Low resource, High
popularity,
Easy implementation,
Single speaker, single
language, Below 300
words
Moderate resource,
High popularity,
Easy-moderate impl,
Multi speaker, Multi-
language, Moderate
vocabulary
High resource, Low
popularity, Modrate-
hard impl, Multi-
speaker, Multi-
language, Large
vocabulary
Power Spectral Analysis
(FFT)
First Order Derivative
(DELTA)
Energy Normalization
Outdated Techniques
DNN
Multiple
User
Multiple
Language
Large
Vocabulary
Abundant
Resources
Phonetic Dictionary (TTS Synthesizer)
Z1
XnX2X1
Z2 Zn
Observed
Data
Hidden or latent data
“Markov Chain”
Why HMM ?
 Simple for sequential and temporal data.
 Handle real world applications.
 It works on the principle of
New State = ʄ (old state, noise)
Initial
Probability
Transition
Probability
Observed
Probability
Applications :
 Speech Recognition.
 Facial Expression Recognition.
 Handwriting Recognition.
 Bioinformatics : Analyzing biological
data.
 Large systems like Siri,
Google voice and
Cortana are based on
neural network.
 High computing
processors.
 AI algorithms
 Parallel processing.
Time
Money
Scientists
Computing
Power
Engineers
Using built-in supportIn order to build from scratch
Kaldi is a toolkit for speech recognition written in C++ and
licensed under the Apache License v2.0. Kaldi is intended for
use by speech recognition researchers.
An open source toolkit for speech recognition, which includes
a recognizer library written in C; an adjustable, modifiable
recognizer written in Java.
Acoustic model, language model, Input source, Dictionary
 Use of ASR systems to interact
with the devices used in daily
life.
 ASR systems working in local
languages.
 Developing Neural network
based ASR systems working in all
major languages.
Speech recognition techniques

Speech recognition techniques

  • 1.
    Acoustic Speech RecognitionTechniques Audio Signal Recognized Text Sonu Kumar Mishra BE Comp 2015-16
  • 2.
     Introduction  HistoricalSurvey  Motivation  General Model of ASR  Feature Extraction  Hidden Markov Model  Existing Systems  Developing ASR Systems  Revolution in ASR
  • 3.
  • 4.
     The firstspeech recognition system (Audrey) was developed at bell laboratories in 1952. It could recognise numbers spoken by one person.  In 1970s Carnegie Mellon came out with HARPY system which could recognise 1011 words with different pronunciation.  In 1980s new systems based on Hidden Markov Model was introduced. HMM was statistical approach and more robust than the earlier technology.
  • 5.
     Language isthe fundamental mode of communication. Communicating with machines in natural language effectively is a challenge.  We can use ASR systems to control machines, find contents online and contribute to generate contents.  Most of the speech recognition systems and contents (about 80 %) are available for 10 major languages. Hence, there is a need to expand the system for local languages.  If we can interact with machines in our own local language then it would be greater achievement for modern era.
  • 6.
    Speech Input Analog todigital Feature Extraction (Generate speech fingerprint) Compare and Select words Prob Matrices updated Compare and Select Sentence level match Pick most probable word Output Training algorithm Word fingerprint template Sentence fingerprint template HMM
  • 7.
    “Dividing sound waves,extract phonemes & represent using some parameters” LPCC MFCC RASTA-PLP Low resource, High popularity, Easy implementation, Single speaker, single language, Below 300 words Moderate resource, High popularity, Easy-moderate impl, Multi speaker, Multi- language, Moderate vocabulary High resource, Low popularity, Modrate- hard impl, Multi- speaker, Multi- language, Large vocabulary Power Spectral Analysis (FFT) First Order Derivative (DELTA) Energy Normalization Outdated Techniques DNN Multiple User Multiple Language Large Vocabulary Abundant Resources Phonetic Dictionary (TTS Synthesizer)
  • 8.
    Z1 XnX2X1 Z2 Zn Observed Data Hidden orlatent data “Markov Chain” Why HMM ?  Simple for sequential and temporal data.  Handle real world applications.  It works on the principle of New State = ʄ (old state, noise) Initial Probability Transition Probability Observed Probability Applications :  Speech Recognition.  Facial Expression Recognition.  Handwriting Recognition.  Bioinformatics : Analyzing biological data.
  • 9.
     Large systemslike Siri, Google voice and Cortana are based on neural network.  High computing processors.  AI algorithms  Parallel processing.
  • 10.
    Time Money Scientists Computing Power Engineers Using built-in supportInorder to build from scratch Kaldi is a toolkit for speech recognition written in C++ and licensed under the Apache License v2.0. Kaldi is intended for use by speech recognition researchers. An open source toolkit for speech recognition, which includes a recognizer library written in C; an adjustable, modifiable recognizer written in Java. Acoustic model, language model, Input source, Dictionary
  • 11.
     Use ofASR systems to interact with the devices used in daily life.  ASR systems working in local languages.  Developing Neural network based ASR systems working in all major languages.