Speech Recognition in
Konkani
Nilkanth Shet Shirodkar
What is Speech Recognition?
Also known as automatic speech recognition or
computer speech recognition which means
understanding voice by the computer and
performing any required task.
Where can it be used?
- System control/Controlling devices
- Commercial/Industrial applications
- Voice dialing
Recognition
Voice Input Analog to Digital Acoustic Model
Language Model
Display Speech Engine
Speech Recognition
• 1. Voice recording
2. Word boundary detection
3. Feature extraction
4. Recognition with the help of language
models
Components of the recognition system
①Sound recording and Word detection Component
Takes the input from the audio recorder, preferably
microphone and identifies the word in the input
signal. Word detection is usually done by using the
energy and the zero crossing rate of the signal. The
output of this component is then sent to the
feature extractor module.
②Feature Extractor
 This is responsible for generating the feature
vectors for the audio signals input to it from
the word detection component. It generates
the MFCC (Mel Frequency Cepstrum Coe􏰆
fficients) which is used later to identify the
audio signal.
• 3. Recognition System
– HMM (Hidden Markov Model-based) component
which takes as input, the feature vectors
generated from the feature extractor component
and then finds the best or most suitable match
from the knowledge model.
• 4. Knowledge Model
– language dictionary which is used to identify the
sound signal.
Speech Recognition system
Acoustic Model
• Features that were extracted from the input
sound by the extraction module have to be
compared with some predefined model to
identify the spoken word.
• Word Model
• Phone Model
• Phone Model :- Only parts of words called
phones are modelled instead of modelling the
word as a whole. Instaed of matching the sound
with each word, we match the sound with the
words and recognise the parts.
• Word Model :- The words are modelled as a
whole. During recognition, the input sound
matched against each word present in the wodel
and the best possible match is then considered to
be spoken word.
o Phone Set :- Phoneme is the basic or the smallest
unit of sound.
o aa, a, iy
o Dictionary
• A dictionary is also known as the pronunciation
lexicon specifies the pronunciations of the words
as linear sequence of phonemes.
• the : dh ax
• on : aa n
Language Model
• Providing a fair idea about the context and the
words that can occur in the context, to the
speech recognition system. It also provides an
idea about the different words that are
possible in the language and the sequence in
which these words may occur.
HMM for ASR
• Building an HMM for each phone
• Combine the phone models based on the
pronunciation model to create word level
models.
• Word level models are combined based on the
language model
How Language Models work
• Hard to compute
– P(“And nothing but the truth”)
• Decompose probability
– P(“And nothing but the truth) = P(“And”)
P(“nothing|and”)  P(“but|and nothing”) 
P(“the|and nothing but”)  P(“truth|and nothing
but the”)
CMUSphinx
Sphinx3 is the speech recognizer (decoder).
SphinxTrain is a set of tools for acoustic
modeling.
SphinxBase is a common set of library used in
CMU Sphinx.
Jasper
• Jasper is an open source platform for
developing voice-controlled applications.
• Uses voice to ask for information
• Jasper runs on Raspberry Pi
• Configure Jasper to make own personal
Assistant.
Resources
• List of publications
 http://cmusphinx.sourceforge.net/wiki/research/
 Speech Recognition With CMU Sphinx [Blog by N.
Shmyrev, Sphinx developer]
• Speech recognition seminars at Leiden Institute for
Advanced Computer Science, Netherlands
• http://www.liacs.nl/~erwin/speechrecognition.html
http://www.liacs.nl/~erwin/SR2003/
http://www.liacs.nl/~erwin/SR2005/
http://www.liacs.nl/~erwin/SR2006/
http://www.liacs.nl/~erwin/SR2009/
References
• [1] Anushree Srivastava, Nivedita Singh and
Shivangi Vaish, Speech Recognition For Hindi
Language, International Journal of Engineering
Research & Technology (IJERT) April – 2013 .
• Wiqas Ghai and Navdeep Singh, “Analysis of
Automatic Speech Recognition Systems for
Indo-Aryan Languages: Punjabi A Case Study”,
Vol-2, Issue-1, March 2012.
• Website : http://cmusphinx.sourceforge.net/

Speech Recognition

  • 1.
  • 2.
    What is SpeechRecognition? Also known as automatic speech recognition or computer speech recognition which means understanding voice by the computer and performing any required task.
  • 3.
    Where can itbe used? - System control/Controlling devices - Commercial/Industrial applications - Voice dialing
  • 4.
    Recognition Voice Input Analogto Digital Acoustic Model Language Model Display Speech Engine
  • 5.
    Speech Recognition • 1.Voice recording 2. Word boundary detection 3. Feature extraction 4. Recognition with the help of language models
  • 6.
    Components of therecognition system ①Sound recording and Word detection Component Takes the input from the audio recorder, preferably microphone and identifies the word in the input signal. Word detection is usually done by using the energy and the zero crossing rate of the signal. The output of this component is then sent to the feature extractor module.
  • 7.
    ②Feature Extractor  Thisis responsible for generating the feature vectors for the audio signals input to it from the word detection component. It generates the MFCC (Mel Frequency Cepstrum Coe􏰆 fficients) which is used later to identify the audio signal.
  • 8.
    • 3. RecognitionSystem – HMM (Hidden Markov Model-based) component which takes as input, the feature vectors generated from the feature extractor component and then finds the best or most suitable match from the knowledge model. • 4. Knowledge Model – language dictionary which is used to identify the sound signal.
  • 9.
  • 10.
    Acoustic Model • Featuresthat were extracted from the input sound by the extraction module have to be compared with some predefined model to identify the spoken word. • Word Model • Phone Model
  • 11.
    • Phone Model:- Only parts of words called phones are modelled instead of modelling the word as a whole. Instaed of matching the sound with each word, we match the sound with the words and recognise the parts. • Word Model :- The words are modelled as a whole. During recognition, the input sound matched against each word present in the wodel and the best possible match is then considered to be spoken word.
  • 12.
    o Phone Set:- Phoneme is the basic or the smallest unit of sound. o aa, a, iy o Dictionary • A dictionary is also known as the pronunciation lexicon specifies the pronunciations of the words as linear sequence of phonemes. • the : dh ax • on : aa n
  • 13.
    Language Model • Providinga fair idea about the context and the words that can occur in the context, to the speech recognition system. It also provides an idea about the different words that are possible in the language and the sequence in which these words may occur.
  • 14.
    HMM for ASR •Building an HMM for each phone • Combine the phone models based on the pronunciation model to create word level models. • Word level models are combined based on the language model
  • 16.
    How Language Modelswork • Hard to compute – P(“And nothing but the truth”) • Decompose probability – P(“And nothing but the truth) = P(“And”) P(“nothing|and”)  P(“but|and nothing”)  P(“the|and nothing but”)  P(“truth|and nothing but the”)
  • 17.
    CMUSphinx Sphinx3 is thespeech recognizer (decoder). SphinxTrain is a set of tools for acoustic modeling. SphinxBase is a common set of library used in CMU Sphinx.
  • 18.
    Jasper • Jasper isan open source platform for developing voice-controlled applications. • Uses voice to ask for information • Jasper runs on Raspberry Pi • Configure Jasper to make own personal Assistant.
  • 19.
    Resources • List ofpublications  http://cmusphinx.sourceforge.net/wiki/research/  Speech Recognition With CMU Sphinx [Blog by N. Shmyrev, Sphinx developer] • Speech recognition seminars at Leiden Institute for Advanced Computer Science, Netherlands • http://www.liacs.nl/~erwin/speechrecognition.html http://www.liacs.nl/~erwin/SR2003/ http://www.liacs.nl/~erwin/SR2005/ http://www.liacs.nl/~erwin/SR2006/ http://www.liacs.nl/~erwin/SR2009/
  • 20.
    References • [1] AnushreeSrivastava, Nivedita Singh and Shivangi Vaish, Speech Recognition For Hindi Language, International Journal of Engineering Research & Technology (IJERT) April – 2013 . • Wiqas Ghai and Navdeep Singh, “Analysis of Automatic Speech Recognition Systems for Indo-Aryan Languages: Punjabi A Case Study”, Vol-2, Issue-1, March 2012. • Website : http://cmusphinx.sourceforge.net/