We are considering that while giving speech to our
system. It is quite exhaustive that it has no noise
other than coming from user.
At certain places we use stored database in that
generates after training sets had done.
11/14/2012 2YoGiV
To implement the above system we have 3
subsystems.
1. ASR (Automatic Speech Recognition)
2. DIALOGUE MANAGEMENT
3. SPOKEN LANGUAGE GENERATION
11/14/2012 3YoGiV
This is the 1st subsystem used in SDS which takes
voice as input and converts it into grammatically
correct speech and stores in the system. This
system moreover focuses on making the voice
(including noise) into certain speech which further
can be used in our next subsystem. This is our main
area to focus.
11/14/2012 4YoGiV
This system mainly focus in the management of the
output taken by ASR according to the individual
identity and Stores in the system for using in next
subsystem
11/14/2012 5YoGiV
This subsystem uses stored speeches and generates
spoken language (say English in our case).
11/14/2012 6YoGiV
11/14/2012 7YoGiV
Now in our case we are dealing with
ASR (Automatic Speech Recognition)
11/14/2012 8YoGiV
ASR will take voice as input and accordingly convert to
understandable speeches.
Question Arise
 How can system distinguish between different
speakers?
 How can system distinguish between ambient
noise and someone speaking?
 How can system derive meaning from what was
said?
 For the above questions we start to describe our
important part “Speech”
11/14/2012 9YoGiV
Some of the factors which are to be taken in mind
while taking speech as input.
a) Biological Factors
b) Phonology
c) Frequency of Sounds
d) Timing
11/14/2012 10YoGiV
1. The way our mouth move to produce certain sounds
affect the features of the sound itself.
2. The structure of the mouth produces multiple
waves in certain patterns.
3. When we manipulate our mouths in the way to
make certain letters say‘t’ we push out more air at
once, making a higher frequency sound. So from
this we have one thing to take care is frequency of
speech and with frequency we take Amplitude and
Pitch into consideration.
11/14/2012 11YoGiV
 It shows that how we use sound to convey meaning
in a language
 In English it states characteristics of sounds like
vowels and consonants.
 Phoneme is the smallest segmental unit of sound in
a language. Each Phoneme has features in the
sound that differs it from another Phoneme
Combine to represent words and sentences.
Regarding English we have about 40-50 phonemes.
So we use phoneme to remove any noise from the
sound
11/14/2012 12YoGiV
 Different vowels have different pitches; they are
similar to musical notes
 for ex. 'i' being the highest 'u' being the lowest
 Consonant phonemes have more waves oscillating
of different parts of the mouth.
 So according to different frequency system we can
store words with different phoneme.
11/14/2012 13YoGiV
There is a lot of information in timing. Breaks between
words, breaks between one sentence and another,
so this all to be considered in the speech to
distinguish between different words. According to
Research Vowels last longer than consonants.
 Now by looking above factors we have to:
 Translate from frequencies to a representation of a
phoneme.
 Discarding the useless information like noise, etc.
 The sentence created must make some sense.
11/14/2012 14YoGiV
For the above problems we use two models and one
database:
 Acoustic Model
 Dictionary
 Language model
11/14/2012 15YoGiV
Based on all the features of a sound wave
 Frequency
 Pitch
 Amplitude
 Time information
11/14/2012 16YoGiV
● The Acoustic Model is the statistical mapping from
the units of speech to all the features of speech.
● Convert Speech Sound to Phoneme then to Word
Statistical
● Tells information about the language Phonology.
It can learn from a training set.
11/14/2012 17YoGiV
It checks the Word broken into the phoneme sounds
as what they are typically made of.
11/14/2012 18YoGiV
● Provides word-level structure for a language.
● Use formal grammar rules to make sentence. As we use context to place
particular word at particular place.
To implement the above context matching in systems we use technique of
Probability. For this we calculate probability of next coming word by
using previous probability
Probability of word is based on the last N-1 terms
P(Y) =∑ P (Y|X) P(X)
(Sum over x)
X= Probability of all the existing word in sentence.
Y= Probability of observing a sequence.
11/14/2012 19YoGiV
11/14/2012 20YoGiV
B. Tech , Computer Science
JIIT , Noida
11/14/2012 21YoGiV

Automatic Speech Recognition

  • 2.
    We are consideringthat while giving speech to our system. It is quite exhaustive that it has no noise other than coming from user. At certain places we use stored database in that generates after training sets had done. 11/14/2012 2YoGiV
  • 3.
    To implement theabove system we have 3 subsystems. 1. ASR (Automatic Speech Recognition) 2. DIALOGUE MANAGEMENT 3. SPOKEN LANGUAGE GENERATION 11/14/2012 3YoGiV
  • 4.
    This is the1st subsystem used in SDS which takes voice as input and converts it into grammatically correct speech and stores in the system. This system moreover focuses on making the voice (including noise) into certain speech which further can be used in our next subsystem. This is our main area to focus. 11/14/2012 4YoGiV
  • 5.
    This system mainlyfocus in the management of the output taken by ASR according to the individual identity and Stores in the system for using in next subsystem 11/14/2012 5YoGiV
  • 6.
    This subsystem usesstored speeches and generates spoken language (say English in our case). 11/14/2012 6YoGiV
  • 7.
  • 8.
    Now in ourcase we are dealing with ASR (Automatic Speech Recognition) 11/14/2012 8YoGiV
  • 9.
    ASR will takevoice as input and accordingly convert to understandable speeches. Question Arise  How can system distinguish between different speakers?  How can system distinguish between ambient noise and someone speaking?  How can system derive meaning from what was said?  For the above questions we start to describe our important part “Speech” 11/14/2012 9YoGiV
  • 10.
    Some of thefactors which are to be taken in mind while taking speech as input. a) Biological Factors b) Phonology c) Frequency of Sounds d) Timing 11/14/2012 10YoGiV
  • 11.
    1. The wayour mouth move to produce certain sounds affect the features of the sound itself. 2. The structure of the mouth produces multiple waves in certain patterns. 3. When we manipulate our mouths in the way to make certain letters say‘t’ we push out more air at once, making a higher frequency sound. So from this we have one thing to take care is frequency of speech and with frequency we take Amplitude and Pitch into consideration. 11/14/2012 11YoGiV
  • 12.
     It showsthat how we use sound to convey meaning in a language  In English it states characteristics of sounds like vowels and consonants.  Phoneme is the smallest segmental unit of sound in a language. Each Phoneme has features in the sound that differs it from another Phoneme Combine to represent words and sentences. Regarding English we have about 40-50 phonemes. So we use phoneme to remove any noise from the sound 11/14/2012 12YoGiV
  • 13.
     Different vowelshave different pitches; they are similar to musical notes  for ex. 'i' being the highest 'u' being the lowest  Consonant phonemes have more waves oscillating of different parts of the mouth.  So according to different frequency system we can store words with different phoneme. 11/14/2012 13YoGiV
  • 14.
    There is alot of information in timing. Breaks between words, breaks between one sentence and another, so this all to be considered in the speech to distinguish between different words. According to Research Vowels last longer than consonants.  Now by looking above factors we have to:  Translate from frequencies to a representation of a phoneme.  Discarding the useless information like noise, etc.  The sentence created must make some sense. 11/14/2012 14YoGiV
  • 15.
    For the aboveproblems we use two models and one database:  Acoustic Model  Dictionary  Language model 11/14/2012 15YoGiV
  • 16.
    Based on allthe features of a sound wave  Frequency  Pitch  Amplitude  Time information 11/14/2012 16YoGiV
  • 17.
    ● The AcousticModel is the statistical mapping from the units of speech to all the features of speech. ● Convert Speech Sound to Phoneme then to Word Statistical ● Tells information about the language Phonology. It can learn from a training set. 11/14/2012 17YoGiV
  • 18.
    It checks theWord broken into the phoneme sounds as what they are typically made of. 11/14/2012 18YoGiV
  • 19.
    ● Provides word-levelstructure for a language. ● Use formal grammar rules to make sentence. As we use context to place particular word at particular place. To implement the above context matching in systems we use technique of Probability. For this we calculate probability of next coming word by using previous probability Probability of word is based on the last N-1 terms P(Y) =∑ P (Y|X) P(X) (Sum over x) X= Probability of all the existing word in sentence. Y= Probability of observing a sequence. 11/14/2012 19YoGiV
  • 20.
  • 21.
    B. Tech ,Computer Science JIIT , Noida 11/14/2012 21YoGiV