How Speech
Reorganization
Works
By:
Taqi Shah
taqi.shajee@gmail.com
Speech Recognition
• Speech recognition is the hottest topic in research
today. In fact, many full-blown speech recognitio...
How It Works
• The voice input to the microphone goes to the
sound card. The output from the sound card—
digital audio—is ...
Sounds Simple
• The user—gives a voice command over the microphone, which is
passed to the sound card in your system. This...
Sounds Simple
• The next stage involves recognizing these bands of frequencies.
For this, the speech recognition software ...
Figuring Out The Right Sound
• There can be so many variations in sound due to how
words are spoken that it’s almost impos...
Other Techniques
• There are many other complexities involved in
recognizing sound.
Other Techniques
• For example, the software has to be able to judge when a
phoneme ends and the next one begins. For this...
Other Techniques
• In another technique called pruning, for a
particular speech, the software generates several
hypotheses...
How speech reorganization works
Upcoming SlideShare
Loading in …5
×

How speech reorganization works

667 views

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
667
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
20
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

How speech reorganization works

  1. 1. How Speech Reorganization Works By: Taqi Shah taqi.shajee@gmail.com
  2. 2. Speech Recognition • Speech recognition is the hottest topic in research today. In fact, many full-blown speech recognition applications are being implemented in the West to increase work efficiency. • Speech recognition uses several techniques to "recognize" the human voice. It functions as a pipeline that converts digital audio signals coming from the sound card to recognized speech. • These signals pass through several stages, where various mathematical and statistical methods are applied to figure out what is actually being said.
  3. 3. How It Works • The voice input to the microphone goes to the sound card. The output from the sound card— digital audio—is processed using FFT (Fast Fourier Transform)—and further fine-processed using HMMs and other techniques. • The built-in database is used for analyzing what’s been spoken. There’s a reverse feedback to the database at the final stage for the purpose of adaptation. The final recognized output then goes back to the CPU.
  4. 4. Sounds Simple • The user—gives a voice command over the microphone, which is passed to the sound card in your system. This analog signal is sampled 16,000 times a second and converted into digital form using a technique called Pulse Code Modulation or PCM. This digital waveform is a stream of amplitudes that look like a wavy line. • The speech recognition software can’t figure out anything from this stream—it first has to translate it into something it can easily recognize. So, it converts this signal into a set of discrete frequency bands using a technique called Windowed Fast Fourier Transform (FFT). • For this, the audio signal is further sampled every 1/100th of a second and each sample is converted into a particular frequency. So, the incoming stream is now a set of discrete frequency bands, in a form that can be used by the speech recognizer.
  5. 5. Sounds Simple • The next stage involves recognizing these bands of frequencies. For this, the speech recognition software has a database containing thousands of frequencies or "phonemes", as they’re called. • A phoneme is the smallest unit of speech in a language or dialect. The utterance of one phoneme is different from another, such that if one phoneme replaces another in a word, the word would have a different meaning. For example, if the "b" in "bat" were replaced by the phoneme "r", the meaning would change to "rat". • The phoneme database is used to match the audio frequency bands that were sampled. So, for example, if the incoming frequency sounds like a "t", the software will try and match it to the corresponding phoneme in the database. Each phoneme is tagged with a feature number, which is then assigned to the incoming signal.
  6. 6. Figuring Out The Right Sound • There can be so many variations in sound due to how words are spoken that it’s almost impossible to exactly match an incoming sound to an entry in the database. • For example, the "t" in "the" sounds different from the "t" in, say "table". Not only that, but different people would pronounce the same word differently. To make matters worse, the environment also adds its own share of noise. • Therefore, the software has to use complex techniques to approximate the incoming sound and figure out which phonemes are being used.
  7. 7. Other Techniques • There are many other complexities involved in recognizing sound.
  8. 8. Other Techniques • For example, the software has to be able to judge when a phoneme ends and the next one begins. For this, it uses a technique called Hidden Markov Models (HMM), which is another mathematical model that uses statistics. To figure out when speech starts and stops, a speech recognizer has silence phonemes, which are also assigned feature numbers. • There are also some phonemes that depend upon what comes before or after them. For example, consider two words, "see" and "saw". Here the vowels "ee" and "aw" intrude into the phoneme "s". You hear the vowels for a longer period than the "s". To solve this problem, speech recognition software uses tri-phones, or phonemes produced along with the surrounding phonemes.
  9. 9. Other Techniques • In another technique called pruning, for a particular speech, the software generates several hypotheses on what could have been spoken. It then generates scores for each hypothesis and the one with the highest score is taken. The ones with the lower scores are "pruned" out. • This is the essence of how speech recognition works, though there are lots of other complexities involved. The technology holds great scope for the future.

×