Paper on Speech Recognition

Abstract:-
The paper dwells on customized speech
recognition system which can recognize any regional
language. Speech recognition is the translation of
spoken words into text. Speech to text involves
capturing and digitizing the sound waves, converting
them to basic language units or phonemes,
constructing words from phonemes and contextually
analyzing the words. Speech recognition software is
designed such that with a microphone, it interprets
spoken words to carryout computer commands. The
existing speech recognition systems are capable of
recognizing only globally accepted languages such as
English, French, Spanish, German, etc…
The proposed system recognizes the voice
command in any language. This command can be
converted into phonemes with the help of Microsoft
SAPI, a speech application programming interface
developed by Microsoft. The software works by
identifying the sound patterns that the user produces,
and associating each pattern with the particular
action in its custom grammar.
1. INTRODUCTION
Language is the man’s most important means
of communication and speech its primary medium.
Speech provides an international forum for
communication among researches in the disciplines
that contribute to our understanding of the
production, perception, processing, learning and use.
Spoken interaction both between human
interlocutors and between humans and machines is
inescapably embedded in the laws and condition of
communication which comprise the encoding and
decoding of meaning as well as the mere
transmission of messages over an acoustical channel.
Speech recognition technology has tremendous
potential as it is an integral part of future intelligent
devices, where speech recognition and speech
synthesis are used as the basic means of
communicating with humans. This technology
transforms spoken words into alphanumeric text and
navigational commands that can be recognized by a
PC. It will simplify the Herculean task of typing and
will eliminate the conventional keyboard. This
technology adds a lot in manufacturing and control
application where hands or eyes are otherwise
occupied. Disabled and elderly people will no longer
need to be away from the internet and the
information technology.
For years, speech recognition has been the poster
child for technology that never lived up to its
promise. Only 3 years ago, the products were
expensive, inaccurate and hard to use. Fast PC’s and
indigenous software improvements mean that
speech recognition technology finally offers real
benefits. Recently there has been a large increase in
the number of recognition applications for use over
telephones, including automated dialling, operator
assistance and remote data access service; such as
financial services; for voice dictation systems like
medical transcription applications.
2. EXISTING SYSTEM
To understand how speech recognition works it is
desirable to have knowledge of speech and what
features of it is used in the recognition process. In
human brain, thoughts are constructed into sentence
and the nerves control the shape of the vocal tract to
Customized speech recognition system 1
CUSTOMIZED SPEECH RECOGNITION SYSTEM
Thejus Joby
Customized speech recognition system

produce the desired sound comes out phonemes
which are the building blocks of speech. Each
phoneme has a unique fundamentals frequency
and hence unique format frequency and it is this
features that enables the identification of each
phoneme at the recognition stage.
The system has 2 primary components. The
first piece called the acoustic model removes
noise and unneeded information such as changes
in volume. Then, using mathematical calculations,
it reduces the data to a spectrum of frequencies,
analyzes the data, and converts the words into
digital representation of phonemes then the
second major component of speech recognition
software, the language model kicks in. The
language model analyzes the content of speech. It
compares the combinations of phonemes to the
words in its digital dictionary, a huge database of
the most common words in the English language.
The language model quickly decides the words
and displays them on the screen.
Sophisticated voice recognition software
offers features that allow the software to learn
the patterns of its use. This is usually done by
creating voice files during the installation process.
Because individuals pronounce specific words
differently, the more information that speech
recognition software has about a particular user’s
speech, the better it can recognize what the user
is saying at any given time and fewer mistakes the
software makes in translating speech or executing
commands.
Limitations:-Speech recognition software begins
with a database of pre-programmed sound
patterns. However, actual user speech varies. A
user’s pronunciation of a given word can change,
the quality of the microphone gathering the sound
patterns may be poor and ambient noise can all
alter the sound pattern for a particular word.
Customized Speech Recognition System
Speech recognition software works has gathered
data about each user’s speech patterns. This
means that the software needs an introductory
learning curve; more over speech recognition is
available only in English, French, Spanish, German,
Japanese, Simplified Chinese and Traditional
Chinese.
3. MICROSOFT SAPI
The Speech Application Programming
Interface or SAPI is an application programming
interface developed by Microsoft to allow the use
of speech recognition and speech synthesis within
Windows application. The SAPI has been shifted
either as a part of a Speech Software
Development Kit (SDK) or as a part of Windows
Operating System itself. SAPI have been designed
such that a software developer speech recognition
and synthesis by using a standard set of
interfaces, accessible from a variety of
programming languages. In addition, it is possible
for a 3-rd party company to produce their own
Speech Recognition and Text-To-Speech engines
or adapt existing engines to work with SAPI. In
principle, as long as these engines conform to the
defined interfaces they can be used instead of the
Microsoft-supplied engines.
The speech API is a freely-redistributable
component which can be shipped with any
windows application that wishes to use speech
technology. Broadly the Speech API can be viewed
as an interface or piece of middleware which sits
between applications and speech engines. In SAPI
versions 1 to 4, applications could directly
communicate with engines. The API included an
abstract interference definition which applications
and engines conformed to. Applications could also
use simplified higher-level objects rather than
directly call methods on the engines. In SAPI 5
however, applications and engines do not directly
2

communicate with each other. Instead each talk
to a runtime component. There is an API
implemented by this component which
applications use, and another set of interfaces for
engines. Typically in SAPI 5 applications issue calls
through the API (for example to load a recognition
grammar; start recognition; or provide text to be
synthesized). The runtime component interprets
these commands and processes them, where
necessary calling on the engine interface (for
example, the loading of a grammar from a file is
done in the runtime, but then the grammar data is
passed to the recognition engine to actually use in
recognition). The recognition and synthesis
engines also generate events while processing (for
example, to indicate an utterance has been
recognized or to indicate word boundaries in the
synthesized speech). These pass in the reverse
direction, from the engines, through the runtime
component, and on to an event sink in the
application. In addition to the actual API definition
and runtime component, other components such
as API definition files, Control Panel applet,
Redistributable components are also shipped with
all versions of SAPI to make a complete Speech
Software Development Kit.
4. PROPOSED SYSTEM
The proposed variant of such a Speech
Recognition system focuses on regional languages
as well. This system can be trained as per the
regional language followed by the user. The new
grammar, which the regional language follows,
can be created to follow the voice commands. The
prepared grammar is basically a database of
words in any language along with a specific task
assigned to the word.
A tutoring stage is accomplished in the initial
installation process so that user can build up a
grammar for any language used in the system. The
system emphasises on teaching the computer how
to follow up with the voice patterns and perform
the required task. The user can specify action for
some custom words. Now, the system converts
the specified word into corresponding phonemes
using Microsoft SAPI, the Speech Application
Programming Interface and stores them in the
database along with their action forming a custom
grammar.
Hereafter whenever the microphone converts
the particular word into analog data signal, the
Microsoft SAPI does its conversion into computer
language. Simultaneously the custom grammar is
loaded into SAPI. This digitalized data is compared
with the existing stored data formats present in
the database. The associated action is identified as
the ultimate result of the word used. This action is
performed in response to the specified word.
There by making the system capable to
recognizing any language.
5. LIMITATIONS
Speech recognition applications are
different from any other kind of computers
applications. It opens up a world of possibilities
for developers, and other telephony applications,
but speech recognition also faces some
challenges. Rather than pressing buttons or
interacting with the computer screen, users must
speak to the computer. This means there will be a
level of uncertainty associated with their input, as
software mostly returns probabilities rather than
certainties. The most obvious weakness is the
possibility of misrecognition. No matter how much
effort and care is put into the development of
software, there can always be room for the
misrecognition of user input.
Customized Speech Recognition System 3

6. CONCLUSION
While speech recognition software
certainly has come a long way in last 50 years,
there are people who would say the technology is
still not quite ‘there’. The programs no matter
how good they are, at times fail to produce the
matching phoneme. Conditions like having a
common cold or working while a noise source is
nearby can confuse software to produce errors.
Computers and softwares are still too vague and
easily confused compared to the human brain.
Concepts like the difference between “read” and
“red” are hard for a simple Speech Recognition
Software to differentiate, since understanding
words with in grammatical context is a very high
brain function.
Author
Thejus Joby – S6 Computer Science and
Engineering Students at St. Joseph’s College of
engineering and Technology Palai.
Customized Speech Recognition System 4

Paper on Speech Recognition

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Paper on Speech Recognition

Similar to Paper on Speech Recognition (20)

Paper on Speech Recognition