speech_recognition

Speech recognition of unknown person

Mutawaqqil Billah
Independent Researcher,
B.Sc in Computer Science and Mathematics,
Ramapo College of New Jersey, USA
Address : 906/2, East Shewrapara, Mirpur,
Dhaka, Bangladesh
Phone: 8801912479175
Email : mutawaqqil02@yahoo.com

Speech recognition is a challenging task. It is not difficult if we can train a computer with a person’s
voice data and then recognize it. But, that is not a normal situation, in normal situation we talk to many
people whom we do not know, but still can understand what they are saying. We need to make
computer or robot up to this level that it can understand anybody’s voice. Sometime we need it to
understand voice from anybody, male, female, young and old. For example, if we want to convert all the
voice data of a mobile company to text so that it takes less space to store all voice data, then, we will
need to understand voice from any random person. Or, if we want to use a voice enabled computer in a
public place, for example, subway or cybercafé, where we will have different users all the time and the
program need to understand them.

To recognize speech, we need to train a classifier which will understand the speech. We need to find the
patterns from speech data and train the classifier with these data. For example, if we want to train the
computer with single person’s voice, at first we need to collect his speech data. We need to know how
does he speak different letters, tone in different situations and other information. People’s voice gets
changed in different situations. People’s voice usually does not remain same all the time. Many people’s
voice gets changed when they get sick or when they wake up early in the morning or when they are
tired. We need to collect all these data and train the computer with it so that it can understand his voice
whenever he speaks. We need to break down the sound file data into small patterns and train the
classifier with these small patterns.
HMM, neural network, SVM, K nearest neighbor and other classifiers can work with small data
variations. That means, they will work only if training data are not vastly diversified. In case of speech
recognition, data variation is largely diversified than other scenarios. For example, if we want to
recognize a small word like “hello world”, then people of different ages will pronounce it differently.
Man and woman, people of different age group and people of different location will pronounce it
differently because of their local accent and other differences.  The background noise is also a big factor.
People can talk from inside a building or from outside of a building where there will be many noises.  In
these types of various situations, background noise will be different. So, if we want to train a classifier to
recognize this word with so many variations, classifier will be overloaded and could not work properly.
So, we can use some smart trick to resolve this issue.
To load a classifier, which will be trained with people of all ages, gender, environment and location, will
be a difficult task. That might not work properly as well. We can resolve this issue very easily. We have
to use binary classifiers before the main classifier to remove these huge variations. For example, we
want to record any phone conversation to text. Mobile companies have huge amount of phone voice
data for one day and it is very difficult to store those data. They might have to delete some of the data
after some months. In that case, they might want to store the voice data to text file which will take small
space in computer hard disk and they will be able to store data for many years with simple hard disk
with normal capacity.

To accomplish that, we need to use some binary classifiers before the main classifier. Our aim is to
reduce load from main classifier so that it becomes easy for it to recognize people’s voice. Using binary
classifiers before using main classifier, our aim is to reduce the variation of data significantly. Each of the
binary classifier will divide the variation by two. And eventually it will lead us to a position where the job
of main classifier will be very easy. For example, first question we might ask is it a human voice or
machine noise? This will be answered by a binary classifier which will be trained with human voice and
machine noise. As its function is to only answer a question is it a human voice or not, we can train it
nicely with only two types of data, human voice and machine noise. If we know that it is human voice,
then, next question will be is it a voice of man or woman? A binary classifier will be trained with both
men and women voices and it can recognize which type of voice it is. If the answer is male, then, we can
ask which age group this data belongs, Is it a young person’s voice or old one? And if answer is female,
we will ask the same question for next step. After that we can ask from which location it belongs, north
part of the country or south. After passing through all these classifiers, we will come to a very small
variation of data which will be left for main classifier. For example, we will know that this data belongs
to woman from 0‐20 age group from north part of the country. We can easily train a classifier with this
small data variation and recognize the words and convert those to text.

Binary classifiers will be trained with two types of training data. For example, if we want to distinguish
between male and female image, we have to train the classifier with male and female images. We have
to find the common female feature points which are not available in male and vice versa. We have to
find the gradient points where difference is very high. Basically, we have to find the difference between
these two types. Classifier only need to answer one question, is it male or female? Using proper training
set, we can make it very strong so that the chances of making mistakes become very small.
This is the flow of data:
Speech data ‐ noise or human voice male or female young or old main classifier for that
category

Classifier
Classifier
Classifier
Classifier
Classifier
Classifier
Classifier
Classifier

This is jus
data we h
use more
of binary
can use ge
This conc
also. For e
north
classifie
1
1: training da
2: male, youn
3: male, old,
4: male, old,
5: female, yo
6: female, yo
7: female, old
8: female, old
st an example
have and the t
binary classi
classifiers to
enetic algorit
ept of binary
example, whe
young
er
south
classifie
2
ata will have d
ng, south
north
south
oung, north
oung, south
d, north
d, south
e. We can int
task we want
fiers in front
examine wh
thm or least s
y classifier in
en we are se
male
er
north
classifie
3
data for male
troduce as m
t to accomplis
of main class
hich order of
square metho
front of mai
arching for a
old
er
south
classifie
4
e, young, nort
many binary c
sh. If we have
sifiers in prop
binary classif
od to accomp
n classifiers c
specific data
speech
data
human
voice
er
north
classifie
5
th part of the
classifiers as w
e large amoun
per way. We
fiers produce
lish that.
could be used
a from huge d
h
young
er
south
classifie
6
noise
country
we want bas
nt of training
can even try
e better resul
d for other se
database, like
female
h
er
north
classifie
7
ed on the tra
data, then w
different ord
lt than other
earch related
e 100 million
old
h
er
south
classifie
8

aining
we can
dering
s. We
d area
data.
h
er

Say, we want to find a face from 100 million peoples face database. We can use many binary classifiers
before using the main classifiers so that our job to search for the data become very easy and less
calculation is required to accomplish that.

Every type will have its own classifier trained with very specific group of people with little data variation.
For example, young female from north part of the country will have its own dataset and separate
classifier. Similarly, elderly male from south will have their own dataset and classifier.
We can use binary classifiers in this way not only for speech recognition, but also in other sections of
artificial intelligence. Whenever we see our classifiers are getting loaded with too many data and it is
not working properly or we need to do many calculations to get it working, binary classifiers will be
worth of try. If we place these classifiers before main classifier and separate the training data to be more
specific, our main task will be easier than before. Pattern recognition will be easier in this way. Using the
binary classifiers, we will lead the data to the appropriate classifiers with less variation.

In binary classifier used in computer vision, we will have two types of images, positive and negative
images. We need to train the positive images and negative images. Binary classifiers could be used to
recognize  male and female, Chinese and American, black and white people, left and right hand, left and
right leg, all left and right body parts, male voice and female voice and many other situations as well. Let
say, we are doing, male vs. female. Collect around 1000 images of male face and 1000 images of female
faces. Convert each image to 8 by 8 blocks. Take one male image and compare with all female images
and save the locations where we can see large difference of values between two images. Do this for all
male images and find out the points which are common in all difference lists or at least close. Use the
close ones when we do not have enough on common points. Give a range for each point to be male or
female. Say, at point (112,204) value 0 ‐ 100 means male and 200 ‐ 256 means female. When given any
image to recognize, convert it to 8 by 8 block. Find out its value of the common points and decide if it is
male or female or could not distinguish.

We can also  associate some random weight to each data and get the L (function summation value) and
examine which set of weights give very close L for all training examples and select those weights. Use
Radial basis function as kernal if the data is non linear or not linearly separable. We can also use this
idea in the above case, just do not include the points where both has same value. The above one is
strictly binary classifier, but the later can be used to recognize shape. Like text recognition. Create one
class for each letter (upper case, lower case, digits). Train the same object files to one class and get
some weights. So, each class will have its own weight list. Given any new image, compare it to all
classes, and see to which class it is close.

Another idea is to get a, b, c as ax + by > c for positive and ax + by < c for negative to recognize male or
female voice. We can use SVM first to find out human voice or other sound. Next step, we can
differentiate from it as male or female if it is human voice. It is like recursively recognition. In my

opinion, many of human pattern recognition is done this way. From top to down, layered, done
recursively. Unfold one mystery at a time. Binary classifiers are easy and accurate.

This method will work when we need to recognize unknown person’s voice. There are numerous
sections of artificial intelligence where it could be used. Naturally we need computer or robot to
understand what people are saying if we want to use those in public places. Placing binary classifiers in
front of main classifiers will make pattern recognition very easy.

speech_recognition

Recommended

Recommended

More Related Content

More from Mutawaqqil Billah

More from Mutawaqqil Billah (8)

speech_recognition