speech_tree

Speech Tree

Mutawaqqil Billah
Independent Researcher,
B.Sc in Computer Science and Mathematics,
Ramapo College of New Jersey, USA
Address : 906/2, East Shewrapara, Mirpur,
Dhaka, Bangladesh
Phone: 8801912479175
Email : mutawaqqil02@yahoo.com

Parsing a sound file data is a very difficult task. In situations where multiple persons speak at the same
time, then at first we need to find the data which is specifically generated from a specific person’s voice
and then take that to a classifier to understand what he is saying. That is a very difficult task. Also it is
difficult to find starting and ending point of a word, especially when other background noises are also
available. Background noises are a very common and natural issue and found in normal situation where
people speak and we want to recognize those spoken words. I am proposing a new type of tree
structure which is different than conventional tree structures(1), using which we can understand spoken
words without parsing the data and it will work in situations where multiple persons are talking at the
same time. The tree will find the word wherever it is in the number string (voice data). While using the
tree, we do not have to think about where a word starts and where it ends.
Breaking of sound data into blocks:
If we break down a sound wave into small packets of data, we will get its value for a particular instant or
time interval. We want to break it into small portions where each portion will give data for a very small
time interval. It is like we are examining every point of this sound wave which is being plotted on a
graph paper with small time interval as x axis, for example millisecond as unit in x axis. If we have some
background noise and multiple person speaking, all these data will come as a single wave data which will
be the accumulated effect of all the sound waves in each time interval. We will get each small time
intervals data in each small part of it which will have combined effect of every type of voices and noise
available at that time. For example, in a very small time interval we get 2 for one person’s voice and 3
for another person’s voice and 4 for noise. So, we will get 9, which is the added value of all three for that
small time interval.

Sound files data is string of numbers, we can pick on each number. Say,
1234511111111123561111112547 is a sound file data, then we will examine each number in the string.
If sound file has too many data for small words, then we can take bundle of consecutive numbers
instead of taking each number into the tree. For example, “hello” word has 1000 numbers as data. Then,
instead of examining each number, we can combine some consecutive numbers as a bundle. For
example, we can bundle every 50 consecutive numbers to one number. We can take their mean and
standard deviation and add these two to get a single number to represent the bundle. We can also try
other mathematical formula for that.
We can insert 0 in front of values less than 10 in the string or we can give 0 in front of values more than
9. Either way will do. If we find numbers less than 10 more frequently, then replace numbers more than
9.this is just to distinguish between single digit numbers and double digit numbers.

Structure :
In a tree, for each level, say we are working with number up to 20 as a combined bundle number.
Obviously, we will get this information properly by examining sound file data. For example let us stay
with 20. So, the tree will have 20 branches in every level. Whichever word has a specific number in their
string for a level, will be in that particular numbered branch. Say “hello” word has 1 as first number, so it
will be in first branch of level one. Say, “world” has 3 as first number, so it will be in 3rd
branch of level
one. But, their added value is 4, so they will be also considered as combo word if string value is 4. We
are adding these because their data comes through a sound wave and we get combined effect of both
waves when two persons talking at the same time. Whatever sounds are available out there, all come as
a single sound wave for a specific time interval. So, each word and its combo with other words are also
considered in the tree. If we break done the sound file data for small time interval, like millisecond, we
will get the combined effect of all available sound for that interval. And by looking this small time
interval data, we cannot understand the spoken words or recognize the sound. We have to consider
data for a significant amount of time interval, but we will examine the small interval to draw a
conclusion on large time interval data. To draw a conclusion for 1 second sound file data, we will break it
down for each millisecond or more smaller time interval and examine all those to find words spoken in
that time. In my opinion, our brain also uses the same procedure to recognize sound data.

It is not like normal tree where one upper level node has child node. In this tree, every level will have
fixed number of nodes,  which is  20 nodes in this example. Any node in any level is not a child node of
any node. Every level will have fixed number of nodes or branches. Upper nodes and lower nodes will
get connected only if necessary while inserting a word. A node is not necessarily connected with every
upper nodes or with every lower nodes. Connections are made if it is necessary to insert a word,
otherwise no connection available. For example, to insert all the words in the training set, we do not
need a connection from 3rd
node of 2ed level to 5th
node of 3rd
level. So, in connection list of 3rd
node of
2ed level, 5th
node of 3rd
level will not be available. Each node will be connected to upper node if
needed while inserting any word, otherwise not connected. Same for lower nodes, connected if
necessary. For example, first node in first layer may be connected with 2ed node in 2ed layer, if that is
necessary to insert a word. Connection will be only created while inserting a word.
Say, for “hello” word we get 12345 as number string and for “world” word we get  14567. Then, there
added value will be 2681012. This will happen with sound wave if both words are spoken at the same
time. Effect of each word acting on same time interval will get added and received by sound recognition
device.
In sound number string, a word can start from anywhere; it could be at the beginning or any random
position.  As we will break down sound file data, so a word could be in any position of that small part.
That is why we have to examine a long string of numbers in a tree with good number of levels. We can
always restrict levels in the tree by increasing bundle or block width. If we make bundle width 100
instead of 50, then we will have fewer levels in the tree. This decision we have to make after examining

sound file data. We can also try different width for bundle and different level numbers in the tree and
choose the best performer or best sized blocks. We will use less width in the cases where we need more
precise result or to remove conflicts in the result.

Insert:
For each word in the training set, we have to get the number string for that word.  We should use same
time interval as we will use for real time data. Also we have to keep couple of trees with different block
size or time interval which will help us in situations where we will find multiple words match from a
string. For example, we should have block size 50, 100, 150. Say, we have found multiple match using
150 sized blocks, then we can try 100 length sized blocks to be more specific. If that also do not give
accurate result, we can try block with 50 sized block data.
After adding all words starting from first level, note down the max length for any word. We will use that
while searching for word in different levels. Say, a word’s string is 12345. So, in first level it will choose
first branch, second branch for 2ed level, third branch for 3rd
level, 4th
branch for 4th
level and 5th
branch
for 5th
level. And this information will be stored that this pattern is for this word only. For a given word,
we can try to create the same word in different tone, voices, pitch and store those in training set. If we
have too many numbers for each word, then we will have many levels in the tree. To prevent that, we
can bundle some consecutive numbers to one number using their mean and standard deviation.

Every level will have branches starting from 1 to 20. So, if 12345 is a word string, it will be added in the
tree starting from first level. Then, again it will be added starting from 2ed level. 2ed level also has a
branch for 1. So, this time, this word will be added where first number will be in 2ed level’s first branch.
Similarly, it will be added up to level where the whole word fits. This word cannot fit in the tree starting
from last 4th
level because its length is 5. So, it will not be added on last 4 levels.

Training:
At first, we have to insert all the words in the tree and after that we can do many calculations to find
words in different position of the number string including combined words. Then calculate for each
node, which words could possibly ends there, including combination words. Create the list of all possible
combo words and check in the tree if that word is in the tree or not, if it is a valid word, then add that
info for the node.
To prepare combo word list, we can implement different methods. One could be to find any possible
word string. For example if we have 12345 as string, then 11111,11245, 11211 could be possible
combination candidate for a word and we have to try all other possible combinations.  Another way
could be to restrict the added values to stay within the string value to be considered as combination

words. For example, 11111 and 11245 could not be accepted because, in first position we have 1 for
both word, but if it is a combination word, then it should be 2 for first position. We can apply this type of
similar restrictions once we are getting wrong words as recognized words.  But, in my opinion, it will be
difficult to find same number string pattern as we have different numbers on it.
To do this, we have to add all the words first in the tree. Then we can do the combo word calculation.
Say, the string is 1234567 and we are in 7th
level and 7th
node. Now, if any word has this string ‐1234567,
then store that info. All possible combo word could be  1111111, 1124567, 1114567,1112567, 1113567
and so on. To reduce work, while creating this list, we can check if the substring already in the tree or
not. When we have 111 as substring, check if it is in the tree or not, if that is found in the tree, then
create remaining string, otherwise discard this and all the remaining combinations starting with 111. To
check if 11 is in the tree or not, we can check the connection list of first node in first level. If first node of
first level has a connection with first node in second level, then 11 is available in the tree. If the input
string is 5555, then we will try combination up to 5 in each position as long as it is available in the tree.
We will not try any string as combo word if it is not available in the tree. For example,
1111,1211,1221,1222 and so on, basically, all the combinations possible by staying below or up to 5 for
each position, provided it is already available in the tree. As we have already added the words in the
tree,  if the string is not available in the tree, it is not a valid word.

Any word can start from any level, so add a word in every level. Any combination word ( combined
effect of two words) can start in any level. Say, 12345 is a string for a word. So, add that starting from
first level. Again, add that starting from 2ed level and keep doing this until we are unable to fit the string
in the tree. Say, we can get to 4 in the last level, then 5 is left out, so it will not be added there. And
while checking for combo word, check for lower levels also, not only starting from first level. When we
want to recognize a word, it will be easy because list is ready for each node, which word ends, which
combo words end. We have to check which words end in a specific node and also have to check which of
them could be accepted for the given string. As long as we are getting single match for word after
examining all possible ways including sub strings of the given string, we do not have to do further
processing for that string.

It will be a huge work, but we can reduce that by ignoring strings which are not in the tree. So, when we
have substring, check it first in the tree, if found, continue, otherwise discard all the possible strings
using that substring. For example, when we have 1, check it in the tree, if available, then continue. Next
is 11, if available, next is 111, if available continue, otherwise discard all string starting with 111. Then
try 112, if available, continue, otherwise discard all string with 112.

Store the max length of string for a word. Say, it is 10. So, any string more than that, we do not need to
check for it. Say, we are in 15th
level. So, we will check string starting from 6th
level for word match.

From any level, we have to start looking for word starting from first level, unless it exceeds the word
length limit. Any combination of the given string will be considered for a match for word. Say the string
is 1234567. So, 234567, 34567, 4567,567,67 could be string of different length . We have 2 for second
level. It could be effect of two words which has 1 for this level. Or it could be 1 for a word and 1 for
some noise. For any number more than 1 in any level, we will have to try all combination possible for
that. For example, we have 5 for a specific level. We have to try 1,2,3,4,5 and see if we can get a word
match. Also any possible combination will also consider for possible combo word match. For example,
the string is 12345. So, 11111,11245,12145,12245,12145 could be a small portion of possible match list.
Basically, any combination possible by staying inside the string values, will be considered for a word
match. If the number is 5, then 1,2,3,4 and 5 all could be a candidate. Because we got five from the
sound wave. So, it could be an effect of 1 and 4, or 2 and 3, or 4 and 1. Or it could be 1 for the word and
4 for noise or other sound. Basically, if we get 5, then any word have less than 5 or equal to 5 is a
candidate.

For any node, try all possible combination to reach that node using the given string and from that find
out which ones are words (all words are already inserted in the tree). For example, if we are in the 4th

node of level 5, then try all possible way to reach this node. For example, 12344, 12354, 52314 and
some others could be a possible choice. When we are creating string to match, also check if that is
available as word in the tree. This way, we can reduce our unnecessary work. For example, when we are
creating 12344, after adding 1, check if that is available as word or word sub string. Then do the same
for 12. If no connection available from 1 to 2, then we can discard this string and all strings starting with
12.  Then, when we will search string for word and reach a node, we will have list of all possible words
which end in that particular node. For example, when we will come to 4th
node of 5th
level, we will have
a list of words which can possibly end here. Then, we have to match or find any one of them available in
the given string. For example, 12344 is in the list and the given string is 55555, then 12344 is a
candidate. But, if the given string is 11555, then 12344 is not a candidate. We also have to keep track of
remaining values when we want more precise result or when we are getting wrong matches. The
remaining values should also match for word or a known sound. We can discard some for real noise.  We
might find word with remaining in the same node or couple of nodes below as we do not know when
that word finishes. But, we have to keep that in mind. Once all the values or most of it matches with
some word or noise, than it is a match. This is only for more precise recognition or for a place with many
talking or noises available.

We can record, for how many words one node is connected to another node. That will help to find
common or popular words quickly. For example, we have to record how many times 3rd node of first
level in connected with 4th
node of 2ed level. We can keep a counter for this and every time we make a
connection while inserting training words, increase the value. This will help to find common words
quickly. Because, when searching for word in number string, we will choose the popular connection first
and try to match for a word. If not found, then we will choose relatively less popular connections. For

example, when we are searching for words, we are currently in 2ed level and 3rd node. And according to
the number string, we have option to go to 2ed, 3rd
and 4th
node of 3ed level. But, while inserting
training words, from 3rd
node of 2ed level, we have made 10 connections for 2ed, 12 connections for 3rd

and 15 connections for 4th
node of 3ed level. In that case, we will choose connecting to 4th
node of 3rd

level, then 3rd
and finally to 2ed node. It will reduce our search time. Using this technique, we will try
popular nodes or patterns first.

Search:
While searching for any word if it is in the tree or not, start checking from first level  because all words
added in the tree starting from first level. When we will go to next level, we will keep track of max
length of word. For example, max length of word is 10. And we are in 15th level and the string is
abcdefghijklmno. We will take the string starting from 5th
level, which is fghijklmno. Then we will try
ghijklmno, hijklmno, ijklmno, jklmno , klmno, lmno , mno, no  and o. and for each number which is more
than 1, we will try all lower numbers. For example, if the number is 5 for a specific level, then we will try
1,2,3,4 and 5 as number for that level. For example, the string is 333. Then we will try 111, 112, 113,
121,122,123,131,132,133,211, 212,213,221,222, 223, 231,232, 233,311,312,313, 321,322,323,331, 332,
333  and also sub strings with less numbers.
Every time we are creating a pattern to match, we will check if it is available in the tree or not, if not
available, we will stop that pattern and all the possible patterns starting from that. Say, we have
checked that 11 is not available, then we will not create that pattern and also discard all the patterns
which start with 11. Once we add a number to the list, we will check it first and continue if available. We
will not try any string which is not available in the tree.  We will do this because, if we have 3 in the
string for a specific time’s sound wave data, that could be generated by combined effect of couple of
sounds or more. For example, we get 5 for one instant value, then 2 might be the effect of one person’s
voice, 2 for another person and 1 for some other noise, all together we got 5.
For any level, direct number will be considered. For example, if we have 5 for first level, any word
starting with 5 will be considered. Also, any two words or more added to 5 will also be considered. To
make it simple, let us stay with two words combination only. Any two words whose first numbers added
to five will also be considered. Say, two words whose first number is 1 and 4 or 2 and 3 will also be
considered.
Any word can start from any level. For example, “hello” has this string 12345 and “world” has this string
23144, but 2ed word start from 3rd
level. When two people are talking, it is not necessary they are
pronouncing each word at the same time. Anyone starting a word has no relation with other person’s
word pronouncing time. So, it is random data. In third level, “hello” word is in 3rd
branch as single word
and also in 5th
branch as a combo word with “world”.

When any word is found, then its combo word will continue for match and it’s single value will be
considered for remaining levels. For example, “hello” will be found on 5th
level. But, in sixth level, we
have considered “world” with 2314. So, any word can start from any level and stop at any level.

We have to keep processing all the time. We can reduce that by ignoring simple data. For example,
111111, data with less variation. Simple sound not conversation. Noise will be a problem. We have to
reduce noise as much as possible. If no match found after a certain number of levels, discard some
numbers (at least the first number) and start the search again.

When passing each level, all the words start with the passing level’s number will be considered and also
all the words whose combination added to the number will also be considered. Say, for 5th
level we have
8. We already have words that started from first level or any upper level along with combo words
starting in any upper level. In addition to that, now, all the words start with 8 will be added in
consideration list. Also all the words whose first number added to 8 will be added to the list. for words
starting in upper levels, we will take any number up to 8 to be considered as a match as we got 8 in the
sound wave for this position. It could be 1 for word and 7 for noise or 2 for one person’s word and 2 for
another person’s word and 3 for noise. Basically, all the numbers up to 8 could be a candidate to be in
word string for this position. It has advantage and also disadvantage. Advantage will be we do not have
to wary about noise and effect of other word. But, disadvantage will be we might find wrong word. To
prevent that we can restrict it. We can allow only valid combination of two words. Or leave some space
for noise. Say, we can search word combination or single word up to 6 and 2 is for noise.

In each level, we have to check if any word’s match is found or not. Say, we are in 5th
level and we have
data 12345. Then, we have to see if any word’s string is 12345. If we find any, then we have found a
match. If any word is substring of any other word, then we have a problem. We can solve that by
continuing the string as a candidate for match. For example, another word has this string 1234567. It is
up to the implementation of the tree. If we see this is possible, then we can add this, otherwise, once a
word is found, that string is out from consideration list. Basically, we have to prepare a possible word list
for each node. But, accept the words only which is in the given string.

12345 ‐ this string will be considered from first level and again from 2ed level. Actually, from any level.
Basically, whenever we have 12345 in the data, this word will be considered. In any level, if we have 1,
then this word is on to be added to list. The string could be 123451122331234512121212345. Then, this
word will be found three times.
One word will be added multiple times. It will be added in the tree starting from each level. Say, 12345
this word will be added starting from 1st
level. Then, again starting with 2ed level. Say, we have 100

levels. It will be added 96 times. Starting from 97th
level, it will not be added as it could not finish by
100th
level. Basically, all the words will start in first level and finish according to the length of the word.
Those will again be added starting from 2ed level and finish according to its length. Each word will be
added up to the level from where if added, its last number stay inside or in last level. Each node will
have list of words which ends there.

There will be many words ending in a specific node, but we have to consider the ones within the given
string. For example, we are in 7th
level and node number 7. Many words will end here, for example,
1234567, 2134567 and many others. Our string is 123456789, in that case we will find 1234567 as a
word and not include 2134567 because it is not in the given string even though it finishes in this node.
We can keep a sorted list or even a tree if we have many words finishes here. So, a node could have a
small tree also. It is not necessary unless we have many words ending in a node.

If we have too many words in the training set, we can divide that smartly into couple of parts and create
separate tree for each one. And also if we are having trouble to load all the trees in the same machine,
we can use distributed system or many machines and load one tree with one portion of training data.

If we are using single tree and abcdefghijklmnopqrst is the number string, this is how it will enter the
tree. Max word length is 3 and we have 10 levels in the tree.  So, at first a will enter in first level, then b
in 2ed level , then c in 3rd. when abcdefghij will enter the tree, then we have j in last level. Then keep
last  two numbers ij and take 8 new number from the string, which is ijklmnopqr. We have to keep last 2
numbers because we might cut in middle of a word. If last three number was a word, that would be
found in last search. But if last 2 number and the next number is a word, we want to find that also. That
is why we are taking 2 from last numbers.

So, if we add all the word in every level, it will be a big advantage as we will do many calculations
already in training time. While searching for word, we have to do some calculation with different
combination of numbers. Because, we have to search the word string in the given string. Some time we
will get the word string directly from the string. Sometime  we will get by using combo word technique
which will have in real  world as many sound  comes to our ear at the same time and our brain search
for word from there.

We can give a portion of sound data to one tree and next portion to another tree and similarly use as
many trees as we can or need. We have to break data smartly because any word can start from
anywhere. If we do not break it smartly, then we might lose some words. For example, max word length

is 3 and for each portion we are taking 10 numbers. So, first 10 will go to tree number 1 and then we
have to take 2 numbers from previous portion and 8 new numbers. Say, abcdefghijklmnopqrstuvwxyz is
the number string. So we will take last 10 numbers which is abcdefghij for tree 1 and ijklmnopqr for 2ed
tree and qrstuvwxyz for 3ed tree and so on. If we implement hard rules for combo words, then we have
to take as many numbers from previous take as necessary by the rule.

Multiple match problem:
If we have multiple matches for word, we can reduce the block length or bundle length and create
another tree. We can create tree with different block length. For example, block length could be 50 or
25 or 10. If we find multiple words for same number string and we are having problem to choose one,
we can try the more accurate or elaborate one. For example, we are checking number string for block
length 50 and multiple matches found. Then we can use 25 block length and use that tree to recheck the
data again and try to find more accurate result.
We can also use tone or voice specific data and find out which one is more likely. Say, we are in north
part of the country and two person having conversation and there we found a word match which could
be from people of the south, so its probability will be less to get accepted as word. Especially when
people from north are having a conversation. So, in between of conversation, voice of other people will
get less probability. This is an example. We can consider similar kind of issues into consideration.

When a word found using inner value of number string, then the remaining should also match to some
other word or some other known sound. We can also keep some common sounds which are found in
our everyday life. And this is only necessary when we have found multiple words using inner values for
number string. And we will choose the word which will give more accurate or appropriate result. This
will minimize to choose wrong word in case multiple matches found for a string. Remaining could be
part of a word or sound. Basically, we have to choose the one which produce better result or which has
chance to produce better result.

Some techniques to reduce search time:
Use small trees:
We can use small trees to do the task quickly. For example, we can put all the words start with 1 in a
separate tree and first level will have single node with value 1 and all the remaining levels will have 20
branches. Similarly, another tree for words starting with 2 and so on. We can also keep separate tree for
different length of words. For example, for words starting with 1 and length 3, one separate tree. Again,
words starting with 1 and word length 4, another tree. Similarly, separate trees for different length of
words starting with different number. We can also use these trees along with the big tree. Whenever we

have string with specific length and starting number, we will use the appropriate tree to process that
string.

Simple version: for simple version, we can add any word only once starting from first level. In that case,
while searching we have to give a string and when we are done with it, we can remove some number
from front and use new string again in the tree.

Summary of tree structure:
All levels will have fixed number of branches.
Every word will be added multiple times in the tree. Each word will be added by starting in each level
until the whole word fits in the tree.
For each node or branch, we have to prepare a list of words which could be end there or possible match
as single word or combination word. We have to prepare that at training time. Our brain probably does
the same thing while we go to sleep.
If multiple matches found for a particular string, we will use more precise tree with less number in a
bundle or packet.
We can use popular links or connections while searching for a word to reduce time.
Connection from top level node or branch to lower level nodes are only created if that is demanded by a
word which will be inserted in the tree.
Try to recognize everything in the string including noise. That will help to recognize accurately. So, train
with common noise also.
Sound file data will be broken down into small packets of consecutive data and we will create one single
value for each packet of data using different mathematical formulas.

While searching for a word and we have reached a particular node, we have to use this node’s
previously prepared word list and find any word extracted from number string matches with those.

References:
1. http://en.wikipedia.org/wiki/Tree_structure

speech_tree

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (7)

speech_tree