- The document proposes a new type of tree structure to recognize spoken words from sound files containing multiple speakers, without needing to parse the sound data.
- The tree would insert sound file data as numbers and examine each number/bundle of numbers to find words. Each level of the tree would have up to 20 branches corresponding to numbers 0-19.
- Words from a training set would be inserted into the tree by their number sequences. The tree could then be used to find words within long number strings, even if they start in any position.
2. Parsing a sound file data is a very difficult task. In situations where multiple persons speak at the same
time, then at first we need to find the data which is specifically generated from a specific person’s voice
and then take that to a classifier to understand what he is saying. That is a very difficult task. Also it is
difficult to find starting and ending point of a word, especially when other background noises are also
available. Background noises are a very common and natural issue and found in normal situation where
people speak and we want to recognize those spoken words. I am proposing a new type of tree
structure which is different than conventional tree structures(1), using which we can understand spoken
words without parsing the data and it will work in situations where multiple persons are talking at the
same time. The tree will find the word wherever it is in the number string (voice data). While using the
tree, we do not have to think about where a word starts and where it ends.
Breaking of sound data into blocks:
If we break down a sound wave into small packets of data, we will get its value for a particular instant or
time interval. We want to break it into small portions where each portion will give data for a very small
time interval. It is like we are examining every point of this sound wave which is being plotted on a
graph paper with small time interval as x axis, for example millisecond as unit in x axis. If we have some
background noise and multiple person speaking, all these data will come as a single wave data which will
be the accumulated effect of all the sound waves in each time interval. We will get each small time
intervals data in each small part of it which will have combined effect of every type of voices and noise
available at that time. For example, in a very small time interval we get 2 for one person’s voice and 3
for another person’s voice and 4 for noise. So, we will get 9, which is the added value of all three for that
small time interval.
Sound files data is string of numbers, we can pick on each number. Say,
1234511111111123561111112547 is a sound file data, then we will examine each number in the string.
If sound file has too many data for small words, then we can take bundle of consecutive numbers
instead of taking each number into the tree. For example, “hello” word has 1000 numbers as data. Then,
instead of examining each number, we can combine some consecutive numbers as a bundle. For
example, we can bundle every 50 consecutive numbers to one number. We can take their mean and
standard deviation and add these two to get a single number to represent the bundle. We can also try
other mathematical formula for that.
We can insert 0 in front of values less than 10 in the string or we can give 0 in front of values more than
9. Either way will do. If we find numbers less than 10 more frequently, then replace numbers more than
9.this is just to distinguish between single digit numbers and double digit numbers.
3. Structure :
In a tree, for each level, say we are working with number up to 20 as a combined bundle number.
Obviously, we will get this information properly by examining sound file data. For example let us stay
with 20. So, the tree will have 20 branches in every level. Whichever word has a specific number in their
string for a level, will be in that particular numbered branch. Say “hello” word has 1 as first number, so it
will be in first branch of level one. Say, “world” has 3 as first number, so it will be in 3rd
branch of level
one. But, their added value is 4, so they will be also considered as combo word if string value is 4. We
are adding these because their data comes through a sound wave and we get combined effect of both
waves when two persons talking at the same time. Whatever sounds are available out there, all come as
a single sound wave for a specific time interval. So, each word and its combo with other words are also
considered in the tree. If we break done the sound file data for small time interval, like millisecond, we
will get the combined effect of all available sound for that interval. And by looking this small time
interval data, we cannot understand the spoken words or recognize the sound. We have to consider
data for a significant amount of time interval, but we will examine the small interval to draw a
conclusion on large time interval data. To draw a conclusion for 1 second sound file data, we will break it
down for each millisecond or more smaller time interval and examine all those to find words spoken in
that time. In my opinion, our brain also uses the same procedure to recognize sound data.
It is not like normal tree where one upper level node has child node. In this tree, every level will have
fixed number of nodes, which is 20 nodes in this example. Any node in any level is not a child node of
any node. Every level will have fixed number of nodes or branches. Upper nodes and lower nodes will
get connected only if necessary while inserting a word. A node is not necessarily connected with every
upper nodes or with every lower nodes. Connections are made if it is necessary to insert a word,
otherwise no connection available. For example, to insert all the words in the training set, we do not
need a connection from 3rd
node of 2ed level to 5th
node of 3rd
level. So, in connection list of 3rd
node of
2ed level, 5th
node of 3rd
level will not be available. Each node will be connected to upper node if
needed while inserting any word, otherwise not connected. Same for lower nodes, connected if
necessary. For example, first node in first layer may be connected with 2ed node in 2ed layer, if that is
necessary to insert a word. Connection will be only created while inserting a word.
Say, for “hello” word we get 12345 as number string and for “world” word we get 14567. Then, there
added value will be 2681012. This will happen with sound wave if both words are spoken at the same
time. Effect of each word acting on same time interval will get added and received by sound recognition
device.
In sound number string, a word can start from anywhere; it could be at the beginning or any random
position. As we will break down sound file data, so a word could be in any position of that small part.
That is why we have to examine a long string of numbers in a tree with good number of levels. We can
always restrict levels in the tree by increasing bundle or block width. If we make bundle width 100
instead of 50, then we will have fewer levels in the tree. This decision we have to make after examining
7. example, when we are searching for words, we are currently in 2ed level and 3rd node. And according to
the number string, we have option to go to 2ed, 3rd
and 4th
node of 3ed level. But, while inserting
training words, from 3rd
node of 2ed level, we have made 10 connections for 2ed, 12 connections for 3rd
and 15 connections for 4th
node of 3ed level. In that case, we will choose connecting to 4th
node of 3rd
level, then 3rd
and finally to 2ed node. It will reduce our search time. Using this technique, we will try
popular nodes or patterns first.
Search:
While searching for any word if it is in the tree or not, start checking from first level because all words
added in the tree starting from first level. When we will go to next level, we will keep track of max
length of word. For example, max length of word is 10. And we are in 15th level and the string is
abcdefghijklmno. We will take the string starting from 5th
level, which is fghijklmno. Then we will try
ghijklmno, hijklmno, ijklmno, jklmno , klmno, lmno , mno, no and o. and for each number which is more
than 1, we will try all lower numbers. For example, if the number is 5 for a specific level, then we will try
1,2,3,4 and 5 as number for that level. For example, the string is 333. Then we will try 111, 112, 113,
121,122,123,131,132,133,211, 212,213,221,222, 223, 231,232, 233,311,312,313, 321,322,323,331, 332,
333 and also sub strings with less numbers.
Every time we are creating a pattern to match, we will check if it is available in the tree or not, if not
available, we will stop that pattern and all the possible patterns starting from that. Say, we have
checked that 11 is not available, then we will not create that pattern and also discard all the patterns
which start with 11. Once we add a number to the list, we will check it first and continue if available. We
will not try any string which is not available in the tree. We will do this because, if we have 3 in the
string for a specific time’s sound wave data, that could be generated by combined effect of couple of
sounds or more. For example, we get 5 for one instant value, then 2 might be the effect of one person’s
voice, 2 for another person and 1 for some other noise, all together we got 5.
For any level, direct number will be considered. For example, if we have 5 for first level, any word
starting with 5 will be considered. Also, any two words or more added to 5 will also be considered. To
make it simple, let us stay with two words combination only. Any two words whose first numbers added
to five will also be considered. Say, two words whose first number is 1 and 4 or 2 and 3 will also be
considered.
Any word can start from any level. For example, “hello” has this string 12345 and “world” has this string
23144, but 2ed word start from 3rd
level. When two people are talking, it is not necessary they are
pronouncing each word at the same time. Anyone starting a word has no relation with other person’s
word pronouncing time. So, it is random data. In third level, “hello” word is in 3rd
branch as single word
and also in 5th
branch as a combo word with “world”.
8. When any word is found, then its combo word will continue for match and it’s single value will be
considered for remaining levels. For example, “hello” will be found on 5th
level. But, in sixth level, we
have considered “world” with 2314. So, any word can start from any level and stop at any level.
We have to keep processing all the time. We can reduce that by ignoring simple data. For example,
111111, data with less variation. Simple sound not conversation. Noise will be a problem. We have to
reduce noise as much as possible. If no match found after a certain number of levels, discard some
numbers (at least the first number) and start the search again.
When passing each level, all the words start with the passing level’s number will be considered and also
all the words whose combination added to the number will also be considered. Say, for 5th
level we have
8. We already have words that started from first level or any upper level along with combo words
starting in any upper level. In addition to that, now, all the words start with 8 will be added in
consideration list. Also all the words whose first number added to 8 will be added to the list. for words
starting in upper levels, we will take any number up to 8 to be considered as a match as we got 8 in the
sound wave for this position. It could be 1 for word and 7 for noise or 2 for one person’s word and 2 for
another person’s word and 3 for noise. Basically, all the numbers up to 8 could be a candidate to be in
word string for this position. It has advantage and also disadvantage. Advantage will be we do not have
to wary about noise and effect of other word. But, disadvantage will be we might find wrong word. To
prevent that we can restrict it. We can allow only valid combination of two words. Or leave some space
for noise. Say, we can search word combination or single word up to 6 and 2 is for noise.
In each level, we have to check if any word’s match is found or not. Say, we are in 5th
level and we have
data 12345. Then, we have to see if any word’s string is 12345. If we find any, then we have found a
match. If any word is substring of any other word, then we have a problem. We can solve that by
continuing the string as a candidate for match. For example, another word has this string 1234567. It is
up to the implementation of the tree. If we see this is possible, then we can add this, otherwise, once a
word is found, that string is out from consideration list. Basically, we have to prepare a possible word list
for each node. But, accept the words only which is in the given string.
12345 ‐ this string will be considered from first level and again from 2ed level. Actually, from any level.
Basically, whenever we have 12345 in the data, this word will be considered. In any level, if we have 1,
then this word is on to be added to list. The string could be 123451122331234512121212345. Then, this
word will be found three times.
One word will be added multiple times. It will be added in the tree starting from each level. Say, 12345
this word will be added starting from 1st
level. Then, again starting with 2ed level. Say, we have 100
10. is 3 and for each portion we are taking 10 numbers. So, first 10 will go to tree number 1 and then we
have to take 2 numbers from previous portion and 8 new numbers. Say, abcdefghijklmnopqrstuvwxyz is
the number string. So we will take last 10 numbers which is abcdefghij for tree 1 and ijklmnopqr for 2ed
tree and qrstuvwxyz for 3ed tree and so on. If we implement hard rules for combo words, then we have
to take as many numbers from previous take as necessary by the rule.
Multiple match problem:
If we have multiple matches for word, we can reduce the block length or bundle length and create
another tree. We can create tree with different block length. For example, block length could be 50 or
25 or 10. If we find multiple words for same number string and we are having problem to choose one,
we can try the more accurate or elaborate one. For example, we are checking number string for block
length 50 and multiple matches found. Then we can use 25 block length and use that tree to recheck the
data again and try to find more accurate result.
We can also use tone or voice specific data and find out which one is more likely. Say, we are in north
part of the country and two person having conversation and there we found a word match which could
be from people of the south, so its probability will be less to get accepted as word. Especially when
people from north are having a conversation. So, in between of conversation, voice of other people will
get less probability. This is an example. We can consider similar kind of issues into consideration.
When a word found using inner value of number string, then the remaining should also match to some
other word or some other known sound. We can also keep some common sounds which are found in
our everyday life. And this is only necessary when we have found multiple words using inner values for
number string. And we will choose the word which will give more accurate or appropriate result. This
will minimize to choose wrong word in case multiple matches found for a string. Remaining could be
part of a word or sound. Basically, we have to choose the one which produce better result or which has
chance to produce better result.
Some techniques to reduce search time:
Use small trees:
We can use small trees to do the task quickly. For example, we can put all the words start with 1 in a
separate tree and first level will have single node with value 1 and all the remaining levels will have 20
branches. Similarly, another tree for words starting with 2 and so on. We can also keep separate tree for
different length of words. For example, for words starting with 1 and length 3, one separate tree. Again,
words starting with 1 and word length 4, another tree. Similarly, separate trees for different length of
words starting with different number. We can also use these trees along with the big tree. Whenever we