Machine Learning White Paper on Speech to Text Conversion

White paper on Machine
Learning
(Speech to Text Conversion using Android Platform)
Group 10
[Type the companyname]
Apurva Mittal (20141009)
Ketan Gyanchandani (20141028)
Riya Giri (20141058)
Sanjeev Kumar (20141063)
Saurabh Ojha (20141064)
Vikash Kumar (20141072)

Introduction:
Machine learning is a type of artificial intelligence (AI) that provides computers with the
ability to learn without being explicitly programmed. Machine learning focuses on the
development of computer programs that can teach themselves to grow and change when
exposed to new data. The process of machine learning is similar to that of data mining. Both
systems search through data to look for patterns. However, instead of extracting data for
human comprehension, machine learning uses that data to improve the program's own
understanding. Machine learning programs detect patterns in data and adjust program actions
accordingly. For example, Facebook's News Feed changes according to the user's personal
interactions with other users. If a user frequently tags a friend in photos, writes on his wall or
"likes" his links, the News Feed will show more of that friend's activity in the user's News
Feed due to presumed closeness.
Essentially, it is a method of teaching computers to make and improve predictions or
behaviours based on some data. What is this "data"? Well, that depends entirely on the
problem. It could be readings from a robot's sensors as it learns to walk, or the correct output
of a program for certain input. Another way to think about machine learning is that it is
"pattern recognition" - the act of teaching a program to react to or recognize patterns.
Speech has not been used much in the field of electronics and computers due to the
complexity and variety of speech signals and sounds. However, with modern processes,
algorithms, and methods we can process speech signals easily and recognize the text. Speech
recognition (SR) is the translation of spoken words into text. It is also known as "automatic
speech recognition" (ASR), "computer speech recognition", or just "speech to text" (STT).
Background:
For the past several decades, designers have processed speech for a wide variety of
applications ranging from mobile communications to automatic reading machines. Speech
has not been used much in the field of electronics and computers due to the complexity and
variety of speech signals and sounds. However, with modern processes, algorithms, and
methods we can process speech signals easily and recognize the text.

Speech recognition is usually processed in middleware; the results are transmitted to the
user applications.
Speech recognition using android platform is done via the Internet, connecting to Google's
server. Speech recognition for Voice uses algorithms based on hidden Markov models (HMM
- Hidden Markov Model) and N-gram language model. It is currently the most successful and
most flexible approach to speech recognition. This application is adapted to input messages in
English.
A hidden Markov model (HMM) is a statistical Markov model in which the system being
modelled is assumed to be a Markov process with unobserved (hidden) states.
Markov Model:
In simple Markov models the state is directly visible or known to the observer, and therefore
the state transition probabilities are the only parameters.
Let’s take a model of the weather consisting of four state Markov model of the weather.
Suppose that once on any day (e.g. in the morning), the weather is observed as any of the
following state with state transition probability as shown in fig.
• State 1: cloudy
• State 2: sunny
• State 3: rainy
• State 4: windy

Fig. Markov Model for weather
Now, from the above figure the pattern of weather over a period of it can be easily predicted
as the initial state is known & the probability of occurrence of various states are known. For
e.g. the probability of getting the sequence of “sunny, rainy, sunny, cloudy, cloudy” can be
given by Eqn.
P(O A,π)= π XA
= π2 .a23 .a32 .a24 .a41 .a11
Where
a11 a12 a13 a14
A= a21 a22 a23 a24
a31 a32 a33 a34
a41 a42 a43 a44 Transition state probability .

Π = π1 π2 π3 π4 Initial state probability.
Thus we can see that in simple Markov model, the prediction of event on the basis of known
initial state can be easily predicted.
Hidden Markov Model:
The Markov model which is used in Speech to text model is known as Hidden Markov
Model. This model is called Hidden Markov because; the state is not directly visible.
Even though state is not directly visible but output which is dependent on the state is
visible. Each state has a probability distribution over the possible output. Therefore the
sequence of output generated by an HMM gives some information about the sequence of
states. The state sequence through which the model passes is hidden, not to the
parameters of the model; even if the model parameters are known exactly, the model is
still 'hidden'.
Hidden Markov models have application in temporal pattern recognition such as speech,
handwriting, gesture recognition, part-of-speech tagging, musical score following, and
bioinformatics.
We can have a better understanding of the HMM model by looking at the Urn & Ball
Model.
Urn & Ball Model: Hidden MarkovModel
Let’s consider that there are N large glass urns in a room. In each urn there is definite no.
of coloured balls. Let’s consider that the set of N urns contains balls of 6 colours (R = red,
O=orange, B=black, G=green, B=blue, P=purple).
The person in the room chooses an urn in that room and randomly draws a ball from that urn.
He then puts the ball on a conveyor belt, where the observer can observe the sequence of the
balls but not the sequence of urns from which they were drawn. The person in the room has
some procedure to choose urns; the choice of the urn for the n-th ball depends only upon a
random number and the choice of the urn for the (n − 1)-th ball. The choice of urn does not
directly depend on the urns chosen before the single previous urn; therefore, this is called a
Markov process. The Markov process itself cannot be observed, and only the sequence of

labelled balls can be observed, thus this process is called Hidden Markov Process.
Fig: Urn & Ball Model
Although the observer doesn’t know the sequence in which the urn has been chosen, he
knows the probability of the different colour ball which can be chosen from each urn. The
observer can calculate the probability of particular ball being chosen from a particular urn by
calculating the various probabilities of the sequence of choice of urn.
N-gram language model: we can easily recall being told by our high school grammar
teacher, not every random combination of words forms a grammatically acceptable sentence:
 Colourless green ideas sleep furiously
 Furiously sleep ideas green colourless
 Ideas furiously colourless sleep green
The sentence Colourless green ideas sleep furiously (made famous by the linguist Noam
Chomsky), for instance, is grammatically perfectly acceptable, but of course entirely
nonsensical. If you compare this sentence to the other two sentences, this grammaticality
becomes evident. The sentence Furiously sleep ideas green colourless is grammatically
unacceptable, and so is Ideas furiously colourless sleep green: these sentences do not play by
the rules of the English language. In other words, the fact that languages have rules
constraints the way in which words can be combined into an acceptable sentence.
Language plays by rules whereas, computers work with rules. Inferring a set of rules gives us
the language model. A model that describes how a language, say English, works and behaves.
The rules by which a language plays are very complex, and no full set of rules to describe a

language has ever been proposed. There are simpler ways to obtain a language model, namely
by exploiting the observation that words do not combine in a random order. That is, we can
learn a lot from a word and its neighbours. Language models that exploit the ordering of
words are called n-gram language models, in which the n represents any integer greater than
zero.
N-gram models can be imagined as placing a small window over a sentence or a text, in
which only n words are visible at the same time. The simplest n-gram model is therefore a so-
called unigram model. This is a model in which we only look at one word at a time. The
sentence Colourless green ideas sleep furiously, for instance, contains five unigrams:
“colourless”, “green”, “ideas”, “sleep”, and “furiously”. Of course, this is not very
informative, as these are just the words that form the sentence. In fact, N-grams start to
become interesting when n is two (a bigram) or greater.
We can easily modify our definition of bigrams to extract n-grams at a specified length.
Rather than always takeing two elements, we make the number of items to take an argument
to the function. When used for language modelling, independence assumptions are made so
that each word depends only on the last n-1 words. This Markov model is used as an
approximation of the true underlying language. This assumption is important because it
massively simplifies the problem of learning the language model from data. In addition,
because of the open nature of language, it is common to group words unknown to the
language model together. in a simple n-gram language model, the probability of a word,
conditioned on some number of previous words (one word in a bigram model, two words in a
trigram model, etc.) can be described as following a categorical distribution (often
imprecisely called a "multinomial distribution"). Basically a n-gram model predicts what is
the likelihood of the next letter. From training data, one can derive the probability distribution
for the next letter given in a history of size n: a=.4, b=.0004, c=0, where the probabilities of
all possible “next letters” sums up to 1.0.
Problems: the following problems faced in today’s world
Hands-free computing:
Today’s generation prefer speaking over writing any day. This may be due to many reasons:
time scarcity, multitasking efficiency, and hassle free tasks division. They have so many
things to do and very less time. There is a need of such an interface that they can connect to
and interact with to make their daily talks easy. With the help of this application any one can
give an input of their voice and abruptly the voice will be converted into text without doing
any extra task.
Education and daily life:
Today’s tech-savvy youngsters want to get their hands on anything and everything. They
have indulged themselves in so many things that they cannot bear spending time writing their
projects and assignments. In today’s generation children are loaded with so many activities

that their health is degrading day by day. This application will help them doing their
assignments. They will have to just dictate their lines and this application will provide them
with the written documents. Moreover they need to learn new languages so that they can
connect with the outer world and work on their pronunciation skills to. This application with
the multiple languages option can help people learning different languages without the help
of any tutor and without going to the particular region.
In day-to-day life when texts messages are integral part of our lives, this application
will help everyone typing the text messages on a go while doing any other task. For example,
anyone can just speak their message to be sent while driving when it is urgent.
Blindness and education:
Among people there are some that are unable to write, either because of blindness (complete
or partial), or for other reasons. Students who are physically disabled or suffer
from Repetitive strain injury/other injuries to the upper extremities have to worry about
handwriting, typing, or working with scribe on school assignments. They all need such an
interface that listens to them and do their task and help them connect to the outer world
instantly.
For those people who cannot read or write, it is very difficult to use the texting application of
phones. This application will help them in this kind of situation by proving the proper
platform.
Solution:
Speech recognition will be very helpful to such people. They will be able to take notes of
anything and everything, send messages across distances at a go. Students who are blind or
have very low vision can benefit from using the technology to convey words and then hear
the application recite for them, as well as use a computer by commanding with their voice,
instead of having to look at the screen and keyboard. For language learning, speech
recognition can be useful for learning a second language. It can teach proper pronunciation,
in addition to helping a person develop fluency with their speaking skills. Today’s generation
prefer speaking over writing any day, for such tech-say youngsters speech to text is the best
way to promote learning and sharing of information. This will help them take notes on a go.
ANDROID Platform as a way out of this problem:
Android is a software environment for mobile devices that includes an operating system,
middleware and key applications.
The Android operating system (OS) architecture is divided into 5 layers. The application
layer of Android OS is visible to end user, and consists of user applications. The application
layer includes basic applications which come with the operating system and applications
which user subsequently takes. All applications are written in the Java programming

language. Framework is extensible set of software components used by all applications in the
operating system. The next layer represents the libraries, written in the C and C + +
programming languages, and OS accesses them via framework. Dalvik Virtual Machine
(DVM), forms the main part of the executive system environment. Virtual machine is used to
start the core libraries written in the Java programming language.
Android Architecture
Unlike Java’s virtual machine, which is based on the stack, DVM bases on registry structure
and it is intended for mobile devices. The last architecture layer of Android operating system
is kernel based on Linux OS, which serves as a hardware abstraction layer. The main reasons
for its use are memory management and processes, security model, network system and the
constant development of systems. There are four basic components used in construction of
applications: activity, intent, service and the content provider. An activity is the main element
of every application and simplified description defines it as a window that users see on their
mobile device. The application can have one or more activities. Main activity is the one that
is used as startup. The transition between the activities is carried out in a way that launched
activity calls a new activity. Each activity as a separate component is implemented with
inheritance of Activity class. During the execution of applications, activities are added to the
stack, currently running activity is on the top of the stack.
An intent is a message used to run the activities, services, or recipient’s multicast. An intent
can contain the name of the components you need to run, the action which is necessary to
execute, the address of stored data needed to run the component, and component type. A

service is a component that runs in the background to perform long running operations or to
perform work for remote processes. One service can link multiple applications and service is
executed until a connection with all applications is done. A content provider manages a
shared set of application data. Data can be stored in the file system, a SQLite database, on the
web, or any other persistent storage location which application can access [1]. Through the
content provider, other applications can query or even modify the data (if the content
provider allows it).
Speech Recognition:
Speech recognition for this application is done on Google server, using the HMM and n-gram
algorithm. The system can be divided into several blocks: feature extraction, acoustic models
database which is built based on the training data, dictionary, language model and the speech
recognition algorithm.
The input audio waveform from a microphone is converted into a sequence of fixed size
acoustic vectors Y 1:T = y1,...,yT in a process called feature extraction. The decoder then
attempts to find the sequence of words w1:L = w1,...,wL which is most likely to have
generated Y , i.e. the decoder tries to find wˆ = arg max w{P(w|Y )}.However, since P(w|Y )
is difficult to model directly,1 Bayes’ Rule is used to transform it into the equivalent problem
of finding:
wˆ = arg max w{p(Y |w)P(w)}
The likelihood p(Y |w) is determined by an acoustic model and the prior P(w) is determined
by a language model.
The basic unit of sound represented by the acoustic model is the phone. For example, the
word “bat” is composed of three phones /b/ /ae/ /t/. About 40 such phones are required for
English.
For any given w, the corresponding acoustic model is synthesized by concatenating phone
models to make words as defined by a pronunciation dictionary. The parameters of these
phone models are estimated from training data consisting of speech waveforms and their
orthographic transcriptions. The language model is typically an N-gram model in which the
probability of each word is conditioned only on its N-1 predecessors. The N-gram parameters
are estimated by counting N-tuples in appropriate text corpora (set of words). The decoder
operates by searching through all possible word sequences using pruning to remove unlikely
hypotheses thereby keeping the search tractable. When the end of the utterance is reached, the

most likely word sequence is output. Alternatively, modern decoders can generate lattices
containing a compact representation of the most likely hypotheses.
MAIN PARTS OF THE PROJECT:
A. Voice Recognition Activity class: Voice Recognition Activity is startup activity
defined as launcher in AndroidManifest.xmlfile.
This is where most of the initialization goes to programmatically interact with widgets
in the user interface. In this method there is also a check whether mobile phone, on
which application is installed, has speech recognition possibility. If a mobile device
doesn’t have one of many Google’s applications which integrate speech recognition,
further work of this application Voice SMS will be disabled and message on the
screen will be “Recognizer not present”. Recognition process is done trough one of
Google’s speech recognition applications. If recognition activity is present user can
start the speech recognition by pressing on the button and thus launching
startActivityForResult (Intent intent, int requestCode). The application uses
startActivityForResult() to broadcast an intent that requests voice recognition,
including an extra parameter that specifies one of two language models.
Enables search after clicking image button

Processes and gives text output
b. SMS class: this class acts as an interface for sending SMS activity. The text is entered in
the space for writing messages and displayed on the screen. By clicking the Send SMS button
application checks whether the message and the number of recipient are entered to perform
sending of message. When cursor is positioned in the space for recipient number from
contacts, button attribute visibility is changed from default gone to visible. Pressing the
button the command allows you to enter the contact numbers. After selecting desired contact,
message can be sent.

Interface for sending SMS
c.XML files: Application consists of two different interfaces. When the user runs application
screen is defined in voice_recognition.xml. The linear arrangement of elements allows adding
widget one below another. Width and height are defined with fill_parent attribute, which
means to be equal as parent (in this case the screen). The second interface, defined within
sms.xml file, is displayed when the user chooses one of offered messages.
AndroidManifest.xml realizes installing and launching applications on the mobile device.
Economic feasibility:
As far as the Economic feasibility of this project is considered, we can say that it would be
very cost efficient for any company to incorporate this application into their project. The
following features of this application shows its economic feasibility:
 Operating System used for development id free of cost and so is the eclipse ide used
as an interface for application development.
 Free use and adaptation of operating system to manufacturers of mobile devices.
 Equality of basic core applications and additional applications in access to resources.
 Optimized use of memory and automatic control of applications which are being
executed.
 Quick and easy development of applications using development tools and rich
database of software libraries.
 High quality of audiovisual content, it is possible to use vector graphics, and most
audio and video formats.

 Ability to test applications on most computing platforms, including Windows, Linux.
Thus saving time and money.
Conclusion:
A speech recognizer’s effectiveness depends on its synthesizing rate and pronunciation
quality. Generally it is seen that STT software uses only type of the language models. Using
only one type of algorithm does serve the purpose of converting speech to text but often lands
up with not so fine quality and low synthesizing rate. Our application attempts to interpolate
the data by combining two sets of model namely, hidden Markov and n-gram model which
are the best algorithms so far for any STT software.
Future developments:
The existing Speech to Text Conversion software available over the net converts the speech
entered in any particular language into that language’s text only. Moreover if, we look at few
applications which do provide the feature of incorporating multiple languages translator, it
often ends up creating a mess for application users by mixing up words an often not giving
desired output.
With the development of software and hardware capabilities of mobile devices, there is an
increased need for device-specific content, what resulted in market changes. We look forward
incorporating the idea of how this software can be developed in future to enter speech in
multi languages and convert it into the multi language text effectively, which could create a
foundation for everyday use of this technology worldwide. The user shall be given a
preference to choose a language he wishes to speak which shall be matched with the existing
database consisting of dictionary and then he will be asked to choose the language he wants
the data to be converted in. The speech synthesizer shall convert the same into required
language and display the desired output. We will focus on various languages spoken in India
thus making it one of its kinds.

Machine Learning White Paper on Speech to Text Conversion

Machine Learning White Paper on Speech to Text Conversion

Recommended

Recommended

More Related Content

Similar to Machine Learning White Paper on Speech to Text Conversion

Similar to Machine Learning White Paper on Speech to Text Conversion (20)

More from Apurva Mittal

More from Apurva Mittal (18)

Recently uploaded

Recently uploaded (20)

Machine Learning White Paper on Speech to Text Conversion