Machine Translation for Low Resource Indian Languages Compared

Machine Translation for low resource
Indian Languages
Trushita Prashant Redij
Supervisor: Prof. Amir Esmaeily
Dublin Business School
This dissertation is submitted in partial fulfillment for degree of
Master of Science in Data Analytics.
May 2020

Copyright @ Trushita Redij, 2020
All Rights Reserved

Declaration
I hereby certify that the embodied thesis report submitted for examination for the award of
Master of Science in Data Analytics, is solely my own work and contains references and
acknowledgements for research done by the researchers and technical scholars.
The thesis comply by the regulations for postgraduate study by research of Dublin
Business School and has not been submitted in whole or in part for another award in any
other university.
The thesis work conforms by the ethics, principles and guidelines of applied research
stated by Dublin Business School.
Trushita Prashant Redij
May 2020

Acknowledgements
Motivation, guidance and determination has played a vital role in completion of this project
report on Machine Translation for Low Resource Indian Languages.
Foremost, I am grateful to God almighty for giving me strength and optimism to complete
this project at this difficult times.
I would like to express special gratitude and thanks to my project guide Prof. Amir
Sajad Esmaily for his expertise, feedback and guidance.
Lastly, my thanks and appreciation goes to my family and friends who encouraged,
supported and helped me with best of their abilities.

Abstract
Natural Language Processing predominantly comprises of various advent techniques and
methods which assist the computers to process natural languages. NLP based applications
like summarization, recommender systems, classification, machine translation systems, etc
have reflected the significant role of Artificial Intelligence in modern times. A tremendous
amount of data is available on the internet which majorly represents the English Language
thereby challenging the machine translation for other low resource languages.
Indian Languages are consecrated, concise, and syntactical rich and provide tremendous
scope to experiment using various methods of Machine Translation. The majority of the
work done on Indian languages has implemented rule-based and language-specific models
thereby assuring space for new experiments and development.
In this work, we present approaches to build an automatic translator system for the
Marathi language. We have proposed to build a statistically based Machine translation
model using the Moses toolkit and Deep Neural Network-based model using OpenNMT.
The training data used for this project comprises a parallel corpora of Bible in Marathi and
English.
Also, the research progressively depicts the evolution of the Machine translation system
and its various application. It highlights the process of data preprocessing, implementation,
testing, and evaluation.
Furthermore, to evaluate the performance of our models we used the BLEU metric
wherein we could analyze the performance of the two models. The performance of the Deep
Neural Network model was more accurate than the Statistical Machine Translation Model.
The thesis helped us conclude that Neural Network has emerged as a strong competitor
challenging the dominance of the primitive and popular SMT based approaches.

Table of contents
List of figures xiii
List of tables xv
1 Introduction 1
1.1 What is Natural language processing ? . . . . . . . . . . . . . . . . . . . . 2
1.2 What is Machine Translation? . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.1 Application of Machine Translation . . . . . . . . . . . . . . . . . 5
1.2.2 Machine Translation System Architectures . . . . . . . . . . . . . 6
1.3 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Key Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.5 Thesis Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2 Background and Related Work 11
2.1 Rule Based Machine Translation . . . . . . . . . . . . . . . . . . . . . . . 12
2.2 Direct Machine Translation . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 Transfer Based Machine Translation . . . . . . . . . . . . . . . . . . . . . 13
2.4 Interlingual Based Machine Translation . . . . . . . . . . . . . . . . . . . 13
2.5 Example Based Machine Translation . . . . . . . . . . . . . . . . . . . . . 14
2.6 Statistical Based Machine Translation . . . . . . . . . . . . . . . . . . . . 15
2.6.1 Word Based Statistical Machine Translation . . . . . . . . . . . . . 16
2.6.2 Syntax Based Statistical Machine Translation . . . . . . . . . . . . 16
2.6.3 Phrase Based Statistical Machine Translation . . . . . . . . . . . . 16
2.7 Deep Neural Machine Translation . . . . . . . . . . . . . . . . . . . . . . 18
2.7.1 Feed Forward Network . . . . . . . . . . . . . . . . . . . . . . . . 21
2.7.2 Recurrent Neural Network . . . . . . . . . . . . . . . . . . . . . . 21
2.7.3 Convolutional Neural Network . . . . . . . . . . . . . . . . . . . . 22
2.7.4 Transformer Model . . . . . . . . . . . . . . . . . . . . . . . . . . 23

x Table of contents
2.7.5 Transformer Architecture . . . . . . . . . . . . . . . . . . . . . . . 24
2.7.6 Open NMT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3 Methodology 29
3.1 Building SMT Model using Moses . . . . . . . . . . . . . . . . . . . . . . 29
3.1.1 Moses - An open source SMT toolkit . . . . . . . . . . . . . . . . 29
3.2 Build Neural Machine Translation Model using OpenNMT . . . . . . . . . 32
3.2.1 Transformer Architecture . . . . . . . . . . . . . . . . . . . . . . . 32
3.2.2 Encoder and Decoder Input . . . . . . . . . . . . . . . . . . . . . 32
3.2.3 Self Attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2.4 Multi Head Attention . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2.5 Masked Multi Head Attention . . . . . . . . . . . . . . . . . . . . 34
4 Objectives and Requirements 35
4.1 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.2 Software Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.3 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5 Building Statistical Machine Translation model 37
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.2 Baseline System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.3 Corpus Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.4 Language Model Training . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.5 Training the Translation System . . . . . . . . . . . . . . . . . . . . . . . 39
5.6 Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.7 Binarising Phrase and Reordering Tables . . . . . . . . . . . . . . . . . . 40
5.8 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.9 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
6 Building Deep Neural Machine Translation Model 43
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
6.2 Setup of Required Modules . . . . . . . . . . . . . . . . . . . . . . . . . . 43
6.3 Corpus Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
6.4 Pre-Processing Text Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
6.5 Training the Translator model : . . . . . . . . . . . . . . . . . . . . . . . . 45
6.6 Translate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
6.7 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

Table of contents xi
6.8 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
7 Evaluation and Analysis 49
7.1 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
7.1.1 Bilingual Evaluation Understudy Score . . . . . . . . . . . . . . . 49
7.2 Analysis of SMT and NMT models . . . . . . . . . . . . . . . . . . . . . . 50
7.2.1 Using Data and Implementing Model . . . . . . . . . . . . . . . . 50
7.2.2 Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
7.2.3 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
8 Conclusion and Future Work 51
References 53

List of figures
1.1 Machine Translation for Languages . . . . . . . . . . . . . . . . . . . . . 1
1.2 Natural Language Processing . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Natural Language Processing Levels
Image Source: NLPhackers.io . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 The Vauquois Triangle
Image Source: researchgate . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1 History of Machine Translation, Image Source: medium.com . . . . . . . . 11
2.2 Direct Machine Translation
Image Source: medium.com . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 Statistical Machine Translation . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4 Phrase Based Statistical Machine Translation
Image Source: wordpress . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.5 Neural Machine Translation
Image Source: altoross.com . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.6 Recurrent Neural Network
Image Source: sdl.com . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.7 Transformer Architecture
Image Source: medium.com . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.1 Statistical Machine Translation using Moses . . . . . . . . . . . . . . . . . 30
3.2 Neural Machine Translation Process . . . . . . . . . . . . . . . . . . . . . 33

List of tables
5.1 BLEU Score for SMT Model . . . . . . . . . . . . . . . . . . . . . . . . . 41
6.1 BLEU Score for NMT Model . . . . . . . . . . . . . . . . . . . . . . . . . 47

Chapter 1
Introduction
One of the most preponderant and challenging task for the computer since its evolution and
development is the automatic translation of texts for different languages. Human languages
are diverse and define distinct syntax and semantics thereby imposing challenges for Artificial
Intelligence to automate translation. Machine translation is a process of automatically
converting text from one language to another by using a software program [1].
Traditionally, machine translation was based on the rule-based system which was used
for interpretation by storing and manipulating knowledge and information [2].
In 1990s rule-based system were replaced by statistical methods were in bilingual or
parallel text corpora are used to derive parameters for the model [3].
Going further, deep neural network models dawned on the new era of automatic translation
called neural machine translation.
Fig. 1.1 Machine Translation for Languages
Machine translations comprises of input data made of a sequence of symbols of source
language which is parsed through a computer program to derive output sequence of the target
language.
The fundamental drawbacks of classical machine translation are the framing of rules and
exceptions, sequential nature, learning long-range dependencies in the network.

2 Introduction
However, for this research, we have implemented statistical machine translation using
the Moses tool and neural machine translation using the transformer model for low resource
Indian Language called Marathi.
1.1 What is Natural language processing ?
Natural language processing is a subfield of artificial intelligence that vitally focuses on
interactions between the human language and computer language. It’s a field which sits at
the intersection of computer science, artificial intelligence, and linguistics [4]. Languages are
spoken or written by humans to communicate like English, Hindi, Marathi, French, Japanese,
Chinese, etc are examples of natural language.
Predominantly, language is based on two fundamental aspects called symbols and rules,
symbols represent information that needs to be conveyed and rules define the manipulation
of symbols.
Fig. 1.2 Natural Language Processing
The primary aim of language processing is to interpret the language by understanding
its semantics and syntax thereby implementing it to develop applications like chatbots,
summarizer, auto tag, named entity recognition, sentiment analysis, online shopping, smart
devices like Cortana, Siri, etc. There are various methods to translate the sentences from one
language to another.

1.1 What is Natural language processing ? 3
However, human languages are complex and are based on unique syntax and seman-
tics thereby imposing a challenge in the field of Artificial Intelligence to process natural
languages.
Natural Language Understanding
• Natural Language Understanding task is to understand, interpret and reason a natu-
ral language which is at the input side. It deals with Machine reading comprehension
which is applied in automated reasoning, categorizing text, machine translation, ques-
tion answering, activating voice and content analysis [5].
Natural Language Generation
• It’s a process that deals with transformation of structured data into natural language.
It is used for automatic content generation for example chatbot, content for mobile or
web application. For Natural Language Generation the system makes decisions for
putting a concept into words thus the ideas the system wants to portray are known
precisely.
Formal Language
Formal language is made of symbols, alphabets and strings or word.
• Symbol is character or an abstract entity which has no meaning of itself. For e.g letters,
digits and special characters.
• Alphabet is finite set of symbols and is denoted using sigma. For e.g B = (0,1) B is an
alphabet of two symbols 0 and 1.
• String is finite sequence of symbols from an alphabet. For e.g 0110 and 111 are strings
from alphabet B above.
• Language is a finite set of strings from an alphabet.

4 Introduction
Linguistics and Language processing
Linguistics is a science of language processing which comprises of sounds, word formation,
sentence structure, meaning and understanding.
Fig. 1.3 Natural Language Processing Levels
Image Source: NLPhackers.io
There are five levels for processing natural language.
1. Morphological and Lexical analysis: Morphology depicts the identification, analysis,
and description of the structure of words into morphemes. Morphemes are the smallest
meaningful unit in the grammar of a language that has semantic meaning. For e.g
the word ’unbreakable’ has 3 morphemes, ’un’, ’break’, ’able’ [6]. There are various
types of Morphemes such as free, bound, inflectional, derivational, root, and null
morpheme. Syntax of the language comprises the set of rules that define the structure
of the language. It’s represented using a parse tree or by a list.

1.2 What is Machine Translation? 5
2. Lexical Analysis divides the text into paragraphs, sentences and words taking into
consideration the morphological and syntactical structure of the language.
3. Syntactical Analysis This step analyzes the words and transforms them to find their
relation with each other. It converts a flat input sentence into a hierarchical structure
that corresponds to the units of meaning in the sentence. It comprises of two main
components called grammar and parser. Grammar declares the syntactical represen-
tation and legal structure of the language. Parser compares the grammar against the
input sentences to produce a parsed structure called Parse Tree.
4. Semantic Analysis This step determines the absolute meaning from a context and
determines the possible meaning for a sentence in context. The structures derived from
syntactic analysis are assigned meaning and mapped to the objects in task domain. For
e.g the sentence ’colourless red ideas’ will be rejected as colorless red does have any
meaning [6].
5. Discourse Processing The meaning of an individual sentence may depend on the
previous sentence or the sentence preceding it. For e.g the word ’it’ in the sentence "
you wanted it " depends on the prior discourse content [6].
6. Pragmatic Analysis This step deals with knowledge that is beyond the context of the
word. Pragmatics analysis derives the various aspects of language that require real
world knowledge by focusing on actual meaning of the sentences. For e.g "Please,
place my order?" should be interpreted as a request [6].
1.2 What is Machine Translation?
Machine translation, normally known as MT, can be characterized as "interpretation from
one natural language to another dialect utilizing modernized frameworks and with or without
human help [7].
1.2.1 Application of Machine Translation
• MT is inconceivably quick.
• It can convert into numerous dialects without a moment’s delay which definitely
decreases the measure of labor required.

6 Introduction
• Actualizing MT into your localization procedure can do the hard work for interpreters
and save their time, permitting them to concentrate on the more multifaceted parts of
translation.
• MT innovation is growing quickly and is continually progressing towards creating
more excellent interpretations and diminishing the requirement for post-altering.
1.2.2 Machine Translation System Architectures
In the etymological design there are three essential methodologies being utilized for creating
MT frameworks that contrast in their unpredictability and advancement. These approaches
are represented in the diagram below:
Fig. 1.4 The Vauquois Triangle
Image Source: researchgate
In direct translation, interpretation is immediate from the source content to the objective
content. The vocabularies of source language are examined varying for the goals of source
language ambiguities, for the right identification of target language articulations as well with
respect to the determination of word order.
In the transfer approach, translation is finished through three phases: the principal stage
comprises in changing over source text into an intermediate representation, for the most part
parse trees; the subsequent stage convert these representations into proportionate ones in the
objective language; and the third phase generates the target text.
The interlingua approach is the most appropriate methodology for multilingual frame-
works. It has two phases: analysis and generation. In the analysis stage, a sentence in the
source language is broke down and its semantic context is separated and represented in the
interlingua form.

1.3 Motivation 7
An interlingua is a completely new dialect that is free of any source or target language
and is intended to be utilized as a delegate internal portrayal of the source content. The
investigation stage is trailed by the target sentence generation [7].
1.3 Motivation
“The world is one big data problem.”
- Andrew McAfee
As we take stock on technical advances over the most recent years, there is one factor common
among them all, information or data. The exponential development of information accessible
to comprehend, help individuals and associations are guiding us to a time that endeavors to
supplant human insight-based decisions with information are driven and measurably upheld
choices.
Natural Language Processing, an ordinarily utilized strategy while endeavoring to com-
prehend and pick up bits of knowledge from information have to a great extent stayed
sidelines for a while. Presently, with the significant advent in technical capacities and the
gigantic measure of information accessible, this technology has all the earmarks of being
promising.
The primary challenges in NLP are Machine Translation. A legitimate solution for the
issue suggests that machines should be capable enough to interpret the patterns of language
and differentiate the structure of the language. The advent of various statistical and deep
neural-based approaches has contributed to addressing the syntactic and semantics issues
in language translation. Also, this architecture predominantly focuses on high resource
languages.
There are a couple of dialects, for example, English, Spanish, Chinese, French, German,
and so forth which get a great deal of consideration from individuals who look into NLP.
Because of these numerous assets like POS taggers, Treebanks, Senti-WordNets, and so
on are accessible in those dialects. The NLP techniques created for these dialects can’t be
utilized for low resourced dialects as they are made to fit too unequivocally, on an enormous
dataset with different highlights. Utilizing these on little datasets would prompt exceptionally
lackluster performance.
Consequently, there is a gigantic need to work on low resource languages. Research into
language-autonomous NLP strategies that are fitting in low-resource settings is frantically
required as such methods can be applied to some low-resource dialects without a moment’s
delay.

8 Introduction
Marathi is one such low resourced language and my native language. Other than this,
there are numerous issues in natural language which we come across while translating the
languages using different available approaches. The above reasons assisted us to set an
objective to introduce and implement methods that are suitable for low resource languages
and can be stretched out to any language.
1.4 Key Contributions
The thesis contributes towards progression in the undertaking of Machine Translation in
Indian dialects. The examination, principally, centers around the Marathi Language.
As expressed already, research in these languages is constrained because of the inaccessi-
bility of annotated resources.
To investigate the parsing, semantic, and syntactic angles while interpreting it from
Marathi to English we proposed two methodologies.
• Statistical Machine Translation using Moses toolkit.
• Neural Machine Translation using OpenNMT.
The major contribution of the thesis is a cross-lingual phrase based translation learning and
transformer model using attention mechanism..
1.5 Thesis Overview
• Chapter 1 contains an introduction and the motivation for the thesis. It briefly de-
scribes the evolution of natural language processing, levels of natural language pro-
cessing Machine translation, architectures, and a computational point of view. This
chapter also highlights the key contributions of the thesis.
• Chapter 2 reveals insight into the earlier research in the field of machine translation.
It briefly explains the state of art models like Rule-Based System, Example-Based
Translation, Statistical Machine Translation, and the recent Deep Neural Network
Based Translation.
• Chapter 3 deals with the stepwise description of the processes and methods used in
the machine translation of Marathi language to English. This chapter describes the
detailed process of using Moses’ tool for statistical machine translation and OpenNMT
for deep neural machine translation.

1.5 Thesis Overview 9
• Chapter 4 highlights the objectives of this research work.
• Chapter 5 showcases our work in developing the Statistical Model for Machine
Translation. We work with parallel corpora from 2 languages: Marathi Bible and
English Bible. We successfully build a statistical model using the Moses toolkit.
• Chapter 6 portrays our work in building the Deep Neural Network Model for Machine
Translation. We were successful to build a Deep Neural Model using the OpenNMT
toolkit.
• Chapter 7 describes the evaluation metric called BLEU. It highlights the performance
results and portrays the BLEU scores for the models built.
• Chapter 8 concludes the thesis and addresses the future scope of research on machine
translation for low resource languages.

Chapter 2
Background and Related Work
Machine translation has evolved over the years and has occupied significant importance in
the field of artificial intelligence. It describes a range of computer-based activities which
involve translation system [8].
The earliest use of machine translation dates back to the period after the second world
war wherein the early computer was used for encoding the secret messages. In the 1980s
there was a drastic change and evolution of the field wherein it paved a new dimension for
the application of machine translation in artificial intelligence [9].
This chapter briefs about the sixty years of history, research, and development in ma-
chine translation. It also highlights the obstacles and drawbacks of implementing different
approaches for machine translation.
Fig. 2.1 History of Machine Translation, Image Source: medium.com

12 Background and Related Work
2.1 Rule Based Machine Translation
The early 70s embarks the start of Machine Translation, wherein the translation was made
based on a set of predefined rules.
It comprises of two important aspects:
• Bilingual dictionary for each language pair.
• Set of linguistic rules.
The translation quality can be improved by adding user-defined rules and dictionaries into
the translation process by overriding the default settings. The text is parsed by the software
and a transitional representation is created which generates text in the target language. Rule-
based procedure is proposed to disentangle the complex sentences dependent on connectives
like relative pronouns, organizing and subjecting combination [10].
It’s based on a large set of lexicons, predefined rules, syntactic and semantic information
of both the source and target language [11].
RBMT system is efficient and reliable to generate the translation but is dependent on a
huge set of rules which consume a lot of time. Also, redefining and updating the system
knowledge is a bit tedious task.
Although, RBMT are productive enough for a company to gain quality translation it takes
huge initial investment to maintain the quality and increase it incrementally.
2.2 Direct Machine Translation
This is a simplest approach to machine translation wherein the words in the source are
replaced by corresponding words in the target language. The translation in this approach is
bilingual and unidirectional with no intermediary representation [12]. It follows bottom up
approach wherein the transfer is made at word level.
Fig. 2.2 Direct Machine Translation
Image Source: medium.com

2.3 Transfer Based Machine Translation 13
It is specific for a language pair considering the words as the translation unit. It relies
less on the syntactic or semantic analysis wherein grammatical adjustments are made to do
word by word translation.
Also, this approach is an easy and feasible way to translate any language pair perhaps the
results obtained are poor as it does not consider the grammar or analyzes the meaning of the
sentence parsed for translation due to linguistic and computational naivety [12].
2.3 Transfer Based Machine Translation
In this approach an intermediate representation is created after the text is parsed from the
source sentence. It comprises of three steps:
• Analysis
• Transfer
• Generation
The first step analyzes the input text and converts into abstract form, the second step converts
the abstract text into intermediate representation oriented to the target language and finally,
the third step generates the target text using the morphological analyzer.
The intermediate representation is specific to the source and target language respectively.
The results obtained with this approach were fairly satisfactorily with 90 percent accuracy
region [12]. Although, this approach was based on simplified grammar rules these rules
needed to be applied at every step for analysis of source language, transfer of source to target,
and generation of the target language.
This resulted in verbatim translation and exhausted linguists which in turn increased the
work, making it complicated to reuse the modules and maintain the simplicity of the modules
[12].
2.4 Interlingual Based Machine Translation
This approach is also based on intermediate representation with the source language being
translated into inter-lingual language wherein the representation is language independent.
Finally, the target language is generated from the interlingual representation. This approach
is very advantageous for generating multiple target languages from a source. KANT is only
an operational commercial inter-lingual machine translation system which is designed to

translate technical English into other languages. This approach is beneficial for multilingual
translation systems.
However, it is a very complex task to create a universal interlingua which extracts the
original meaning of the source language and retains it in the generated target [12].
Dave et al, [13] in their research work contemplate the language dissimilarity among
English and Hindi and its suggestion to machine interpretation between these dialects utilizing
the Universal Networking Language (UNL). The portrayal works at the degree of single
sentences and characterizes a semantic net-like structure in which nodes are word ideas and
curves are semantic relations between these ideas.
2.5 Example Based Machine Translation
Example based Machine Translation was primarily developed to overcome the drawbacks
of Rule-Based Machine Translation when translating between languages having different
structures for e.g English and Japanese [14]. This approach retrieves similar examples in the
form of pairs of source phrases, sentences, texts or translation from a database of example to
translate into a new input [15].
Bilingual corpus with parallel text constitutes as the main knowledge of the Example-
Based Machine Translation system. The system input comprises a set of sentences from the
source language and corresponding mapping of translations for each sentence in the target
language. These examples are the base to translate similar sentences from source language to
target language.
There are four steps in Example Based Machine Translation:
• Example acquisition
• Example base and management
• Example application
• Synthesis
The translation of Example based machine translation is predominantly based on the analogy
wherein example translations are used to train the models by encoding principle of analogical
translation [14].
Example-Based Machine Translation is beneficial for machine translation as it does not
require manually derived rules. Perhaps it requires pre-trained translation models to analyze
the sentences. Also, it requires high computational efficiency for large databases.

2.6 Statistical Based Machine Translation 15
2.6 Statistical Based Machine Translation
The IBM research center was introduced to a machine translation system in the early 1970s
which knew very little about rules and linguistics. This system dealt with the analysis of two
languages and tried to recognize the patterns [16].
Fig. 2.3 Statistical Machine Translation
Statistical models that are derived by analyzing bilingual text corpora form the basis of
Statistical Machine Translation. Bayes Theorem was the base for building Statistical model
wherein the system takes the view of the most probable sentence which matches the source
sentence to be translated [16].
The advantage of Statistical Machine Translation is that its the most accurate method that
was introduced and overcame the drawbacks of the traditional rule-based system. There is
no need for predefined rules thus supervision by the linguists is not needed thereby saving
efforts and time.

2.6.1 Word Based Statistical Machine Translation
The initial models were based on words as an atomic unit that may be translated, dropped,
and reordered. The preliminary step of machine translation is aligning the words in sentence
pairs. This approach uses both translation model and language model thereby ensuring good
output [17].
The first word-based models split-ted the sentence into words and translated one word
into multiple count stats. The model memorizes the usual place the word takes at the output
sentence and shuffles it for more natural sound.
Although word-based systems embarked on a new revolution in the field of machine
translation it couldn’t deal with exceptions like gender and homonym. This approach became
redundant and was further replaced by a phrase-based system.
2.6.2 Syntax Based Statistical Machine Translation
Syntax analysis deals with the subject, predicate and other parts of the sentence to build a
tree. Unlike, phrase-based machine translation which translates single words or strings of
words this approach translates the syntactic units.
For e.g, the Data-Oriented Processing based machine translation and synchronous context-
free grammars include Syntax Based Statistical Machine Translation [18].
This approach has demonstrated improved translated results perhaps its speed is consid-
ered slow as compared to other approaches.
2.6.3 Phrase Based Statistical Machine Translation
This approach is built on the principles of word-based translation which comprises statistics,
reordering, and lexical hacks. It split the text into atomic phrases.
The advantages of phrase-based models are that none com-positioned phrases can be
handled using many to many translations. We can use local content in translation. It can be
used for a larger data set that needs to be translated. It is a standard model used by google
translate.
Phrase-based models were based on N-grams which are nothing but a continuous sequence
of words. As a result, the machine was able to process these sequences of words thereby
improving accuracy.

2.6 Statistical Based Machine Translation 17
Fig. 2.4 Phrase Based Statistical Machine Translation
Image Source: wordpress

This approach provided options to choose the bilingual texts for learning. The word-based
translation ignored or excluded the free translation thereby making it critical to exactly match
the sources.
However, phrase-based translation overcame this by learning from literary or free transla-
tion. Phrase-based translation gained considerable importance starting 2006 to 2016. It was
used in the working of various online translators like Google Translate, Bing and Yandex
[19].
2.7 Deep Neural Machine Translation
This approach has pioneered a new era for machine translation wherein it uses a large neural
network to predict a likelihood of a sequence of words thereby creating a single integrated
sentence model [20]. The early 1990s embarks on the appearance of speech recognition
applications based on deep learning.
In 2014, first scientific paper based on neural networks in machine translation was
published which was later followed by developments in the following years which included
an application to image capturing, subword-NMT, Zero-Shot NMT, Zero-Resource NMT,
Fully Character-NMT, Large vocabulary NMT, Multi-source NMT, character-dec NMT, etc
[21].
Fig. 2.5 Neural Machine Translation
Image Source: altoross.com

2.7 Deep Neural Machine Translation 19
The fundamental benefit of this approach is it trains a single system directly on the
source and target text thereby no more pipeline of the specialized system is required to be
used in statistical machine learning. Also, neural machine translation systems are called an
end-to-end system as they are based on only one model for the translation.
The learning occurs in two phases.
• The first phase in Deep Neural Network consists of applying a nonlinear transformation
of the input and create a statistical model as output.
• The second phase improves the model with a mathematical method termed as deriva-
tive.
The above two steps are repeated several times until the desired accuracy is obtained.
The repetition of this two-phase is termed as an iteration. Various architectures such as deep
neural networks, recurrent neural networks, deep belief networks have played a significant
role in the fields such as computer vision, audio recognition, social network filtering, speech
recognition, machine translation, drug design and bioinformatics where outstanding results
were obtained [21].
The main objective of a neural network is that they receive a set of inputs, complex
calculations are performed on them, and an output is generated which eventually addresses
real-world problems like classification, supervised learning and reinforcement learning.
The gradient descent method is used to optimize the network and minimize the loss
function. The most important step in deep learning models is training the data set. Also,
Backpropagation is the main algorithm used to train the models.
In Deep Neural Network architecture compositional models are generated wherein the
object is expressed as the layered composition of primitives. The extra layers facilitate the
composition of features that belong to lower layers thereby modeling complex data with
fewer units. Also, the deep architectures are based on many variants of a few basic approach
which are successful in specific domains. Deep Neural Networks are predominantly fed
forward networks wherein the data flow from the input layer to the output layer without
looping back. On the other hand for Recurrent Neural Networks, the data flows in multiple
directions which are applicable in language modeling. They have considerably enhanced the
state-of-the-art Neural Machine Translation as they are capable to model complex functions
and capture complex linguistic structures.
However, Neural Machine Translation systems with deep architecture suffer from severe
gradient diffusion in their encoder or decoder due to the non-linear recurrent activations
thereby making it difficult to optimize [21]. To address it, the solution is to use an attention

mechanism wherein the model learns to place attention on the input sequence while each
word of the output sequence is decoded.
The recurrent neural network encoder-decoder architecture with attention has played a
significant role to address problems for machine translation. Also, it is used by the Google
Neural Machine Translation system or GNMT for Google translate service.
However, despite being efficient the neural machine translation systems have few draw-
backs when scaled with large vocabularies and consume a lot of time for training the models.
Neural Machine Translation systems are proven to be computationally expensive for
training and translation. Also, most systems have difficulty with exceptions and rare words.
These issues have hindered the deployment and use of this approach to retrieve accurate
results.
Going further, Google’s Neural Machine Translation system has attempted to address
many of these issues. The models are based on deep Long short-term memory (LSTM)
network, with 8 encoder and 8 decoder layers using attention and residual connections. This
approach has helped in improving parallelism thereby decreasing training time. The attention
mechanism of Google NMT connects the bottom layer of the decoder to the top layer of the
encoder.
Finally, to increase the translation speed, they use low-precision arithmetic for compu-
tations. They deal with rare words, by dividing words into a limited set of common units
called word piece for both input and output thereby providing a good balance between the
flexibility of "character" delimited models and the efficiency of "word"-delimited models.
Also, they have a beam search technique which includes a length-normalization procedure
and uses a coverage penalty, which generates an output sentence which most likely covers all
the words in the source sentence [22].

2.7.1 Feed Forward Network
The feed-forward neural network is a primitive type of artificial neural network which is
based on a simple design. The feed-forward neural network has an input layer, hidden layers,
and an output layer. Information always travels in a uniform direction from input to output
layer without forming a loop or cycle. [23]. Supervised learning is used to feed the input
examples to the network and transformed it into a labeled output.
In a feed-forward network, training is done on labeled images until the errors are reduced
while categorizing. Going further, the network uses these trained models to categorize data it
has never seen.
This trained feed-forward network can be exposed to any random collection of pho-
tographs, it will classify all images separately considering each image it is exposed to as an
individual input without perceiving the past input.
2.7.2 Recurrent Neural Network
Recurrent networks, on the other hand, take as their input not just from the current input
example they see, but also what they have perceived previously in time. Recurrent Neural
Network is a multi-layered neural network wherein the information is stored in context nodes
thereby allowing it to learn sequences of input data and generate output sequence. In simple
words, connections between nodes are based on loops [24].
For example, consider an input sentence "where is the . ... .... ... .? wherein we predict the
next word.
The RNN neurons will get a sign that points to the beginning of the sentence. The system
gets "Where" as info and produces a vector of the number. This vector is taken care of back
to the neuron to give a memory to the system. This stage causes the system to recollect the
word "Where" and it is in the first position. The system will comparatively continue to the
following words. It takes "is" and "the" and the condition of the neurons is refreshed after
getting each word. The neural system will give a likelihood to every English word that can be
utilized to finish the sentence. An all-around prepared recurrent neural network most likely
allots a high likelihood to "bistro," "drink," "burger," and so on.
Normal employments of Recurrent Neural Networks:
• Help securities brokers to produce systematic reports.
• Recognize variations from the norm in the agreement of fiscal summary.
• Recognize false Visa exchange.

• Give a subtitle to pictures.
• Power chatbots.
The standard employments of RNN happen when the professionals are working with
time-series information or sequences (e.g. sound chronicles or content).
2.7.3 Convolutional Neural Network
Convolutional Neural Network is a multi-layered neural system with a novel engineering
intended to extricate progressively complex highlights of the information at each layer
to decide the yield. This approach is generally utilized when there is an unstructured
informational index (e.g., pictures) and the specialists need to separate data from it.
For example, if the assignment is to foresee a picture inscription. The network gets a picture
of supposing a cat, this picture, in scientific term, is an assortment of the pixel. By and large,
one layer for the grey-scale picture and three layers for a shading picture.
During the element learning (i.e.shrouded layers), the system will distinguish novel
highlights, for example, the tail of the cat, the ear, and so forth.
At the point when the system completely figured out how to perceive an image, it can
give a likelihood to each picture it knows. The mark with the most elevated likelihood will
turn into the expectation of the system.

2.7.4 Transformer Model
RNN based models are difficult to parallelize and can experience issues learning long-extend
conditions inside the info and yield arrangements
The Transformer models all of these conditions utilizing attention mechanisms.
Fig. 2.6 Recurrent Neural Network
Image Source: sdl.com
Rather than utilizing one range of attention, the Transformer utilizes various "heads".
Moreover, the Transformer utilizes layer normalization and residual connection which make
advancement simpler. Attention can’t use the input position. To settle this, the Transformer
utilizes explicit position encodings which are added to the input embeddings. [25].
The attention mechanism in the Transformer is deciphered as a method of figuring the
pertinence of a lot of values(information) based on certain keys and inquiries. Essentially,
the attention mechanism is utilized as a path for the model to concentrate on important data
dependent on what it is as of now handling.
Generally, the attention weights were the significance of the encoder hidden state (values)
in preparing the decoder state and were determined depending on the encoder hidden states
(keys) and the decoder shrouded state (query).
As should be obvious, a single attention head has an exceptionally straightforward
structure: it applies a one of a kind direct change to its input queries, keys, and values,
registers the attention score between each query and key, at that point utilizes it to weight the
qualities and summarize them. The Multi-Head Attention square just applies various squares
in equal, connects their yields, at that point applies one single linear transformation [26].

Scaled Dot Product Attention
Concerning the attention mechanism, the transformer utilizes a specific type of attention
called the "Scaled Dot-Product Attention" which is figured by the accompanying equation:
The essential attention system is a dot product between the query and the key. The size of
the speck item will, in general, develop with the dimensionality of the query and key vectors,
however, so the Transformer rescales the dot product to keep it from detonating into gigantic
qualities [26].
2.7.5 Transformer Architecture
The Transformer despite everything utilizes the fundamental encoder-decoder structure of
customary neural machine translation frameworks. The left-hand side is the encoder and
the right-hand side is the decoder. The underlying contributions to the encoder are the
embeddings of the input, and the underlying contributions to the decoder are the embeddings
of the yields up to that point [26].
Encoder
The encoder is made out of a pile of N = 6 indistinguishable layers. Each layer has two sub-
layers. The first is a multi-head self-attention component, and the second is a straightforward,
position astute completely associated feed-forward system.
We utilize a lingering association around each of the two sub-layers, trailed by layer
normalization. That is, the yield of each sub-layer is LayerNorm(x + Sublayer(x)), where
Sublayer(x) is an actualized function [26].
To encourage these residual connections, all sub-layers in the model along with embed-
ding layers, produce yields of dimension dmodel = 512 [26].

Fig. 2.7 Transformer Architecture
Image Source: medium.com

Decoder
The decoder is additionally made out of a stack of N = 6 indistinguishable layers. Notwith-
standing the two sub-layers in each encoder layer, the decoder embeds a third sub-layer,
which performs multi-head attention over the yield of the encoder stack. Like the encoder, we
utilize residual connections around every one of the sub-layers, trailed by layer normalization.
We likewise adjust the self-attention sub-layer in the decoder stack to keep positions
from taking care of resulting positions. This concealing joined with the truth that the output
embedding is counterbalanced by one position, guarantees that the forecasts for the position
i can rely just upon the known outputs at positions less than i [26].
Positional Encodings
Since our model contains no repeat and no convolution, all together for the model to utilize
the request of the grouping, we should infuse some data about the relative position of the
tokens in the sequence.
To this end, we include "positional encodings" to the input embeddings at the bottoms
of the encoder and decoder stacks. The positional encodings have a similar measurement
dmodel as the embeddings, with the goal that the two can be added. There are numerous
decisions of positional encodings, learned, and fixed.
In this work, we use sine and cosine elements of various frequencies:
where pos is the position and i is the measurement. That is, each element of the positional
encoding compares to a sinusoid.
We picked this capacity since we speculated it would permit the model to handily figure
out how to go to by relative positions, since for any fixed counterbalance k, P Epos+k can be
spoken to as a linear function of P Epos [26].

2.7.6 Open NMT
OpenNMT is a nonexclusive profound learning system, for the most part, had some exper-
tise in sequence-to-sequence models covering an assortment of assignments, for example,
machine translation, image to text, summarization, and speech recognition. The structure
has additionally been stretched out for other non-grouping sequence-to-sequence tasks like
language modeling and sequence tagging.
The toolkit organizes proficiency, seclusion, and extensibility with the objective of sup-
porting neural machine translation investigation into model designs, highlight portrayals,
and source modalities while keeping up serious execution and reasonable training require-
ments. The toolbox comprises of modeling and interpretation support, just as point by point
academic documentation about the hidden strategies [27].
OpenNMT was structured to achieve following three goals:
• Prioritize first training and test productivity.
• Keep up model measured quality and coherence.
• Support research extensibility.
Application of Open NMT
• Summarization
The models are prepared precisely like NMT models. In any case, the nature of the
preparation information is unique: source corpus are full length report or articles, and
target are summaries.
• Image to text
Im2Text, created by Yuntian Deng from the Harvard NLP group, is actualizing a
nonexclusive picture to-content application on OpenNMT libraries for visual markup
decompilation. The fundamental alteration to the vanilla OpenNMT is an encoder
presenting CNN layers in mix with RNN.
• Speech recognition
While OpenNMT isn’t fundamentally targetting speech recognition applications, its
capacity to help input vectors and pyramidal RNN makes conceivable start to finish
probes speech to text applications as portrayed for example in Listen, Attend and Spell.

• Sequence tagging
A sequence tagger is accessible in OpenNMT. It has the equivalent encoder engineering
as a sequence-to-sequence model however needn’t bother with a decoder since each
information is matched up with a yield. A sequence tagger simply needs an encoder
and generation layer. Sequence tagging can be utilized for any comment undertakings,
for example, speech tagging.
– To prepare a sequence tagger we need to preprocess the equal information with
source and target sequence having a similar length (you can utilize the - checkp-
length alternative).
– Train the model with - model-type seqtagger.
– Utilize the model with tag.lua
• Language modelling
A language model is fundamentally the same as an sequence tagger. The primary
contrast is that the yield "tag" for every token is the accompanying word in source
sentence.
– Preprocess the information with data-type monotext.
– Train the model with model-type lm.
– Utilize the model with lm.lua.

Chapter 3
Methodology
3.1 Building SMT Model using Moses
3.1.1 Moses - An open source SMT toolkit
In the year 2005, Moses toolkit was developed by the Edinburgh MT group to train statistical
models of text translation from a source language to a target language. Going further, this
tool decodes the source language text thereby producing automatic text in the target language
[28].
Parallel corpora containing source and target language text is required to train the model.
Also, it uses concurrences of words and segments to infer translation correspondences
between the two languages of interest.
Moses is described as an open-source toolkit for statistical machine translation whose
novel contributions are to support linguistically motivating factors, integration confusion
network decoding and providing efficient data formats for translation models which allows
the processing of large data with limited hardware.
Also, the toolkit includes a wide variety of tools for training, tuning and applying the
system to many translation tasks and finally evaluating the resulting translations using BLEU
score [28].
The Training Pipeline
It comprises of a collection of tools which take the raw data as input and generate a machine
translation model. There are various stages involved which are implemented as a pipeline
and are controlled by the Moses experiment management system.

30 Methodology
Also, Moses is compatible with the use of different types of external tools in the training
pipeline. The initial step involves preparing data by cleaning it by using heuristics to remove
misaligned and long sentence pairs.
Going further, GIZA++ is used to word-align parallel sentences which are used to
extract phrase-based translation or hierarchical rules. Moses uses external tools to develop a
language model that is built using the monolingual data in the target language and is used by
the decoder to ensure accurate output. The penultimate step is tuning wherein the statistical
models are weighted against each other to generate the best translation [28].
Fig. 3.1 Statistical Machine Translation using Moses

3.1 Building SMT Model using Moses 31
Decoder
The decoder is an application based on C++ wherein a trained machine translation model
and a source sentence is given as input thereby translating the source sentence into the
target language. Also, the decoder finds the highest scoring sentence in the target language
which corresponds to a given source sentence. the decoder can also reveal the ranked list of
translated candidates and provide information about its decision.
The decoder is written in a modular fashion and allows the user to vary the decoding
process in various ways, such as:
• Input: This is generally a plain sentence or it can be annotated with xml-like elements,
a structure like a lattice or confusion network.
• Translation model: This is based on phrase-phrase rules, or hierarchical rules and can
undergo binarised compilation for swift loading. Additional features which ensures
the reliability by indicating the source of the phrase pairs can also be added.
• Decoding algorithm: Moses implements several different strategies for decoding,
such as stack-based, cube-pruning, chart parsing etc to ease the search.
• Language model: Language model toolkits like SRILM, KenLM, IRSTLM, RandLM
are supported by moses.

32 Methodology
3.2 Build Neural Machine Translation Model using Open-
NMT
3.2.1 Transformer Architecture
The Transformer has a heap of 6 Encoder and 6 Decoder, dissimilar to Seq2Seq; the Encoder
contains two sub-layers: multi-head self-attention layer and a completely associated feed-
forward system.
The Decoder contains three sub-layers, a multi-head self-attention layer, an extra layer
that performs multi-head self-attention over encoder yields, and a completely associated
feed-forward network.
3.2.2 Encoder and Decoder Input
All input and output tokens to Encoder/Decoder are changed over to vectors utilizing learned
embeddings. These input embeddings are then passed to Positional Encoding.
Positional Encoding
The Transformer’s architecture doesn’t contain any repeat or convolution and henceforth has
no thought of word request. All the words of the input are taken care of by the system with
no exceptional request or position as they all stream at the same time through the Encoder
and decoder stack. To comprehend the significance of a sentence, it is basic to comprehend
the position and the request for words.
Positional encoding is added to the model to infuses the data about the absolute position-
ing of the words in the sentence. Also, it has a similar measurement as input embedding with
the goal that the two can be added.
3.2.3 Self Attention
A self-attention layer associates all positions with a constant number of successively executed
tasks and henceforth are quicker than repetitive layers.
An Attention function in a Transformer is depicted as mapping a query and a set of
key and value pair to the output. Query, key, and value are vectors. Attention weights are
determined to utilize Scaled Dot-Product Attention for each word in the sentence. The last
score is the weighted entirety of the values.

3.2 Build Neural Machine Translation Model using OpenNMT 33
Fig. 3.2 Neural Machine Translation Process
1. Dot Product
Take the dot product of the query and key for each word in the sentence. Dot Product
decides the amount to concentrate on different words in the input sentence.
2. Scale
Scale the Dot-Product by partitioning by the square base of the component of the key
vector. Dimension is 64; subsequently we partition the Dot-Product by 8.
3. Apply softmax
Softmax standardizes the scaled worth. Subsequent to applying Softmax, all the
qualities are certain and mean 1.
4. Calculate the weighted sum of the values
Dot-Product is applied between the normalized score and the value vector and then the
sum is calculated. The above steps are repeated for all words in the sentence.

34 Methodology
3.2.4 Multi Head Attention
Rather than utilizing a single attention function where the attention can be overwhelmed by
the actual word itself, transformers utilize numerous attention heads. Every attention head
has a linear transformation applied to a similar input representation.
The Transformer utilizes eight diverse attention heads, which are processed in parallel.
With eight distinctive attention heads, we have eight unique arrangements of the query, key,
and value, and furthermore eight arrangements of Encoder and Decoder every one of these
sets is initialized randomly.
3.2.5 Masked Multi Head Attention
The Decoder has veiled multi-head attention where it covers or hinders the decoder inputs
from the future steps. During training, the multi-head attention of the Decoder conceals the
future decoder inputs.
For the machine translation task to decipher a sentence, "I appreciate nature" from English
to Hindi utilizing the Transformer, the Decoder will consider all the sources of input words
"I, appreciate, nature" to anticipate the primary word.
Residual connections
These are "skip connections" that permit angles to move through the network without going
through the non-linear activation function. Residual connection assists with abstaining from
disappearing or detonating gradient issues.
For residual connections to work, the yield of each sub-layer in the model ought to be the
equivalent. All sub-layers in the Transformer, produce a yield of measurement 512.
Layer Normalization
It normalizes the inputs over every feature and is autonomous of different models. Layer
normalization lessens the preparation time in feed-forward neural systems. In Layer normal-
ization, we process mean and difference from the entirety of the added inputs to the neurons
in a layer on a single training case.

Chapter 4
Objectives and Requirements
4.1 Goals
Following are the primary goals of this venture:
• To get a comprehension of Statistical Machine Translation and Deep Neural Networks,
and how they are utilized to complete interpretation between languages.
• Build a machine translation framework by creating and preparing a profound learning
Statistical Machine Translation model and Transformer model. Also, perceive how it
performs on a machine with standard handling power.
• Play around with the hyper-parameters, preparing information size, number of prepar-
ing steps, and think about the exactness of the outcomes for various combinations.
4.2 Software Setup
The undertaking depended intensely on the equipment and programming support given by
the PC. The PC utilized for running the undertaking had the accompanying details:
• Running Ubuntu 16.04 with the most recent drivers and bundles to give development
environment support to the execution.
• NVIDIA GTX 970 GPU and 6GB RAM to give equipment support to the implementa-
tion.
• Access to the web to download freely accessible open-source Marathi Bible files for
training and cross-validation process.
• Python version 3.6.

36 Objectives and Requirements
4.3 Dataset
The dataset for testing and training for the Marathi language was procured from:
http://opus.nlpl.eu/bible-uedin-v1.php
The dataset is a paralllel corpora of Marathi to English language. It has 60876 sentence
pairs and 2.70M words.

Chapter 5
Building Statistical Machine Translation
model
5.1 Introduction
Moses is one of the most utilized Statistical Machine Translation frameworks. It is a finished
framework with a worked in decoder that can be utilized with a few alignment algorithms.
Moses is the SMT framework that we have used to prepare Marathi to English translation
model. In the following segment, we depict the steps followed to create, train, and test the
model utilizing Moses.
5.2 Baseline System
After effectively introducing Moses and another required programming (Giza++, Boost, and
so on). We utilized it to prepare Marathi to English translation model by utilizing the Marathi
Bible as Marathi corpus and the King James Version(KJV) Bible as English corpus.
Following is a passage of the initial three refrains of the New Testament in both the
Marathi and the English version.

38 Building Statistical Machine Translation model
As should be obvious from the above passage of an equal corpus, we need to have two
records two proportionate texts in two dialects: the target language and the source language.
The content in those two documents needs to compare line by line. Line 100 in the target
language document ought to be the interpretation of line 100 in the record containing the
source language.
For this project, as we attempted to translate a Marathi to English translation framework,
we began with two distinct documents, one containing the Marathi Bible and the other one
containing the King James Version English Bible. When utilizing Moses, the initial phase in
preparing a translation model called "Corpus Preparation".
5.3 Corpus Preparation
Corpus Preparation comprises of three stages: tokenization, truecasing, and cleaning. During
tokenization, spaces are included between all words and accentuation to make sure that
various types of a similar word are considered one.
In the subsequent stage, Moses utilizes a truecasing content, otherwise called the truecaser,
to compute the frequencies proportions of how frequently a specific word is lower-cased
contrasted with when it is capitalized.
This is significant as, without this progression, it would be practically unimaginable for
the translation framework to figure if the words toward the start of a sentence are promoted
on the grounds that they are typically promoted (legitimate names) or in the event that they
are promoted in light of the fact that they are the toward the start of particular sentences.
The last yet significant step is the cleaning step. In this progression, a sentence pair is
expelled from the training data if one of its sentences has a character count more prominent
than a set sum, or if the proportion of the character count of its sentences is not relative to the
determined or set proportion for the training data.
The constraining character count is set by the structure of the dialects that are being
managed with or the quality/size of the parallel corpora being utilized.

5.4 Language Model Training 39
5.4 Language Model Training
Command to Build Language Model
In this progression, we utilized Moses worked in the KenLM 3-gram model tool to build a
target language model dependent on the corpus. For this case, as we have Marathi to English
translation, English is the target language, along these lines we utilized the Marathi bible
corpus document created by the truecaser.
Now, there is no need of utilizing the output created after the cleaning phase of the Corpus
Preparation process as a language model just relies upon the structure of the target language
being used, English for this situation, and not to its proportionate interpretation to the source
language, Marathi.
Subsequently, there is no compelling reason to mull over the impacts of the sentence
character count and the restricting ratio used to channelize data in the cleaning phase of the
Corpus Preparation process. In the wake of building the English language model, we utilized
the Moses binarizing script to transform the record containing the English language model
into a twofold form that heaps quicker.
At this progression, we can utilize it to get the likelihood that any input sentence is a
piece of the English as per the language model that we developed using data exclusively
from the Marathi Bible.
5.5 Training the Translation System
Command to train the SMT model:
Since we have prepared our target language model, the time has come to begin the
preparation of the translation framework. For this progression, we utilized Moses’s default
word-arrangement instrument called Giza++. Subsequent to running the orders for this

40 Building Statistical Machine Translation model
progression, Moses produced a Moses.ini setup document that can be utilized to make an
interpretation of any Marathi sentence to English.
There are two principle issues that must be taken a gander at. The first one is that
translation sets aside a long effort; to fix this we have to binarise the expression and reorder
the tables. The subsequent one is that weights in our model arrangement record are not
balanced, i.e.: they are reliant to the Bible information we utilized in preparing the model.
In the following subsection, we tune the model to make it progressively adjusted and less
reliant on the data used to prepare it.
5.6 Tuning
Command to Tune the SMT model:
5.7 Binarising Phrase and Reordering Tables
Command to Binarise and reorder tables:
When the tuning procedure is finished, it is encouraged to binarise the expression and
reorder the tables in your translation model by utilizing the Moses’ tools.
5.8 Testing
Since we have finished the essential strides of building, preparing, and tuning an interpretation
model utilizing Moses, we can utilize it to do some basic translations. To do this, we simply
run the terminal command, and that way you can decipher a document containing sentences
in Marathi to English. The sentences in the information record must be in a similar format as
that of training and tuning stages. Following is the Marathi input file that we used to test our
translation model followed by the generated English file.

5.9 Results and Analysis 41
5.9 Results and Analysis
The previous sections depict the results predicted by the translation model given an input file
containing Marathi Corpus and the output file is generated containing data in English. We
trained the model on large parallel corpora of Marathi and English Bible which had 60776
sentence pairs and 2.7M words. The translation obtained is noticeable accurate and can be
termed as successful.
The BLEU score obtained for the SMT model is 27.17 which is moderate and satisfactory.
BP ratio hyp-len ref-len BLEU
0.728 0.759 44678 58852 27.17
Table 5.1 BLEU Score for SMT Model

Chapter 6
Building Deep Neural Machine
Translation Model
6.1 Introduction
OpenNMT is a finished library for preparing and conveying neural machine translation
models. The framework is a replacement to seq2seq-attn created at Harvard and has been
revised for simplicity of productivity, lucidness, and generalizability. It incorporates vanilla
NMT models alongside support for attention, gating, stacking, input feeding, regularization,
beam search.
The fundamental framework is actualized in the Lua/Torch scientific structure and can be
effectively be expanded utilizing Torch’s inner standard neural network segments.
6.2 Setup of Required Modules
The main bundle required for preparing your custom translation framework is basically
pyTorch, in which the Open-NMT models have been actualized [29].
The priliminary step, is to clone the OpenNMT-py repsitory :

44 Building Deep Neural Machine Translation Model
6.3 Corpus Preparation
The dataset includes an equal corpus of source and target language records containing one
sentence for every line with the end goal that every token is isolated by a space. We used
equal corpora of Marathi and English sentences put away in isolated records.
6.4 Pre-Processing Text Data
To pre process the training data, validation data and extract features to generate vocabulary
files we used the following command: The data comprises of parallel source and target data
which contain one sentence per line wherein the tokens are separated by a space.
Following files are generated after running the preprocessing :

6.5 Training the Translator model : 45
6.5 Training the Translator model :
The command for training is really easy to use. It takes as input, a data file and a save file.
This will run the default model, which comprises of a 2-layer LSTM with 500 shrouded units
on both the encoder/decoder.

46 Building Deep Neural Machine Translation Model
6.6 Translate
The following command is executed to play out a surmising step on unseen content in the
Source language (Marathi) and produce comparing interpretations which are predicted.
A translated output is generated and the predictions are stored into pred.txt file.
6.7 Testing
Since we have completed the basic steps of pre-processing, training, and translating an
NMT based model using the OpenNMT toolkit, we can use it to do some fundamental
interpretations. To do this, we just run the terminal order and that way you can unravel an
archive containing sentences in Marathi to English. The sentences in the input file must be in
a comparable arrangement as that of preparing and tuning stages.
Following is the Marathi input document that we used to test our interpretation model
followed by the produced English translation.

6.8 Results and Analysis 47
6.8 Results and Analysis
The above section portrays the translation results in the generated output file and it is capable
of translating the sentence pairs for the corresponding input sentence in Marathi Corpora.
The BLEU score obtained for NMT is 43.74 which helps us conclude that the NMT model
has performed better than the SMT model.
BP ratio hyp-len ref-len BLEU
0.953 0.954 44481 46631 43.74
Table 6.1 BLEU Score for NMT Model

Chapter 7
Evaluation and Analysis
7.1 Evaluation
Human assessments of machine translation are broad however costly. Also, they take a
long time to complete and include human work that can not be reused. [30] Papineni et
al proposed a technique for programmed machine interpretation assessment that is fast,
economical, and language-autonomous, that connects profoundly with human assessment,
and that has minimal minor expense per run. We present this strategy as a mechanized
understudy to talented human appointed authorities which substitutes for them when there is
a requirement for speedy or regular assessments.
7.1.1 Bilingual Evaluation Understudy Score
The Bilingual Evaluation Understudy Score, or BLEU Score, alludes to an assessment
metric to assess Machine Translation Systems by contrasting a created sentence with a
reference sentence. An ideal match in this correlation brings about a BLEU score of 1.0 ,
while a total confound brings about a BLEU score of 0.0. The BLEU score is an all-around
adjusted measurement for assessing interpretation models as it is autonomous of language,
easy to decipher, and has a high relationship with the manual assessment.
The BLEU score is created after including n-grams in the candidate translation coor-
dinating with the n-grams in the reference content. Word order isn’t considered in this
comparison.

50 Evaluation and Analysis
7.2 Analysis of SMT and NMT models
To think about the two Machine Translation(MT) models examined in this paper: SMT
(Statistical Machine Translation) and NMT (Neural Machine Translation), it is imperative
to see how the two models are executed, what sort of crude information they require, and
what sort of results to expect when utilizing these two MT models. In expansion to that, it is
critical to contemplate the measure of exertion it would take to improve or scale every one of
the two models.
7.2.1 Using Data and Implementing Model
The principle distinction among SMT and NMT is the kind of information utilized in their
executions. The Moses SMT model that we executed use equal corpora (translated sentences
pairs) from the two dialects as essential input information. On the other hand, the NMT
model that we actualized utilizing OpenNMT can be prepared legitimately on Marathi and
English content without any pipeline of specialized frameworks utilized in SMT.
7.2.2 Efficiency
SMT is information-driven, requiring just a corpus of models with both source and target
language content. In contrary, neural machine interpretation frameworks are supposed to be
an end to end systems as just one model is required for the interpretation.
7.2.3 Accuracy
The results obtained after implementing the NMT model depicted higher accuracy than the
SMT model. Thus, given a set of large parallel corpus data, the NMT transformer model
produces more reliable output. The BLEU score for the NMT model was 43.74 and for SMT
model was 27.17 justifies the above statement.

Chapter 8
Conclusion and Future Work
Our investigation uncovers that an out-of-the-box NMT framework, prepared on a parallel
corpus of Marathi to English text, accomplishes a lot higher interpretation quality than a
custom-fitted SMT framework. These outcomes are really astonishing given that Marathi
presents huge numbers of the known difficulties that NMT right now battles with (information
shortage, long sentences, and rich morphology).
In future trials, we would like to explore strategies for fitting NMT to a specific domain
and language pair. A potential road of research to investigate is the consideration of linguistic
features in NMT.
Finally, it will be significant in the future to include human assessment for our exam-
inations to guarantee that the MT frameworks intended for open organization use will be
streamlined to improve the undertaking of a human interpreter, and won’t just be tuned to
programmed measurements.

References
[1] Wikipedia contributors. Machine translation — Wikipedia, The Free Encyclopedia.
[Online; accessed 4-May-2020 ]. 2020. URL: https://en.wikipedia.org/w/index.php?
title=Machine_translation&oldid=953518509.
[2] Wikipedia contributors. Rule-based system — Wikipedia, The Free Encyclopedia.
[Online; accessed 4-May-2020 ]. 2020. URL: https://en.wikipedia.org/w/index.php?
title=Rule-based_system&oldid=948096750.
[3] Wikipedia contributors. Statistical machine translation — Wikipedia, The Free Ency-
clopedia. [Online; accessed 4-May-2020 ]. 2020. URL: https://en.wikipedia.org/w/
index.php?title=Statistical_machine_translation&oldid=950991925.
[4] Wikipedia contributors. Natural language processing — Wikipedia, The Free Encyclo-
pedia. [Online; accessed 4-May-2020]. 2020. URL: https://en.wikipedia.org/w/index.
php?title=Natural_language_processing&oldid=954334473.
[5] Wikipedia contributors. Natural-language understanding — Wikipedia, The Free
Encyclopedia. [Online; accessed 7-May-2020]. 2020. URL: https://en.wikipedia.org/
w/index.php?title=Natural-language_understanding&oldid=954266182.
[6] Elizabeth D Liddy. “Natural language processing”. In: (2001).
[7] Mohamed Amine Chéragui. “Theoretical overview of machine translation”. In: Pro-
ceedings ICWIT (2012), p. 160.
[8] John hutchins. “Machine translation: A concise history”. In: Computer aided transla-
tion: Theory and practice 13.29-70 (2007), p. 11.
[9] Jonathan Slocum. “A survey of machine translation: its history, current status, and
future prospects”. In: Computational linguistics 11.1 (1985), pp. 1–17.
[10] C Poornima et al. “Rule based sentence simplification for english to tamil machine
translation system”. In: International Journal of Computer Applications 25.8 (2011),
pp. 38–42.
[11] W3Techs. Usage Statistics of Content Languages for Websites. Last accessed 16
September 2017. 2017. URL: https://www.freecodecamp.org/news/a-history-of-
machine-translation-from-the-cold-war-to-deep-learning-f1d335ce8b5/.
[12] MD Okpor. “Machine translation approaches: issues and challenges”. In: International
Journal of Computer Science Issues (IJCSI) 11.5 (2014), p. 159.
[13] Shachi Dave, Jignashu Parikh, and Pushpak Bhattacharyya. “Interlingua-based English–
Hindi machine translation and language divergence”. In: Machine Translation 16.4
(2001), pp. 251–304.

54 References
[14] John Hutchins. “Towards a definition of example-based machine translation”. In:
Machine Translation Summit X, Second Workshop on Example-Based Machine Trans-
lation. 2005, pp. 63–70.
[15] Eiichiro Sumita and Hitoshi Iida. “Experiments and prospects of example-based
machine translation”. In: Proceedings of the 29th annual meeting on Association for
Computational Linguistics. Association for Computational Linguistics. 1991, pp. 185–
192.
[16] Adam Lopez. “Statistical machine translation”. In: ACM Computing Surveys (CSUR)
40.3 (2008), pp. 1–49.
[17] Philipp Koehn. Statistical machine translation. Cambridge University Press, 2009.
[18] Eugene Charniak, Kevin Knight, and Kenji Yamada. “Syntax-based language models
for statistical machine translation”. In: Proceedings of MT Summit IX. Citeseer. 2003,
pp. 40–46.
[19] Philipp Koehn, Franz Josef Och, and Daniel Marcu. “Statistical phrase-based transla-
tion”. In: Proceedings of the 2003 Conference of the North American Chapter of the
Association for Computational Linguistics on Human Language Technology-Volume 1.
Association for Computational Linguistics. 2003, pp. 48–54.
[20] John Kelleher. “Fundamentals of machine learning for neural machine translation”. In:
(2016).
[21] Fahimeh Ghasemi et al. “Deep neural network in QSAR studies using deep belief
network”. In: Applied Soft Computing 62 (2018), pp. 251–258.
[22] Yonghui Wu et al. Google’s Neural Machine Translation System: Bridging the Gap
between Human and Machine Translation. 2016. arXiv: 1609.08144 [cs.CL].
[23] Terrence L Fine. Feedforward neural network methodology. Springer Science &
Business Media, 2006.
[24] Larry R Medsker and LC Jain. “Recurrent neural networks”. In: Design and Applica-
tions 5 (2001).
[25] Martin Popel and Ondˇrej Bojar. “Training tips for the transformer model”. In: The
Prague Bulletin of Mathematical Linguistics 110.1 (2018), pp. 43–70.
[26] Ashish Vaswani et al. “Attention is all you need”. In: Advances in neural information
processing systems. 2017, pp. 5998–6008.
[27] Guillaume Klein et al. “Opennmt: Open-source toolkit for neural machine translation”.
In: arXiv preprint arXiv:1701.02810 (2017).
[28] Philipp Koehn et al. “Moses: Open source toolkit for statistical machine translation”.
In: Proceedings of the 45th annual meeting of the association for computational
linguistics companion volume proceedings of the demo and poster sessions. 2007,
pp. 177–180.
[29] Guillaume Klein et al. “OpenNMT: Open-Source Toolkit for Neural Machine Trans-
lation”. In: Proc. ACL. 2017. DOI: 10.18653/v1/P17-4012. URL: https://doi.org/10.
18653/v1/P17-4012.
[30] Kishore Papineni et al. “BLEU: a method for automatic evaluation of machine transla-
tion”. In: Proceedings of the 40th annual meeting on association for computational
linguistics. Association for Computational Linguistics. 2002, pp. 311–318.

Machine Translation for Low Resource Indian Languages Compared

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Similar to Machine Translation for Low Resource Indian Languages Compared

Similar to Machine Translation for Low Resource Indian Languages Compared (20)

Recently uploaded

Recently uploaded (20)

Machine Translation for Low Resource Indian Languages Compared