Session on machine translation batu 19 march2016

English To Hindi Statistical Machine
Translation System
Presented By
Nakul Sharma
Assistant Professor (IT)
SAE
Pune

Agenda

Introduction

Machine Translation Approaches

My Experiences

Design and Implementation of Statistical Machine
Translation System.

Conclusion

Q/A Session

Practical Sessions for demonstating the SMT concepts

Introduction

Machine Translation employs machines in
converting one natural language to another.

Statistical Machine Translation (SMT) involves
using statistical methods (Bayers Theorem,
Probability, etc.) in undertaking this conversion.

Machine Translation Approaches

Direct Based MT

Rule Based MT

Corpus Based MT

Knowledge Based MT

Why only Linux for SMT
1. There is no license issues
2. Softwares are available for free
3. Softwares can be used directly for research.

Basic Prerequisities
Linux commands
File structure
Installation of software on linux.
Knowledge about computational linguistics and
machine translation tasks

My Experiences with Linux (Ubuntu)
Installing Softwares

Installing Softwares
 sudo apt-get install <Name_of_Software> provided it
is available in the repository...
Downloading the binaries and then reading the
README files and and completing the installation.
 Using synaptic pakage manager (for certain
dependencies). Ubuntu 10 above do not have
synaptic package manager, so install them seperately.
 Using Ubuntu Software Center

My Experiences With SMT
Softwares

Statistical Machine Translation relies heavily on
the following:-
 Development of high quality parallel corpus
 Algorithms applied in LM/TM/Decoder

Transition from Windows

Linux Quite different.

Errors are sometimes difficult to understand
(Especially Windows Users)

Errors can occur during installation or they can
occur while training/running the software.

General Issues with Open Source
Softwares

Sometimes give unexpected results.

Quite dynamic in nature.

Documentation is too vague and when the
software/os version gets updated it is difficult to
find the clue as to what is to be done.

It is sometimes a challenge to run/install the
softwares as the person gets lost in the
dependencies.

SMT Software for Windows ?

Well, not so easy but some generic solutions do
exist:-
 Using Vmware to install and run Ubuntu system

Updating the system is slow (unless you are using high
configuration machines)

Moses for Win 7 (online support for running Moses for
Windows 7)

Cygwin for windows.

Latest Updates
Lots of changed since 2011 (Time of ME Thesis).
Moses has undergone a revamp of major
functionalities
SRILM is now available in 1.7 version
GIZA++ not much change is there

Effect of the Changes in
Implementation
The older methods may not work very accurately.
For example in case there is change in the directory
structure then the same command may not run
properly.
However, the generic steps remain the same. These
change only after a major version is released...

Design and Implementation of
Statistical Machine
Translation System
LM
TM
Decoder

SMT System Involves
1. Downloading the softwares
2. Installing the softwares
3. Preparing the corpus
4. Training the softwares
5. Testing the Softwares
6. Developing applications using softwares
7. Deploying the applications

Language Model
SRI 's LM
Predictes the probabilities of a target sentence

Translation Model
The Translation Model (TM) computes the
probability of source sentence ‘S’, for a given
target sentence ‘T’. (conditional probability)
GIZA++ for TM

Decoder
The decoder maximizes the probability of the
generated sentence.
Moses software is used for decoder.

Preperation of Data
Development of parallel corpus text having
following contents:-
1. One sentence per line.
2. All sentences of parallel corpus need to be in
lowercased.
3. Try to include simple sentences instead of long
complicated sentences at least initially.

Preperation of Data
1. Tokenizing the Corpus
2. Filtering out long sentences
3. Lowercasing data
All the above is done using training scripts available in
moses folder

Language Model (Command will
change according to the software
being used)
./ngram-count –order 3 –text
corpus_new4.lowercased.hi –lm hindi.lm –write
count.cnt (According to ME thesis, But for the
latest see LM documentation)

Translation Model
GIZA++ Training as given in Moses Manual

Moses
./train-factored-phrase-model.perl -scripts-root-dir
/home/nakul/moses/mosesdecoder/trunk/scripts/training/moses-scripts/scripts-
20110405-1055/ -root-dir . --corpus corpus_new5.loweredcased -f en -e hi
-lm
0:3:/home/nakul/moses/mosesdecoder/trunk/scripts/training/moses-
scripts/scripts-20110405-1055/training/hindi_lm5.lm>& training_new5.out &

Few Names of Stalwards in SMT
Philip Kohen
Hieu Hong—Moses Specific
Christopher D Manning

Acknowledgements
Prof. Dr. Prateek Bhatia (ME Thesis Guide)
Mother and Father

References

My own experience in developing SMT system

Prateek Bhatia, Nakul Sharma, “English to Hindi
Statistical Machine Translation System”, ME
Thesis, Thapar University, Patiala, Punjab.

Softwares (Moses, GIZA++, SRILM).

Thank You
Any Questions or Comments...
Please feel free to drop me email at
nakul777@gmail.com.

Session on machine translation batu 19 march2016

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Session on machine translation batu 19 march2016

Similar to Session on machine translation batu 19 march2016 (20)

More from Nakul Sharma

More from Nakul Sharma (10)

Recently uploaded

Recently uploaded (20)

Session on machine translation batu 19 march2016