3. Introduction
Machine Translation employs machines in
converting one natural language to another.
Statistical Machine Translation (SMT) involves
using statistical methods (Bayers Theorem,
Probability, etc.) in undertaking this conversion.
7. My Experiences with Linux (Ubuntu)
Installing Softwares
Installing Softwares
sudo apt-get install <Name_of_Software> provided it
is available in the repository...
Downloading the binaries and then reading the
README files and and completing the installation.
Using synaptic pakage manager (for certain
dependencies). Ubuntu 10 above do not have
synaptic package manager, so install them seperately.
Using Ubuntu Software Center
8. My Experiences With SMT
Softwares
Statistical Machine Translation relies heavily on
the following:-
Development of high quality parallel corpus
Algorithms applied in LM/TM/Decoder
9. Transition from Windows
Linux Quite different.
Errors are sometimes difficult to understand
(Especially Windows Users)
Errors can occur during installation or they can
occur while training/running the software.
10. General Issues with Open Source
Softwares
Sometimes give unexpected results.
Quite dynamic in nature.
Documentation is too vague and when the
software/os version gets updated it is difficult to
find the clue as to what is to be done.
It is sometimes a challenge to run/install the
softwares as the person gets lost in the
dependencies.
11. SMT Software for Windows ?
Well, not so easy but some generic solutions do
exist:-
Using Vmware to install and run Ubuntu system
Updating the system is slow (unless you are using high
configuration machines)
Moses for Win 7 (online support for running Moses for
Windows 7)
Cygwin for windows.
12. Latest Updates
Lots of changed since 2011 (Time of ME Thesis).
Moses has undergone a revamp of major
functionalities
SRILM is now available in 1.7 version
GIZA++ not much change is there
13. Effect of the Changes in
Implementation
The older methods may not work very accurately.
For example in case there is change in the directory
structure then the same command may not run
properly.
However, the generic steps remain the same. These
change only after a major version is released...
16. SMT System Involves
1. Downloading the softwares
2. Installing the softwares
3. Preparing the corpus
4. Training the softwares
5. Testing the Softwares
6. Developing applications using softwares
7. Deploying the applications
18. Translation Model
The Translation Model (TM) computes the
probability of source sentence ‘S’, for a given
target sentence ‘T’. (conditional probability)
GIZA++ for TM
20. Preperation of Data
Development of parallel corpus text having
following contents:-
1. One sentence per line.
2. All sentences of parallel corpus need to be in
lowercased.
3. Try to include simple sentences instead of long
complicated sentences at least initially.
21. Preperation of Data
1. Tokenizing the Corpus
2. Filtering out long sentences
3. Lowercasing data
All the above is done using training scripts available in
moses folder
22. Language Model (Command will
change according to the software
being used)
./ngram-count –order 3 –text
corpus_new4.lowercased.hi –lm hindi.lm –write
count.cnt (According to ME thesis, But for the
latest see LM documentation)
27. References
My own experience in developing SMT system
Prateek Bhatia, Nakul Sharma, “English to Hindi
Statistical Machine Translation System”, ME
Thesis, Thapar University, Patiala, Punjab.
Softwares (Moses, GIZA++, SRILM).
28. Thank You
Any Questions or Comments...
Please feel free to drop me email at
nakul777@gmail.com.