Translationusing moses1

A SEMINAR REPORT
ON
English-Assamese Statistical
Machine Translation using Moses
by
KALYANEE KANCHAN BARUAAH

Contents
Introduction
Literature Review
Implementation
Problems and proposed solutions
Results and Evaluation
 Conclusion and future work
 References

Introduction
 Natural Language Processing
 Machine Translation
Need for Machine Translation
Problems in MT
Approaches to machine translation
Direct-based MT
Rule-based MT
Corpus-based MT
Knowledge-based MT
SOME EXISTING MT SYSTEMS
Statistical Machine Translation

Introduction
NLP (Natural Language Processing)
– deals with understanding and developing computational
theories of Human Language. Such theories allows us to
understand the structure of the language and build computer
software that can process language.
– Plays a major role in men-machine communication as well as
men-men communication.
– Machine Translation (MT) is a sub-field of computational
linguistics that investigates the use of computer software to
translate text or speech from one natural language to another.

Machine Translation
Machine translation the application of computer and language
sciences to the development of systems answering practical needs.
– Need for Machine Translation
– needed to translate literary works which from any language into
native languages.
– Most of the information available is in English which is understood
by only 3% of the population .
– making available work rich sources of literature available to people
across the world.
– Problems in Machine Translation
– Translation is not straightforward
– Automation of translation not easy
– Idioms
– Ambiguity

Machine Translation
MT Approaches

MT Approaches
 Direct MT
The most basic form of MT. It translates the individual words in a
sentence from one language to another using a two-way
dictionary. It makes use of very simple grammar rules
Little analysis of source language
No parsing
Reliance on large two-way dictionary
 Rule-Based Machine Translation
 (RBMT; also known as “Knowledge-Based Machine
Translation”; “Classical Approach” of MT) is a general term that
denotes machine translation systems based on linguistic
information about source and target languages basically retrieved
from (bilingual) dictionaries and grammars covering the main
semantic, morphological, and syntactic regularities of each
language respectively

MT Approaches
• Interlingua based Machine Translation
• Fig 2.3: Multilingual MT system with Interlingua approach
– The Interlingua Machine Translation converts words into a
universal language that is created for the MT simply to translate
it to more than one language.
– can be used in applications like information retrieval.
– more practical when several languages are to be interpreted since
it only needs to translate it from the source language

MT Approaches
Transfer based MT
• Transfer based translation have the same idea as that of interlingua i.e.
to make a translation it is necessary to have an intermediate
representation that captures the "meaning" of the original sentence in
order to generate the correct translation. In interlingua-based MT this
intermediate representation must be independent of the languages in
question, whereas in transfer-based MT, it has some dependence on
the language pair involved[9].
Knowledge-based MT
• Semantic-based approaches to language analysis have been introduced
by AI researchers. The approached require a large knowledge-base
that includes both ontological and lexical knowledge [6].

MT Approaches
Corpus Based Machine Translation
classified into statistical and example-based Machine Translation.
• Example based MT
– Example based systems use previous translation examples to
generate translations for an input provided. When an input
sentence is presented to the system, it retrieves a similar source
sentence from the example-base and its translation.
Statistical Machine Translation
– SMT requires less human effort to undertake translation. SMT is
a machine translation paradigm where translations are generated
on the basis of statistical models
Example-based MT Statistical-based MT
Example-based MT systems use variety of linguistic
resources such as dictionaries and thesauri, etc., to
translate text.
Statistical-based MT uses purely
statistical based methods in aligning the
words and generation of texts.

SOME EXISTING MT SYSTEMS
• Google Translate
• Systran
• Bing Translator
• Bable Fish

 Statistical Machine Translation consists of Language Model (LM),
Translation Model (TM) and the decoder.
 The purpose of the language model is to encourage fluent output
and the purpose of the translation model is to encourage similarity
between input and output, the decoder maximizes the probability of
translated text of target language.
 SMT is based on ideas used in Information Theory and in particular
Shannon’s noisy-channel model. The purpose of this model is to
identify a message which is transmitted through a communication
channel and is hence prone to errors due to the channel’s quality.

Parallel corpus C= a collection of text chunks and their
translations.(byproduct of human translations.)
Given a source sentence f, select target sentence e.
𝑎𝑟𝑔𝑚𝑎𝑥 𝑒∈𝐸 𝑓 {p(e|f)}= 𝑎𝑟𝑔𝑚𝑎𝑥 𝑒∈𝐸 𝑓 {p(e)*p(f|e)}
E(f)=set of hypothesized translation of f.
P(f/e)=diverges due to –
– Word order
– Morphology
– Syntactic relation
– Idiomatic ways of expression
– Sparse datasets(popularized primarily with sparse datasets)

SMT-Language Model
• A language model gives the probability of a sentence. The
probability is computed using n-gram model. Language Model can
be considered as computation of the probability of single word given
all of the words that precede it in a sentence .
• A sentence is decomposed into the product of conditional
probability.
• By using chain rule the probability of sentence P (S), is broken
down as the probability of individual words P(w).
• An n-gram model simplifies the task by approximating the
probability of a word given all the previous words.

SMT-Translation Model
Udaipur is a famous city
উদয়পুৰ এখন বিখযাত চহৰ
The Translation Model helps to compute the
conditional probability P(T|S).

Implementation..
 Install all packages in Moses
• Install Giza++
• Install IRSTLM
Training
Tuning
Generate output (decoding)

TRAINING THE MOSES DECODER
• Prepare data
• Run GIZA++
• Align words
• Get lexical translation table
• Extract phrases
• Score phrases
• Build lexicalized reordering model
• Build generation models.
• Create configuration file

PREPARING DATA
Tokenising - inserting spaces between words and
punctuation.
Truecaseing - setting the case of the first word in
each sentence.
Cleaning - removing empty lines, redundant
spaces, and lines that are too short or too long.

Sample of Parallel Corpus
eng-ass1.en eng-ass1.as
Shopping in Udaipur is always a delightful
experience and it displays excellent
handicrafts and works developed by local
traders.
উদয়পুৰত িজাৰ কৰাট া এক আনন্দদায়ক অবিজ্ঞতা আৰু ইটয়
স্থানীয় িযৱসায়ীসকলৰ হস্তকলা আৰু কামৰ উত্তম বনদৰ্শন দাবি
ধটৰ।
September to March is the best season to
visit Udaipur.
চচটেম্বৰৰ পৰা মাচশ লল উদয়পুৰ ভ্ৰমণৰ উপযুক্ত সময়।
The Shilpagram is designed on the concept
of a village with little emphasis on the
modern concept.
এখন গাাঁ ৱৰ প িূ বমত বৰ্ল্পগ্ৰাম অংবকত কৰা হহটে যʼত
আধুবনক ধাৰণাৰ ওপৰত গুৰুত্ব বদয়া চহাৱা নাই।
A part of the City Palace is now converted
into a museum that displays some of the
best forms of art and culture.
বচটি চপটলচৰ এ া অংৰ্ এবতয়া এ া সংগ্ৰাহালয়লল ৰূপান্তবৰত
কৰা হহটে আৰু ইয়াত বকেুমান উন্নত কলা আৰু সংস্কৃ বতৰ
প্ৰদৰ্শন কৰা হহটে।

Sample output
English sentences as input Corresponding output in Assamese
Kanak Vrindavan is a popular picnic
spot in Jaipur
কনক িৃন্দািন হহটে জয়পুৰৰ এখন জনবপ্ৰয় িনটিাজ স্থান
City Palace is a synthesis of Mughal
and Rajasthani architecture
বচটি চপটলচ চমাগল আৰু ৰাজস্থানী স্থাপতয বিদযাৰ সংবমশ্ৰণ।
Jama Masjid is the largest mosque in
India
জামা মেবজদ িাৰতৰ বিতৰত আ াইতলক ডািৰ মেবজদ।
A part of the City Palace is
now converted into a museum
বচটি চপটলচৰ এ া অংৰ্ এবতয়া এ া সংগ্ৰাহালয়লল
ৰূপান্তবৰত কৰা হহটে

Block diagram showing SMT using moses

Results and evaluation
– The output of the experiment was evaluated using
BLEU(Bilingual Evaluation Understudy)
– If we do Assamese-English translation using same parallel
corpus, BLEU score of 5.72 is obtained. This is very small
and may be because we have used a very small data set.
• BLEU scores are not commensurate even between different corpora in
the same translation direction. Bleu is really only comparable for
different systems or system variants on the exact same data.
Source/Target BLEU Score
English-Assamese 4.71

Problems and proposed solution
Problems…
– As we have used limited amount of English-Assamese parallel
corpus. The efficiency of the translation model is less as
efficiency increase when we are with more amount of data
(parallel corpus) for training.
– As it is not convenient here to get a better result of translation
for the OOV (Out of vocabulary Words) here as moses tool
either ignore OOV words or drop down. We are trying to
implement transliteration for those OOV. OOV words can be
those words which are not present in the corpus, some proper
nouns etc.

Problems and proposed solution
Solution..
Transliteration
• কু-মা-ৰ -> (ku-ma-r) ৰাজ কু মা ৰ ->(Raj-ku-ma-r)
• Example: English-Assamese Transliteration
For example, when we translate the sentence “panaji is a city" ,We
have used the following command for incorporating transliteration
into translation
•
• echo 'panaji is a city'| ~/mymoses/bin/moses -f ~/work/mert-work/moses.ini |
./space.pl | ~/mymoses/bin/moses -f ~/work1/train/model/moses.ini | ./join.pl
This gives us the output :
• “পানাবজলহটেএ াচহৰ”

Conclusion and Future Work
• The work can be extended in following directions.
– The system can also be put in the web-based portal to translate content of
one web page in English to Assamese.
– We will try to increase the corpus for better training for better efficiency.
– We will try to develop the translation system by our own instead of using
Moses MT system.
– Since all Indian languages follow SOV order, and are relatively rich in
terms of morphology, the methodology presented should be applicable to
English to Indian language SMT in general. Since morphological and
parsing tools are not much widely available for Indian languages, an
approach like this which minimizes the use of such tools for the target
language would be quite handy.

Conclusion and Future Work
• we try to get more corpora from different domains in such a
way that it will cover all the wordings. Since BLUE is not so
good for rough translation we need some other evaluation
techniques also.
• We should try the incorporation of shallow syntactic
information (POS tags) in our discriminative model to boost
the performance of translation.

References
– Machine Translation Approaches and Survey for Indian Languages Antony P. J.∗
– G. Singh and G. Singh Lehal,” A Punjabi to Hindi Machine Translation System”, Coling 2008: Companion
volume- Posters and demonstrations, Manchester, August 2008.
– F.J.Och., “GIZA++: Training of statistical translation models”, [Online]. Available at:
http://fjoch.com/GIZA++.html.
– Moses Manual
• Natural%20language%20processing%20-
%20Wikipedia,%20the%20free%20encyclopedia.html
• D. D. Rao, “Machine Translation A Gentle Introduction”, RESONANCE, July 1998.
• “Statistical machine translation”, [Online].
Available,http://en.wikipedia.org/wiki/Statistical_machine_translation
• S.K. Dwivedi and P. P. Sukadeve, “Machine Translation System Indian Perspectives”,
Proceeding of Journal of Computer Science Vol. 6 No. 10. pp 1082-1087, May 2010.
• “Machine Translation ”, [Online],Available :
• http://www.ida.liu.se/~729G11/HYPERLINK
"http://www.ida.liu.se/~729G11/projekt/studentpapper-10/maria-"projekt/studentpapper-
10/maria- hedblom.pdf
• “Machine Translation”, [Online]. Available,
http://faculty.ksu.edu.sa/homiedan/Publications/
• Machine%20Translation.pdf
• D. D. Rao, “Machine Translation A Gentle Introduction”, RESONANCE, July 1998.

Translationusing moses1

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Viewers also liked

Viewers also liked (20)

Similar to Translationusing moses1

Similar to Translationusing moses1 (20)

Recently uploaded

Recently uploaded (20)

Translationusing moses1