3. Introduction
ī Natural Language Processing
ī Machine Translation
īNeed for Machine Translation
īProblems in MT
īApproaches to machine translation
īDirect-based MT
īRule-based MT
īCorpus-based MT
īKnowledge-based MT
īSOME EXISTING MT SYSTEMS
īStatistical Machine Translation
4. Introduction
NLP (Natural Language Processing)
â deals with understanding and developing computational
theories of Human Language. Such theories allows us to
understand the structure of the language and build computer
software that can process language.
â Plays a major role in men-machine communication as well as
men-men communication.
â Machine Translation (MT) is a sub-field of computational
linguistics that investigates the use of computer software to
translate text or speech from one natural language to another.
5. Machine Translation
Machine translation the application of computer and language
sciences to the development of systems answering practical needs.
â Need for Machine Translation
â needed to translate literary works which from any language into
native languages.
â Most of the information available is in English which is understood
by only 3% of the population .
â making available work rich sources of literature available to people
across the world.
â Problems in Machine Translation
â Translation is not straightforward
â Automation of translation not easy
â Idioms
â Ambiguity
7. MT Approaches
ī Direct MT
īThe most basic form of MT. It translates the individual words in a
sentence from one language to another using a two-way
dictionary. It makes use of very simple grammar rules
īLittle analysis of source language
īNo parsing
īReliance on large two-way dictionary
ī Rule-Based Machine Translation
ī (RBMT; also known as âKnowledge-Based Machine
Translationâ; âClassical Approachâ of MT) is a general term that
denotes machine translation systems based on linguistic
information about source and target languages basically retrieved
from (bilingual) dictionaries and grammars covering the main
semantic, morphological, and syntactic regularities of each
language respectively
8. MT Approaches
âĸ Interlingua based Machine Translation
âĸ Fig 2.3: Multilingual MT system with Interlingua approach
â The Interlingua Machine Translation converts words into a
universal language that is created for the MT simply to translate
it to more than one language.
â can be used in applications like information retrieval.
â more practical when several languages are to be interpreted since
it only needs to translate it from the source language
9. MT Approaches
Transfer based MT
âĸ Transfer based translation have the same idea as that of interlingua i.e.
to make a translation it is necessary to have an intermediate
representation that captures the "meaning" of the original sentence in
order to generate the correct translation. In interlingua-based MT this
intermediate representation must be independent of the languages in
question, whereas in transfer-based MT, it has some dependence on
the language pair involved[9].
Knowledge-based MT
âĸ Semantic-based approaches to language analysis have been introduced
by AI researchers. The approached require a large knowledge-base
that includes both ontological and lexical knowledge [6].
10. MT Approaches
Corpus Based Machine Translation
classified into statistical and example-based Machine Translation.
âĸ Example based MT
â Example based systems use previous translation examples to
generate translations for an input provided. When an input
sentence is presented to the system, it retrieves a similar source
sentence from the example-base and its translation.
Statistical Machine Translation
â SMT requires less human effort to undertake translation. SMT is
a machine translation paradigm where translations are generated
on the basis of statistical models
Example-based MT Statistical-based MT
Example-based MT systems use variety of linguistic
resources such as dictionaries and thesauri, etc., to
translate text.
Statistical-based MT uses purely
statistical based methods in aligning the
words and generation of texts.
11. SOME EXISTING MT SYSTEMS
âĸ Google Translate
âĸ Systran
âĸ Bing Translator
âĸ Bable Fish
12. Statistical Machine Translation
ī Statistical Machine Translation consists of Language Model (LM),
Translation Model (TM) and the decoder.
ī The purpose of the language model is to encourage fluent output
and the purpose of the translation model is to encourage similarity
between input and output, the decoder maximizes the probability of
translated text of target language.
ī SMT is based on ideas used in Information Theory and in particular
Shannonâs noisy-channel model. The purpose of this model is to
identify a message which is transmitted through a communication
channel and is hence prone to errors due to the channelâs quality.
13. Statistical Machine Translation
Parallel corpus C= a collection of text chunks and their
translations.(byproduct of human translations.)
Given a source sentence f, select target sentence e.
đđđđđđĨ đâđ¸ đ {p(e|f)}= đđđđđđĨ đâđ¸ đ {p(e)*p(f|e)}
E(f)=set of hypothesized translation of f.
P(f/e)=diverges due to â
â Word order
â Morphology
â Syntactic relation
â Idiomatic ways of expression
â Sparse datasets(popularized primarily with sparse datasets)
15. SMT-Language Model
âĸ A language model gives the probability of a sentence. The
probability is computed using n-gram model. Language Model can
be considered as computation of the probability of single word given
all of the words that precede it in a sentence .
âĸ A sentence is decomposed into the product of conditional
probability.
âĸ By using chain rule the probability of sentence P (S), is broken
down as the probability of individual words P(w).
âĸ An n-gram model simplifies the task by approximating the
probability of a word given all the previous words.
17. SMT-Translation Model
Udaipur is a famous city
āĻāĻĻā§āĻĒā§ā§° āĻāĻāĻ¨ āĻŦāĻŋāĻāĻ¯āĻžāĻ¤ āĻāĻšā§°
The Translation Model helps to compute the
conditional probability P(T|S).
18. Implementation..
īą Install all packages in Moses
âĸ Install Giza++
âĸ Install IRSTLM
īąTraining
īąTuning
īąGenerate output (decoding)
19. TRAINING THE MOSES DECODER
âĸ Prepare data
âĸ Run GIZA++
âĸ Align words
âĸ Get lexical translation table
âĸ Extract phrases
âĸ Score phrases
âĸ Build lexicalized reordering model
âĸ Build generation models.
âĸ Create configuration file
20. PREPARING DATA
īTokenising - inserting spaces between words and
punctuation.
īTruecaseing - setting the case of the first word in
each sentence.
īCleaning - removing empty lines, redundant
spaces, and lines that are too short or too long.
21. Sample of Parallel Corpus
eng-ass1.en eng-ass1.as
Shopping in Udaipur is always a delightful
experience and it displays excellent
handicrafts and works developed by local
traders.
āĻāĻĻā§āĻĒā§ā§°āĻ¤ āĻŋāĻāĻžā§° āĻā§°āĻžāĻ āĻž āĻāĻ āĻāĻ¨āĻ¨ā§āĻĻāĻĻāĻžā§āĻ āĻ āĻŦāĻŋāĻā§āĻāĻ¤āĻž āĻā§°ā§ āĻāĻā§
āĻ¸ā§āĻĨāĻžāĻ¨ā§ā§ āĻŋāĻ¯ā§ąāĻ¸āĻžā§ā§āĻ¸āĻāĻ˛ā§° āĻšāĻ¸ā§āĻ¤āĻāĻ˛āĻž āĻā§°ā§ āĻāĻžāĻŽā§° āĻāĻ¤ā§āĻ¤āĻŽ āĻŦāĻ¨āĻĻā§°ā§āĻļāĻ¨ āĻĻāĻžāĻŦāĻŋ
āĻ§āĻā§°āĨ¤
September to March is the best season to
visit Udaipur.
āĻāĻāĻā§āĻŽā§āĻŦā§°ā§° āĻĒā§°āĻž āĻŽāĻžāĻāĻļ āĻ˛āĻ˛ āĻāĻĻā§āĻĒā§ā§° āĻā§ā§°āĻŽāĻŖā§° āĻāĻĒāĻ¯ā§āĻā§āĻ¤ āĻ¸āĻŽā§āĨ¤
The Shilpagram is designed on the concept
of a village with little emphasis on the
modern concept.
āĻāĻāĻ¨ āĻāĻžāĻžāĻ ā§ąā§° āĻĒ āĻŋā§ āĻŦāĻŽāĻ¤ āĻŦā§°ā§āĻ˛ā§āĻĒāĻā§ā§°āĻžāĻŽ āĻ āĻāĻŦāĻāĻ¤ āĻā§°āĻž āĻšāĻšāĻā§ āĻ¯ĘŧāĻ¤
āĻāĻ§ā§āĻŦāĻ¨āĻ āĻ§āĻžā§°āĻŖāĻžā§° āĻāĻĒā§°āĻ¤ āĻā§ā§°ā§āĻ¤ā§āĻŦ āĻŦāĻĻā§āĻž āĻāĻšāĻžā§ąāĻž āĻ¨āĻžāĻāĨ¤
A part of the City Palace is now converted
into a museum that displays some of the
best forms of art and culture.
āĻŦāĻāĻāĻŋ āĻāĻĒāĻāĻ˛āĻā§° āĻ āĻž āĻ āĻā§°ā§ āĻāĻŦāĻ¤ā§āĻž āĻ āĻž āĻ¸āĻāĻā§ā§°āĻžāĻšāĻžāĻ˛ā§āĻ˛āĻ˛ ā§°ā§āĻĒāĻžāĻ¨ā§āĻ¤āĻŦā§°āĻ¤
āĻā§°āĻž āĻšāĻšāĻā§ āĻā§°ā§ āĻā§āĻžāĻ¤ āĻŦāĻā§ā§āĻŽāĻžāĻ¨ āĻāĻ¨ā§āĻ¨āĻ¤ āĻāĻ˛āĻž āĻā§°ā§ āĻ¸āĻāĻ¸ā§āĻā§ āĻŦāĻ¤ā§°
āĻĒā§ā§°āĻĻā§°ā§āĻļāĻ¨ āĻā§°āĻž āĻšāĻšāĻā§āĨ¤
22. Sample output
English sentences as input Corresponding output in Assamese
Kanak Vrindavan is a popular picnic
spot in Jaipur
āĻāĻ¨āĻ āĻŋā§āĻ¨ā§āĻĻāĻžāĻŋāĻ¨ āĻšāĻšāĻā§ āĻā§āĻĒā§ā§°ā§° āĻāĻāĻ¨ āĻāĻ¨āĻŦāĻĒā§ā§°ā§ āĻŋāĻ¨āĻāĻŋāĻžāĻ āĻ¸ā§āĻĨāĻžāĻ¨
City Palace is a synthesis of Mughal
and Rajasthani architecture
āĻŦāĻāĻāĻŋ āĻāĻĒāĻāĻ˛āĻ āĻāĻŽāĻžāĻāĻ˛ āĻā§°ā§ ā§°āĻžāĻāĻ¸ā§āĻĨāĻžāĻ¨ā§ āĻ¸ā§āĻĨāĻžāĻĒāĻ¤āĻ¯ āĻŦāĻŋāĻĻāĻ¯āĻžā§° āĻ¸āĻāĻŦāĻŽāĻļā§ā§°āĻŖāĨ¤
Jama Masjid is the largest mosque in
India
āĻāĻžāĻŽāĻž āĻŽā§āĻŦāĻāĻĻ āĻŋāĻžā§°āĻ¤ā§° āĻŦāĻŋāĻ¤ā§°āĻ¤ āĻ āĻžāĻāĻ¤āĻ˛āĻ āĻĄāĻžāĻŋā§° āĻŽā§āĻŦāĻāĻĻāĨ¤
A part of the City Palace is
now converted into a museum
āĻŦāĻāĻāĻŋ āĻāĻĒāĻāĻ˛āĻā§° āĻ āĻž āĻ āĻā§°ā§ āĻāĻŦāĻ¤ā§āĻž āĻ āĻž āĻ¸āĻāĻā§ā§°āĻžāĻšāĻžāĻ˛ā§āĻ˛āĻ˛
ā§°ā§āĻĒāĻžāĻ¨ā§āĻ¤āĻŦā§°āĻ¤ āĻā§°āĻž āĻšāĻšāĻā§
24. Results and evaluation
â The output of the experiment was evaluated using
BLEU(Bilingual Evaluation Understudy)
â If we do Assamese-English translation using same parallel
corpus, BLEU score of 5.72 is obtained. This is very small
and may be because we have used a very small data set.
âĸ BLEU scores are not commensurate even between different corpora in
the same translation direction. Bleu is really only comparable for
different systems or system variants on the exact same data.
Source/Target BLEU Score
English-Assamese 4.71
25. Problems and proposed solution
ProblemsâĻ
â As we have used limited amount of English-Assamese parallel
corpus. The efficiency of the translation model is less as
efficiency increase when we are with more amount of data
(parallel corpus) for training.
â As it is not convenient here to get a better result of translation
for the OOV (Out of vocabulary Words) here as moses tool
either ignore OOV words or drop down. We are trying to
implement transliteration for those OOV. OOV words can be
those words which are not present in the corpus, some proper
nouns etc.
26. Problems and proposed solution
Solution..
Transliteration
âĸ āĻā§-āĻŽāĻž-ā§° -> (ku-ma-r) ā§°āĻžāĻ āĻā§ āĻŽāĻž ā§° ->(Raj-ku-ma-r)
âĸ Example: English-Assamese Transliteration
For example, when we translate the sentence âpanaji is a city" ,We
have used the following command for incorporating transliteration
into translation
âĸ
âĸ echo 'panaji is a city'| ~/mymoses/bin/moses -f ~/work/mert-work/moses.ini |
./space.pl | ~/mymoses/bin/moses -f ~/work1/train/model/moses.ini | ./join.pl
This gives us the output :
âĸ âāĻĒāĻžāĻ¨āĻžāĻŦāĻāĻ˛āĻšāĻā§āĻ āĻžāĻāĻšā§°â
27. Conclusion and Future Work
âĸ The work can be extended in following directions.
â The system can also be put in the web-based portal to translate content of
one web page in English to Assamese.
â We will try to increase the corpus for better training for better efficiency.
â We will try to develop the translation system by our own instead of using
Moses MT system.
â Since all Indian languages follow SOV order, and are relatively rich in
terms of morphology, the methodology presented should be applicable to
English to Indian language SMT in general. Since morphological and
parsing tools are not much widely available for Indian languages, an
approach like this which minimizes the use of such tools for the target
language would be quite handy.
28. Conclusion and Future Work
âĸ we try to get more corpora from different domains in such a
way that it will cover all the wordings. Since BLUE is not so
good for rough translation we need some other evaluation
techniques also.
âĸ We should try the incorporation of shallow syntactic
information (POS tags) in our discriminative model to boost
the performance of translation.
29. References
â Machine Translation Approaches and Survey for Indian Languages Antony P. J.â
â G. Singh and G. Singh Lehal,â A Punjabi to Hindi Machine Translation Systemâ, Coling 2008: Companion
volume- Posters and demonstrations, Manchester, August 2008.
â F.J.Och., âGIZA++: Training of statistical translation modelsâ, [Online]. Available at:
http://fjoch.com/GIZA++.html.
â Moses Manual
âĸ Natural%20language%20processing%20-
%20Wikipedia,%20the%20free%20encyclopedia.html
âĸ D. D. Rao, âMachine Translation A Gentle Introductionâ, RESONANCE, July 1998.
âĸ âStatistical machine translationâ, [Online].
Available,http://en.wikipedia.org/wiki/Statistical_machine_translation
âĸ S.K. Dwivedi and P. P. Sukadeve, âMachine Translation System Indian Perspectivesâ,
Proceeding of Journal of Computer Science Vol. 6 No. 10. pp 1082-1087, May 2010.
âĸ âMachine Translation â, [Online],Available :
âĸ http://www.ida.liu.se/~729G11/HYPERLINK
"http://www.ida.liu.se/~729G11/projekt/studentpapper-10/maria-"projekt/studentpapper-
10/maria- hedblom.pdf
âĸ âMachine Translationâ, [Online]. Available,
http://faculty.ksu.edu.sa/homiedan/Publications/
âĸ Machine%20Translation.pdf
âĸ D. D. Rao, âMachine Translation A Gentle Introductionâ, RESONANCE, July 1998.