In this presentation, I tried to do study on effect of Morphological Segmentation and De-segmentation on the effect and quality of Machine translation with respect to English-Konkani Translation.
2. RECAP
Concept Definition
Overview of existing tools like Mantra, Anuvaadak,
MAT etc.
Machine Translation for E-GOV - Baltic Case
Machine Translation approaches
Classification of Government Documents
Proposed an idea of Structured Translation.
3. ISSUES IN MACHINE TRANSLATION
Disambiguation (WSD)
Non-Standard Speech
Named Entities
Translate
Transliterate
4. IMPROVING TRANSLATION QUALITY
Translation from Multi-parallel sources
Morphological Segmentation
Morphological De-Segmentation
Ontologies in MT
5. MORPHEMES
Smallest unit of language which has meaning.
Any word form can be expressed as a combination
of morphemes.
affect+ion+ate
dinner+s
eat+ing
king+'s
open+mind+ed+ness.
6. MORPHOLOGICAL SEGMENTATION
Morphological segmentation transforms the
sentence by segmenting relevant morphemes,
which are then handled as regular tokens during
alignment and translation.
To reduce data sparsity and to improve
correspondence with the source language (usually
English)
7. MORPHOLOGICAL DE-SEGMENTATION
De-segmentation is the process of converting
segmented words into their original surface form.
For many segmentations, especially unsupervised
ones, this amounts to simple concatenation.
Two schemes proposed by Badr et al. (2008)
table-based and
rule-based
8. SIMPLE SYNTACTIC AND MORPHOLOGICAL
PROCESSING CAN HELP ENGLISH-HINDI
STATISTICAL MACHINE TRANSLATION
By Ananthakrishnan Ramanathan, Pushpak
Bhattacharyya, Jayprasad Hegde, Ritesh M. Shah,
Sasikumar M.
ACL 2014
Re-ordering (3.8)
Transliteration (4.8)
Using suffixes of Hindi (Morfessor 2.0)
BLEU from 12.10 to 15.88
9. STATISTICAL MACHINE TRANSLATION INTO A
MORPHOLOGICALLY COMPLEX LANGUAGE
Kemal Oflazer
CICLing 2008 (Conference on Intelligent Text
Processing & Computational Linguistics)
Phrase-based SMT from English into Turkish
Improved BLEU score by 7.10 points i.e. from
19.77 to 26.87
Moses toolkit
SRILM language modelling toolkit.
10. APPROACH
Representing both English and Turkish at the
morpheme-level but with some selective
morpheme-grouping on the Turkish side of the
training data
Re-ranking the n-best morpheme-sequence outputs
of the decoder with a word-based language model
“Repairing” translated words with incorrect
morphological structure and words which are out-
of-vocabulary relative to the training and the
language model corpus
11. LATTICE DE-SEGMENTATION FOR STATISTICAL
MACHINE TRANSLATION
By Mohammad Salameh, Colin Cherry, Grzegorz
Kondrak
Published in Proceedings of the 52nd Annual
Meeting of the Association for Computational
Linguistics 2014
English-to-Arabic and English-to-Finnish translation
12. APPROACH:
Baseline (Without Segmentation)
1 best De-segmentation: Segmentation at Training
& De-segmentation after decoding
N-best De-segmentation: De-segments, augments
and re-ranks the decoder’s 1000-best list.
Lattice De-segmentation: Exponential number of
hypothesis
The search graph of a phrase-based decoder can be
interpreted as a lattice.
De-segmenting Transducer
15. COMBINING MORPHEME-BASED MACHINE
TRANSLATION WITH POST-PROCESSING
MORPHEME PREDICTION
By Ann Clifton and Anoop Sarkar
Proceedings of the 49th Annual Meeting of the
Association for Computational Linguistics 2011
English-Finnish
BLEU improves from 14.82 to 15.09.
16. APPROACH
Idea of segmented translation where they explicitly
allow phrase pairs that can end with a dangling
morpheme, which can connect with other
morphemes as part of the translation process
Use of a fully segmented translation model in
combination with a post-processing morpheme
prediction system, using unsupervised morphology
induction.
17. Baselines:
Word Based
Factored (Unsupervised)
Segmented Models (Supervised)
Segmentation using Morfessor (Unsupervised)
18. PROPOSED WORK
To study and experiment the effect of Morphological
Segmentation & De-segmentation on Phrase Based
Statistical Machine Translation
Before evaluation
Before decoding
Before phrase extraction
Implement on English to Konkani and Hindi to
Konkani translation systems.
Evaluate with BLEU and METEOR
19. CURRENT STATUS
Got familiar with basics of Moses
Developed a Baseline System as suggested on
Moses website with their corpus
Developed basic English-Hindi translation system
using parallel data available online with BLEU score
5:31 only.
20. NEXT..
Get the parallel data in text files which is not in
Unicode format.
Align the data.
Identify the Named Entities.
Morphological Segmentation for Konkani.
Test the improvement in BLEU score.
21. REFERENCES
“The IIT Bombay SMT System for ICON 2014 Tools
Contest” By Anoop Kunchukuttan, Ratish Puduppully,
Rajen Chatterjee, Abhijit Mishra, Pushpak
Bhattacharyya
“Statistical Machine Translation into a Morphologically
Complex Language” by Kemal Oflazer in CICLing 2008.
“Lattice De-segmentation for Statistical Machine
Translation” by Mohammad Salameh, ColinCherry,
Grzegorz Kondrak in ACL 2014
“Combining Morpheme-based Machine Translation with
Post-processing Morpheme Prediction” by Ann Clifton
and Anoop Sarkar in ACL 2011.