EFFECT OF MORPHOLOGICAL
SEGMENTATION & DE-SEGMENTATION ON
MACHINE TRANSLATION
Sunayana R. Gawde
14109, M.Tech Part II
RECAP
 “Simple Syntactic and Morphological Processing
Can Help English-Hindi Statistical Machine
Translation” by Bhattacharya et al.; ACL 2014.
 “Statistical Machine Translation into a
Morphologically Complex Language” by Oflazer et
al.; CICLing 2008.
 “Combining Morpheme-based Machine Translation
with Post-processing Morpheme Prediction” by
Sarkar et al.; ACL 2011
MORPHOLOGICAL DE-SEGMENTATION
 De-segmentation is the process of converting
segmented words into their original surface form.
 Concatenation, Rules or Table look-up
 Segmentation-Sparsity reduction technique
 eat+ing
 dinner+s
LATTICE DE-SEGMENTATION FOR STATISTICAL
MACHINE TRANSLATION
 By Mohammad Salameh, Colin Cherry, Grzegorz
Kondrak
 Published in Proceedings of the 52nd Annual
Meeting of the Association for Computational
Linguistics 2014
 English-to-Arabic and English-to-Finnish translation
LATTICE
 A word lattice G = (V,E) is a directed acyclic graph
that formally is a weighted finite state automata
(FSA)
 Exactly one node has no outgoing edges and it is
called as ‘end node’.
EXAMPLES:
GENERALISING WORD LATTICE TRANSLATION
 By Christopher Dyer, Smaranda Muresan, Philip
Resnik
 In Proceedings of Association for Computational
Linguistics 2008
 Chinese to English and Arabic to English
translation.
THE CHART-REPRESENTATION OF THE GRAPH
WORD LATTICE DECODING
 2 classes of Translation models for lattice
translation:
 Finite State Transducers with hierarchical Phrase based
models.
 Synchronous CFG based decoder
LATTICE TRANSLATION WITH FST BASED
PHRASE BASED MODELS
 Phrase based models
 Splitting the sentence and creating phrases
 Choosing the path from lattice
 Moses phrase-based decoder to translate word
lattices
 Left to right parsing of Lattice
SYNCHRONOUS CONTEXT FREE GRAMMAR
 Source-Target synchronous rules
 Parse the input using source language grammar
 Simultaneously build a tree on target language
EFFECT OF WORD LATTICES
 Improvement in BLEU score
 Decrease in OOV words
 Poor Coverage of Named Entities
LATTICE DE-SEGMENTATION FOR STATISTICAL
MACHINE TRANSLATION
 By Mohammad Salameh, Colin Cherry, Grzegorz
Kondrak
 Published in Proceedings of the 52nd Annual
Meeting of the Association for Computational
Linguistics 2014
 English-to-Arabic and English-to-Finnish translation
GOAL
 De-segment the decoder’s output lattice
 Gain access to a compact, de-segmented view of a
large portion of the translation search space
 Morphemes De-segmenting Transducer De-
segmented words
 Lattice Specific Table Finite State Transducer
APPROACH:
 Baseline (Without Segmentation)
 1 best De-segmentation: Segmentation at Training
& De-segmentation after decoding
 N-best De-segmentation: De-segments, augments
and re-ranks the decoder’s 1000-best list.
 Lattice De-segmentation: Exponential number of
hypothesis
 The search graph of a phrase-based decoder can be
interpreted as a lattice.
 De-segmenting Transducer
ENGLISH TO ARABIC
 Table based De-Segmentor
ENGLISH TO FINNISH
 Simple concatenation
PROPOSED APPROACH TO IMPROVE
TRANSLATION QUALITY
 Translation from Multi-parallel sources
 English, Hindi, Konkani & Marathi
 Morphological Segmentation- to reduce data
sparsity
 Morfessor / Morph Analyser
 Morphological De-Segmentation
 Named Entity Tagger
 Cognates
PROPOSED WORK
 To study and experiment the effect of Morphological
Segmentation & De-segmentation on Phrase Based
Statistical Machine Translation
 Before evaluation
 Before decoding
 Before phrase extraction
 Implement on English to Konkani and Hindi to
Konkani translation systems.
 Evaluate with BLEU and METEOR
CURRENT STATUS
 Got familiar with basics of Moses
 Developed a Baseline System as suggested on
Moses website with their corpus
 Developed basic English-Hindi translation system
using parallel data available online with BLEU score
5:31 only.
 Hindi to Konkani Translation system for 3k
sentences of ILCI with BLEU score of 27.3
NEXT..
 Get the parallel data in text files which is not in
Unicode format.
 Align the data.
 Identify the Named Entities.
 Morphological Segmentation for Konkani.
 Morphological De-segmentation for Konkani
 Test the improvement in BLEU score.
REFERENCES
 “Lattice De-segmentation for Statistical Machine
Translation” by Mohammad Salameh, ColinCherry,
Grzegorz Kondrak in ACL 2014
 “Generalising Word Lattice Translation” by
Christopher Dyer, Smaranda Muresan, Philip
Resnik in ACL 2008.
THANK YOU

Effect of morphological segmentation & de-segmentation on machine translation Part2

  • 1.
    EFFECT OF MORPHOLOGICAL SEGMENTATION& DE-SEGMENTATION ON MACHINE TRANSLATION Sunayana R. Gawde 14109, M.Tech Part II
  • 2.
    RECAP  “Simple Syntacticand Morphological Processing Can Help English-Hindi Statistical Machine Translation” by Bhattacharya et al.; ACL 2014.  “Statistical Machine Translation into a Morphologically Complex Language” by Oflazer et al.; CICLing 2008.  “Combining Morpheme-based Machine Translation with Post-processing Morpheme Prediction” by Sarkar et al.; ACL 2011
  • 3.
    MORPHOLOGICAL DE-SEGMENTATION  De-segmentationis the process of converting segmented words into their original surface form.  Concatenation, Rules or Table look-up  Segmentation-Sparsity reduction technique  eat+ing  dinner+s
  • 4.
    LATTICE DE-SEGMENTATION FORSTATISTICAL MACHINE TRANSLATION  By Mohammad Salameh, Colin Cherry, Grzegorz Kondrak  Published in Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics 2014  English-to-Arabic and English-to-Finnish translation
  • 5.
    LATTICE  A wordlattice G = (V,E) is a directed acyclic graph that formally is a weighted finite state automata (FSA)  Exactly one node has no outgoing edges and it is called as ‘end node’.
  • 6.
  • 7.
    GENERALISING WORD LATTICETRANSLATION  By Christopher Dyer, Smaranda Muresan, Philip Resnik  In Proceedings of Association for Computational Linguistics 2008  Chinese to English and Arabic to English translation.
  • 8.
  • 9.
    WORD LATTICE DECODING 2 classes of Translation models for lattice translation:  Finite State Transducers with hierarchical Phrase based models.  Synchronous CFG based decoder
  • 10.
    LATTICE TRANSLATION WITHFST BASED PHRASE BASED MODELS  Phrase based models  Splitting the sentence and creating phrases  Choosing the path from lattice  Moses phrase-based decoder to translate word lattices  Left to right parsing of Lattice
  • 11.
    SYNCHRONOUS CONTEXT FREEGRAMMAR  Source-Target synchronous rules  Parse the input using source language grammar  Simultaneously build a tree on target language
  • 12.
    EFFECT OF WORDLATTICES  Improvement in BLEU score  Decrease in OOV words  Poor Coverage of Named Entities
  • 13.
    LATTICE DE-SEGMENTATION FORSTATISTICAL MACHINE TRANSLATION  By Mohammad Salameh, Colin Cherry, Grzegorz Kondrak  Published in Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics 2014  English-to-Arabic and English-to-Finnish translation
  • 14.
    GOAL  De-segment thedecoder’s output lattice  Gain access to a compact, de-segmented view of a large portion of the translation search space  Morphemes De-segmenting Transducer De- segmented words  Lattice Specific Table Finite State Transducer
  • 15.
    APPROACH:  Baseline (WithoutSegmentation)  1 best De-segmentation: Segmentation at Training & De-segmentation after decoding  N-best De-segmentation: De-segments, augments and re-ranks the decoder’s 1000-best list.  Lattice De-segmentation: Exponential number of hypothesis  The search graph of a phrase-based decoder can be interpreted as a lattice.  De-segmenting Transducer
  • 16.
    ENGLISH TO ARABIC Table based De-Segmentor
  • 17.
    ENGLISH TO FINNISH Simple concatenation
  • 18.
    PROPOSED APPROACH TOIMPROVE TRANSLATION QUALITY  Translation from Multi-parallel sources  English, Hindi, Konkani & Marathi  Morphological Segmentation- to reduce data sparsity  Morfessor / Morph Analyser  Morphological De-Segmentation  Named Entity Tagger  Cognates
  • 19.
    PROPOSED WORK  Tostudy and experiment the effect of Morphological Segmentation & De-segmentation on Phrase Based Statistical Machine Translation  Before evaluation  Before decoding  Before phrase extraction  Implement on English to Konkani and Hindi to Konkani translation systems.  Evaluate with BLEU and METEOR
  • 20.
    CURRENT STATUS  Gotfamiliar with basics of Moses  Developed a Baseline System as suggested on Moses website with their corpus  Developed basic English-Hindi translation system using parallel data available online with BLEU score 5:31 only.  Hindi to Konkani Translation system for 3k sentences of ILCI with BLEU score of 27.3
  • 21.
    NEXT..  Get theparallel data in text files which is not in Unicode format.  Align the data.  Identify the Named Entities.  Morphological Segmentation for Konkani.  Morphological De-segmentation for Konkani  Test the improvement in BLEU score.
  • 22.
    REFERENCES  “Lattice De-segmentationfor Statistical Machine Translation” by Mohammad Salameh, ColinCherry, Grzegorz Kondrak in ACL 2014  “Generalising Word Lattice Translation” by Christopher Dyer, Smaranda Muresan, Philip Resnik in ACL 2008.
  • 23.