This document summarizes a paper that explores relearning a rule-based machine translation (RBMT) system using statistical methods. It compares the performance of the original SYSTRAN RBMT system, a relearnt statistical model of SYSTRAN called SYSTRAN Relearnt, and a baseline statistical model called SYSTRAN Relearnt-0. The models are trained without parallel corpora by using SYSTRAN translations. Evaluation shows SYSTRAN Relearnt achieves 5 BLEU points higher than the baseline by using a real English language model and tuning set. Error analysis of 100 sentences identifies common error types between the systems like missing words, extra words, and translation choices to discriminate the nature and training of
1. Summary of Can we relearn an RBMT system?
Summary of Can we relearn an RBMT system?
Hiroshi Matsumoto
Nagaoka University of Technology EEI Dept.
March 5, 2013
2. Summary of Can we relearn an RBMT system?
Outline
1 About this paper
2 Introduction
3 Systems
4 Models
5 Results
3. Summary of Can we relearn an RBMT system?
About this paper
About this paper:
Title:Can we relearn an rbmt system?
Author:Dugast, Lo{ï}c and Senellart, Jean and Koehn, Philipp
Booktitle:Proceedings of the Third Workshop on Statistical
Machine Translation
Pages: 175178
Year: 2008
Organization:Association for Computational Linguistics
4. Summary of Can we relearn an RBMT system?
Introduction
Introduction
Two Major Researches:
1 Rule-based Systems
Manually written rules associated with bilingual dictionaries
2 Statistical Machine Translation
Statistical framework based on large amount of monolingual
and parallel corpora
Aims of this research:
nding ecient combination setups
discriminating strengths/weaknesses of rule-based and
statistical systems
5. Summary of Can we relearn an RBMT system?
Systems
Systems
Systems
SYSTRAN:
a pure rule-based system
SYSTRAN Relearnt:
a statistical model of the rule-based engine
Relearnt uses a real English language model
SYSTRAN Relearnt-0:
a plain statistical model of SYSTRAN
MOSES
6. Summary of Can we relearn an RBMT system?
Models
Training w/o human ref. translation
Problem
The reliance of statistical models on parallel corpora is
problematic.
Solutions for this are such as by domain adaptation, statistical
post-editing.
Here, they came up with a new solution
7. Summary of Can we relearn an RBMT system?
Models
Training w/o human ref. translation
Submitted system:
SL side of parallel corpus was translated with rule-based
translation engine to produce the target side of the training
data
LM was trained on the real TL from data
Non-Submitted system:
Each corpus was built from newspaper
SL corpus was translated by the rule-based system to produce
the parallel training data, while TL corpus was used to train a
LM
8. Summary of Can we relearn an RBMT system?
Models
Training w/o human ref. translation
9. Summary of Can we relearn an RBMT system?
Results
Results #1
Comparison of Baseline Relearnt-0
Relearnt-0 model is slightly lower than the rule-based original
Comparison of Relearnt Relearnt-0
5 BLEU points more for the Relearnt-0 with a real English
language model and tuning set
10. Summary of Can we relearn an RBMT system?
Results
Results #2
To discriminate between the statistical nature of a translation
system and the fact it was trained on the relevant domain,
dened 11 error types
counted occurrences for 100 random-picked sentences
11. Summary of Can we relearn an RBMT system?
Results
Results #2
Missing words
Typical statistial error: but no evidence
Extra words
One of rule-based features to produce something extra
Unknown words
Not in dictionaries for rule-based
Translation choice
Statistical strength
12. Summary of Can we relearn an RBMT system?
Results
Result #3