SMART Final Review MeetingConfidence EstimationWork by: Lucia Specia, Craig Saunders, Marco Turchi, Zhuroan Wang, John Shawe-Taylor, NelloCristianiniNovember 2009
ContributorsLucia Specia, Craig Saunders, XEROX Marco Turchi , NelloCristianiniUoBZhuroan Wang, John Shawe-Taylor UCL
Addressed problemWe address the problem of predicting the quality of a translation, assigning it to a class out of a finite set.This is done without using the human translation (not doing evaluation).We only use features that can be available to the translation software.
Addressed ProblemOne target scenario is that of professional translators post-editing MT segments. In that scenario, the simplest and possibly most effective form of a segment-level quality estimate is a binary “good” or “bad” score, where translations judged as “bad” are not suggested for post-edition.
Addressed ProblemWe propose a score that is estimated using a machine learning technique from a collection of information sources and translations annotated according to 1-4 quality scoresA regression algorithm produces a continuous score, which is then thresholded into the classes to filter out “bad” translations. Differently from previous work, we can define this threshold dynamically.
DataIn order to train the machine learning system, we use 2 types of data: sentences (a) translations automatically annotated with NIST scores, and (b) translations produced by different MT systems and for multiple language-pairs, manually annotated with different types of scoresEnglish-Spanish parallel corpus provided by WMT-08 (Callison-Burch et al., 2008) and used to translate 4K Europarl sentences from the development and test sets also provided by WMT-08.
DataFor each system, translations are manually annotated by professional translators with 1-4 quality scores, which are commonly used by them to indicate the quality of translations with respect to the need for post-edition:1. requires complete retranslation2. post editing quicker than retranslation3. little post editing needed4. fit for purpose
FeaturesWe used a total of 84 features, of 2 different types:Black Box Type: we focus on features that do not depend on any aspect of the translation process, that is, which can be extracted from any MT system, given only the input (source) and translation (target) sentences, and possibly monolingual or parallel corpora.Glass Box Type:internal properties of the translation process (only available if open source software, in general)
Machine Learning MethodsSince we used a large number of correlated features, we resorted to Partial Least Squares (regression method traditionally used in chemometrics where this situation is common).PLS projects the original data onto a different space of latent variables (or “components”). It can be defined as an ordinary multiple regression problem
We take advantage of a property of PLS, which is the ordering of the features according to their relevance, to select subsets of discriminative features.
ResultsThe figures for the subsets of features consistently outperform those for using all features and are also more stable (lower standard deviations). Using only the selected features, predictions deviate on average ∼ 0.618-0.68 from the true score.We cannot compare to any previous study, but we are making this data available (to ensure future comparability).
In all experiments, feature selection improves performanceCorrelation between our predictor and human evaluation is higher than correlation between Bleu/Meteor/Nist e Ter and human evaluation
Comparing with the State of the ArtThe problem is a standard one, and many methods have been tried.This study makes use of machine learning algorithms	Partial Least Squares.Results are very positive, but there is no other specific study with which we can compare (we will make ourselves comparable by releasing the data)
User Trials and DemosIt works well in practice, we have deployed it (best results in CAT user evaluation) and on a Xerox Platform (used by real customers)
DisseminationPublicationsLucia Specia, Marco Turchi, Nicola Cancedda, Marc Dymetman and NelloCristianini: Estimating the Sentence-Level Quality of Machine Translation Systems, in Conference of the European Association for Machine Translation, Barcelona, Spain, 2009.Lucia Specia, Marco Turchi, Zhuoran Wang, John Shawe-Taylor and Craig Saunders: Improving the Confidence of Machine Translation Quality Estimates., in Machine Translation Summit XII, Ottawa, Canada. 2009.Videolectures

Slide 1

  • 1.
    SMART Final ReviewMeetingConfidence EstimationWork by: Lucia Specia, Craig Saunders, Marco Turchi, Zhuroan Wang, John Shawe-Taylor, NelloCristianiniNovember 2009
  • 2.
    ContributorsLucia Specia, CraigSaunders, XEROX Marco Turchi , NelloCristianiniUoBZhuroan Wang, John Shawe-Taylor UCL
  • 3.
    Addressed problemWe addressthe problem of predicting the quality of a translation, assigning it to a class out of a finite set.This is done without using the human translation (not doing evaluation).We only use features that can be available to the translation software.
  • 4.
    Addressed ProblemOne targetscenario is that of professional translators post-editing MT segments. In that scenario, the simplest and possibly most effective form of a segment-level quality estimate is a binary “good” or “bad” score, where translations judged as “bad” are not suggested for post-edition.
  • 5.
    Addressed ProblemWe proposea score that is estimated using a machine learning technique from a collection of information sources and translations annotated according to 1-4 quality scoresA regression algorithm produces a continuous score, which is then thresholded into the classes to filter out “bad” translations. Differently from previous work, we can define this threshold dynamically.
  • 6.
    DataIn order totrain the machine learning system, we use 2 types of data: sentences (a) translations automatically annotated with NIST scores, and (b) translations produced by different MT systems and for multiple language-pairs, manually annotated with different types of scoresEnglish-Spanish parallel corpus provided by WMT-08 (Callison-Burch et al., 2008) and used to translate 4K Europarl sentences from the development and test sets also provided by WMT-08.
  • 7.
    DataFor each system,translations are manually annotated by professional translators with 1-4 quality scores, which are commonly used by them to indicate the quality of translations with respect to the need for post-edition:1. requires complete retranslation2. post editing quicker than retranslation3. little post editing needed4. fit for purpose
  • 8.
    FeaturesWe used atotal of 84 features, of 2 different types:Black Box Type: we focus on features that do not depend on any aspect of the translation process, that is, which can be extracted from any MT system, given only the input (source) and translation (target) sentences, and possibly monolingual or parallel corpora.Glass Box Type:internal properties of the translation process (only available if open source software, in general)
  • 9.
    Machine Learning MethodsSincewe used a large number of correlated features, we resorted to Partial Least Squares (regression method traditionally used in chemometrics where this situation is common).PLS projects the original data onto a different space of latent variables (or “components”). It can be defined as an ordinary multiple regression problem
  • 10.
    We take advantageof a property of PLS, which is the ordering of the features according to their relevance, to select subsets of discriminative features.
  • 11.
    ResultsThe figures forthe subsets of features consistently outperform those for using all features and are also more stable (lower standard deviations). Using only the selected features, predictions deviate on average ∼ 0.618-0.68 from the true score.We cannot compare to any previous study, but we are making this data available (to ensure future comparability).
  • 12.
    In all experiments,feature selection improves performanceCorrelation between our predictor and human evaluation is higher than correlation between Bleu/Meteor/Nist e Ter and human evaluation
  • 14.
    Comparing with theState of the ArtThe problem is a standard one, and many methods have been tried.This study makes use of machine learning algorithms Partial Least Squares.Results are very positive, but there is no other specific study with which we can compare (we will make ourselves comparable by releasing the data)
  • 15.
    User Trials andDemosIt works well in practice, we have deployed it (best results in CAT user evaluation) and on a Xerox Platform (used by real customers)
  • 16.
    DisseminationPublicationsLucia Specia, MarcoTurchi, Nicola Cancedda, Marc Dymetman and NelloCristianini: Estimating the Sentence-Level Quality of Machine Translation Systems, in Conference of the European Association for Machine Translation, Barcelona, Spain, 2009.Lucia Specia, Marco Turchi, Zhuoran Wang, John Shawe-Taylor and Craig Saunders: Improving the Confidence of Machine Translation Quality Estimates., in Machine Translation Summit XII, Ottawa, Canada. 2009.Videolectures