Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

11. manuel leiva & juanjo arevalillo (hermes) evaluation of machine translation

1,940 views

Published on

Published in: Technology, Business
  • Be the first to comment

  • Be the first to like this

11. manuel leiva & juanjo arevalillo (hermes) evaluation of machine translation

  1. 1. Machine translation evaluation Hermes Traducciones y Servicios Lingüísticos
  2. 2. MT at Hermes 2  Pure RBMT engines with pre- and post-processing macros.  Texts from technical domains.  Applied-technology department has been working for over a year in MT engines.  Over 250,000 words post-edited with internal engines in the last year.  Average new word count for projects post-edited with internal engines: 9,000 words.
  3. 3. Our purpose with MT evals 3 Automated metrics might help us:  predict PE time and productivity gains;  negotiate reasonable discounts;  evaluate quality of engines;  measure performance of applied-technology department;  not depend on human-reported data.
  4. 4. What we hoped to find 4  We hoped some metric would correlate with productivity gain data provided by post-editors.  We gathered BLEU, F-Measure, METEOR and TER values.  Ideally, we would end up relying on automated metrics rather than time and productivity measurements reported by posteditors.
  5. 5. What we hoped to find 5 120.00 100.00 80.00 60.00 40.00 20.00 0.00 0.00 20.00 40.00 60.00 Productivity gain % 80.00 100.00 120.00
  6. 6. What we hoped to find 6 120.00 100.00 80.00 60.00 40.00 20.00 0.00 0.00 20.00 40.00 60.00 Productivity gain % 80.00 100.00 120.00
  7. 7. What we actually found: No correlation 7 100.00 90.00 80.00 70.00 60.00 BLEU 50.00 F-Measure TER 40.00 METEOR 30.00 20.00 10.00 0.00 0.00 20.00 40.00 60.00 80.00 100.00 Productivity gain % 120.00 140.00 160.00
  8. 8. What we actually found: No correlation 8 100.00 90.00 80.00 70.00 60.00 BLEU 50.00 F-Measure TER 40.00 METEOR 30.00 20.00 10.00 0.00 0.00 20.00 40.00 60.00 80.00 100.00 Productivity gain % 120.00 140.00 160.00
  9. 9. Reasons for the variability 9  Different CAT environments (Trados Studio, memoQ, Idiom, TagEditor, etc.).  Different engines (per domain, per client, etc.).  Different clients, different needs.  Different post-editors.  Or, if same post-editor, different post-editing skills over time.  Different word volumes.  Specific productivity or consistency-enhancement processing can affect metrics negatively.
  10. 10. Productivity-enhancement example 10  Source: Add events as described in Adding Events to a Model.  PE: Agregue los eventos como se describe en Adición de eventos a un modelo.  Raw 1: Agregue los eventos como se describe en la adición de los eventos a un modelo.  Raw 2: Agregue los eventos como se describe en Adding Events to a Model.  Scores: Raw 1 Raw 2  BLEU  TER 68,59 17,65 53,33 29,41 Metrics for Raw 1 are significantly better, but Raw 2 is faster to post-edit thanks to automatic terminology insertion tools (such as Xbench).
  11. 11. Human evaluation 11  Adequacy: How much of the meaning expressed in the goldstandard translation or the source is also expressed in the target translation?     4. Everything 3. Most 2. Little 1. None  Fluency: To what extent is a target side translation grammatically well informed, without spelling errors and experienced as using natural/intuitive language by a native speaker?     4. Flawless 3. Good 2. Dis-fluent 1. Incomprehensible Source: TAUS MT evaluation guidelines https://evaluation.taus.net/resources/adequacy-fluency-guidelines
  12. 12. Conclusions 12  We combine automated metrics with time/productivity data reported by post-editor for final evaluation of internal MT performance.  Poor post-editing skills or any project-specific contingency can be counter-balanced with good automated metrics.  We look for qualitative information in automated metrics, not quantitative.  BLEU values of 65 and 70 for two different engines tell us both are good engines, not that one will render 5% better results than the other.

×