Introduction, background Newspapers in Catalan Net Circulation 90.000 79.239 80.000 70.000 60.000 50.000 45.309 40.000 31.762 30.000 20.000 15.662 10.000 6.779 0Source: Estudi General de Mitjans (EGM), 2012
Introduction, backgroundResults Increase +4% of copies +7% of readers Distribution 57% Spanish 43% Catalan
Introduction, background Why a Catalan version? Celebration of LV’s 130 anniversary Normalization of the use of Catalan Investment to face the crisisOpportunity to consolidate LV’s hegemony
 Customer goalsTo publish two language Journalists should be editions of the same able to write in newspaper daily any (supplements incl.). of the two languages. Neither quality nor distribution timeframes should be affected.
Customer requirements • Tailor-made system • Complying with LV’s style guide • Seamless integration into journalist’s workflow MT • Translation of Hermes XML and InDesign formats • Reliability, high availability • High performance
 Ramp-up phaseProject set-upWork areas MT linguistic improvement/tuning Post-editing preparation MT system set-up and integration MT lexicon trainingDuration 8 months (+ 3 months)Staff LV: 10-12 in-house journalists Lucy: 3 computational linguists / lexicographers 1 software developer Incyta: 2 professional post-editorsImportant! On-site support
SubphasesTASKS Phase 1 Phase 2 Phase 3 Phase 4Linguistic improvement/tuning - Language-type definition x - Creation of a corpus of real texts x x x x - Analysis of the translation quality x x x x - Error reporting (lexicon and grammar errors) x x x x - Linguistic implementation (lex and grammar) x x x x - Pre and post-editing filters x x x xPost-editing preparation - Gathering of MT post-editing guidelines x - Evaluation of post-editing effort x x - Creation and training of the post-editing team xTechnical set-up - System set-up and integration x - Preparation of XML converters xMaintenance - Lexicon maintenance training xDuration 2 mo 3 mo 3 mo 3 mo
[a] Linguistic tuningLanguage model Corpus Translation quality (TQ) Analysis and error-reporting Implementation Accomplished improvement data
Linguistic tuning Catalan language model • no exclusion • compliant with standards • innovative in terminology • dynamic in syntactical structures Corpus • ES: 500,000 transl. units – 8,300,000 words • CA: 250,000 transl. units – 3,000,000 words
Linguistic tuning Translation Quality Medium Minimal post-edit post- 2% editing 24% Perfect 74%Conclusions• No specific domains (except Sports)• Culture: proper names• Opinion: idioms, plays on words• Errors not repetitive• % style to be post-edited
Linguistic tuning Analysis and error reporting • Semi-automatic detection of missing words • Terminology lists • New and different translations, error reporting Implementation • Proper names [44.5 % of the TUs ] • Idioms • Alternatives
Linguistic tuningAccomplished improvement data• Work in figures 40,000 lexicon entries (20,000 for each transl. direction) Around 440 grammar rules Around 7,200 words in the proper names files (each transl. dir)• Non-measurable work Understanding of the MT system Understanding of the newspaper specificities Support in the style guide taking into account MT• Improvement ES>CA 41% diff => 35% better , 4% similar, 2% worse CA>ES 36% diff => 32% better, 3% similar, 1% worse
Post-editing Metrics on translation volume Metrics onSpecificities post-editing effort of the text Post-editors Post-editing workspace resources Error reporting process and tools Post-editing team and profile
Post-editing: metrics Total Lex/gram StyleFile translation units post-edition % post-edition %LV_2010-10-27 2,474 464 18.79% 394 15.96%(= 42.512 words) Conclusions • Different sections had different levels of post-editing • What style corrections could be avoided? • Post-editing speed: 1,000-1,500 words/h • Daily volume: 75,000 words • New post-editing team: 20 post-editors/12 editors
Post-editing: resources, workspace Post-editors Resources on should have Post-editing Adapt CMS to new Intranet language proficiency in their guide workflow portal skills BUT also Be trained on New Bilingual style Classified MT post-ed processing guide frequent MT errors status Have an Links to all integrated reference workspace dictionaries Reference Have document for New mark-ups training MT portal for resources any journalist at a click
Post-editing: resources, workspace La Vanguardia’s intranet: linguistic portal
Post-editing: error reporting, team Error reporting • Crucial for continuous improvement • Not automated (yet) • Provide better support to error reporting Definition of post-editing profile and team • Proficient in Catalan • Journalist background
[c] System integration During phase 1: pre-production • Pre-production set-up and installation • Hermes XML converter • Changes in the LT engine to translate InDesign files During phase 3: production • Production installation • Test (load, performance and stress) • Performance 500-1,200 w/sec • Definition of the final installation size
System integration Language HermesHermes InDesign portal InDesign Web Service Web Service Production Pre-production Maintenance• Production: balanced high performance (HP) and high availability (HA) configuration• System requirements: normal Windows Server -> low HW footprint (e.g. Dual Core/Quad 2.5-3 GHz, 2-4 GB RAM running Win Server 2003/2008)
 Operation: production process Staff Effort Timeline • 20 post-editors • 30’ linguistic review • Start 5 p.m. • 12 editors • 10’ journalistic review • First edition 11.30 p.m. • 70,000 words/day + suppl. • Second edition 2.30 a.m.
 Next goalsSuccess! Yes.Thanks to• Close work and Next! cooperation • How to reduce• Three parties post-editing effort involved • How to re-use• Time and effort post-edited text investment• Customisation
Thank you for your attentionMagí Camps Blanca Vidal Ignasi NavarroLa Vanguardia Lucy Software Ibérica Incytamcamps@lavanguardia.es email@example.com Ignasi_navarro@incyta.comwww.lavanguardia.es www.lucysoftware.com www.incyta.com
A particular slide catching your eye?
Clipping is a handy way to collect important slides you want to go back to later.