TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASEIntegration of Advanced Language Processing Techniques intoStatistical Machin...
Language Processing Techniques                          forStatistical Machine Translation      Contact: Diego Bartolome –...
To start ...               Contact: Diego Bartolome – dbc@tauyou.com               C/ Les Planes 39, 1o 2a – 08201 Sabadel...
… you choose Moses ...Translation memories + linguistic assetsCleaning and training following tutorialsBLEU score seems ok...
Why?Not enough dataUnclean translation memoriesMisalignmentsSpelling and grammar errorsDifficult language pairsSelection o...
Contact: Diego Bartolome – dbc@tauyou.comC/ Les Planes 39, 1o 2a – 08201 Sabadell – SpainTel. +34 93 711 29 96
Some stepsMaximum exploitation of existing assetsSource content optimizationData selection and cleaningImprovement of the ...
Existing assets: increase TM leverageTranslation memory sharing   Clients, Partners, Competitors, EU, UN, TAUSRelevant on-...
Source optimization (I): Pre-editingnewdoc                                         proposed                               ...
Source optimization (II): Summarizationnewdoc                                         proposed                            ...
Summarization example                  http://www.translationautomation.com/press-                  releases/free-open-sou...
Data selection and cleaning – a sampleClean translation memories   Length, punctuation, terminology, repetitions …   Segme...
Models optimizationFilter the translation tables   Remove the garbage + tune the weights if necessaryOptimize language mod...
Linguistic processingIn the source and/or target language   Grammar checking   Entities detection      proper nouns, alpha...
An example fromSourceXXX 335102 doses are calculated as a free acid of the sodium salt (NA).The potential toxicity of XXX ...
Generic engineXXX 335102 se calculan en forma de dosis de ácido libre del sodio sal (NA).La Toxicidad potencial de XXX 335...
ConclusionsMT can be combined with other advanced techniquesCreating an improving an engine requires time   You can also b...
Upcoming SlideShare
Loading in …5
×

TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Seattle, Language Processing Techniques for Statistical Machine Translation, Diego Bartolome, tauyou, 17 October 2012

1,161 views

Published on

This presentation is a part of the MosesCore project that encourages the development and usage of open source machine translation tools, notably the Moses statistical MT toolkit.

MosesCore is supported by the European Commission Grant Number 288487 under the 7th Framework Programme.

For the latest updates, follow us on Twitter - #MosesCore

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,161
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
13
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Seattle, Language Processing Techniques for Statistical Machine Translation, Diego Bartolome, tauyou, 17 October 2012

  1. 1. TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASEIntegration of Advanced Language Processing Techniques intoStatistical Machine Translation11:10-11:30Wednesday, 17 OctoberDiego BartolomeTauyou
  2. 2. Language Processing Techniques forStatistical Machine Translation Contact: Diego Bartolome – dbc@tauyou.com C/ Les Planes 39, 1o 2a – 08201 Sabadell – Spain Tel. +34 93 711 29 96
  3. 3. To start ... Contact: Diego Bartolome – dbc@tauyou.com C/ Les Planes 39, 1o 2a – 08201 Sabadell – Spain Tel. +34 93 711 29 96
  4. 4. … you choose Moses ...Translation memories + linguistic assetsCleaning and training following tutorialsBLEU score seems ok in training … but ...the results are awful! Contact: Diego Bartolome – dbc@tauyou.com C/ Les Planes 39, 1o 2a – 08201 Sabadell – Spain Tel. +34 93 711 29 96
  5. 5. Why?Not enough dataUnclean translation memoriesMisalignmentsSpelling and grammar errorsDifficult language pairsSelection of wrong parametersApplication of suboptimal techniquesSo many things … what can you do? Contact: Diego Bartolome – dbc@tauyou.com C/ Les Planes 39, 1o 2a – 08201 Sabadell – Spain Tel. +34 93 711 29 96
  6. 6. Contact: Diego Bartolome – dbc@tauyou.comC/ Les Planes 39, 1o 2a – 08201 Sabadell – SpainTel. +34 93 711 29 96
  7. 7. Some stepsMaximum exploitation of existing assetsSource content optimizationData selection and cleaningImprovement of the modelsLinguistic processing... Contact: Diego Bartolome – dbc@tauyou.com C/ Les Planes 39, 1o 2a – 08201 Sabadell – Spain Tel. +34 93 711 29 96
  8. 8. Existing assets: increase TM leverageTranslation memory sharing Clients, Partners, Competitors, EU, UN, TAUSRelevant on-line data retrievalAdvanced TM techniques Sub-segment matching Parts of Speech replacement Contact: Diego Bartolome – dbc@tauyou.com C/ Les Planes 39, 1o 2a – 08201 Sabadell – Spain Tel. +34 93 711 29 96
  9. 9. Source optimization (I): Pre-editingnewdoc proposed doc + html report Spell check Grammar check Style check Terminology check Client checklist Contact: Diego Bartolome – dbc@tauyou.com C/ Les Planes 39, 1o 2a – 08201 Sabadell – Spain Tel. +34 93 711 29 96
  10. 10. Source optimization (II): Summarizationnewdoc proposed doc + html report % to reduce Use translation memories Project Client All Contact: Diego Bartolome – dbc@tauyou.com C/ Les Planes 39, 1o 2a – 08201 Sabadell – Spain Tel. +34 93 711 29 96
  11. 11. Summarization example http://www.translationautomation.com/press- releases/free-open-source-machine-translation- tutorial-is-made-available-by-taus Contact: Diego Bartolome – dbc@tauyou.com C/ Les Planes 39, 1o 2a – 08201 Sabadell – Spain Tel. +34 93 711 29 96
  12. 12. Data selection and cleaning – a sampleClean translation memories Length, punctuation, terminology, repetitions … Segment splittingOptimize weight of most frequent n-grams in corpus Validate their translationsAdd out-of-domain data for irrelevant n-grams Contact: Diego Bartolome – dbc@tauyou.com C/ Les Planes 39, 1o 2a – 08201 Sabadell – Spain Tel. +34 93 711 29 96
  13. 13. Models optimizationFilter the translation tables Remove the garbage + tune the weights if necessaryOptimize language models Adapt them to the translation purposeTune parameters correctly Tune set, test set, optimization parameters …Improve recasing Contact: Diego Bartolome – dbc@tauyou.com C/ Les Planes 39, 1o 2a – 08201 Sabadell – Spain Tel. +34 93 711 29 96
  14. 14. Linguistic processingIn the source and/or target language Grammar checking Entities detection proper nouns, alphanumeric words, numbers, ... Compund words splitting Sentence reordering Contact: Diego Bartolome – dbc@tauyou.com C/ Les Planes 39, 1o 2a – 08201 Sabadell – Spain Tel. +34 93 711 29 96
  15. 15. An example fromSourceXXX 335102 doses are calculated as a free acid of the sodium salt (NA).The potential toxicity of XXX 335102 was studied in a number of acute toxicity studies in mouse and ratand repeat dose toxicity studies of 8 and 32 weeks each in rat and monkeys.XXX 335102 was negative in a panel of in vivo and in vitro tests to assess mutagenicity andclastogenicity identifying no genotoxic risks for human subjects.An in vitro assay for phototoxic potential suggested that XXX 335102 is photoxic/photosensitive.In the 8-week studies in monkeys, increases in unconjugated bilirubin were noted at the doses tested(33, 88, 192 and 444mg/kg/day); the greatest increases occurring at Week 4 and declining or returningto control levels by Week 8.ReferenceLas dosis de XXX 335102 se calculan como la sal sódica sin ácido (AS).La toxicidad potencial de XXX 335102 se estudió en varios estudios de toxicidad aguda en ratones yratas y en estudios de toxicidad con administración repetida de 8 y 32 semanas en ratas y monos.Se obtuvieron resultados negativos en un grupo de pruebas in vivo e in vitro para evaluar su mutageniay clastogenia, sin identificarse riesgos genotóxicos para el ser humano.En un estudio in vitro de su potencial fototóxico se sugirió que XXX 335102 es fototóxico ofotosensibilizador.En los estudios de 8 semanas en monos se apreció el aumento de la bilirrubina no conjugada con lasdosis estudiadas (33, 88, 192 y 444 mg/kg/día), produciéndose el mayor incremento en la semana 4 ydisminuyendo o volviendo a los niveles de control en la semana 8.
  16. 16. Generic engineXXX 335102 se calculan en forma de dosis de ácido libre del sodio sal (NA).La Toxicidad potencial de XXX 335102 fue estudiado en una serie de estudios de toxicidad aguda enratón y rata y vuelva a dosis estudios de toxicidad, de 8 y de 32 semanas en rata y cada uno de losmonos.XXX 335102 era negativo en un grupo de in vivo y pruebas in vitro para evaluar mutagenicidad ygenotóxicas clastogenicity no identificar los riesgos para los participantes humanos.Un para fines de ensayo in vitro phototoxic potencial se sugirió que XXX 335102photoxic/Photosensitive.En Los 8 -week estudios en los monos, aumentos en unconjugated bilirrubina salieron a las dosisanalizada (33, 88, 192 y 444 mg/kg/día); los mayores incrementos habidos En la semana 4 y lareducción o devolver a nivel de control de 8 Por semana.Medical engine with improvementsLas dosis XXX 335102 se calculan como ácido libre de la sal sódica (AS).La toxicidad potencial de XXX 335102 se estudió en varios estudios de toxicidad aguda en ratones yratas y en estudios de toxicidad con administración repetida de 8 y 32 semanas en ratas y monos.XXX 335102 dio negativo en un grupo de pruebas in vivo e in vitro para evaluar su mutagenia yclastogenia, sin identificarse riesgos genotóxicos para el ser humano.En un estudio in vitro de su potencial fototóxico se sugirió que XXX 335102 es fototóxico ofotosensibilizador.En los estudios de 8 semanas en monos, el aumento de la bilirrubina no conjugada con las dosisestudiadas (33, 88, 192 y 444 mg/kg/día); el mayor incremento en la semana 4 y disminuyendo ovolviendo a los niveles de control en la semana 8.
  17. 17. ConclusionsMT can be combined with other advanced techniquesCreating an improving an engine requires time You can also be lucky at the first try!The optimum results require translators Implementation of the linguistic knowledge Continuous improvement Contact: Diego Bartolome – dbc@tauyou.com C/ Les Planes 39, 1o 2a – 08201 Sabadell – Spain Tel. +34 93 711 29 96

×