Build Your Own Statistical MachineTranslation Engines            Ruben de la Fuente
About Me• 4-year degree in translation• Worked as translator for 10+ years• Working full time in MT for the past  year
Agenda•   Quick comparison with RbMT•   Fundamentals of SMT•   Requirements and preparation•   Using DoMY
Disclaimer• I’m not saying SMT is better• I’m not saying SMT is right for you
Statistical Machine TranslationComputer learns to translate throughstatistical analysis of alignment inbilingual corpora
Rule-based Machine TranslationUser Dictionaries + Grammar andtranslation rules
SMT: Pros and ConsPros              ConsQuick to build    UnpredictableCheap             QuickFluent            improvemen...
Features of an SMT system• Translation Model: table containing  source and target phrases, together  with a probability sc...
Language and Translation Models• LM (fluency)     • TM (accuracy)
Tokenization and recasingBreaking up text in        Lowercase all wordsmeaningul units (tokens)                           ...
Requirements: Computing•4 GB RAM PC needed•Ubuntu 10.04 64-bit OS•Virtual Machine OK
Requirements: sizeMS Translator Hub recommends at least10k segmentsI have gotten good results with 100-200ksegmentsRoughly...
Publicly Available Corpora• Opus (ECB, EMA, OpenOffice)• Acquis Communautaire• Europarl• Hansard• Multilingual websites: B...
Bitextor is Cunningwww.mywebsite.com/en/overview.htmlwww.mywebsite.com/es/overview.html<title>My source text</title><title...
Requirements: relevanceData needs to be in-domain
Requirements: qualityGarbage in, garbage outDiagnose your TMs with automated QAchecks (e.g. glossary adherence, length)
CheckMate: General
CheckMate: Length
CheckMate: Terminology
Remove Repetitions
Remove MarkupMarkup brings noise to the learningprocessClick <strong>Send</strong>Haga clic en <strong>Enviar</strong>
Do-Moses-Yourself (DoMY)Moses: state-of-the-art extensively usedopen source SMT toolkitDoMY: extension of Moses makinginst...
Online SMT Portals                  Consletsmt.eu                  NDA-compliancesmartmate.co      Availability           ...
DoMY (Basics)Graphs: import-tmx, clean-LM/TM, buildLM/TM, train, translate.Ini files: configuration (language pairs,paths ...
Folder structurecorpus           graphs
Run from terminalEdit ini            Command line
Running from GUI
GraphsGraph        Function             Input       OutputImport-tmx   Extract data from    Raw         Corpora/sa        ...
Tips for settingsLM: 7-gramTM: 9-gramAligner: Berkeley for distant languages
TroubleshootError message in terminalLog file in graph folderDoMT QA
Is Your Engine Good?A set is excluded from training to be usedfor evaluation (598 segments)From 0.5 BLEU points, engine is...
Keep ImprovingRetrain the engine periodically as moretranslation corpus become availableGather feedback on what needs to b...
Statistical PE• Keep a corpus of raw vs. PE• Treat them as separate language pairs• Run them thru DoMY• Create raw vs. PE ...
Questions?Speak now…Or reach me at:www.facebook.com/xlationwww.wordbonds.es@rubendelafuentehttp://www.linkedin.com/in/rube...
Upcoming SlideShare
Loading in …5
×

Build your own statistical engines

1,488 views

Published on

Published in: Technology
2 Comments
2 Likes
Statistics
Notes
  • http://www.casmacat.eu/index.php?n=UserGuide.HomePage
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Is there any available software to create ready-to-use English&gt;Arabic engines to use with CASMACAT?
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total views
1,488
On SlideShare
0
From Embeds
0
Number of Embeds
17
Actions
Shares
0
Downloads
4
Comments
2
Likes
2
Embeds 0
No embeds

No notes for slide
  • Why? SMT is based in probability, calculated as # of a given token / total amount of tokens. Case and punctuation can disrupt the calculation.
  • To get good results with SMT, you need around 10.000 segments at least
  • Using Olifant from Okapi Framework
  • Clean data: remove too long/short, empty sentences
  • Build your own statistical engines

    1. 1. Build Your Own Statistical MachineTranslation Engines Ruben de la Fuente
    2. 2. About Me• 4-year degree in translation• Worked as translator for 10+ years• Working full time in MT for the past year
    3. 3. Agenda• Quick comparison with RbMT• Fundamentals of SMT• Requirements and preparation• Using DoMY
    4. 4. Disclaimer• I’m not saying SMT is better• I’m not saying SMT is right for you
    5. 5. Statistical Machine TranslationComputer learns to translate throughstatistical analysis of alignment inbilingual corpora
    6. 6. Rule-based Machine TranslationUser Dictionaries + Grammar andtranslation rules
    7. 7. SMT: Pros and ConsPros ConsQuick to build UnpredictableCheap QuickFluent improvements not easy
    8. 8. Features of an SMT system• Translation Model: table containing source and target phrases, together with a probability score (accuracy)• Language Model: list of sequences of n-words in target language together with a probability score (fluency)
    9. 9. Language and Translation Models• LM (fluency) • TM (accuracy)
    10. 10. Tokenization and recasingBreaking up text in Lowercase all wordsmeaningul units (tokens) File > file file? > file ? file. > file . File! > file !
    11. 11. Requirements: Computing•4 GB RAM PC needed•Ubuntu 10.04 64-bit OS•Virtual Machine OK
    12. 12. Requirements: sizeMS Translator Hub recommends at least10k segmentsI have gotten good results with 100-200ksegmentsRoughly over 1 million words corpus
    13. 13. Publicly Available Corpora• Opus (ECB, EMA, OpenOffice)• Acquis Communautaire• Europarl• Hansard• Multilingual websites: Bitextor
    14. 14. Bitextor is Cunningwww.mywebsite.com/en/overview.htmlwww.mywebsite.com/es/overview.html<title>My source text</title><title>My target text</title>
    15. 15. Requirements: relevanceData needs to be in-domain
    16. 16. Requirements: qualityGarbage in, garbage outDiagnose your TMs with automated QAchecks (e.g. glossary adherence, length)
    17. 17. CheckMate: General
    18. 18. CheckMate: Length
    19. 19. CheckMate: Terminology
    20. 20. Remove Repetitions
    21. 21. Remove MarkupMarkup brings noise to the learningprocessClick <strong>Send</strong>Haga clic en <strong>Enviar</strong>
    22. 22. Do-Moses-Yourself (DoMY)Moses: state-of-the-art extensively usedopen source SMT toolkitDoMY: extension of Moses makinginstallation and configuration easier
    23. 23. Online SMT Portals Consletsmt.eu NDA-compliancesmartmate.co Availability Speed
    24. 24. DoMY (Basics)Graphs: import-tmx, clean-LM/TM, buildLM/TM, train, translate.Ini files: configuration (language pairs,paths for input and output).Folder structure: always includesuperdomain, domain and subdomain
    25. 25. Folder structurecorpus graphs
    26. 26. Run from terminalEdit ini Command line
    27. 27. Running from GUI
    28. 28. GraphsGraph Function Input OutputImport-tmx Extract data from Raw Corpora/sa tmx filesClean-tm Clean data Corpora/sa Corpora/re adyBuild-lm Prepares training Corpora/re builds set for LM adyBuild-tm Prepares training Corpora/re builds set for TM adyTrain Trains MT engine Builds enginesTranslate Translates input Translation Translation files and produces s/in s/out tmx output
    29. 29. Tips for settingsLM: 7-gramTM: 9-gramAligner: Berkeley for distant languages
    30. 30. TroubleshootError message in terminalLog file in graph folderDoMT QA
    31. 31. Is Your Engine Good?A set is excluded from training to be usedfor evaluation (598 segments)From 0.5 BLEU points, engine is likely toperform well
    32. 32. Keep ImprovingRetrain the engine periodically as moretranslation corpus become availableGather feedback on what needs to beimproved
    33. 33. Statistical PE• Keep a corpus of raw vs. PE• Treat them as separate language pairs• Run them thru DoMY• Create raw vs. PE engine• 2 engines: source > target, raw > PE
    34. 34. Questions?Speak now…Or reach me at:www.facebook.com/xlationwww.wordbonds.es@rubendelafuentehttp://www.linkedin.com/in/rubendelafuente

    ×