The Impact of Corpora Qulality on Neural Machine Translation

•Download as PPTX, PDF•

0 likes•92 views

Matīss ‎‎‎‎‎‎‎

Presented at Baltic HLT 2018 https://www.hlt2018.ut.ee/

Technology

The Impact of Corpora Quality
on Neural Machine Translation
Matīss Rikters
Tilde
{firstname.lastname}@tilde.lv
September 28, 2018

The Story
• WMT 2017
• Weak results
• Why?

Tilde MT
https://tilde.com/products-and-services/machine-translation

The Story
• WMT 2017
• Weak results
• Why?
• WMT 2018!

Corpora Filters
• Unique parallel sentence filter – removes duplicate source-target sentence pairs
• Equal source-target filter – removes sentences that are identical in the source side and the target
side of the corpus
• Multiple sources - one target and multiple targets - one source filters – remove repeating
sentence pairs where the same source sentence is aligned to multiple different target sentences
and multiple source sentences aligned to the same target sentence
• Non-alphabetical filters – remove sentences that contain over 50% non-alphabetical symbols on
either the source side or the target and sentence pairs that have significantly more (at least 1:3)
non-alphabetical symbols in the source side than in the target side (or vice versa).
• Repeating token filter – removes sentences that exhibit repeated words or repeated phrases that
may occur in back-translated data
• Correct language filter – uses language identification software to estimate the language of each
sentence and removes any sentence that has a different identified language from the one
specified
• Moses Scripts and Subword NMT – calls Moses scripts for tokenising, cleaning, truecasing, and
Subword NMT for splitting into subword units

Experiments – Training NMT
Transformer architecture models trained with Sockeye
• 6 encoder and decoder layers
• 8 transformer attention heads per layer
• word embeddings and hidden layers of size 512
• dropout of 0.2
• shared subword unit vocabulary of 50 000 tokens
• maximum sentence length of 128 symbols
• a batch size of 3072 words

WMT 2018
http://matrix.statmt.org/matrix/systems_list/1884

WMT 2018
http://matrix.statmt.org/matrix/systems_list/1885

WMT 2018
http://matrix.statmt.org/matrix/systems_list/1894

WMT 2018
http://matrix.statmt.org/matrix/systems_list/1895

WMT 2018
https://groups.google.com/forum/#!topic/wmt-tasks/YeJgziR8Tso

Conclusion
Clean your data before training models!
Use this - https://github.com/M4t1ss/parallel-corpora-tools

”Neural Network Modelling for Inflected Natural Languages”
No. 1.1.1.1/16/A/215.

Similar to The Impact of Corpora Qulality on Neural Machine Translation

What is machine translationStephen Peacock

5. manuel arcedillo & juanjo arevalillo (hermes) translation memoriesRIILP

Integration of speech recognition with computer assisted translationChamani Shiranthika

walkmod - JUG talkwalkmod

Searching for the Best Machine Translation CombinationMatīss ‎‎‎‎‎‎‎

Real-time DirectTranslation System for Sinhala and Tamil Languages.Sheeyam Shellvacumar

Breaking the language barrier: how do we quickly add multilanguage support in...Jaya Mathew

Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...Lucidworks

mt_cat_presentations CAT TRANSLATION PPTRamdan43

Machine Learning for Software MaintainabilityValerio Maggio

Experiments with Different Models of Statistcial Machine Translationkhyati gupta

project presentkhyati gupta

Tools-Driven Content Curation & Engine Training ATMA 2014Welocalize

Apex code (Salesforce)Mohammed Safwat Abu Kwaik

EAMT Workshop 2015 - KantanMTkantanmt

Dice.com Bay Area Search - Beyond Learning to Rank TalkSimon Hughes

ER1 Eduard Barbu - EXPERT Summer School - Malaga 2015RIILP

Compier Design_Unit I.pptsivaganesh293

Similar to The Impact of Corpora Qulality on Neural Machine Translation (20)

What is machine translation

5. manuel arcedillo & juanjo arevalillo (hermes) translation memories

Integration of speech recognition with computer assisted translation

walkmod - JUG talk

Searching for the Best Machine Translation Combination

Real-time DirectTranslation System for Sinhala and Tamil Languages.

Breaking the language barrier: how do we quickly add multilanguage support in...

Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...

mt_cat_presentations CAT TRANSLATION PPT

Machine Learning for Software Maintainability

Experiments with Different Models of Statistcial Machine Translation

project present

Tools-Driven Content Curation & Engine Training ATMA 2014

Apex code (Salesforce)

EAMT Workshop 2015 - KantanMT

Dice.com Bay Area Search - Beyond Learning to Rank Talk

ER1 Eduard Barbu - EXPERT Summer School - Malaga 2015

Compier Design_Unit I.ppt

Recently uploaded

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@

Apidays New York 2024 - The value of a flexible API Management solution for O...apidays

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung

Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1

How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo

GenCyber Cyber Security Day PresentationMichael W. Hawkins

Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko

Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services

HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics

Tech Trends Report 2024 Future Today Institute.pdfhans926745

How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes

Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer

Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun

AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin

Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j

presentation ICT roal in 21st century educationjfdjdjcjdnsjd

Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2

Recently uploaded (20)

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...

Apidays New York 2024 - The value of a flexible API Management solution for O...

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...

Boost Fertility New Invention Ups Success Rates.pdf

How to Troubleshoot Apps for the Modern Connected Worker

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

GenCyber Cyber Security Day Presentation

Handwritten Text Recognition for manuscripts and early printed texts

Strategies for Landing an Oracle DBA Job as a Fresher

HTML Injection Attacks: Impact and Mitigation Strategies

Tech Trends Report 2024 Future Today Institute.pdf

How to Troubleshoot Apps for the Modern Connected Worker

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024

Powerful Google developer tools for immediate impact! (2023-24 C)

AWS Community Day CPH - Three problems of Terraform

Tata AIG General Insurance Company - Insurer Innovation Award 2024

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...

presentation ICT roal in 21st century education

Exploring the Future Potential of AI-Enabled Smartphone Processors

The Impact of Corpora Qulality on Neural Machine Translation

1. The Impact of Corpora Quality on Neural Machine Translation Matīss Rikters Tilde {firstname.lastname}@tilde.lv September 28, 2018

2. The Story • WMT 2017 • Weak results • Why?

4. Tilde MT https://tilde.com/products-and-services/machine-translation

6. The Story • WMT 2017 • Weak results • Why? • WMT 2018!

7. Problems in Corpora

8. Problems in Corpora

9. Corpora Filters • Unique parallel sentence filter – removes duplicate source-target sentence pairs • Equal source-target filter – removes sentences that are identical in the source side and the target side of the corpus • Multiple sources - one target and multiple targets - one source filters – remove repeating sentence pairs where the same source sentence is aligned to multiple different target sentences and multiple source sentences aligned to the same target sentence • Non-alphabetical filters – remove sentences that contain over 50% non-alphabetical symbols on either the source side or the target and sentence pairs that have significantly more (at least 1:3) non-alphabetical symbols in the source side than in the target side (or vice versa). • Repeating token filter – removes sentences that exhibit repeated words or repeated phrases that may occur in back-translated data • Correct language filter – uses language identification software to estimate the language of each sentence and removes any sentence that has a different identified language from the one specified • Moses Scripts and Subword NMT – calls Moses scripts for tokenising, cleaning, truecasing, and Subword NMT for splitting into subword units

10. Experiments - Filtering

11. Experiments – Training NMT Transformer architecture models trained with Sockeye • 6 encoder and decoder layers • 8 transformer attention heads per layer • word embeddings and hidden layers of size 512 • dropout of 0.2 • shared subword unit vocabulary of 50 000 tokens • maximum sentence length of 128 symbols • a batch size of 3072 words

12. Results

13. WMT 2018 http://matrix.statmt.org/matrix/systems_list/1884

14. WMT 2018 http://matrix.statmt.org/matrix/systems_list/1885

15. WMT 2018 http://matrix.statmt.org/matrix/systems_list/1894

16. WMT 2018 http://matrix.statmt.org/matrix/systems_list/1895

17. WMT 2018 https://groups.google.com/forum/#!topic/wmt-tasks/YeJgziR8Tso

18. Conclusion Clean your data before training models! Use this - https://github.com/M4t1ss/parallel-corpora-tools

19. ”Neural Network Modelling for Inflected Natural Languages” No. 1.1.1.1/16/A/215.