SlideShare a Scribd company logo
1 of 19
The Impact of Corpora Quality
on Neural Machine Translation
Matīss Rikters
Tilde
{firstname.lastname}@tilde.lv
September 28, 2018
The Story
• WMT 2017
• Weak results
• Why?
Tilde MT
https://tilde.com/products-and-services/machine-translation
The Story
• WMT 2017
• Weak results
• Why?
• WMT 2018!
Problems in Corpora
Problems in Corpora
Corpora Filters
• Unique parallel sentence filter – removes duplicate source-target sentence pairs
• Equal source-target filter – removes sentences that are identical in the source side and the target
side of the corpus
• Multiple sources - one target and multiple targets - one source filters – remove repeating
sentence pairs where the same source sentence is aligned to multiple different target sentences
and multiple source sentences aligned to the same target sentence
• Non-alphabetical filters – remove sentences that contain over 50% non-alphabetical symbols on
either the source side or the target and sentence pairs that have significantly more (at least 1:3)
non-alphabetical symbols in the source side than in the target side (or vice versa).
• Repeating token filter – removes sentences that exhibit repeated words or repeated phrases that
may occur in back-translated data
• Correct language filter – uses language identification software to estimate the language of each
sentence and removes any sentence that has a different identified language from the one
specified
• Moses Scripts and Subword NMT – calls Moses scripts for tokenising, cleaning, truecasing, and
Subword NMT for splitting into subword units
Experiments - Filtering
Experiments – Training NMT
Transformer architecture models trained with Sockeye
• 6 encoder and decoder layers
• 8 transformer attention heads per layer
• word embeddings and hidden layers of size 512
• dropout of 0.2
• shared subword unit vocabulary of 50 000 tokens
• maximum sentence length of 128 symbols
• a batch size of 3072 words
Results
WMT 2018
http://matrix.statmt.org/matrix/systems_list/1884
WMT 2018
http://matrix.statmt.org/matrix/systems_list/1885
WMT 2018
http://matrix.statmt.org/matrix/systems_list/1894
WMT 2018
http://matrix.statmt.org/matrix/systems_list/1895
WMT 2018
https://groups.google.com/forum/#!topic/wmt-tasks/YeJgziR8Tso
Conclusion
Clean your data before training models!
Use this - https://github.com/M4t1ss/parallel-corpora-tools
”Neural Network Modelling for Inflected Natural Languages”
No. 1.1.1.1/16/A/215.

More Related Content

Similar to The Impact of Corpora Qulality on Neural Machine Translation

What is machine translation
What is machine translationWhat is machine translation
What is machine translationStephen Peacock
 
5. manuel arcedillo & juanjo arevalillo (hermes) translation memories
5. manuel arcedillo & juanjo arevalillo (hermes) translation memories5. manuel arcedillo & juanjo arevalillo (hermes) translation memories
5. manuel arcedillo & juanjo arevalillo (hermes) translation memoriesRIILP
 
Integration of speech recognition with computer assisted translation
Integration of speech recognition with computer assisted translationIntegration of speech recognition with computer assisted translation
Integration of speech recognition with computer assisted translationChamani Shiranthika
 
walkmod - JUG talk
walkmod - JUG talkwalkmod - JUG talk
walkmod - JUG talkwalkmod
 
Real-time DirectTranslation System for Sinhala and Tamil Languages.
Real-time DirectTranslation System for Sinhala and Tamil Languages.Real-time DirectTranslation System for Sinhala and Tamil Languages.
Real-time DirectTranslation System for Sinhala and Tamil Languages.Sheeyam Shellvacumar
 
Breaking the language barrier: how do we quickly add multilanguage support in...
Breaking the language barrier: how do we quickly add multilanguage support in...Breaking the language barrier: how do we quickly add multilanguage support in...
Breaking the language barrier: how do we quickly add multilanguage support in...Jaya Mathew
 
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...Lucidworks
 
mt_cat_presentations CAT TRANSLATION PPT
mt_cat_presentations CAT TRANSLATION PPTmt_cat_presentations CAT TRANSLATION PPT
mt_cat_presentations CAT TRANSLATION PPTRamdan43
 
Machine Learning for Software Maintainability
Machine Learning for Software MaintainabilityMachine Learning for Software Maintainability
Machine Learning for Software MaintainabilityValerio Maggio
 
Experiments with Different Models of Statistcial Machine Translation
Experiments with Different Models of Statistcial Machine TranslationExperiments with Different Models of Statistcial Machine Translation
Experiments with Different Models of Statistcial Machine Translationkhyati gupta
 
Experiments with Different Models of Statistcial Machine Translation
Experiments with Different Models of Statistcial Machine TranslationExperiments with Different Models of Statistcial Machine Translation
Experiments with Different Models of Statistcial Machine Translationkhyati gupta
 
Tools-Driven Content Curation & Engine Training ATMA 2014
Tools-Driven Content Curation & Engine Training ATMA 2014Tools-Driven Content Curation & Engine Training ATMA 2014
Tools-Driven Content Curation & Engine Training ATMA 2014Welocalize
 
EAMT Workshop 2015 - KantanMT
EAMT Workshop 2015 - KantanMTEAMT Workshop 2015 - KantanMT
EAMT Workshop 2015 - KantanMTkantanmt
 
Dice.com Bay Area Search - Beyond Learning to Rank Talk
Dice.com Bay Area Search - Beyond Learning to Rank TalkDice.com Bay Area Search - Beyond Learning to Rank Talk
Dice.com Bay Area Search - Beyond Learning to Rank TalkSimon Hughes
 
ER1 Eduard Barbu - EXPERT Summer School - Malaga 2015
ER1 Eduard Barbu - EXPERT Summer School - Malaga 2015ER1 Eduard Barbu - EXPERT Summer School - Malaga 2015
ER1 Eduard Barbu - EXPERT Summer School - Malaga 2015RIILP
 
Compier Design_Unit I.ppt
Compier Design_Unit I.pptCompier Design_Unit I.ppt
Compier Design_Unit I.pptsivaganesh293
 
Compier Design_Unit I.ppt
Compier Design_Unit I.pptCompier Design_Unit I.ppt
Compier Design_Unit I.pptsivaganesh293
 

Similar to The Impact of Corpora Qulality on Neural Machine Translation (20)

What is machine translation
What is machine translationWhat is machine translation
What is machine translation
 
5. manuel arcedillo & juanjo arevalillo (hermes) translation memories
5. manuel arcedillo & juanjo arevalillo (hermes) translation memories5. manuel arcedillo & juanjo arevalillo (hermes) translation memories
5. manuel arcedillo & juanjo arevalillo (hermes) translation memories
 
Integration of speech recognition with computer assisted translation
Integration of speech recognition with computer assisted translationIntegration of speech recognition with computer assisted translation
Integration of speech recognition with computer assisted translation
 
walkmod - JUG talk
walkmod - JUG talkwalkmod - JUG talk
walkmod - JUG talk
 
Searching for the Best Machine Translation Combination
Searching for the Best Machine Translation CombinationSearching for the Best Machine Translation Combination
Searching for the Best Machine Translation Combination
 
Real-time DirectTranslation System for Sinhala and Tamil Languages.
Real-time DirectTranslation System for Sinhala and Tamil Languages.Real-time DirectTranslation System for Sinhala and Tamil Languages.
Real-time DirectTranslation System for Sinhala and Tamil Languages.
 
Breaking the language barrier: how do we quickly add multilanguage support in...
Breaking the language barrier: how do we quickly add multilanguage support in...Breaking the language barrier: how do we quickly add multilanguage support in...
Breaking the language barrier: how do we quickly add multilanguage support in...
 
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
 
mt_cat_presentations CAT TRANSLATION PPT
mt_cat_presentations CAT TRANSLATION PPTmt_cat_presentations CAT TRANSLATION PPT
mt_cat_presentations CAT TRANSLATION PPT
 
Machine Learning for Software Maintainability
Machine Learning for Software MaintainabilityMachine Learning for Software Maintainability
Machine Learning for Software Maintainability
 
Experiments with Different Models of Statistcial Machine Translation
Experiments with Different Models of Statistcial Machine TranslationExperiments with Different Models of Statistcial Machine Translation
Experiments with Different Models of Statistcial Machine Translation
 
Experiments with Different Models of Statistcial Machine Translation
Experiments with Different Models of Statistcial Machine TranslationExperiments with Different Models of Statistcial Machine Translation
Experiments with Different Models of Statistcial Machine Translation
 
project present
project presentproject present
project present
 
Tools-Driven Content Curation & Engine Training ATMA 2014
Tools-Driven Content Curation & Engine Training ATMA 2014Tools-Driven Content Curation & Engine Training ATMA 2014
Tools-Driven Content Curation & Engine Training ATMA 2014
 
Apex code (Salesforce)
Apex code (Salesforce)Apex code (Salesforce)
Apex code (Salesforce)
 
EAMT Workshop 2015 - KantanMT
EAMT Workshop 2015 - KantanMTEAMT Workshop 2015 - KantanMT
EAMT Workshop 2015 - KantanMT
 
Dice.com Bay Area Search - Beyond Learning to Rank Talk
Dice.com Bay Area Search - Beyond Learning to Rank TalkDice.com Bay Area Search - Beyond Learning to Rank Talk
Dice.com Bay Area Search - Beyond Learning to Rank Talk
 
ER1 Eduard Barbu - EXPERT Summer School - Malaga 2015
ER1 Eduard Barbu - EXPERT Summer School - Malaga 2015ER1 Eduard Barbu - EXPERT Summer School - Malaga 2015
ER1 Eduard Barbu - EXPERT Summer School - Malaga 2015
 
Compier Design_Unit I.ppt
Compier Design_Unit I.pptCompier Design_Unit I.ppt
Compier Design_Unit I.ppt
 
Compier Design_Unit I.ppt
Compier Design_Unit I.pptCompier Design_Unit I.ppt
Compier Design_Unit I.ppt
 

More from Matīss ‎‎‎‎‎‎‎  

Hybrid Machine Translation by Combining Multiple Machine Translation Systems
Hybrid Machine Translation by Combining Multiple Machine Translation SystemsHybrid Machine Translation by Combining Multiple Machine Translation Systems
Hybrid Machine Translation by Combining Multiple Machine Translation SystemsMatīss ‎‎‎‎‎‎‎  
 
Effective online learning implementation for statistical machine translation
Effective online learning implementation for statistical machine translationEffective online learning implementation for statistical machine translation
Effective online learning implementation for statistical machine translationMatīss ‎‎‎‎‎‎‎  
 
Hybrid machine translation by combining multiple machine translation systems
Hybrid machine translation by combining multiple machine translation systemsHybrid machine translation by combining multiple machine translation systems
Hybrid machine translation by combining multiple machine translation systemsMatīss ‎‎‎‎‎‎‎  
 

More from Matīss ‎‎‎‎‎‎‎   (20)

日本のお風呂
日本のお風呂日本のお風呂
日本のお風呂
 
Thrifty Food Tweets on a Rainy Day
Thrifty Food Tweets on a Rainy DayThrifty Food Tweets on a Rainy Day
Thrifty Food Tweets on a Rainy Day
 
私の趣味
私の趣味私の趣味
私の趣味
 
How Masterly Are People at Playing with Their Vocabulary?
How Masterly Are People at Playing with Their Vocabulary?How Masterly Are People at Playing with Their Vocabulary?
How Masterly Are People at Playing with Their Vocabulary?
 
私の町リガ
私の町リガ私の町リガ
私の町リガ
 
大学への交通手段
大学への交通手段大学への交通手段
大学への交通手段
 
小学生に 携帯電話
小学生に 携帯電話小学生に 携帯電話
小学生に 携帯電話
 
Tracing multisensory food experience on twitter
Tracing multisensory food experience on twitterTracing multisensory food experience on twitter
Tracing multisensory food experience on twitter
 
ラトビア大学
ラトビア大学ラトビア大学
ラトビア大学
 
私の趣味
私の趣味私の趣味
私の趣味
 
富士山りょこう
富士山りょこう富士山りょこう
富士山りょこう
 
Tips and Tools for NMT
Tips and Tools for NMTTips and Tools for NMT
Tips and Tools for NMT
 
Hybrid Machine Translation by Combining Multiple Machine Translation Systems
Hybrid Machine Translation by Combining Multiple Machine Translation SystemsHybrid Machine Translation by Combining Multiple Machine Translation Systems
Hybrid Machine Translation by Combining Multiple Machine Translation Systems
 
Advancing Estonian Machine Translation
Advancing Estonian Machine TranslationAdvancing Estonian Machine Translation
Advancing Estonian Machine Translation
 
Debugging neural machine translations
Debugging neural machine translationsDebugging neural machine translations
Debugging neural machine translations
 
Effective online learning implementation for statistical machine translation
Effective online learning implementation for statistical machine translationEffective online learning implementation for statistical machine translation
Effective online learning implementation for statistical machine translation
 
Neirontulkojumu atkļūdošana
Neirontulkojumu atkļūdošanaNeirontulkojumu atkļūdošana
Neirontulkojumu atkļūdošana
 
Hybrid machine translation by combining multiple machine translation systems
Hybrid machine translation by combining multiple machine translation systemsHybrid machine translation by combining multiple machine translation systems
Hybrid machine translation by combining multiple machine translation systems
 
Paying attention to MWEs in NMT
Paying attention to MWEs in NMTPaying attention to MWEs in NMT
Paying attention to MWEs in NMT
 
CoLing 2016
CoLing 2016CoLing 2016
CoLing 2016
 

Recently uploaded

Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 

Recently uploaded (20)

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 

The Impact of Corpora Qulality on Neural Machine Translation