SlideShare a Scribd company logo
1 of 29
Download to read offline
TAUS - Portland, October 24, 2016
MMT
Machine Translation in Numbers
Team
Problems with current Open Source MT?
22
years old
idea
(Brown, Della Pietra - 1994)
10
years old
implementation
(Moses, JHU workshop 2006)
Problems with current Open Source
Need re-training to learn from new data
Problems with current Open Source
Does not adapt to context
Problems with current Open Source
Does not adapt to context
Today, you often get to the absurd:
More data = Lower Quality
Welcome to MMT
● Incremental: Learns corrections in seconds.
● Adapts to context as you use it.
● No more initial training needed, like our old TMs :)
● Comes with data. Lots of data.
One more thing...
It is Free and Open Source
How does it work?
Context Analyzer
Retrieves best matching TMs
based on context similarity
Indexed instead of Training
● Suffix array indexed with TMs
● Phrase table is built on the fly
by sampling from the SA
● Phrases of TMs with highest
weights sampled first
Adaptive Language Model
Why is this different from Matecat or Lilt?
Learns for all users not just one
Why is this different from Matecat or Lilt?
Learns for all users not just one
Uses context
Quality - Using the TAUS Data Cloud
MS Translator HUB - commercial adaptive engine by Microsoft
Paypal 1 Paypal 2 Paypal 3 Linkedin 1 Paypal 1 Paypal 2 Paypal 3
MS Translator Hub vs Modern MT
ModernMT - our adaptive and incremental solution
Paypal 1 Paypal 2 Paypal 3 Linkedin 1 Paypal 1 Paypal 2 Paypal 3
Initial Setup
MMT
Moses
Neural MT
3 hours ($3 AWS)
30 hours ($30 AWS)
300 hours ($300 AWS)
100M parallel words, 1B monolingual, $1 / hour AWS
Translation speed
MMT
Moses
Neural MT
855 w/s
455 w/s
409 w/s
100M parallel words, 1B monolingual, $1 / hour AWS
Marco stop talking, it’s Jaap time.
TAUS Data Cloud
● Largest industry-shared repository of translation data
● A neutral and secure repository platform for
○ Sharing/pooling translation data based on a reciprocity model
○ Searching domain-specific or general data
○ Leveraging Translation Data
● Solid legal framework established by 45 founding members
● Addresses the shortage of available in-domain parallel data from the
industry
● September 2016: 72B+ words in the repository
● 10M to 100M words per ModernMT language pair
Collecting from the Web - Hard!
● The Web is large - even the so-called Surface or Indexable Web
● The Web is messy
● The Web is constantly in flux
● Not many organizations crawl the entire indexable web
○ Google - about 49B web pages in index (Source: www.worldwidewebsize.com)
○ Microsoft - about 20B web pages in index (Source: www.worldwidewebsize.com)
● Other crawls are focused crawls on a subset with certain criteria/goals
● Still hard for the same reasons
Common Crawl Come to Rescue
● Commoncrawl.org
○ “CommonCrawl is a 501(c)(3) non-profit organization dedicated to providing a copy of
the internet to internet researchers, companies and individuals at no cost for the
purpose of research and analysis.”
● On average 1.5B unique URLs per crawl
● A very good resource for sourcing bilingual and monolingual data for machine
translation purposes
○ Prototype developed by academic developers in 2012/2013 showed potential to mine
parallel corpora with millions of source words
Common Crawl Come to Rescue
● Implemented data collection pipeline based on prototype techniques
● Collecting monolingual and bilingual data
● Open sourced at https://github.com/ModernMT/DataCollection
● We are making the indices of parallel pages we discover available
○ Saves running half of the data collection pipeline
○ Each user still has to download their own data
● Avoids potential copyright issues
Parallel Data Stats
Monolingual Data
What’s next
● Release 0.14 - Next Week
○ Planned for AMTA next week. 45 languages supported, adding incremental learning.
● Baseline engines and data - 3 months
○ Finish the crawling and legal activity to release the data for the baseline engines.
● Neural MT - 12 months
○ Engineering effort to make it cost-effective, incremental and context aware.
Included by default in MMT
How to contribute
● Do you want to use MMT? Provide Feedback (it is on GitHub).
● Do you want your engineers to contribute to the project?
● Do you want to add your data to the TAUS Data Cloud and help
sharing baseline engines?
Thank you
https://github.com/ModernMT/MMT

More Related Content

More from TAUS - The Language Data Network

Game Changer for Linguistic Review: Shifting the Paradigm, Klaus Fleischmann...
 Game Changer for Linguistic Review: Shifting the Paradigm, Klaus Fleischmann... Game Changer for Linguistic Review: Shifting the Paradigm, Klaus Fleischmann...
Game Changer for Linguistic Review: Shifting the Paradigm, Klaus Fleischmann...TAUS - The Language Data Network
 
A translation memory P2P trading platform - to make global translation memory...
A translation memory P2P trading platform - to make global translation memory...A translation memory P2P trading platform - to make global translation memory...
A translation memory P2P trading platform - to make global translation memory...TAUS - The Language Data Network
 
Shiyibao — The Most Efficient Translation Feedback System Ever, Guanqing Hao ...
Shiyibao — The Most Efficient Translation Feedback System Ever, Guanqing Hao ...Shiyibao — The Most Efficient Translation Feedback System Ever, Guanqing Hao ...
Shiyibao — The Most Efficient Translation Feedback System Ever, Guanqing Hao ...TAUS - The Language Data Network
 
Stepes – Instant Human Translation Services for the Digital World, Carl Yao (...
Stepes – Instant Human Translation Services for the Digital World, Carl Yao (...Stepes – Instant Human Translation Services for the Digital World, Carl Yao (...
Stepes – Instant Human Translation Services for the Digital World, Carl Yao (...TAUS - The Language Data Network
 
Smart Translation Resource Management: Semantic Matching, Kirk Zhang (Wiitran...
Smart Translation Resource Management: Semantic Matching, Kirk Zhang (Wiitran...Smart Translation Resource Management: Semantic Matching, Kirk Zhang (Wiitran...
Smart Translation Resource Management: Semantic Matching, Kirk Zhang (Wiitran...TAUS - The Language Data Network
 
The Theory and Practice of Computer Aided Translation Training System, Liu Q...
 The Theory and Practice of Computer Aided Translation Training System, Liu Q... The Theory and Practice of Computer Aided Translation Training System, Liu Q...
The Theory and Practice of Computer Aided Translation Training System, Liu Q...TAUS - The Language Data Network
 
How to efficiently use large-scale TMs in translation, Jing Zhang (Tmxmall)
How to efficiently use large-scale TMs in translation, Jing Zhang (Tmxmall)How to efficiently use large-scale TMs in translation, Jing Zhang (Tmxmall)
How to efficiently use large-scale TMs in translation, Jing Zhang (Tmxmall)TAUS - The Language Data Network
 
A use-case for getting MT into your company, Kerstin Berns (berns language c...
 A use-case for getting MT into your company, Kerstin Berns (berns language c... A use-case for getting MT into your company, Kerstin Berns (berns language c...
A use-case for getting MT into your company, Kerstin Berns (berns language c...TAUS - The Language Data Network
 
How Existing Quality Models Get Challenged, by Katka Gasova (Moravia)
How Existing Quality Models Get Challenged, by Katka Gasova (Moravia)How Existing Quality Models Get Challenged, by Katka Gasova (Moravia)
How Existing Quality Models Get Challenged, by Katka Gasova (Moravia)TAUS - The Language Data Network
 
Traditional Models of Translation Outsourcing Seem Well-Established and Sound...
Traditional Models of Translation Outsourcing Seem Well-Established and Sound...Traditional Models of Translation Outsourcing Seem Well-Established and Sound...
Traditional Models of Translation Outsourcing Seem Well-Established and Sound...TAUS - The Language Data Network
 
Smarter, Faster, Better: The secrets of productive Machine Translation, Tony ...
Smarter, Faster, Better: The secrets of productive Machine Translation, Tony ...Smarter, Faster, Better: The secrets of productive Machine Translation, Tony ...
Smarter, Faster, Better: The secrets of productive Machine Translation, Tony ...TAUS - The Language Data Network
 
Driving Innovation through Standardized APIs, By Klaus Fleischmann, Kaleidoscope
Driving Innovation through Standardized APIs, By Klaus Fleischmann, KaleidoscopeDriving Innovation through Standardized APIs, By Klaus Fleischmann, Kaleidoscope
Driving Innovation through Standardized APIs, By Klaus Fleischmann, KaleidoscopeTAUS - The Language Data Network
 
Lights Out, Translation is Datafied, by Jaap van der Meer (TAUS)
Lights Out, Translation is Datafied, by Jaap van der Meer (TAUS)Lights Out, Translation is Datafied, by Jaap van der Meer (TAUS)
Lights Out, Translation is Datafied, by Jaap van der Meer (TAUS)TAUS - The Language Data Network
 

More from TAUS - The Language Data Network (20)

Game Changer for Linguistic Review: Shifting the Paradigm, Klaus Fleischmann...
 Game Changer for Linguistic Review: Shifting the Paradigm, Klaus Fleischmann... Game Changer for Linguistic Review: Shifting the Paradigm, Klaus Fleischmann...
Game Changer for Linguistic Review: Shifting the Paradigm, Klaus Fleischmann...
 
A translation memory P2P trading platform - to make global translation memory...
A translation memory P2P trading platform - to make global translation memory...A translation memory P2P trading platform - to make global translation memory...
A translation memory P2P trading platform - to make global translation memory...
 
Shiyibao — The Most Efficient Translation Feedback System Ever, Guanqing Hao ...
Shiyibao — The Most Efficient Translation Feedback System Ever, Guanqing Hao ...Shiyibao — The Most Efficient Translation Feedback System Ever, Guanqing Hao ...
Shiyibao — The Most Efficient Translation Feedback System Ever, Guanqing Hao ...
 
Stepes – Instant Human Translation Services for the Digital World, Carl Yao (...
Stepes – Instant Human Translation Services for the Digital World, Carl Yao (...Stepes – Instant Human Translation Services for the Digital World, Carl Yao (...
Stepes – Instant Human Translation Services for the Digital World, Carl Yao (...
 
Farmer Lv (TrueTran)
Farmer Lv (TrueTran)Farmer Lv (TrueTran)
Farmer Lv (TrueTran)
 
Smart Translation Resource Management: Semantic Matching, Kirk Zhang (Wiitran...
Smart Translation Resource Management: Semantic Matching, Kirk Zhang (Wiitran...Smart Translation Resource Management: Semantic Matching, Kirk Zhang (Wiitran...
Smart Translation Resource Management: Semantic Matching, Kirk Zhang (Wiitran...
 
The Theory and Practice of Computer Aided Translation Training System, Liu Q...
 The Theory and Practice of Computer Aided Translation Training System, Liu Q... The Theory and Practice of Computer Aided Translation Training System, Liu Q...
The Theory and Practice of Computer Aided Translation Training System, Liu Q...
 
Translation Technology Showcase in Shenzhen
Translation Technology Showcase in ShenzhenTranslation Technology Showcase in Shenzhen
Translation Technology Showcase in Shenzhen
 
How to efficiently use large-scale TMs in translation, Jing Zhang (Tmxmall)
How to efficiently use large-scale TMs in translation, Jing Zhang (Tmxmall)How to efficiently use large-scale TMs in translation, Jing Zhang (Tmxmall)
How to efficiently use large-scale TMs in translation, Jing Zhang (Tmxmall)
 
SDL Trados Studio 2017, Jocelyn He (SDL)
SDL Trados Studio 2017, Jocelyn He (SDL)SDL Trados Studio 2017, Jocelyn He (SDL)
SDL Trados Studio 2017, Jocelyn He (SDL)
 
How we train post-editors - Yongpeng Wei (Lingosail)
How we train post-editors - Yongpeng Wei (Lingosail)How we train post-editors - Yongpeng Wei (Lingosail)
How we train post-editors - Yongpeng Wei (Lingosail)
 
A use-case for getting MT into your company, Kerstin Berns (berns language c...
 A use-case for getting MT into your company, Kerstin Berns (berns language c... A use-case for getting MT into your company, Kerstin Berns (berns language c...
A use-case for getting MT into your company, Kerstin Berns (berns language c...
 
QE integrated in XTM, by Bob Willans (XTM)
QE integrated in XTM, by Bob Willans (XTM)QE integrated in XTM, by Bob Willans (XTM)
QE integrated in XTM, by Bob Willans (XTM)
 
How Existing Quality Models Get Challenged, by Katka Gasova (Moravia)
How Existing Quality Models Get Challenged, by Katka Gasova (Moravia)How Existing Quality Models Get Challenged, by Katka Gasova (Moravia)
How Existing Quality Models Get Challenged, by Katka Gasova (Moravia)
 
Traditional Models of Translation Outsourcing Seem Well-Established and Sound...
Traditional Models of Translation Outsourcing Seem Well-Established and Sound...Traditional Models of Translation Outsourcing Seem Well-Established and Sound...
Traditional Models of Translation Outsourcing Seem Well-Established and Sound...
 
Smarter, Faster, Better: The secrets of productive Machine Translation, Tony ...
Smarter, Faster, Better: The secrets of productive Machine Translation, Tony ...Smarter, Faster, Better: The secrets of productive Machine Translation, Tony ...
Smarter, Faster, Better: The secrets of productive Machine Translation, Tony ...
 
QE Made Easy by Attila Görög (TAUS)
QE Made Easy by Attila Görög (TAUS)QE Made Easy by Attila Görög (TAUS)
QE Made Easy by Attila Görög (TAUS)
 
Driving Innovation through Standardized APIs, By Klaus Fleischmann, Kaleidoscope
Driving Innovation through Standardized APIs, By Klaus Fleischmann, KaleidoscopeDriving Innovation through Standardized APIs, By Klaus Fleischmann, Kaleidoscope
Driving Innovation through Standardized APIs, By Klaus Fleischmann, Kaleidoscope
 
Lights Out, Translation is Datafied, by Jaap van der Meer (TAUS)
Lights Out, Translation is Datafied, by Jaap van der Meer (TAUS)Lights Out, Translation is Datafied, by Jaap van der Meer (TAUS)
Lights Out, Translation is Datafied, by Jaap van der Meer (TAUS)
 
Topic 5: DQF Integrations and Use Cases
Topic 5: DQF Integrations and Use CasesTopic 5: DQF Integrations and Use Cases
Topic 5: DQF Integrations and Use Cases
 

Recently uploaded

Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 

Recently uploaded (20)

Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 

Machine Translation in Numbers

  • 1. TAUS - Portland, October 24, 2016 MMT Machine Translation in Numbers
  • 3. Problems with current Open Source MT? 22 years old idea (Brown, Della Pietra - 1994) 10 years old implementation (Moses, JHU workshop 2006)
  • 4. Problems with current Open Source Need re-training to learn from new data
  • 5. Problems with current Open Source Does not adapt to context
  • 6. Problems with current Open Source Does not adapt to context Today, you often get to the absurd: More data = Lower Quality
  • 7. Welcome to MMT ● Incremental: Learns corrections in seconds. ● Adapts to context as you use it. ● No more initial training needed, like our old TMs :) ● Comes with data. Lots of data.
  • 8. One more thing... It is Free and Open Source
  • 9. How does it work?
  • 10. Context Analyzer Retrieves best matching TMs based on context similarity
  • 11. Indexed instead of Training ● Suffix array indexed with TMs ● Phrase table is built on the fly by sampling from the SA ● Phrases of TMs with highest weights sampled first
  • 13. Why is this different from Matecat or Lilt? Learns for all users not just one
  • 14. Why is this different from Matecat or Lilt? Learns for all users not just one Uses context
  • 15. Quality - Using the TAUS Data Cloud MS Translator HUB - commercial adaptive engine by Microsoft Paypal 1 Paypal 2 Paypal 3 Linkedin 1 Paypal 1 Paypal 2 Paypal 3
  • 16. MS Translator Hub vs Modern MT ModernMT - our adaptive and incremental solution Paypal 1 Paypal 2 Paypal 3 Linkedin 1 Paypal 1 Paypal 2 Paypal 3
  • 17. Initial Setup MMT Moses Neural MT 3 hours ($3 AWS) 30 hours ($30 AWS) 300 hours ($300 AWS) 100M parallel words, 1B monolingual, $1 / hour AWS
  • 18. Translation speed MMT Moses Neural MT 855 w/s 455 w/s 409 w/s 100M parallel words, 1B monolingual, $1 / hour AWS
  • 19. Marco stop talking, it’s Jaap time.
  • 20. TAUS Data Cloud ● Largest industry-shared repository of translation data ● A neutral and secure repository platform for ○ Sharing/pooling translation data based on a reciprocity model ○ Searching domain-specific or general data ○ Leveraging Translation Data ● Solid legal framework established by 45 founding members ● Addresses the shortage of available in-domain parallel data from the industry ● September 2016: 72B+ words in the repository ● 10M to 100M words per ModernMT language pair
  • 21.
  • 22. Collecting from the Web - Hard! ● The Web is large - even the so-called Surface or Indexable Web ● The Web is messy ● The Web is constantly in flux ● Not many organizations crawl the entire indexable web ○ Google - about 49B web pages in index (Source: www.worldwidewebsize.com) ○ Microsoft - about 20B web pages in index (Source: www.worldwidewebsize.com) ● Other crawls are focused crawls on a subset with certain criteria/goals ● Still hard for the same reasons
  • 23. Common Crawl Come to Rescue ● Commoncrawl.org ○ “CommonCrawl is a 501(c)(3) non-profit organization dedicated to providing a copy of the internet to internet researchers, companies and individuals at no cost for the purpose of research and analysis.” ● On average 1.5B unique URLs per crawl ● A very good resource for sourcing bilingual and monolingual data for machine translation purposes ○ Prototype developed by academic developers in 2012/2013 showed potential to mine parallel corpora with millions of source words
  • 24. Common Crawl Come to Rescue ● Implemented data collection pipeline based on prototype techniques ● Collecting monolingual and bilingual data ● Open sourced at https://github.com/ModernMT/DataCollection ● We are making the indices of parallel pages we discover available ○ Saves running half of the data collection pipeline ○ Each user still has to download their own data ● Avoids potential copyright issues
  • 27. What’s next ● Release 0.14 - Next Week ○ Planned for AMTA next week. 45 languages supported, adding incremental learning. ● Baseline engines and data - 3 months ○ Finish the crawling and legal activity to release the data for the baseline engines. ● Neural MT - 12 months ○ Engineering effort to make it cost-effective, incremental and context aware. Included by default in MMT
  • 28. How to contribute ● Do you want to use MMT? Provide Feedback (it is on GitHub). ● Do you want your engineers to contribute to the project? ● Do you want to add your data to the TAUS Data Cloud and help sharing baseline engines?