Machine Translation in Numbers

TAUS - Portland, October 24, 2016
MMT
Machine Translation in Numbers

Problems with current Open Source MT?
22
years old
idea
(Brown, Della Pietra - 1994)
10
years old
implementation
(Moses, JHU workshop 2006)

Problems with current Open Source
Need re-training to learn from new data

Does not adapt to context

Does not adapt to context
Today, you often get to the absurd:
More data = Lower Quality

Welcome to MMT
● Incremental: Learns corrections in seconds.
● Adapts to context as you use it.
● No more initial training needed, like our old TMs :)
● Comes with data. Lots of data.

One more thing...
It is Free and Open Source

Context Analyzer
Retrieves best matching TMs
based on context similarity

Indexed instead of Training
● Suffix array indexed with TMs
● Phrase table is built on the fly
by sampling from the SA
● Phrases of TMs with highest
weights sampled first

Why is this different from Matecat or Lilt?
Learns for all users not just one

Why is this different from Matecat or Lilt?
Learns for all users not just one
Uses context

Quality - Using the TAUS Data Cloud
MS Translator HUB - commercial adaptive engine by Microsoft
Paypal 1 Paypal 2 Paypal 3 Linkedin 1 Paypal 1 Paypal 2 Paypal 3

MS Translator Hub vs Modern MT
ModernMT - our adaptive and incremental solution
Paypal 1 Paypal 2 Paypal 3 Linkedin 1 Paypal 1 Paypal 2 Paypal 3

Initial Setup
MMT
Moses
Neural MT
3 hours ($3 AWS)
30 hours ($30 AWS)
300 hours ($300 AWS)
100M parallel words, 1B monolingual, $1 / hour AWS

Translation speed
MMT
Moses
Neural MT
855 w/s
455 w/s
409 w/s
100M parallel words, 1B monolingual, $1 / hour AWS

Marco stop talking, it’s Jaap time.

TAUS Data Cloud
● Largest industry-shared repository of translation data
● A neutral and secure repository platform for
○ Sharing/pooling translation data based on a reciprocity model
○ Searching domain-specific or general data
○ Leveraging Translation Data
● Solid legal framework established by 45 founding members
● Addresses the shortage of available in-domain parallel data from the
industry
● September 2016: 72B+ words in the repository
● 10M to 100M words per ModernMT language pair

Collecting from the Web - Hard!
● The Web is large - even the so-called Surface or Indexable Web
● The Web is messy
● The Web is constantly in flux
● Not many organizations crawl the entire indexable web
○ Google - about 49B web pages in index (Source: www.worldwidewebsize.com)
○ Microsoft - about 20B web pages in index (Source: www.worldwidewebsize.com)
● Other crawls are focused crawls on a subset with certain criteria/goals
● Still hard for the same reasons

Common Crawl Come to Rescue
● Commoncrawl.org
○ “CommonCrawl is a 501(c)(3) non-profit organization dedicated to providing a copy of
the internet to internet researchers, companies and individuals at no cost for the
purpose of research and analysis.”
● On average 1.5B unique URLs per crawl
● A very good resource for sourcing bilingual and monolingual data for machine
translation purposes
○ Prototype developed by academic developers in 2012/2013 showed potential to mine
parallel corpora with millions of source words

Common Crawl Come to Rescue
● Implemented data collection pipeline based on prototype techniques
● Collecting monolingual and bilingual data
● Open sourced at https://github.com/ModernMT/DataCollection
● We are making the indices of parallel pages we discover available
○ Saves running half of the data collection pipeline
○ Each user still has to download their own data
● Avoids potential copyright issues

What’s next
● Release 0.14 - Next Week
○ Planned for AMTA next week. 45 languages supported, adding incremental learning.
● Baseline engines and data - 3 months
○ Finish the crawling and legal activity to release the data for the baseline engines.
● Neural MT - 12 months
○ Engineering effort to make it cost-effective, incremental and context aware.
Included by default in MMT

How to contribute
● Do you want to use MMT? Provide Feedback (it is on GitHub).
● Do you want your engineers to contribute to the project?
● Do you want to add your data to the TAUS Data Cloud and help
sharing baseline engines?

Thank you
https://github.com/ModernMT/MMT

Machine Translation in Numbers

Recommended

Recommended

More Related Content

More from TAUS - The Language Data Network

More from TAUS - The Language Data Network (20)

Recently uploaded

Recently uploaded (20)

Machine Translation in Numbers