Presentation by Marco Trombetti (Translated) and Jaap van der Meer (TAUS), followed by a discussion about effective ways of using data.
Effective MT customization is not so straightforward. While it has been shown that pooling data even with competitors in the same industry domain has a positive effect on MT quality, combining data from different sources and selecting the right set of data to build well performing MT systems is often still done more like alchemy rather than science. For many language combinations and domains little or no data is available. Even if nominally sufficient data is available, adding data that is not appropriate or not of high enough quality leads to diminishing returns. In the ModernMT EU project we faced the same challenges and built a large data repository combining data from the industry leading translation data sharing platforms TAUS Data Cloud and Translated’s MyMemory, public data and data sourced from the open web repository Common Crawl. ModernMT uses context-aware data selection to choose data from this repository, combining it with data optionally submitted by the user to create domain-adapted MT systems on-the-fly. In this session we present lessons we learned from using data from TAUS, MyMemory, Oracle, PayPal and LinkedIn for Context and Adaptive MT.
6. Problems with current Open Source
Does not adapt to context
Today, you often get to the absurd:
More data = Lower Quality
7. Welcome to MMT
● Incremental: Learns corrections in seconds.
● Adapts to context as you use it.
● No more initial training needed, like our old TMs :)
● Comes with data. Lots of data.
11. Indexed instead of Training
● Suffix array indexed with TMs
● Phrase table is built on the fly
by sampling from the SA
● Phrases of TMs with highest
weights sampled first
20. TAUS Data Cloud
● Largest industry-shared repository of translation data
● A neutral and secure repository platform for
○ Sharing/pooling translation data based on a reciprocity model
○ Searching domain-specific or general data
○ Leveraging Translation Data
● Solid legal framework established by 45 founding members
● Addresses the shortage of available in-domain parallel data from the
industry
● September 2016: 72B+ words in the repository
● 10M to 100M words per ModernMT language pair
21.
22. Collecting from the Web - Hard!
● The Web is large - even the so-called Surface or Indexable Web
● The Web is messy
● The Web is constantly in flux
● Not many organizations crawl the entire indexable web
○ Google - about 49B web pages in index (Source: www.worldwidewebsize.com)
○ Microsoft - about 20B web pages in index (Source: www.worldwidewebsize.com)
● Other crawls are focused crawls on a subset with certain criteria/goals
● Still hard for the same reasons
23. Common Crawl Come to Rescue
● Commoncrawl.org
○ “CommonCrawl is a 501(c)(3) non-profit organization dedicated to providing a copy of
the internet to internet researchers, companies and individuals at no cost for the
purpose of research and analysis.”
● On average 1.5B unique URLs per crawl
● A very good resource for sourcing bilingual and monolingual data for machine
translation purposes
○ Prototype developed by academic developers in 2012/2013 showed potential to mine
parallel corpora with millions of source words
24. Common Crawl Come to Rescue
● Implemented data collection pipeline based on prototype techniques
● Collecting monolingual and bilingual data
● Open sourced at https://github.com/ModernMT/DataCollection
● We are making the indices of parallel pages we discover available
○ Saves running half of the data collection pipeline
○ Each user still has to download their own data
● Avoids potential copyright issues
27. What’s next
● Release 0.14 - Next Week
○ Planned for AMTA next week. 45 languages supported, adding incremental learning.
● Baseline engines and data - 3 months
○ Finish the crawling and legal activity to release the data for the baseline engines.
● Neural MT - 12 months
○ Engineering effort to make it cost-effective, incremental and context aware.
Included by default in MMT
28. How to contribute
● Do you want to use MMT? Provide Feedback (it is on GitHub).
● Do you want your engineers to contribute to the project?
● Do you want to add your data to the TAUS Data Cloud and help
sharing baseline engines?