Grammarly AI-NLP Club #9 - Dumpster diving for parallel corpora with efficient translation - Kenneth Heafield

Dumpster diving for parallel
corpora with eﬃcient translation
paracrawl.eu browser.mt
Kenneth Heaﬁeld, University of Edinburgh
neural.mt

Problem ParaCrawl Browser Translation Conclusion
2

Do not scratch the
protected relic.
3

4

Need more data!
Photographing Beijing
tourist signs doesn’t scale.
5

Bureaucrats translate.
Harvest their data!
6

The chair broke.
Le présidente a éclaté.
7

Project
mine web for translations
for free: paracrawl.eu
8

Project
9

Projects
bergam t
ﬁrefox translation extension
client-side
in progress: browser.mt
10

Projects
bergam t
client-side
data
11

Projects
bergam t
client-side
data
fast translation
12

Projects
Part 1
bergam t
Part 2
client-side
data
fast translation
13

ParaCrawl: crawl the web for parallel corpora
All 26 EU + EEA oﬃcial languages
+3 Spanish co-oﬃcial languages
4–1,178 Millon words per language
510,482 Websites
1+ Petabyte of compressed web pages
14

Parallel Corpus Size
Language Words
French 1,178,317,233
German 929,818,868
Spanish 897,891,704
Italian 533,512,632
Portuguese 299,634,135
Dutch 233,087,345
Russian 157,061,045
Polish 145,802,939
Swedish 138,264,978
Czech 117,385,158
Danish 106,565,546
Hungarian 104,292,635
Language Words
Greek 88,669,279
Finnish 66,385,933
Romanian 62,189,306
Bulgarian 55,725,444
Slovak 45,636,383
Croatian 43,464,197
Slovenian 31,855,427
Estonian 30,858,140
Lithuanian 27,214,054
Latvian 23,656,140
Irish 21,909,039
Maltese 4,252,814
Words on English side, after ﬁltering
15

Improving Quality
ParaCrawl BLEU Gain
From To Release 1 Release 4
English Finnish +0.0 +1.2
Finnish English +2.5 +4.6
English Latvian +0.7 +1.9
Latvian English +0.9 +2.5
English Romanian +0.6 +1.3
Romanian English +2.4 +4.0
English Czech -1.4 -0.1
Czech English +0.6 +1.1
English German -3.2 +1.2
German English -1.0 +3.1
Gains relative to WMT data without ParaCrawl.
16

Text Extraction
CommonCrawl Targeted Crawls
Language
Detection
Identify Multilingual Sites
Target
Document and
Sentence Alignment
Cleaning Evaluation
17

Site Crawling
95% of translations we ﬁnd are not in CommonCrawl.
Because CommonCrawl is too shallow.
18

Site Crawling
95% of translations we ﬁnd are not in CommonCrawl.
Because CommonCrawl is too shallow.
→ We directly crawl multilingual sites.
→ Use the Internet Archive.
19

Learn what pages to crawl/links to follow?
URL: domain, language code, etc.
Link context: text, XPath
Bandit learning problem
Reward: pages in both languages are found
Ongoing work by Hieu Hoang.
20

Not Translated: wordpress.com
Blog hosting site
=⇒ multilingual, but few translations.
We blacklist large untranslated sites.
21

Language classiﬁcation
Say you’re looking for isiXhosa translations:
English Do you have pets?
isiXhosa Unazo izilwanaya zasekhaya?
22

Language classification
Say you’re looking for isiXhosa translations:
English Do you have pets?
isiXhosa Unazo izilwanaya zasekhaya?
isiXhosa occurs 0.000008x as often as English on the web.
This is lower than error rate in language classification.
=⇒ Most of the “isiXhosa” was actually baseball statistics.
=⇒ Sometimes we need to build language models to filter.
23

Matching
We have text. How do we ﬁnd translations?
Language codes in URLs [Resnick and Smith, 2003]
Translate to English, match [Uszkoreit et al, 2010]
Neural network vectors [Schwenk, 2018]
24

Matching
We have text. How do we ﬁnd translations?
Language codes in URLs [Resnick and Smith, 2003]
Translate to English, match [Uszkoreit et al, 2010]
Neural network vectors [Schwenk, 2018]
25

Matching
Translate everything to English.
=⇒ Need translation system (can use dictionary)
=⇒ Need fast translation
Match pages by tf-idf in (translated) English.
Then match sentences with n–gram overlap.
26

Boilerplate: santander.co.uk
“Santander UK plc. Registered Oﬃce: 2 Triton Square, Regent’s Place,
London, NW1 3AN, United Kingdom. Registered Number 2294747.
Registered in England and Wales. www.santander.co.uk. Telephone 0800
389 7000. Calls may be recorded or monitored. Authorised by the
Prudential Regulation Authority and regulated by the Financial Conduct
Authority and the Prudential Regulation Authority. Our Financial
Services Register number is 106054. You can check this on the Financial
Services Register by visiting the FCA’s website www.fca.org.uk/register.
Santander and the ﬂame logo are registered trademarks.”
=⇒ Match pages on boilerplate.
=⇒ Learn to translate boilerplate really well.
We use boilerpipe which tries to throw it out.
27

Templates: booking.com
“Solo travelers in particular like the location – they rated it 9.5 for
a one-person stay.”
“Les voyageurs individuels apprécient particulièrement
l’emplacement de cet établissement. Ils lui donnent la note de 9,5
pour un séjour en solo.”
“Solo travelers in particular like the location – they rated it 8.9 for
a one-person stay.”
“Les voyageurs individuels apprécient particulièrement
l’emplacement de cet établissement. Ils lui donnent la note de 8,9
pour un séjour en solo.”
Corpus of repetitive sentences is less useful.
=⇒ Diversity cleaning.
28

Noise
Paid people to judge English–German sentences:
Okay 23%
Misaligned sentences 41%
Third language 3%
Both English 10%
Both German 10%
Untranslated sentences 4%
Short segments (≤2 tokens) 1%
Short segments (3–5 tokens) 5%
Non-linguistic characters 2%
[Koehn et al, 2018]
29

Cleaning
Supervised classiﬁer trained on 50k good, 50k bad sentences
Handwritten patterns
Character-based language model
Test set attempts to have consistent cut-oﬀ across languages
30

Shared Task on Corpus Filtering
Common techniques from 2018 Conference on MT:
Aggressive language model ﬁltering
Score from translation systems, both directions
Remove near-duplicates on source and target (not translated)
Partially implemented
31

Copyright
Remember: 510,482 websites.
Crawls follow robots.txt
Crawler leaves contact information.
A few sites have asked to be removed and we have.
Under GDPR, people have the right to correct information.
We hope they do!
32

Company that sells corpora speads copyright fear:
The ﬁrst word of
copyright is copy.
33

So I found them selling crawled corpora:
They took it down.
34

Summary
There’s training data for some languages.
Search engines have been mining the web for years.
Time for large open data.
35

Bergamot: Browser-based Machine Translation
browser.mt
This project has received funding from the European Union’s Horizon 2020 research and innovation
programme under grant agreement No 825303.

Motivation
Statoil (Norwegian state oil company) employment information and
contracts leaked on Translate.com –Norsk Rikskringkasting, 2017

Motivation
Statoil (Norwegian state oil company) employment information and
contracts leaked on Translate.com –Norsk Rikskringkasting, 2017
Don’t trade your privacy for Google Translate.

Client-side neural machine translation as a Firefox extension:
Local processing =⇒ private.

Project Goals and Outline
Broad use as a Firefox extension + open platform
Fast on a desktop
Trustworthy
Support web forms
Domain adaptation
40

We’re Making a Public Product
=⇒ User Experience Work Package
41

42

43

44

Speed on Desktops
CPU version of Marian toolkit developed with Microsoft and Intel.
45

Speed Contest
0 20 40 60 80 100 120 140
18.0
20.0
22.0
24.0
26.0
28.0
2018: others GPU
2018: others CPU
2018: Marian GPU
2018: Marian CPU
2019: Marian CPU
2019: Marian GPU
Million translated source tokens per USD
BLEUonnewstest2014
2018 GPU systems
2018 CPU systems
2019 GPU systems
2019 CPU systems
46

Some of the Optimizations
Tune model size,
1 Teacher-student
2 Greedy search
3 Simplify model structure
4 Integer arithmetic
47

Teacher-student
Option 1: Train a model directly.
Option 2: Teacher-student (Kim and Rush, 2016)
Teacher: Large high-quality translation model.
Teacher translates source-language sentences.
Student: model learns on output created by teacher.
Model GPU BLEU
1xTeacher, beam size 8 109.7 28.1
4xTeacher, beam size 8 410.8 29.0
1xStudent, beam size 4 52.0 28.4
Even models with same size improve slightly.
48

Greedy Search
Normally: keep competing translations and take the highest probability.
Beam size is the number of competing translations.
Model GPU BLEU
Computing probabilities is expensive because we need to normalize.
Greedy can just pick the highest number without normalizing.
49

Simplify model structure
A transformer model generates sentences from left to right.
Each step consults all previous steps. → O(n2)
Zhang et al (2018): just average previous steps.
Update average on the ﬂy → O(n).
Model GPU BLEU
Baseline transformer 12.8 27.6
Averaged transformer 7.2 27.6
Further work: simpliﬁed simple recurrent unit.
50

Integer Arithmetic
Why Integers
Benchmarks: Memory bandwidth is limiting factor
=⇒ Compress model.
More at once: P40 does 47 TOPS int8, 12 TOPS ﬂoat.
Can do int8 with no quality loss [Quinn et al, 2018]
51

Fast 8-bit matrix multiplication
mm512 maddubs epi16 aka vpmaddubsw
The only 512-bit wide multiply of 8-bit integers on Intel.
Multiply signed by unsigned integers, then sum adjacent pairs into 16-bit.
Why signed * unsigned?!
New 8-bit VNNI instruction is also signed * unsigned.
52

Working Around signed * unsigned
Skew
Add 128 to one of the arguments.
A ∗ B = A ∗ (128J + B) − A ∗ 128J
where 128J is a matrix full of 128.
Eﬃcient if A is constant.
Normalize sign
Manually manipulate sign bits in the multiply.
=⇒ Extra instructions in hot loop.
53

4 bits?
Quantize log parameters (Miyashita et al, 2016).
Try quantizing a trained model.
3-bit 4-bit 5-bit 6-bit 7-bit 8-bit
0.72 28.92 35.08 35.60 35.69 35.67
5 bits is annoying to ﬁt in registers
. . . so close to 4 bits!
54

Continued Training
First, train as normal with ﬂoats.
Then quantize parameters after every update.
Remember the rounding error so small changes can accumulate.
-0.19 BLEU with 4-bit quantization.
https://arxiv.org/abs/1909.06091 [Aji and Heaﬁeld, 2019]
55

Decapitating Transformers
Default Transformer Model
Encoder 6-layers, self attention
Decoder 6-layers, self attention, encoder attention
8 heads/type/layer: 144 heads.
56

144 Heads
Voita et al 2019: prune 50% after training.
Pruning before training doesn’t work.
57

144 Heads
Voita et al 2019: prune 50% after training.
Pruning before training doesn’t work.
PhD student Maxi Behnke: prune during training?
58

Lottery ticket hypothesis
Some parameters are luckily initialized
Bigger models have more entries
Even if most can be discarded.
(Frankle and Carbin, 2018)
Remove entire unlucky heads?
59

Head Pruning Results
Heads pruned 0% 56% 72% 83%
Size 672M 592M 568M 552M
Reduction — 11.90% 15.48% 17.86%
Avg. time 107.58s 78.44s 70.50s 63.62s
Speed-up — 1.37× 1.53× 1.69×
∆ BLEU — -0.07 -0.20 -0.93
60

Quality Estimation
https://www.haaretz.com/israel-news/
palestinian-arrested-over-mistranslated-good-morning-facebook-post-1.5459427
Show quality estimates to the user in the browser:
User interface research
Quality estimation research
61

Old Danish Ticket: Klippekort
No longer in use
Can apply for a refund
. . . via a form
Public domain image from Wikipedia.
62

Danish Ticket Refund Form
Expects answers in Danish
63

Danish Ticket Refund Form
Expects answers in Danish
So I traded mine for a beer with Dirk Hovy at EMNLP 2017
64

What if you don’t have Dirk Hovy?
Answer a Danish web form in Danish:
Be conﬁdent my answers are correct.
. . . Even though I don’t speak Danish.
=⇒ Browser will prompt to rephrase when uncertain.
65

What if you don’t have Dirk Hovy?
Answer a Danish web form in Danish:
Be conﬁdent my answers are correct.
. . . Even though I don’t speak Danish.
=⇒ Browser will prompt to rephrase when uncertain.
. . . And use all rephrasings to translate better.
66

We’re in the Browser
The browser knows your history (if you let it).
It knows what site you are on.
Adapt translations to the user and page.
67

We’re in the Browser
The browser knows your history (if you let it).
It knows what site you are on.
Adapt translations to the user and page.
Much less creepy when all processing is local.
68

Bergamot Summary
Privacy-preserving translation via local processing.
Coming as a Firefox extension.
Anybody want to help with Ukrainian?
69

Questions?
Hiring
PhD: https://edinburghnlp.inf.ed.ac.uk/cdt/
Job: contact kheafiel@inf.ed.ac.uk
Mozilla, to work on translation:
https://careers.mozilla.org/position/gh/1666741/
70

Grammarly AI-NLP Club #9 - Dumpster diving for parallel corpora with efficient translation - Kenneth Heafield

Recommended

Recommended

More Related Content

Similar to Grammarly AI-NLP Club #9 - Dumpster diving for parallel corpora with efficient translation - Kenneth Heafield

Similar to Grammarly AI-NLP Club #9 - Dumpster diving for parallel corpora with efficient translation - Kenneth Heafield (20)

More from Grammarly

More from Grammarly (14)

Recently uploaded

Recently uploaded (20)

Grammarly AI-NLP Club #9 - Dumpster diving for parallel corpora with efficient translation - Kenneth Heafield