SlideShare a Scribd company logo
Dumpster diving for parallel
corpora with efficient translation
paracrawl.eu browser.mt
Kenneth Heafield, University of Edinburgh
neural.mt
Problem ParaCrawl Browser Translation Conclusion
2
Do not scratch the
protected relic.
Problem ParaCrawl Browser Translation Conclusion
3
Problem ParaCrawl Browser Translation Conclusion
4
Need more data!
Photographing Beijing
tourist signs doesn’t scale.
Problem ParaCrawl Browser Translation Conclusion
5
Bureaucrats translate.
Harvest their data!
Problem ParaCrawl Browser Translation Conclusion
6
The chair broke.
Le pr´esidente a ´eclat´e.
Problem ParaCrawl Browser Translation Conclusion
7
Project
mine web for translations
for free: paracrawl.eu
Problem ParaCrawl Browser Translation Conclusion
8
Project
mine web for translations
for free: paracrawl.eu
Problem ParaCrawl Browser Translation Conclusion
9
Projects
mine web for translations
for free: paracrawl.eu
bergam t
firefox translation extension
client-side
in progress: browser.mt
Problem ParaCrawl Browser Translation Conclusion
10
Projects
mine web for translations
for free: paracrawl.eu
bergam t
firefox translation extension
client-side
in progress: browser.mt
data
Problem ParaCrawl Browser Translation Conclusion
11
Projects
mine web for translations
for free: paracrawl.eu
bergam t
firefox translation extension
client-side
in progress: browser.mt
data
fast translation
Problem ParaCrawl Browser Translation Conclusion
12
Projects
Part 1
mine web for translations
for free: paracrawl.eu
bergam t
Part 2
firefox translation extension
client-side
in progress: browser.mt
data
fast translation
Problem ParaCrawl Browser Translation Conclusion
13
ParaCrawl: crawl the web for parallel corpora
All 26 EU + EEA official languages
+3 Spanish co-official languages
4–1,178 Millon words per language
510,482 Websites
1+ Petabyte of compressed web pages
Problem ParaCrawl Browser Translation Conclusion
14
Parallel Corpus Size
Language Words
French 1,178,317,233
German 929,818,868
Spanish 897,891,704
Italian 533,512,632
Portuguese 299,634,135
Dutch 233,087,345
Russian 157,061,045
Polish 145,802,939
Swedish 138,264,978
Czech 117,385,158
Danish 106,565,546
Hungarian 104,292,635
Language Words
Greek 88,669,279
Finnish 66,385,933
Romanian 62,189,306
Bulgarian 55,725,444
Slovak 45,636,383
Croatian 43,464,197
Slovenian 31,855,427
Estonian 30,858,140
Lithuanian 27,214,054
Latvian 23,656,140
Irish 21,909,039
Maltese 4,252,814
Words on English side, after filtering
Problem ParaCrawl Browser Translation Conclusion
15
Improving Quality
ParaCrawl BLEU Gain
From To Release 1 Release 4
English Finnish +0.0 +1.2
Finnish English +2.5 +4.6
English Latvian +0.7 +1.9
Latvian English +0.9 +2.5
English Romanian +0.6 +1.3
Romanian English +2.4 +4.0
English Czech -1.4 -0.1
Czech English +0.6 +1.1
English German -3.2 +1.2
German English -1.0 +3.1
Gains relative to WMT data without ParaCrawl.
Problem ParaCrawl Browser Translation Conclusion
16
Text Extraction
CommonCrawl Targeted Crawls
Language
Detection
Identify Multilingual Sites
Target
Document and
Sentence Alignment
Cleaning Evaluation
Problem ParaCrawl Browser Translation Conclusion
17
Site Crawling
95% of translations we find are not in CommonCrawl.
Because CommonCrawl is too shallow.
Problem ParaCrawl Browser Translation Conclusion
18
Site Crawling
95% of translations we find are not in CommonCrawl.
Because CommonCrawl is too shallow.
→ We directly crawl multilingual sites.
→ Use the Internet Archive.
Problem ParaCrawl Browser Translation Conclusion
19
Learn what pages to crawl/links to follow?
URL: domain, language code, etc.
Link context: text, XPath
Bandit learning problem
Reward: pages in both languages are found
Ongoing work by Hieu Hoang.
Problem ParaCrawl Browser Translation Conclusion
20
Not Translated: wordpress.com
Blog hosting site
=⇒ multilingual, but few translations.
We blacklist large untranslated sites.
Problem ParaCrawl Browser Translation Conclusion
21
Language classification
Say you’re looking for isiXhosa translations:
English Do you have pets?
isiXhosa Unazo izilwanaya zasekhaya?
Problem ParaCrawl Browser Translation Conclusion
22
Language classification
Say you’re looking for isiXhosa translations:
English Do you have pets?
isiXhosa Unazo izilwanaya zasekhaya?
isiXhosa occurs 0.000008x as often as English on the web.
This is lower than error rate in language classification.
=⇒ Most of the “isiXhosa” was actually baseball statistics.
=⇒ Sometimes we need to build language models to filter.
Problem ParaCrawl Browser Translation Conclusion
23
Matching
We have text. How do we find translations?
Language codes in URLs [Resnick and Smith, 2003]
Translate to English, match [Uszkoreit et al, 2010]
Neural network vectors [Schwenk, 2018]
Problem ParaCrawl Browser Translation Conclusion
24
Matching
We have text. How do we find translations?
Language codes in URLs [Resnick and Smith, 2003]
Translate to English, match [Uszkoreit et al, 2010]
Neural network vectors [Schwenk, 2018]
Problem ParaCrawl Browser Translation Conclusion
25
Matching
Translate everything to English.
=⇒ Need translation system (can use dictionary)
=⇒ Need fast translation
Match pages by tf-idf in (translated) English.
Then match sentences with n–gram overlap.
Problem ParaCrawl Browser Translation Conclusion
26
Boilerplate: santander.co.uk
“Santander UK plc. Registered Office: 2 Triton Square, Regent’s Place,
London, NW1 3AN, United Kingdom. Registered Number 2294747.
Registered in England and Wales. www.santander.co.uk. Telephone 0800
389 7000. Calls may be recorded or monitored. Authorised by the
Prudential Regulation Authority and regulated by the Financial Conduct
Authority and the Prudential Regulation Authority. Our Financial
Services Register number is 106054. You can check this on the Financial
Services Register by visiting the FCA’s website www.fca.org.uk/register.
Santander and the flame logo are registered trademarks.”
=⇒ Match pages on boilerplate.
=⇒ Learn to translate boilerplate really well.
We use boilerpipe which tries to throw it out.
Problem ParaCrawl Browser Translation Conclusion
27
Templates: booking.com
“Solo travelers in particular like the location – they rated it 9.5 for
a one-person stay.”
“Les voyageurs individuels appr´ecient particuli`erement
l’emplacement de cet ´etablissement. Ils lui donnent la note de 9,5
pour un s´ejour en solo.”
“Solo travelers in particular like the location – they rated it 8.9 for
a one-person stay.”
“Les voyageurs individuels appr´ecient particuli`erement
l’emplacement de cet ´etablissement. Ils lui donnent la note de 8,9
pour un s´ejour en solo.”
Corpus of repetitive sentences is less useful.
=⇒ Diversity cleaning.
Problem ParaCrawl Browser Translation Conclusion
28
Noise
Paid people to judge English–German sentences:
Okay 23%
Misaligned sentences 41%
Third language 3%
Both English 10%
Both German 10%
Untranslated sentences 4%
Short segments (≤2 tokens) 1%
Short segments (3–5 tokens) 5%
Non-linguistic characters 2%
[Koehn et al, 2018]
Problem ParaCrawl Browser Translation Conclusion
29
Cleaning
Supervised classifier trained on 50k good, 50k bad sentences
Handwritten patterns
Character-based language model
Test set attempts to have consistent cut-off across languages
Problem ParaCrawl Browser Translation Conclusion
30
Shared Task on Corpus Filtering
Common techniques from 2018 Conference on MT:
Aggressive language model filtering
Score from translation systems, both directions
Remove near-duplicates on source and target (not translated)
Partially implemented
Problem ParaCrawl Browser Translation Conclusion
31
Copyright
Remember: 510,482 websites.
Crawls follow robots.txt
Crawler leaves contact information.
A few sites have asked to be removed and we have.
Under GDPR, people have the right to correct information.
We hope they do!
Problem ParaCrawl Browser Translation Conclusion
32
Company that sells corpora speads copyright fear:
The first word of
copyright is copy.
Problem ParaCrawl Browser Translation Conclusion
33
So I found them selling crawled corpora:
They took it down.
Problem ParaCrawl Browser Translation Conclusion
34
Summary
There’s training data for some languages.
Search engines have been mining the web for years.
Time for large open data.
Problem ParaCrawl Browser Translation Conclusion
35
Bergamot: Browser-based Machine Translation
browser.mt
This project has received funding from the European Union’s Horizon 2020 research and innovation
programme under grant agreement No 825303.
Motivation
Statoil (Norwegian state oil company) employment information and
contracts leaked on Translate.com –Norsk Rikskringkasting, 2017
Motivation
Statoil (Norwegian state oil company) employment information and
contracts leaked on Translate.com –Norsk Rikskringkasting, 2017
Don’t trade your privacy for Google Translate.
Client-side neural machine translation as a Firefox extension:
Local processing =⇒ private.
Project Goals and Outline
Broad use as a Firefox extension + open platform
Fast on a desktop
Trustworthy
Support web forms
Domain adaptation
Problem ParaCrawl Browser Translation Conclusion
40
We’re Making a Public Product
=⇒ User Experience Work Package
Problem ParaCrawl Browser Translation Conclusion
41
Problem ParaCrawl Browser Translation Conclusion
42
Problem ParaCrawl Browser Translation Conclusion
43
Problem ParaCrawl Browser Translation Conclusion
44
Speed on Desktops
CPU version of Marian toolkit developed with Microsoft and Intel.
Problem ParaCrawl Browser Translation Conclusion
45
Speed Contest
0 20 40 60 80 100 120 140
18.0
20.0
22.0
24.0
26.0
28.0
2018: others GPU
2018: others CPU
2018: Marian GPU
2018: Marian CPU
2019: Marian CPU
2019: Marian GPU
Million translated source tokens per USD
BLEUonnewstest2014
2018 GPU systems
2018 CPU systems
2019 GPU systems
2019 CPU systems
Problem ParaCrawl Browser Translation Conclusion
46
Some of the Optimizations
Tune model size,
1 Teacher-student
2 Greedy search
3 Simplify model structure
4 Integer arithmetic
Problem ParaCrawl Browser Translation Conclusion
47
Teacher-student
Option 1: Train a model directly.
Option 2: Teacher-student (Kim and Rush, 2016)
Teacher: Large high-quality translation model.
Teacher translates source-language sentences.
Student: model learns on output created by teacher.
Model GPU BLEU
1xTeacher, beam size 8 109.7 28.1
4xTeacher, beam size 8 410.8 29.0
1xStudent, beam size 4 52.0 28.4
1xStudent, beam size 1 19.9 28.2
Even models with same size improve slightly.
Problem ParaCrawl Browser Translation Conclusion
48
Greedy Search
Normally: keep competing translations and take the highest probability.
Beam size is the number of competing translations.
Model GPU BLEU
1xStudent, beam size 4 52.0 28.4
1xStudent, beam size 2 31.9 28.4
1xStudent, beam size 1 19.9 28.2
Computing probabilities is expensive because we need to normalize.
Greedy can just pick the highest number without normalizing.
Problem ParaCrawl Browser Translation Conclusion
49
Simplify model structure
A transformer model generates sentences from left to right.
Each step consults all previous steps. → O(n2)
Zhang et al (2018): just average previous steps.
Update average on the fly → O(n).
Model GPU BLEU
Baseline transformer 12.8 27.6
Averaged transformer 7.2 27.6
Further work: simplified simple recurrent unit.
Problem ParaCrawl Browser Translation Conclusion
50
Integer Arithmetic
Why Integers
Benchmarks: Memory bandwidth is limiting factor
=⇒ Compress model.
More at once: P40 does 47 TOPS int8, 12 TOPS float.
Can do int8 with no quality loss [Quinn et al, 2018]
Problem ParaCrawl Browser Translation Conclusion
51
Fast 8-bit matrix multiplication
mm512 maddubs epi16 aka vpmaddubsw
The only 512-bit wide multiply of 8-bit integers on Intel.
Multiply signed by unsigned integers, then sum adjacent pairs into 16-bit.
Why signed * unsigned?!
New 8-bit VNNI instruction is also signed * unsigned.
Problem ParaCrawl Browser Translation Conclusion
52
Working Around signed * unsigned
Skew
Add 128 to one of the arguments.
A ∗ B = A ∗ (128J + B) − A ∗ 128J
where 128J is a matrix full of 128.
Efficient if A is constant.
Normalize sign
Manually manipulate sign bits in the multiply.
=⇒ Extra instructions in hot loop.
Problem ParaCrawl Browser Translation Conclusion
53
4 bits?
Quantize log parameters (Miyashita et al, 2016).
Try quantizing a trained model.
3-bit 4-bit 5-bit 6-bit 7-bit 8-bit
0.72 28.92 35.08 35.60 35.69 35.67
5 bits is annoying to fit in registers
. . . so close to 4 bits!
Problem ParaCrawl Browser Translation Conclusion
54
Continued Training
First, train as normal with floats.
Then quantize parameters after every update.
Remember the rounding error so small changes can accumulate.
-0.19 BLEU with 4-bit quantization.
https://arxiv.org/abs/1909.06091 [Aji and Heafield, 2019]
Problem ParaCrawl Browser Translation Conclusion
55
Decapitating Transformers
Default Transformer Model
Encoder 6-layers, self attention
Decoder 6-layers, self attention, encoder attention
8 heads/type/layer: 144 heads.
Problem ParaCrawl Browser Translation Conclusion
56
144 Heads
Voita et al 2019: prune 50% after training.
Pruning before training doesn’t work.
Problem ParaCrawl Browser Translation Conclusion
57
144 Heads
Voita et al 2019: prune 50% after training.
Pruning before training doesn’t work.
PhD student Maxi Behnke: prune during training?
Problem ParaCrawl Browser Translation Conclusion
58
Lottery ticket hypothesis
Some parameters are luckily initialized
Bigger models have more entries
Even if most can be discarded.
(Frankle and Carbin, 2018)
Remove entire unlucky heads?
Problem ParaCrawl Browser Translation Conclusion
59
Head Pruning Results
Heads pruned 0% 56% 72% 83%
Size 672M 592M 568M 552M
Reduction — 11.90% 15.48% 17.86%
Avg. time 107.58s 78.44s 70.50s 63.62s
Speed-up — 1.37× 1.53× 1.69×
∆ BLEU — -0.07 -0.20 -0.93
Problem ParaCrawl Browser Translation Conclusion
60
Quality Estimation
https://www.haaretz.com/israel-news/
palestinian-arrested-over-mistranslated-good-morning-facebook-post-1.5459427
Show quality estimates to the user in the browser:
User interface research
Quality estimation research
Problem ParaCrawl Browser Translation Conclusion
61
Old Danish Ticket: Klippekort
No longer in use
Can apply for a refund
. . . via a form
Public domain image from Wikipedia.
Problem ParaCrawl Browser Translation Conclusion
62
Danish Ticket Refund Form
Expects answers in Danish
Problem ParaCrawl Browser Translation Conclusion
63
Danish Ticket Refund Form
Expects answers in Danish
So I traded mine for a beer with Dirk Hovy at EMNLP 2017
Problem ParaCrawl Browser Translation Conclusion
64
What if you don’t have Dirk Hovy?
Answer a Danish web form in Danish:
Be confident my answers are correct.
. . . Even though I don’t speak Danish.
=⇒ Browser will prompt to rephrase when uncertain.
Problem ParaCrawl Browser Translation Conclusion
65
What if you don’t have Dirk Hovy?
Answer a Danish web form in Danish:
Be confident my answers are correct.
. . . Even though I don’t speak Danish.
=⇒ Browser will prompt to rephrase when uncertain.
. . . And use all rephrasings to translate better.
Problem ParaCrawl Browser Translation Conclusion
66
We’re in the Browser
The browser knows your history (if you let it).
It knows what site you are on.
Adapt translations to the user and page.
Problem ParaCrawl Browser Translation Conclusion
67
We’re in the Browser
The browser knows your history (if you let it).
It knows what site you are on.
Adapt translations to the user and page.
Much less creepy when all processing is local.
Problem ParaCrawl Browser Translation Conclusion
68
Bergamot Summary
Privacy-preserving translation via local processing.
Coming as a Firefox extension.
Anybody want to help with Ukrainian?
Problem ParaCrawl Browser Translation Conclusion
69
Questions?
Hiring
PhD: https://edinburghnlp.inf.ed.ac.uk/cdt/
Job: contact kheafiel@inf.ed.ac.uk
Mozilla, to work on translation:
https://careers.mozilla.org/position/gh/1666741/
Problem ParaCrawl Browser Translation Conclusion
70

More Related Content

Similar to Grammarly AI-NLP Club #9 - Dumpster diving for parallel corpora with efficient translation - Kenneth Heafield

From V8 to Modern Compilers
From V8 to Modern CompilersFrom V8 to Modern Compilers
From V8 to Modern Compilers
Min-Yih Hsu
 
Has responsive had it's day? : Amplience Customer Day 2014
Has responsive had it's day? : Amplience Customer Day 2014Has responsive had it's day? : Amplience Customer Day 2014
Has responsive had it's day? : Amplience Customer Day 2014
Ben Seymour
 
DevOps Fest 2019. Gianluca Arbezzano. DevOps never sleeps. What we learned fr...
DevOps Fest 2019. Gianluca Arbezzano. DevOps never sleeps. What we learned fr...DevOps Fest 2019. Gianluca Arbezzano. DevOps never sleeps. What we learned fr...
DevOps Fest 2019. Gianluca Arbezzano. DevOps never sleeps. What we learned fr...
DevOps_Fest
 
Intro to JavaScript - LA - July
Intro to JavaScript - LA - JulyIntro to JavaScript - LA - July
Intro to JavaScript - LA - July
Thinkful
 
SearchLeeds 2018 - Craig Campbell - How to fix the most common technical SEO ...
SearchLeeds 2018 - Craig Campbell - How to fix the most common technical SEO ...SearchLeeds 2018 - Craig Campbell - How to fix the most common technical SEO ...
SearchLeeds 2018 - Craig Campbell - How to fix the most common technical SEO ...
Branded3
 
Craig Campbell Search Leeds, Most Common SEO Technical Issues
Craig Campbell Search Leeds, Most Common SEO Technical IssuesCraig Campbell Search Leeds, Most Common SEO Technical Issues
Craig Campbell Search Leeds, Most Common SEO Technical Issues
Craig Campbell
 
Life of a Request by Ana Oprea
Life of a Request by Ana OpreaLife of a Request by Ana Oprea
Life of a Request by Ana Oprea
Rails Girls MUC
 
Canary Analyze All the Things
Canary Analyze All the ThingsCanary Analyze All the Things
Canary Analyze All the Things
royrapoport
 
Natural born conversion killers - Conversion Jam
Natural born conversion killers - Conversion JamNatural born conversion killers - Conversion Jam
Natural born conversion killers - Conversion Jam
Craig Sullivan
 
How Tracking Companies Circumvent Ad Blockers Using WebSockets
How Tracking Companies Circumvent Ad Blockers Using WebSocketsHow Tracking Companies Circumvent Ad Blockers Using WebSockets
How Tracking Companies Circumvent Ad Blockers Using WebSockets
Sajjad "JJ" Arshad
 
6 site migration fails and how to avoid them - BrightonSEO September 2018 - J...
6 site migration fails and how to avoid them - BrightonSEO September 2018 - J...6 site migration fails and how to avoid them - BrightonSEO September 2018 - J...
6 site migration fails and how to avoid them - BrightonSEO September 2018 - J...
Oban International
 
Globant Week Cali - Entendiendo el desarrollo Front-end del mundo moderno.
Globant Week Cali - Entendiendo el desarrollo Front-end del mundo moderno.Globant Week Cali - Entendiendo el desarrollo Front-end del mundo moderno.
Globant Week Cali - Entendiendo el desarrollo Front-end del mundo moderno.
Globant
 
Do Try This At Home Ajax Bookmarking, Cross Site Scripting, And Other Web 2 ...
Do Try This At Home  Ajax Bookmarking, Cross Site Scripting, And Other Web 2 ...Do Try This At Home  Ajax Bookmarking, Cross Site Scripting, And Other Web 2 ...
Do Try This At Home Ajax Bookmarking, Cross Site Scripting, And Other Web 2 ...jward5519
 
PharoJS: Hijack the JavaScript Ecosystem
PharoJS: Hijack the JavaScript EcosystemPharoJS: Hijack the JavaScript Ecosystem
PharoJS: Hijack the JavaScript Ecosystem
ESUG
 
Build a Game with JavaScript - Pasadena July
Build a Game with JavaScript - Pasadena JulyBuild a Game with JavaScript - Pasadena July
Build a Game with JavaScript - Pasadena July
Thinkful
 
Flink Forward Berlin 2018: Brian Wolfe - "Upshot: distributed tracing using F...
Flink Forward Berlin 2018: Brian Wolfe - "Upshot: distributed tracing using F...Flink Forward Berlin 2018: Brian Wolfe - "Upshot: distributed tracing using F...
Flink Forward Berlin 2018: Brian Wolfe - "Upshot: distributed tracing using F...
Flink Forward
 
5 Steps to Open Your Website to a Global Audience
5 Steps to Open Your Website to a Global Audience5 Steps to Open Your Website to a Global Audience
5 Steps to Open Your Website to a Global AudienceAcquia
 
Going Global 101: How to Manage Your Websites Worldwide Using Drupal
Going Global 101: How to Manage Your Websites Worldwide Using DrupalGoing Global 101: How to Manage Your Websites Worldwide Using Drupal
Going Global 101: How to Manage Your Websites Worldwide Using Drupal
Acquia
 
Performance is not an Option - gRPC and Cassandra
Performance is not an Option - gRPC and CassandraPerformance is not an Option - gRPC and Cassandra
Performance is not an Option - gRPC and Cassandra
Dave Bechberger
 

Similar to Grammarly AI-NLP Club #9 - Dumpster diving for parallel corpora with efficient translation - Kenneth Heafield (20)

From V8 to Modern Compilers
From V8 to Modern CompilersFrom V8 to Modern Compilers
From V8 to Modern Compilers
 
Has responsive had it's day? : Amplience Customer Day 2014
Has responsive had it's day? : Amplience Customer Day 2014Has responsive had it's day? : Amplience Customer Day 2014
Has responsive had it's day? : Amplience Customer Day 2014
 
DevOps Fest 2019. Gianluca Arbezzano. DevOps never sleeps. What we learned fr...
DevOps Fest 2019. Gianluca Arbezzano. DevOps never sleeps. What we learned fr...DevOps Fest 2019. Gianluca Arbezzano. DevOps never sleeps. What we learned fr...
DevOps Fest 2019. Gianluca Arbezzano. DevOps never sleeps. What we learned fr...
 
Intro to JavaScript - LA - July
Intro to JavaScript - LA - JulyIntro to JavaScript - LA - July
Intro to JavaScript - LA - July
 
SearchLeeds 2018 - Craig Campbell - How to fix the most common technical SEO ...
SearchLeeds 2018 - Craig Campbell - How to fix the most common technical SEO ...SearchLeeds 2018 - Craig Campbell - How to fix the most common technical SEO ...
SearchLeeds 2018 - Craig Campbell - How to fix the most common technical SEO ...
 
Craig Campbell Search Leeds, Most Common SEO Technical Issues
Craig Campbell Search Leeds, Most Common SEO Technical IssuesCraig Campbell Search Leeds, Most Common SEO Technical Issues
Craig Campbell Search Leeds, Most Common SEO Technical Issues
 
Life of a Request by Ana Oprea
Life of a Request by Ana OpreaLife of a Request by Ana Oprea
Life of a Request by Ana Oprea
 
Canary Analyze All the Things
Canary Analyze All the ThingsCanary Analyze All the Things
Canary Analyze All the Things
 
Natural born conversion killers - Conversion Jam
Natural born conversion killers - Conversion JamNatural born conversion killers - Conversion Jam
Natural born conversion killers - Conversion Jam
 
How Tracking Companies Circumvent Ad Blockers Using WebSockets
How Tracking Companies Circumvent Ad Blockers Using WebSocketsHow Tracking Companies Circumvent Ad Blockers Using WebSockets
How Tracking Companies Circumvent Ad Blockers Using WebSockets
 
6 site migration fails and how to avoid them - BrightonSEO September 2018 - J...
6 site migration fails and how to avoid them - BrightonSEO September 2018 - J...6 site migration fails and how to avoid them - BrightonSEO September 2018 - J...
6 site migration fails and how to avoid them - BrightonSEO September 2018 - J...
 
Globant Week Cali - Entendiendo el desarrollo Front-end del mundo moderno.
Globant Week Cali - Entendiendo el desarrollo Front-end del mundo moderno.Globant Week Cali - Entendiendo el desarrollo Front-end del mundo moderno.
Globant Week Cali - Entendiendo el desarrollo Front-end del mundo moderno.
 
Do Try This At Home Ajax Bookmarking, Cross Site Scripting, And Other Web 2 ...
Do Try This At Home  Ajax Bookmarking, Cross Site Scripting, And Other Web 2 ...Do Try This At Home  Ajax Bookmarking, Cross Site Scripting, And Other Web 2 ...
Do Try This At Home Ajax Bookmarking, Cross Site Scripting, And Other Web 2 ...
 
PharoJS: Hijack the JavaScript Ecosystem
PharoJS: Hijack the JavaScript EcosystemPharoJS: Hijack the JavaScript Ecosystem
PharoJS: Hijack the JavaScript Ecosystem
 
Build a Game with JavaScript - Pasadena July
Build a Game with JavaScript - Pasadena JulyBuild a Game with JavaScript - Pasadena July
Build a Game with JavaScript - Pasadena July
 
Flink Forward Berlin 2018: Brian Wolfe - "Upshot: distributed tracing using F...
Flink Forward Berlin 2018: Brian Wolfe - "Upshot: distributed tracing using F...Flink Forward Berlin 2018: Brian Wolfe - "Upshot: distributed tracing using F...
Flink Forward Berlin 2018: Brian Wolfe - "Upshot: distributed tracing using F...
 
5 Steps to Open Your Website to a Global Audience
5 Steps to Open Your Website to a Global Audience5 Steps to Open Your Website to a Global Audience
5 Steps to Open Your Website to a Global Audience
 
Going Global 101: How to Manage Your Websites Worldwide Using Drupal
Going Global 101: How to Manage Your Websites Worldwide Using DrupalGoing Global 101: How to Manage Your Websites Worldwide Using Drupal
Going Global 101: How to Manage Your Websites Worldwide Using Drupal
 
Performance is not an Option - gRPC and Cassandra
Performance is not an Option - gRPC and CassandraPerformance is not an Option - gRPC and Cassandra
Performance is not an Option - gRPC and Cassandra
 
Web Design
Web DesignWeb Design
Web Design
 

More from Grammarly

Vitalii Braslavskyi - Declarative engineering
Vitalii Braslavskyi - Declarative engineering Vitalii Braslavskyi - Declarative engineering
Vitalii Braslavskyi - Declarative engineering
Grammarly
 
Grammarly AI-NLP Club #10 - Information-Theoretic Probing with Minimum Descri...
Grammarly AI-NLP Club #10 - Information-Theoretic Probing with Minimum Descri...Grammarly AI-NLP Club #10 - Information-Theoretic Probing with Minimum Descri...
Grammarly AI-NLP Club #10 - Information-Theoretic Probing with Minimum Descri...
Grammarly
 
Grammarly AI-NLP Club #8 - Arabic Natural Language Processing: Challenges and...
Grammarly AI-NLP Club #8 - Arabic Natural Language Processing: Challenges and...Grammarly AI-NLP Club #8 - Arabic Natural Language Processing: Challenges and...
Grammarly AI-NLP Club #8 - Arabic Natural Language Processing: Challenges and...
Grammarly
 
Grammarly AI-NLP Club #6 - Sequence Tagging using Neural Networks - Artem Che...
Grammarly AI-NLP Club #6 - Sequence Tagging using Neural Networks - Artem Che...Grammarly AI-NLP Club #6 - Sequence Tagging using Neural Networks - Artem Che...
Grammarly AI-NLP Club #6 - Sequence Tagging using Neural Networks - Artem Che...
Grammarly
 
Grammarly AI-NLP Club #5 - Automatic text simplification in the biomedical do...
Grammarly AI-NLP Club #5 - Automatic text simplification in the biomedical do...Grammarly AI-NLP Club #5 - Automatic text simplification in the biomedical do...
Grammarly AI-NLP Club #5 - Automatic text simplification in the biomedical do...
Grammarly
 
Grammarly AI-NLP Club #3 - Learning to Read for Automated Fact Checking - Isa...
Grammarly AI-NLP Club #3 - Learning to Read for Automated Fact Checking - Isa...Grammarly AI-NLP Club #3 - Learning to Read for Automated Fact Checking - Isa...
Grammarly AI-NLP Club #3 - Learning to Read for Automated Fact Checking - Isa...
Grammarly
 
Grammarly AI-NLP Club #4 - Understanding and assessing language with neural n...
Grammarly AI-NLP Club #4 - Understanding and assessing language with neural n...Grammarly AI-NLP Club #4 - Understanding and assessing language with neural n...
Grammarly AI-NLP Club #4 - Understanding and assessing language with neural n...
Grammarly
 
Grammarly Meetup: DevOps at Grammarly: Scaling 100x
Grammarly Meetup: DevOps at Grammarly: Scaling 100xGrammarly Meetup: DevOps at Grammarly: Scaling 100x
Grammarly Meetup: DevOps at Grammarly: Scaling 100x
Grammarly
 
Grammarly Meetup: Memory Networks for Question Answering on Tabular Data - Sv...
Grammarly Meetup: Memory Networks for Question Answering on Tabular Data - Sv...Grammarly Meetup: Memory Networks for Question Answering on Tabular Data - Sv...
Grammarly Meetup: Memory Networks for Question Answering on Tabular Data - Sv...
Grammarly
 
Grammarly AI-NLP Club #2 - Recent advances in applied chatbot technology - Jo...
Grammarly AI-NLP Club #2 - Recent advances in applied chatbot technology - Jo...Grammarly AI-NLP Club #2 - Recent advances in applied chatbot technology - Jo...
Grammarly AI-NLP Club #2 - Recent advances in applied chatbot technology - Jo...
Grammarly
 
Grammarly AI-NLP Club #1 - Domain and Social Bias in NLP: Case Study in Langu...
Grammarly AI-NLP Club #1 - Domain and Social Bias in NLP: Case Study in Langu...Grammarly AI-NLP Club #1 - Domain and Social Bias in NLP: Case Study in Langu...
Grammarly AI-NLP Club #1 - Domain and Social Bias in NLP: Case Study in Langu...
Grammarly
 
Grammarly Meetup: Paraphrase Detection in NLP (PART 2) - Andriy Gryshchuk
Grammarly Meetup: Paraphrase Detection in NLP (PART 2) - Andriy GryshchukGrammarly Meetup: Paraphrase Detection in NLP (PART 2) - Andriy Gryshchuk
Grammarly Meetup: Paraphrase Detection in NLP (PART 2) - Andriy Gryshchuk
Grammarly
 
Grammarly Meetup: Paraphrase Detection in NLP (PART 1) - Yuriy Guts
Grammarly Meetup: Paraphrase Detection in NLP (PART 1) - Yuriy GutsGrammarly Meetup: Paraphrase Detection in NLP (PART 1) - Yuriy Guts
Grammarly Meetup: Paraphrase Detection in NLP (PART 1) - Yuriy Guts
Grammarly
 
Natural Language Processing for biomedical text mining - Thierry Hamon
Natural Language Processing for biomedical text mining - Thierry HamonNatural Language Processing for biomedical text mining - Thierry Hamon
Natural Language Processing for biomedical text mining - Thierry Hamon
Grammarly
 

More from Grammarly (14)

Vitalii Braslavskyi - Declarative engineering
Vitalii Braslavskyi - Declarative engineering Vitalii Braslavskyi - Declarative engineering
Vitalii Braslavskyi - Declarative engineering
 
Grammarly AI-NLP Club #10 - Information-Theoretic Probing with Minimum Descri...
Grammarly AI-NLP Club #10 - Information-Theoretic Probing with Minimum Descri...Grammarly AI-NLP Club #10 - Information-Theoretic Probing with Minimum Descri...
Grammarly AI-NLP Club #10 - Information-Theoretic Probing with Minimum Descri...
 
Grammarly AI-NLP Club #8 - Arabic Natural Language Processing: Challenges and...
Grammarly AI-NLP Club #8 - Arabic Natural Language Processing: Challenges and...Grammarly AI-NLP Club #8 - Arabic Natural Language Processing: Challenges and...
Grammarly AI-NLP Club #8 - Arabic Natural Language Processing: Challenges and...
 
Grammarly AI-NLP Club #6 - Sequence Tagging using Neural Networks - Artem Che...
Grammarly AI-NLP Club #6 - Sequence Tagging using Neural Networks - Artem Che...Grammarly AI-NLP Club #6 - Sequence Tagging using Neural Networks - Artem Che...
Grammarly AI-NLP Club #6 - Sequence Tagging using Neural Networks - Artem Che...
 
Grammarly AI-NLP Club #5 - Automatic text simplification in the biomedical do...
Grammarly AI-NLP Club #5 - Automatic text simplification in the biomedical do...Grammarly AI-NLP Club #5 - Automatic text simplification in the biomedical do...
Grammarly AI-NLP Club #5 - Automatic text simplification in the biomedical do...
 
Grammarly AI-NLP Club #3 - Learning to Read for Automated Fact Checking - Isa...
Grammarly AI-NLP Club #3 - Learning to Read for Automated Fact Checking - Isa...Grammarly AI-NLP Club #3 - Learning to Read for Automated Fact Checking - Isa...
Grammarly AI-NLP Club #3 - Learning to Read for Automated Fact Checking - Isa...
 
Grammarly AI-NLP Club #4 - Understanding and assessing language with neural n...
Grammarly AI-NLP Club #4 - Understanding and assessing language with neural n...Grammarly AI-NLP Club #4 - Understanding and assessing language with neural n...
Grammarly AI-NLP Club #4 - Understanding and assessing language with neural n...
 
Grammarly Meetup: DevOps at Grammarly: Scaling 100x
Grammarly Meetup: DevOps at Grammarly: Scaling 100xGrammarly Meetup: DevOps at Grammarly: Scaling 100x
Grammarly Meetup: DevOps at Grammarly: Scaling 100x
 
Grammarly Meetup: Memory Networks for Question Answering on Tabular Data - Sv...
Grammarly Meetup: Memory Networks for Question Answering on Tabular Data - Sv...Grammarly Meetup: Memory Networks for Question Answering on Tabular Data - Sv...
Grammarly Meetup: Memory Networks for Question Answering on Tabular Data - Sv...
 
Grammarly AI-NLP Club #2 - Recent advances in applied chatbot technology - Jo...
Grammarly AI-NLP Club #2 - Recent advances in applied chatbot technology - Jo...Grammarly AI-NLP Club #2 - Recent advances in applied chatbot technology - Jo...
Grammarly AI-NLP Club #2 - Recent advances in applied chatbot technology - Jo...
 
Grammarly AI-NLP Club #1 - Domain and Social Bias in NLP: Case Study in Langu...
Grammarly AI-NLP Club #1 - Domain and Social Bias in NLP: Case Study in Langu...Grammarly AI-NLP Club #1 - Domain and Social Bias in NLP: Case Study in Langu...
Grammarly AI-NLP Club #1 - Domain and Social Bias in NLP: Case Study in Langu...
 
Grammarly Meetup: Paraphrase Detection in NLP (PART 2) - Andriy Gryshchuk
Grammarly Meetup: Paraphrase Detection in NLP (PART 2) - Andriy GryshchukGrammarly Meetup: Paraphrase Detection in NLP (PART 2) - Andriy Gryshchuk
Grammarly Meetup: Paraphrase Detection in NLP (PART 2) - Andriy Gryshchuk
 
Grammarly Meetup: Paraphrase Detection in NLP (PART 1) - Yuriy Guts
Grammarly Meetup: Paraphrase Detection in NLP (PART 1) - Yuriy GutsGrammarly Meetup: Paraphrase Detection in NLP (PART 1) - Yuriy Guts
Grammarly Meetup: Paraphrase Detection in NLP (PART 1) - Yuriy Guts
 
Natural Language Processing for biomedical text mining - Thierry Hamon
Natural Language Processing for biomedical text mining - Thierry HamonNatural Language Processing for biomedical text mining - Thierry Hamon
Natural Language Processing for biomedical text mining - Thierry Hamon
 

Recently uploaded

FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
Bhaskar Mitra
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
Abida Shariff
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
Fwdays
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 

Recently uploaded (20)

FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 

Grammarly AI-NLP Club #9 - Dumpster diving for parallel corpora with efficient translation - Kenneth Heafield

  • 1. Dumpster diving for parallel corpora with efficient translation paracrawl.eu browser.mt Kenneth Heafield, University of Edinburgh neural.mt
  • 2. Problem ParaCrawl Browser Translation Conclusion 2
  • 3. Do not scratch the protected relic. Problem ParaCrawl Browser Translation Conclusion 3
  • 4. Problem ParaCrawl Browser Translation Conclusion 4
  • 5. Need more data! Photographing Beijing tourist signs doesn’t scale. Problem ParaCrawl Browser Translation Conclusion 5
  • 6. Bureaucrats translate. Harvest their data! Problem ParaCrawl Browser Translation Conclusion 6
  • 7. The chair broke. Le pr´esidente a ´eclat´e. Problem ParaCrawl Browser Translation Conclusion 7
  • 8. Project mine web for translations for free: paracrawl.eu Problem ParaCrawl Browser Translation Conclusion 8
  • 9. Project mine web for translations for free: paracrawl.eu Problem ParaCrawl Browser Translation Conclusion 9
  • 10. Projects mine web for translations for free: paracrawl.eu bergam t firefox translation extension client-side in progress: browser.mt Problem ParaCrawl Browser Translation Conclusion 10
  • 11. Projects mine web for translations for free: paracrawl.eu bergam t firefox translation extension client-side in progress: browser.mt data Problem ParaCrawl Browser Translation Conclusion 11
  • 12. Projects mine web for translations for free: paracrawl.eu bergam t firefox translation extension client-side in progress: browser.mt data fast translation Problem ParaCrawl Browser Translation Conclusion 12
  • 13. Projects Part 1 mine web for translations for free: paracrawl.eu bergam t Part 2 firefox translation extension client-side in progress: browser.mt data fast translation Problem ParaCrawl Browser Translation Conclusion 13
  • 14. ParaCrawl: crawl the web for parallel corpora All 26 EU + EEA official languages +3 Spanish co-official languages 4–1,178 Millon words per language 510,482 Websites 1+ Petabyte of compressed web pages Problem ParaCrawl Browser Translation Conclusion 14
  • 15. Parallel Corpus Size Language Words French 1,178,317,233 German 929,818,868 Spanish 897,891,704 Italian 533,512,632 Portuguese 299,634,135 Dutch 233,087,345 Russian 157,061,045 Polish 145,802,939 Swedish 138,264,978 Czech 117,385,158 Danish 106,565,546 Hungarian 104,292,635 Language Words Greek 88,669,279 Finnish 66,385,933 Romanian 62,189,306 Bulgarian 55,725,444 Slovak 45,636,383 Croatian 43,464,197 Slovenian 31,855,427 Estonian 30,858,140 Lithuanian 27,214,054 Latvian 23,656,140 Irish 21,909,039 Maltese 4,252,814 Words on English side, after filtering Problem ParaCrawl Browser Translation Conclusion 15
  • 16. Improving Quality ParaCrawl BLEU Gain From To Release 1 Release 4 English Finnish +0.0 +1.2 Finnish English +2.5 +4.6 English Latvian +0.7 +1.9 Latvian English +0.9 +2.5 English Romanian +0.6 +1.3 Romanian English +2.4 +4.0 English Czech -1.4 -0.1 Czech English +0.6 +1.1 English German -3.2 +1.2 German English -1.0 +3.1 Gains relative to WMT data without ParaCrawl. Problem ParaCrawl Browser Translation Conclusion 16
  • 17. Text Extraction CommonCrawl Targeted Crawls Language Detection Identify Multilingual Sites Target Document and Sentence Alignment Cleaning Evaluation Problem ParaCrawl Browser Translation Conclusion 17
  • 18. Site Crawling 95% of translations we find are not in CommonCrawl. Because CommonCrawl is too shallow. Problem ParaCrawl Browser Translation Conclusion 18
  • 19. Site Crawling 95% of translations we find are not in CommonCrawl. Because CommonCrawl is too shallow. → We directly crawl multilingual sites. → Use the Internet Archive. Problem ParaCrawl Browser Translation Conclusion 19
  • 20. Learn what pages to crawl/links to follow? URL: domain, language code, etc. Link context: text, XPath Bandit learning problem Reward: pages in both languages are found Ongoing work by Hieu Hoang. Problem ParaCrawl Browser Translation Conclusion 20
  • 21. Not Translated: wordpress.com Blog hosting site =⇒ multilingual, but few translations. We blacklist large untranslated sites. Problem ParaCrawl Browser Translation Conclusion 21
  • 22. Language classification Say you’re looking for isiXhosa translations: English Do you have pets? isiXhosa Unazo izilwanaya zasekhaya? Problem ParaCrawl Browser Translation Conclusion 22
  • 23. Language classification Say you’re looking for isiXhosa translations: English Do you have pets? isiXhosa Unazo izilwanaya zasekhaya? isiXhosa occurs 0.000008x as often as English on the web. This is lower than error rate in language classification. =⇒ Most of the “isiXhosa” was actually baseball statistics. =⇒ Sometimes we need to build language models to filter. Problem ParaCrawl Browser Translation Conclusion 23
  • 24. Matching We have text. How do we find translations? Language codes in URLs [Resnick and Smith, 2003] Translate to English, match [Uszkoreit et al, 2010] Neural network vectors [Schwenk, 2018] Problem ParaCrawl Browser Translation Conclusion 24
  • 25. Matching We have text. How do we find translations? Language codes in URLs [Resnick and Smith, 2003] Translate to English, match [Uszkoreit et al, 2010] Neural network vectors [Schwenk, 2018] Problem ParaCrawl Browser Translation Conclusion 25
  • 26. Matching Translate everything to English. =⇒ Need translation system (can use dictionary) =⇒ Need fast translation Match pages by tf-idf in (translated) English. Then match sentences with n–gram overlap. Problem ParaCrawl Browser Translation Conclusion 26
  • 27. Boilerplate: santander.co.uk “Santander UK plc. Registered Office: 2 Triton Square, Regent’s Place, London, NW1 3AN, United Kingdom. Registered Number 2294747. Registered in England and Wales. www.santander.co.uk. Telephone 0800 389 7000. Calls may be recorded or monitored. Authorised by the Prudential Regulation Authority and regulated by the Financial Conduct Authority and the Prudential Regulation Authority. Our Financial Services Register number is 106054. You can check this on the Financial Services Register by visiting the FCA’s website www.fca.org.uk/register. Santander and the flame logo are registered trademarks.” =⇒ Match pages on boilerplate. =⇒ Learn to translate boilerplate really well. We use boilerpipe which tries to throw it out. Problem ParaCrawl Browser Translation Conclusion 27
  • 28. Templates: booking.com “Solo travelers in particular like the location – they rated it 9.5 for a one-person stay.” “Les voyageurs individuels appr´ecient particuli`erement l’emplacement de cet ´etablissement. Ils lui donnent la note de 9,5 pour un s´ejour en solo.” “Solo travelers in particular like the location – they rated it 8.9 for a one-person stay.” “Les voyageurs individuels appr´ecient particuli`erement l’emplacement de cet ´etablissement. Ils lui donnent la note de 8,9 pour un s´ejour en solo.” Corpus of repetitive sentences is less useful. =⇒ Diversity cleaning. Problem ParaCrawl Browser Translation Conclusion 28
  • 29. Noise Paid people to judge English–German sentences: Okay 23% Misaligned sentences 41% Third language 3% Both English 10% Both German 10% Untranslated sentences 4% Short segments (≤2 tokens) 1% Short segments (3–5 tokens) 5% Non-linguistic characters 2% [Koehn et al, 2018] Problem ParaCrawl Browser Translation Conclusion 29
  • 30. Cleaning Supervised classifier trained on 50k good, 50k bad sentences Handwritten patterns Character-based language model Test set attempts to have consistent cut-off across languages Problem ParaCrawl Browser Translation Conclusion 30
  • 31. Shared Task on Corpus Filtering Common techniques from 2018 Conference on MT: Aggressive language model filtering Score from translation systems, both directions Remove near-duplicates on source and target (not translated) Partially implemented Problem ParaCrawl Browser Translation Conclusion 31
  • 32. Copyright Remember: 510,482 websites. Crawls follow robots.txt Crawler leaves contact information. A few sites have asked to be removed and we have. Under GDPR, people have the right to correct information. We hope they do! Problem ParaCrawl Browser Translation Conclusion 32
  • 33. Company that sells corpora speads copyright fear: The first word of copyright is copy. Problem ParaCrawl Browser Translation Conclusion 33
  • 34. So I found them selling crawled corpora: They took it down. Problem ParaCrawl Browser Translation Conclusion 34
  • 35. Summary There’s training data for some languages. Search engines have been mining the web for years. Time for large open data. Problem ParaCrawl Browser Translation Conclusion 35
  • 36. Bergamot: Browser-based Machine Translation browser.mt This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 825303.
  • 37. Motivation Statoil (Norwegian state oil company) employment information and contracts leaked on Translate.com –Norsk Rikskringkasting, 2017
  • 38. Motivation Statoil (Norwegian state oil company) employment information and contracts leaked on Translate.com –Norsk Rikskringkasting, 2017 Don’t trade your privacy for Google Translate.
  • 39. Client-side neural machine translation as a Firefox extension: Local processing =⇒ private.
  • 40. Project Goals and Outline Broad use as a Firefox extension + open platform Fast on a desktop Trustworthy Support web forms Domain adaptation Problem ParaCrawl Browser Translation Conclusion 40
  • 41. We’re Making a Public Product =⇒ User Experience Work Package Problem ParaCrawl Browser Translation Conclusion 41
  • 42. Problem ParaCrawl Browser Translation Conclusion 42
  • 43. Problem ParaCrawl Browser Translation Conclusion 43
  • 44. Problem ParaCrawl Browser Translation Conclusion 44
  • 45. Speed on Desktops CPU version of Marian toolkit developed with Microsoft and Intel. Problem ParaCrawl Browser Translation Conclusion 45
  • 46. Speed Contest 0 20 40 60 80 100 120 140 18.0 20.0 22.0 24.0 26.0 28.0 2018: others GPU 2018: others CPU 2018: Marian GPU 2018: Marian CPU 2019: Marian CPU 2019: Marian GPU Million translated source tokens per USD BLEUonnewstest2014 2018 GPU systems 2018 CPU systems 2019 GPU systems 2019 CPU systems Problem ParaCrawl Browser Translation Conclusion 46
  • 47. Some of the Optimizations Tune model size, 1 Teacher-student 2 Greedy search 3 Simplify model structure 4 Integer arithmetic Problem ParaCrawl Browser Translation Conclusion 47
  • 48. Teacher-student Option 1: Train a model directly. Option 2: Teacher-student (Kim and Rush, 2016) Teacher: Large high-quality translation model. Teacher translates source-language sentences. Student: model learns on output created by teacher. Model GPU BLEU 1xTeacher, beam size 8 109.7 28.1 4xTeacher, beam size 8 410.8 29.0 1xStudent, beam size 4 52.0 28.4 1xStudent, beam size 1 19.9 28.2 Even models with same size improve slightly. Problem ParaCrawl Browser Translation Conclusion 48
  • 49. Greedy Search Normally: keep competing translations and take the highest probability. Beam size is the number of competing translations. Model GPU BLEU 1xStudent, beam size 4 52.0 28.4 1xStudent, beam size 2 31.9 28.4 1xStudent, beam size 1 19.9 28.2 Computing probabilities is expensive because we need to normalize. Greedy can just pick the highest number without normalizing. Problem ParaCrawl Browser Translation Conclusion 49
  • 50. Simplify model structure A transformer model generates sentences from left to right. Each step consults all previous steps. → O(n2) Zhang et al (2018): just average previous steps. Update average on the fly → O(n). Model GPU BLEU Baseline transformer 12.8 27.6 Averaged transformer 7.2 27.6 Further work: simplified simple recurrent unit. Problem ParaCrawl Browser Translation Conclusion 50
  • 51. Integer Arithmetic Why Integers Benchmarks: Memory bandwidth is limiting factor =⇒ Compress model. More at once: P40 does 47 TOPS int8, 12 TOPS float. Can do int8 with no quality loss [Quinn et al, 2018] Problem ParaCrawl Browser Translation Conclusion 51
  • 52. Fast 8-bit matrix multiplication mm512 maddubs epi16 aka vpmaddubsw The only 512-bit wide multiply of 8-bit integers on Intel. Multiply signed by unsigned integers, then sum adjacent pairs into 16-bit. Why signed * unsigned?! New 8-bit VNNI instruction is also signed * unsigned. Problem ParaCrawl Browser Translation Conclusion 52
  • 53. Working Around signed * unsigned Skew Add 128 to one of the arguments. A ∗ B = A ∗ (128J + B) − A ∗ 128J where 128J is a matrix full of 128. Efficient if A is constant. Normalize sign Manually manipulate sign bits in the multiply. =⇒ Extra instructions in hot loop. Problem ParaCrawl Browser Translation Conclusion 53
  • 54. 4 bits? Quantize log parameters (Miyashita et al, 2016). Try quantizing a trained model. 3-bit 4-bit 5-bit 6-bit 7-bit 8-bit 0.72 28.92 35.08 35.60 35.69 35.67 5 bits is annoying to fit in registers . . . so close to 4 bits! Problem ParaCrawl Browser Translation Conclusion 54
  • 55. Continued Training First, train as normal with floats. Then quantize parameters after every update. Remember the rounding error so small changes can accumulate. -0.19 BLEU with 4-bit quantization. https://arxiv.org/abs/1909.06091 [Aji and Heafield, 2019] Problem ParaCrawl Browser Translation Conclusion 55
  • 56. Decapitating Transformers Default Transformer Model Encoder 6-layers, self attention Decoder 6-layers, self attention, encoder attention 8 heads/type/layer: 144 heads. Problem ParaCrawl Browser Translation Conclusion 56
  • 57. 144 Heads Voita et al 2019: prune 50% after training. Pruning before training doesn’t work. Problem ParaCrawl Browser Translation Conclusion 57
  • 58. 144 Heads Voita et al 2019: prune 50% after training. Pruning before training doesn’t work. PhD student Maxi Behnke: prune during training? Problem ParaCrawl Browser Translation Conclusion 58
  • 59. Lottery ticket hypothesis Some parameters are luckily initialized Bigger models have more entries Even if most can be discarded. (Frankle and Carbin, 2018) Remove entire unlucky heads? Problem ParaCrawl Browser Translation Conclusion 59
  • 60. Head Pruning Results Heads pruned 0% 56% 72% 83% Size 672M 592M 568M 552M Reduction — 11.90% 15.48% 17.86% Avg. time 107.58s 78.44s 70.50s 63.62s Speed-up — 1.37× 1.53× 1.69× ∆ BLEU — -0.07 -0.20 -0.93 Problem ParaCrawl Browser Translation Conclusion 60
  • 61. Quality Estimation https://www.haaretz.com/israel-news/ palestinian-arrested-over-mistranslated-good-morning-facebook-post-1.5459427 Show quality estimates to the user in the browser: User interface research Quality estimation research Problem ParaCrawl Browser Translation Conclusion 61
  • 62. Old Danish Ticket: Klippekort No longer in use Can apply for a refund . . . via a form Public domain image from Wikipedia. Problem ParaCrawl Browser Translation Conclusion 62
  • 63. Danish Ticket Refund Form Expects answers in Danish Problem ParaCrawl Browser Translation Conclusion 63
  • 64. Danish Ticket Refund Form Expects answers in Danish So I traded mine for a beer with Dirk Hovy at EMNLP 2017 Problem ParaCrawl Browser Translation Conclusion 64
  • 65. What if you don’t have Dirk Hovy? Answer a Danish web form in Danish: Be confident my answers are correct. . . . Even though I don’t speak Danish. =⇒ Browser will prompt to rephrase when uncertain. Problem ParaCrawl Browser Translation Conclusion 65
  • 66. What if you don’t have Dirk Hovy? Answer a Danish web form in Danish: Be confident my answers are correct. . . . Even though I don’t speak Danish. =⇒ Browser will prompt to rephrase when uncertain. . . . And use all rephrasings to translate better. Problem ParaCrawl Browser Translation Conclusion 66
  • 67. We’re in the Browser The browser knows your history (if you let it). It knows what site you are on. Adapt translations to the user and page. Problem ParaCrawl Browser Translation Conclusion 67
  • 68. We’re in the Browser The browser knows your history (if you let it). It knows what site you are on. Adapt translations to the user and page. Much less creepy when all processing is local. Problem ParaCrawl Browser Translation Conclusion 68
  • 69. Bergamot Summary Privacy-preserving translation via local processing. Coming as a Firefox extension. Anybody want to help with Ukrainian? Problem ParaCrawl Browser Translation Conclusion 69
  • 70. Questions? Hiring PhD: https://edinburghnlp.inf.ed.ac.uk/cdt/ Job: contact kheafiel@inf.ed.ac.uk Mozilla, to work on translation: https://careers.mozilla.org/position/gh/1666741/ Problem ParaCrawl Browser Translation Conclusion 70