Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Grammarly AI-NLP Club #9 - Dumpster diving for parallel corpora with efficient translation - Kenneth Heafield

Speaker: Kenneth Heafield, Lecturer at the University of Edinburgh

Summary: The ParaCrawl project is mining a petabyte of the web for translations to release freely at https://paracrawl.eu/releases.html. But the web is a messy place, with a lot of data to sift through. To find translations, we translate everything into English or at least use a neural encoder. A related project makes machine translation inference more efficient by using optimizations ranging from assembly instructions to removal of bits of model architecture.

  • Be the first to comment

  • Be the first to like this

Grammarly AI-NLP Club #9 - Dumpster diving for parallel corpora with efficient translation - Kenneth Heafield

  1. 1. Dumpster diving for parallel corpora with efficient translation paracrawl.eu browser.mt Kenneth Heafield, University of Edinburgh neural.mt
  2. 2. Problem ParaCrawl Browser Translation Conclusion 2
  3. 3. Do not scratch the protected relic. Problem ParaCrawl Browser Translation Conclusion 3
  4. 4. Problem ParaCrawl Browser Translation Conclusion 4
  5. 5. Need more data! Photographing Beijing tourist signs doesn’t scale. Problem ParaCrawl Browser Translation Conclusion 5
  6. 6. Bureaucrats translate. Harvest their data! Problem ParaCrawl Browser Translation Conclusion 6
  7. 7. The chair broke. Le pr´esidente a ´eclat´e. Problem ParaCrawl Browser Translation Conclusion 7
  8. 8. Project mine web for translations for free: paracrawl.eu Problem ParaCrawl Browser Translation Conclusion 8
  9. 9. Project mine web for translations for free: paracrawl.eu Problem ParaCrawl Browser Translation Conclusion 9
  10. 10. Projects mine web for translations for free: paracrawl.eu bergam t firefox translation extension client-side in progress: browser.mt Problem ParaCrawl Browser Translation Conclusion 10
  11. 11. Projects mine web for translations for free: paracrawl.eu bergam t firefox translation extension client-side in progress: browser.mt data Problem ParaCrawl Browser Translation Conclusion 11
  12. 12. Projects mine web for translations for free: paracrawl.eu bergam t firefox translation extension client-side in progress: browser.mt data fast translation Problem ParaCrawl Browser Translation Conclusion 12
  13. 13. Projects Part 1 mine web for translations for free: paracrawl.eu bergam t Part 2 firefox translation extension client-side in progress: browser.mt data fast translation Problem ParaCrawl Browser Translation Conclusion 13
  14. 14. ParaCrawl: crawl the web for parallel corpora All 26 EU + EEA official languages +3 Spanish co-official languages 4–1,178 Millon words per language 510,482 Websites 1+ Petabyte of compressed web pages Problem ParaCrawl Browser Translation Conclusion 14
  15. 15. Parallel Corpus Size Language Words French 1,178,317,233 German 929,818,868 Spanish 897,891,704 Italian 533,512,632 Portuguese 299,634,135 Dutch 233,087,345 Russian 157,061,045 Polish 145,802,939 Swedish 138,264,978 Czech 117,385,158 Danish 106,565,546 Hungarian 104,292,635 Language Words Greek 88,669,279 Finnish 66,385,933 Romanian 62,189,306 Bulgarian 55,725,444 Slovak 45,636,383 Croatian 43,464,197 Slovenian 31,855,427 Estonian 30,858,140 Lithuanian 27,214,054 Latvian 23,656,140 Irish 21,909,039 Maltese 4,252,814 Words on English side, after filtering Problem ParaCrawl Browser Translation Conclusion 15
  16. 16. Improving Quality ParaCrawl BLEU Gain From To Release 1 Release 4 English Finnish +0.0 +1.2 Finnish English +2.5 +4.6 English Latvian +0.7 +1.9 Latvian English +0.9 +2.5 English Romanian +0.6 +1.3 Romanian English +2.4 +4.0 English Czech -1.4 -0.1 Czech English +0.6 +1.1 English German -3.2 +1.2 German English -1.0 +3.1 Gains relative to WMT data without ParaCrawl. Problem ParaCrawl Browser Translation Conclusion 16
  17. 17. Text Extraction CommonCrawl Targeted Crawls Language Detection Identify Multilingual Sites Target Document and Sentence Alignment Cleaning Evaluation Problem ParaCrawl Browser Translation Conclusion 17
  18. 18. Site Crawling 95% of translations we find are not in CommonCrawl. Because CommonCrawl is too shallow. Problem ParaCrawl Browser Translation Conclusion 18
  19. 19. Site Crawling 95% of translations we find are not in CommonCrawl. Because CommonCrawl is too shallow. → We directly crawl multilingual sites. → Use the Internet Archive. Problem ParaCrawl Browser Translation Conclusion 19
  20. 20. Learn what pages to crawl/links to follow? URL: domain, language code, etc. Link context: text, XPath Bandit learning problem Reward: pages in both languages are found Ongoing work by Hieu Hoang. Problem ParaCrawl Browser Translation Conclusion 20
  21. 21. Not Translated: wordpress.com Blog hosting site =⇒ multilingual, but few translations. We blacklist large untranslated sites. Problem ParaCrawl Browser Translation Conclusion 21
  22. 22. Language classification Say you’re looking for isiXhosa translations: English Do you have pets? isiXhosa Unazo izilwanaya zasekhaya? Problem ParaCrawl Browser Translation Conclusion 22
  23. 23. Language classification Say you’re looking for isiXhosa translations: English Do you have pets? isiXhosa Unazo izilwanaya zasekhaya? isiXhosa occurs 0.000008x as often as English on the web. This is lower than error rate in language classification. =⇒ Most of the “isiXhosa” was actually baseball statistics. =⇒ Sometimes we need to build language models to filter. Problem ParaCrawl Browser Translation Conclusion 23
  24. 24. Matching We have text. How do we find translations? Language codes in URLs [Resnick and Smith, 2003] Translate to English, match [Uszkoreit et al, 2010] Neural network vectors [Schwenk, 2018] Problem ParaCrawl Browser Translation Conclusion 24
  25. 25. Matching We have text. How do we find translations? Language codes in URLs [Resnick and Smith, 2003] Translate to English, match [Uszkoreit et al, 2010] Neural network vectors [Schwenk, 2018] Problem ParaCrawl Browser Translation Conclusion 25
  26. 26. Matching Translate everything to English. =⇒ Need translation system (can use dictionary) =⇒ Need fast translation Match pages by tf-idf in (translated) English. Then match sentences with n–gram overlap. Problem ParaCrawl Browser Translation Conclusion 26
  27. 27. Boilerplate: santander.co.uk “Santander UK plc. Registered Office: 2 Triton Square, Regent’s Place, London, NW1 3AN, United Kingdom. Registered Number 2294747. Registered in England and Wales. www.santander.co.uk. Telephone 0800 389 7000. Calls may be recorded or monitored. Authorised by the Prudential Regulation Authority and regulated by the Financial Conduct Authority and the Prudential Regulation Authority. Our Financial Services Register number is 106054. You can check this on the Financial Services Register by visiting the FCA’s website www.fca.org.uk/register. Santander and the flame logo are registered trademarks.” =⇒ Match pages on boilerplate. =⇒ Learn to translate boilerplate really well. We use boilerpipe which tries to throw it out. Problem ParaCrawl Browser Translation Conclusion 27
  28. 28. Templates: booking.com “Solo travelers in particular like the location – they rated it 9.5 for a one-person stay.” “Les voyageurs individuels appr´ecient particuli`erement l’emplacement de cet ´etablissement. Ils lui donnent la note de 9,5 pour un s´ejour en solo.” “Solo travelers in particular like the location – they rated it 8.9 for a one-person stay.” “Les voyageurs individuels appr´ecient particuli`erement l’emplacement de cet ´etablissement. Ils lui donnent la note de 8,9 pour un s´ejour en solo.” Corpus of repetitive sentences is less useful. =⇒ Diversity cleaning. Problem ParaCrawl Browser Translation Conclusion 28
  29. 29. Noise Paid people to judge English–German sentences: Okay 23% Misaligned sentences 41% Third language 3% Both English 10% Both German 10% Untranslated sentences 4% Short segments (≤2 tokens) 1% Short segments (3–5 tokens) 5% Non-linguistic characters 2% [Koehn et al, 2018] Problem ParaCrawl Browser Translation Conclusion 29
  30. 30. Cleaning Supervised classifier trained on 50k good, 50k bad sentences Handwritten patterns Character-based language model Test set attempts to have consistent cut-off across languages Problem ParaCrawl Browser Translation Conclusion 30
  31. 31. Shared Task on Corpus Filtering Common techniques from 2018 Conference on MT: Aggressive language model filtering Score from translation systems, both directions Remove near-duplicates on source and target (not translated) Partially implemented Problem ParaCrawl Browser Translation Conclusion 31
  32. 32. Copyright Remember: 510,482 websites. Crawls follow robots.txt Crawler leaves contact information. A few sites have asked to be removed and we have. Under GDPR, people have the right to correct information. We hope they do! Problem ParaCrawl Browser Translation Conclusion 32
  33. 33. Company that sells corpora speads copyright fear: The first word of copyright is copy. Problem ParaCrawl Browser Translation Conclusion 33
  34. 34. So I found them selling crawled corpora: They took it down. Problem ParaCrawl Browser Translation Conclusion 34
  35. 35. Summary There’s training data for some languages. Search engines have been mining the web for years. Time for large open data. Problem ParaCrawl Browser Translation Conclusion 35
  36. 36. Bergamot: Browser-based Machine Translation browser.mt This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 825303.
  37. 37. Motivation Statoil (Norwegian state oil company) employment information and contracts leaked on Translate.com –Norsk Rikskringkasting, 2017
  38. 38. Motivation Statoil (Norwegian state oil company) employment information and contracts leaked on Translate.com –Norsk Rikskringkasting, 2017 Don’t trade your privacy for Google Translate.
  39. 39. Client-side neural machine translation as a Firefox extension: Local processing =⇒ private.
  40. 40. Project Goals and Outline Broad use as a Firefox extension + open platform Fast on a desktop Trustworthy Support web forms Domain adaptation Problem ParaCrawl Browser Translation Conclusion 40
  41. 41. We’re Making a Public Product =⇒ User Experience Work Package Problem ParaCrawl Browser Translation Conclusion 41
  42. 42. Problem ParaCrawl Browser Translation Conclusion 42
  43. 43. Problem ParaCrawl Browser Translation Conclusion 43
  44. 44. Problem ParaCrawl Browser Translation Conclusion 44
  45. 45. Speed on Desktops CPU version of Marian toolkit developed with Microsoft and Intel. Problem ParaCrawl Browser Translation Conclusion 45
  46. 46. Speed Contest 0 20 40 60 80 100 120 140 18.0 20.0 22.0 24.0 26.0 28.0 2018: others GPU 2018: others CPU 2018: Marian GPU 2018: Marian CPU 2019: Marian CPU 2019: Marian GPU Million translated source tokens per USD BLEUonnewstest2014 2018 GPU systems 2018 CPU systems 2019 GPU systems 2019 CPU systems Problem ParaCrawl Browser Translation Conclusion 46
  47. 47. Some of the Optimizations Tune model size, 1 Teacher-student 2 Greedy search 3 Simplify model structure 4 Integer arithmetic Problem ParaCrawl Browser Translation Conclusion 47
  48. 48. Teacher-student Option 1: Train a model directly. Option 2: Teacher-student (Kim and Rush, 2016) Teacher: Large high-quality translation model. Teacher translates source-language sentences. Student: model learns on output created by teacher. Model GPU BLEU 1xTeacher, beam size 8 109.7 28.1 4xTeacher, beam size 8 410.8 29.0 1xStudent, beam size 4 52.0 28.4 1xStudent, beam size 1 19.9 28.2 Even models with same size improve slightly. Problem ParaCrawl Browser Translation Conclusion 48
  49. 49. Greedy Search Normally: keep competing translations and take the highest probability. Beam size is the number of competing translations. Model GPU BLEU 1xStudent, beam size 4 52.0 28.4 1xStudent, beam size 2 31.9 28.4 1xStudent, beam size 1 19.9 28.2 Computing probabilities is expensive because we need to normalize. Greedy can just pick the highest number without normalizing. Problem ParaCrawl Browser Translation Conclusion 49
  50. 50. Simplify model structure A transformer model generates sentences from left to right. Each step consults all previous steps. → O(n2) Zhang et al (2018): just average previous steps. Update average on the fly → O(n). Model GPU BLEU Baseline transformer 12.8 27.6 Averaged transformer 7.2 27.6 Further work: simplified simple recurrent unit. Problem ParaCrawl Browser Translation Conclusion 50
  51. 51. Integer Arithmetic Why Integers Benchmarks: Memory bandwidth is limiting factor =⇒ Compress model. More at once: P40 does 47 TOPS int8, 12 TOPS float. Can do int8 with no quality loss [Quinn et al, 2018] Problem ParaCrawl Browser Translation Conclusion 51
  52. 52. Fast 8-bit matrix multiplication mm512 maddubs epi16 aka vpmaddubsw The only 512-bit wide multiply of 8-bit integers on Intel. Multiply signed by unsigned integers, then sum adjacent pairs into 16-bit. Why signed * unsigned?! New 8-bit VNNI instruction is also signed * unsigned. Problem ParaCrawl Browser Translation Conclusion 52
  53. 53. Working Around signed * unsigned Skew Add 128 to one of the arguments. A ∗ B = A ∗ (128J + B) − A ∗ 128J where 128J is a matrix full of 128. Efficient if A is constant. Normalize sign Manually manipulate sign bits in the multiply. =⇒ Extra instructions in hot loop. Problem ParaCrawl Browser Translation Conclusion 53
  54. 54. 4 bits? Quantize log parameters (Miyashita et al, 2016). Try quantizing a trained model. 3-bit 4-bit 5-bit 6-bit 7-bit 8-bit 0.72 28.92 35.08 35.60 35.69 35.67 5 bits is annoying to fit in registers . . . so close to 4 bits! Problem ParaCrawl Browser Translation Conclusion 54
  55. 55. Continued Training First, train as normal with floats. Then quantize parameters after every update. Remember the rounding error so small changes can accumulate. -0.19 BLEU with 4-bit quantization. https://arxiv.org/abs/1909.06091 [Aji and Heafield, 2019] Problem ParaCrawl Browser Translation Conclusion 55
  56. 56. Decapitating Transformers Default Transformer Model Encoder 6-layers, self attention Decoder 6-layers, self attention, encoder attention 8 heads/type/layer: 144 heads. Problem ParaCrawl Browser Translation Conclusion 56
  57. 57. 144 Heads Voita et al 2019: prune 50% after training. Pruning before training doesn’t work. Problem ParaCrawl Browser Translation Conclusion 57
  58. 58. 144 Heads Voita et al 2019: prune 50% after training. Pruning before training doesn’t work. PhD student Maxi Behnke: prune during training? Problem ParaCrawl Browser Translation Conclusion 58
  59. 59. Lottery ticket hypothesis Some parameters are luckily initialized Bigger models have more entries Even if most can be discarded. (Frankle and Carbin, 2018) Remove entire unlucky heads? Problem ParaCrawl Browser Translation Conclusion 59
  60. 60. Head Pruning Results Heads pruned 0% 56% 72% 83% Size 672M 592M 568M 552M Reduction — 11.90% 15.48% 17.86% Avg. time 107.58s 78.44s 70.50s 63.62s Speed-up — 1.37× 1.53× 1.69× ∆ BLEU — -0.07 -0.20 -0.93 Problem ParaCrawl Browser Translation Conclusion 60
  61. 61. Quality Estimation https://www.haaretz.com/israel-news/ palestinian-arrested-over-mistranslated-good-morning-facebook-post-1.5459427 Show quality estimates to the user in the browser: User interface research Quality estimation research Problem ParaCrawl Browser Translation Conclusion 61
  62. 62. Old Danish Ticket: Klippekort No longer in use Can apply for a refund . . . via a form Public domain image from Wikipedia. Problem ParaCrawl Browser Translation Conclusion 62
  63. 63. Danish Ticket Refund Form Expects answers in Danish Problem ParaCrawl Browser Translation Conclusion 63
  64. 64. Danish Ticket Refund Form Expects answers in Danish So I traded mine for a beer with Dirk Hovy at EMNLP 2017 Problem ParaCrawl Browser Translation Conclusion 64
  65. 65. What if you don’t have Dirk Hovy? Answer a Danish web form in Danish: Be confident my answers are correct. . . . Even though I don’t speak Danish. =⇒ Browser will prompt to rephrase when uncertain. Problem ParaCrawl Browser Translation Conclusion 65
  66. 66. What if you don’t have Dirk Hovy? Answer a Danish web form in Danish: Be confident my answers are correct. . . . Even though I don’t speak Danish. =⇒ Browser will prompt to rephrase when uncertain. . . . And use all rephrasings to translate better. Problem ParaCrawl Browser Translation Conclusion 66
  67. 67. We’re in the Browser The browser knows your history (if you let it). It knows what site you are on. Adapt translations to the user and page. Problem ParaCrawl Browser Translation Conclusion 67
  68. 68. We’re in the Browser The browser knows your history (if you let it). It knows what site you are on. Adapt translations to the user and page. Much less creepy when all processing is local. Problem ParaCrawl Browser Translation Conclusion 68
  69. 69. Bergamot Summary Privacy-preserving translation via local processing. Coming as a Firefox extension. Anybody want to help with Ukrainian? Problem ParaCrawl Browser Translation Conclusion 69
  70. 70. Questions? Hiring PhD: https://edinburghnlp.inf.ed.ac.uk/cdt/ Job: contact kheafiel@inf.ed.ac.uk Mozilla, to work on translation: https://careers.mozilla.org/position/gh/1666741/ Problem ParaCrawl Browser Translation Conclusion 70

×