Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Artificial Intelligence TM and Terminology Onboarding

292 views

Published on

We're building a tool that automatically scrapes a multilingual website and generates a translation memory from it. From there it extracts the terminology.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Artificial Intelligence TM and Terminology Onboarding

  1. 1. AI TM and Terminology Onboarding Rémy Blättler Chief of the System @ Supertext
  2. 2. Webspider (with PhantomJS) NMT-based alignment Term extraction Structured text in multiple languages Translation memory Terminology database
  3. 3. PhantomJS (based on Chrome) can spider JavaScript based pages.  Great for modern dynamic site. Result: Structured (HTML) multilingual data Webspider
  4. 4. • Speed • Multitasking / Scaling Problems
  5. 5. NMT Alignment How can I create my own coupon template? Wie kann ich eine eigene Gutscheinvorlage erstellen? Wie kann ich eine eigene Coupon-Vorlage erstellen? 2. Align 1. Translate
  6. 6. • False positives • Best matching algorithm • Gale-Church, Levenshtein, Bleu score Problems
  7. 7. breakfast cereal dinner lunch => Cereal GENSIM word2vec man => boy woman => girl Sweden  Norway 0.76  Finland 0.71  Estonia 0.54
  8. 8. New York City Terms and Conditions GENSIM Phrases Never Follow (Audi) Just do it (Nike)
  9. 9. 1. Average occurrence of a term over all corpora 2. Average occurrence of a term for this scan 3. Same for the other language Term extraction
  10. 10. Detect the specific phrase in the source & target: Geänderte Segmentberichterstattung erhöht Aussagekraft. Improvements gained from changes in segment reporting. Term extraction 2
  11. 11. Lindschulte-Gruppe Lindschulte Group Anleihen bonds BKW Energie AG BKW Energie AG Revisionsstelle external auditors Vizepräsident des Verwaltungsrats Deputy Chair staatlichen Fonds state funds Die beizulegenden Zeitwerte The fair values Stilllegungs- und Entsorgungsfonds disposal funds Gebäudetechnik building technologies Wertänderung Value adjustment Bewertungsverfahren Level valuations
  12. 12. • Speed (tests take multiple hours) • Insufficient data (>50k TM units helps) • Bad source data (HTML, Javascript, etc.) Problems

×