Multilingual Term Extraction as a Service from Acrolinx, CHAT2013

  • 236 views
Uploaded on

Presenters: Ben Gottesman and Michael Klemme (Acrolinx) …

Presenters: Ben Gottesman and Michael Klemme (Acrolinx)

This presentation is a part of TaaS project funded from the European Union Seventh Framework Programme (FP7/2007-2013), grant agreement no 296312

More in: Technology , Business
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
236
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
8
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Multilingual Term Extraction as a Service from Acrolinx Ben Gottesman Michael Klemme Acrolinx CHAT2013
  • 2. Definitions term extraction: automatically identifying potential terms in a document (corpus) multilingual term extraction: automatically identifying potential terms and their translations in a document and its translation (parallel corpus / translation memory) The wizard begins creating the bootable image. Der Assistent beginnt mit der Erstellung des bootfähigen Image. (… or, if the source-language terminology already exists, just identify translations)
  • 3. Synonyms Identify same-language synonyms via translations in common German English Die Spannungsversorgung für die Elektronik wird vom Speisegerät G526 sichergestellt. The voltage supply for the electronics is maintained by the power supply unit G526. Spannungsversorgung für interne Speisung (X3e) Power supply for internal supply (X3e) Unterspannung in der Stromversorgung Undervoltage in the power supply Spannungsversorgung Stromversorgung voltage supply power supply
  • 4. Outline • What is multilingual term extraction? • What is the workflow from customer perspective? – customer use case examples – show extraction results, demonstrate human validation • How does the extraction work? – how we identify candidates • source-language candidates • translation candidates – how we filter translation candidates – how we identify source-language synonyms • What is Acrolinx and how does MTE fit in?
  • 5. Outline • What is multilingual term extraction? • What is the workflow from customer perspective? – customer use case examples – show extraction results, demonstrate human validation • How does the extraction work? – how we identify candidates • source-language candidates • translation candidates – how we filter translation candidates – how we identify source-language synonyms • What is Acrolinx and how does MTE fit in?
  • 6. Workflow: Customer perspective 1. Customer provides translated documents 2. Acrolinx provides extracted multilingual term candidates to customer 3. Customer validates candidates 4. Validated results become (or are added to) customer’s term bank
  • 7. Customer use cases, past examples Use case 1 – de-<en,fr,es,it,pt> (mostly de-en) – ~142,000 bilingual segments; ~2,685,000 tokens (total) Use case 2 – de-<en,fr> (all data trilingual) – ~132,000 bilingual segments; ~1,259,000 tokens – data document-aligned, not segment-aligned, so extra step required Use case 3 – – – – en-de ~942,000 bilingual segments; ~25,000,000 tokens extract translations of a given list of keywords determine which keywords don’t occur in data
  • 8. Results • human validation in Excel “Baugruppe” has been translated inconsistently into English in the past Mark respective translations as preferred/deprecated to guide translators in the future.
  • 9. Results “Stromversorgung” and “Einspeisung” have translations in common. → automatically identified as possible synonyms, so same Cluster ID To validate synonym link, edit Subcluster IDs to be the same. Mark respective variants as preferred/deprecated to guide authors.
  • 10. Outline • What is multilingual term extraction? • What is the workflow from customer perspective? – customer use case examples – show extraction results, demonstrate human validation • How does the extraction work? – how we identify candidates • source-language candidates • translation candidates – how we filter translation candidates – how we identify source-language synonyms • What is Acrolinx and how does MTE fit in?
  • 11. How does the extraction work? • Extract source-language term candidates from source-language text (unless source-language terminology exists) The wizard begins creating the bootable image. – linguistics-based • especially part-of-speech patterns – same functionality built into the core Acrolinx product
  • 12. How does the extraction work? • Extract translation candidates of each sourcelanguage term candidate from target-language text The wizard begins creating the bootable image. Der Assistent beginnt mit der Erstellung des bootfähigen Image. – use statistical phrase-alignment technology – same used in statistical machine translation
  • 13. How does the extraction work? • Filter translation candidates translation candidates for “Eingangsspannung” (pink = filtered out) … based on: – confidence score calculated from translation probabilities • can adjust threshold to favour precision or recall – surface characteristics (closed-class words, punctuation) – term-candidacy of translation (if possible for language)
  • 14. How does the extraction work? • Identify synonyms (‘cluster’ candidates) cluster around “Stromwandler” (minimum link confidence threshold = 0.01) – link confidence based on the degree to which translations are shared – can adjust threshold to favour precision or recall of links
  • 15. How does the extraction work? • Identify synonyms (‘cluster’ candidates) cluster around “Stromwandler” (minimum link confidence threshold = 0.03) – link confidence based on the degree to which translations are shared – can adjust threshold to favour precision or recall of links
  • 16. Outline • What is multilingual term extraction? • What is the workflow from customer perspective? – customer use case examples – show extraction results, demonstrate human validation • How does the extraction work? – how we identify candidates • source-language candidates • translation candidates – how we filter translation candidates – how we identify source-language synonyms • What is Acrolinx and how does MTE fit in?
  • 17. What is Acrolinx? Acrolinx is Content Optimization Software. It helps authors make there text – more correct, – more consistent, – and more readable.
  • 18. What is Acrolinx? Acrolinx is Content Optimization Software. It helps authors make their text – more correct, – more consistent, – and more readable. Consistent use of terminology is an important factor in the readability of text. Acrolinx provides: – term extraction (monolingual, aka term harvesting) – terminology management – term checking Multilingual Term Extraction as a Service is a natural complement to the prior terminology functions.
  • 19. Acrolinx @ tekom Visit Acrolinx at tekom! → Hall 3, Stand 310
  • 20. Outline • What is multilingual term extraction? • What is the workflow from customer perspective? – customer use case examples – show extraction results, demonstrate human validation • How does the extraction work? – how we identify candidates • source-language candidates • translation candidates – how we filter translation candidates – how we identify source-language synonyms • What is Acrolinx and how does MTE fit in?
  • 21. Questions?