Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Chelo Vargas-Sierra


Published on

"Bilingual Terminology Extraction from TMX. A state-of-the-art overview." Presentation at Translating Europe Forum 2016. Focus on translation technology.

Published in: Technology
  • Login to see the comments

Chelo Vargas-Sierra

  1. 1. Bilingual Terminology Extraction from TMX A State-of-the-Art Overview Chelo Vargas-Sierra, PhD University of Alicante, Spain
  2. 2. 2 Key words Overview of terms involved in the process 1st point 2nd point 3rd point 4th point Evaluation BATE under evaluation Measures for accuracy Quality in use model and tasks Terminology and extractors Terminology management Its timeline BATE (approaches, state of the art) Results Precision & Recall Parameters & Questionnaire INDEX Main points of this presentation
  3. 3. Parallel corpus TMX Alignment levels Paragraph, sentence and word level ATE & BATE Precision/Recall Getting only terms and all terms Gold standard Exhaustive, manually created bilingual glossary Validation * Term validation facility * Which TCs are real terms? Usability Software used to achieve user’s objectives with effectiveness, efficiency, and satisfaction Quality in use model ISO standard KEY WORDS Terms involved in the process
  4. 4. 2. Terminology & Extractors
  5. 5. 5 IDENTIFY FINDRETRIEVE the terminology in the source text adequately Identify and interpret terminological data Retrieve and store proper documentation and information resources Find and use IMPORTANCE OF TERMINOLOGY Translators were the first professionals to be aware of term-related issues
  6. 6. 6 6 Time spent to solve terminological problems (Arntz 1993, Walker 1993).+40% In specialized translation TERMINOLOGY MANAGEMENT
  7. 7. 7 7 Managing terminology (extracting, validating, importing, adding, editing, deleting, revising, updating, exporting, publishing) is a time-comsuming process. Time spent to solve terminological problems (Arntz 1993, Walker 1993). +40% In specialized translation TERMINOLOGY MANAGEMENT
  8. 8. 8 8 Managing terminology (extracting, validating, importing, adding, editing, deleting, revising, updating, exporting, publishing) is a time-comsuming process. Time spent to solve terminological problems (Arntz 1993, Walker 1993). +40% In specialized translation TERMINOLOGY MANAGEMENT Terminology work is “on backstage”, and customer or employers may not be fully aware of their befefits for QA.
  9. 9. 9 9 Managing terminology (extracting, validating, importing, adding, editing, deleting, revising, updating, exporting, publishing) is a time-comsuming process. Time spent to solve terminological problems (Arntz 1993, Walker 1993). +40% In specialized translation TERMINOLOGY MANAGEMENT Return on Investment (ROI) on terminology management reported by some corporate studies (Childress, 2007; Popiolek, 2015) 90% Terminology work is “on backstage”, and customer or employers may not be fully aware of their befefits for QA.
  10. 10. 10 10 TERMINOLOGY MANAGEMENT Extraction • List of terms extracted from ST • List of terms to validate (accept or reject) Translation • List is added to a termbase • List is translated and additional data added Approval • List approved by a person in charge of terminology • When the client has requested there is an addtional step for client approval General model por project terminology creation (Popiolek, 2015: 351) Monolingual extraction & validation Importing & looking for equivalents
  11. 11. 11 Preparing the files and import them into the BATE Preparation: TMX import List of candidate term pairs extracted from TMX Bilingual extraction TIMELINE in Terminology Management with bilingual extraction
  12. 12. 12 - List of pair of terms to validate (accept or reject terms and suggested equivalents) - Term by term and additional data are added to a term base (Synchroterm) Validation (& data entry) - Export bilingual terms and additional data in an available file format (.xls, .txt, .TBX, …) - Import output file to a TDB system (to be integrated into a MT System) Export/Import
  13. 13. 13 Person in charge of terminology or client Approval Ready to use Finish
  14. 14. 14 Bilingual Automatic Term Extractors Two approaches (Foo, 2012) EXTRACT-ALIGN 1ST step: monolingual terminology extraction in both languages. 2nd step: cross-linguistic matching using word-alignment or co-occurrence statistics to find equivalents. Commercial systems in this approach
  15. 15. 15 ALIGN-FILTER 1ST step: word-alignment on the parallel texts. 2nd step: rank the aligned units to finally select the most likely pair of candidates (statistics) TExSIS (Macken et al, 2013) Bilingual Automatic Term Extractors Two approaches (Foo, 2012)
  16. 16. 16 Bilingual Automatic Term Extractors Academic / In-house - English-French TERMIGHT (Dagan & Church, 1994) - English-French (Kupiek, 1993) - English-Dutch (Eijk, 1993) - English-French (Gaussier, 1995) - English and Swedish (Ahrenberg et al., 1998) - French-Japanese (Morin et al 2010, from ACABIT, Daille, 2003): not bilingual, but multilingual - Slovene and English, Luiz (Vintar, 2010); - English and Swedish ITools suite (Foo & Merkel, 2010) - English and German (Gojun et al., 2012). - English, French, German, Spanish, TTC TermSuite (Daille, 2012) - English-Spanish TBXTools (Oliver & Vázquez, 2015) (under development) - Chinese, Czech, Dutch, English, French, German, Italian, Japanese, Korean, Polish, Portuguese, Russian, Spanish: Sketch Engine (Baisa et al 2015, Koval et al 2016) - French-German (Blank, 2000) - Japanese-English, MNH (Nakagawa & Mori, 2003) - Spanish-Basque, Elexbi (Hernaiz et al., 2006), from a TMX; - Spanish-German, Autoterm (Haller, 2008); - English-Spanish, Mutual Bilingual Term Extractor (Ha et al, 2008) - French-English, French-Italian and French-Dutch (Lefever et al., 2009) 90s 2000-2009 2010 -2016
  17. 17. 17 Bilingual Automatic Term Extractors Other BATE (free / comercial) - TermExtractor (Shimohata et al 2001) - MemoQ's built-in term extractor - Déjà Vu - Lexicon - TermoStat Web: - Yate (IULA) - Okapi - TerMine: - TerminologyExtractor: - PRoMT - FiveFilters (web-based): extraction/ - Concordace programs: WordSmith Tools, AntConc (free), … 90s 2010 -2016MONOLINGUAL ATE - Xerox Terminology Suite (2001) - SDL Multiterm Extract - Synchroterm - CrossMining (Across) - MultiTrans Term Extractor - Similis™ (by Lingua et Machina™) - Anchovy (by Swordfish) - Araya Term Extractor - Analysis software: Sketch Engine (terminology extraction from TMX) BILINGUAL
  18. 18. 3. Evaluation
  19. 19. 19BATE UNDER EVALUATION Sketch Engine SIMILIS Multiterm Extract Synchroterm
  20. 20. 20 Multiterm Extract SynchroTerm Similis SkE Araya Import TMX Extraction config. Extraction scores Validation facility Term base indexation Export to TBX (xls, txt…) Trados TMX MAIN FEATURES Others Others
  22. 22. Context coverage degree to which the product understands the complete context of its usage. Flexibility Effectiveness accuracy and completeness with which user achieves objectives Satisfaction Efficiency resources expended in relation to the accuracy and completeness Freedom from risk no risk for the security of users, software, context or the environment degree to which user needs are satisfied when a software is used in a specified context of use QUALITY IN USE MODEL Characteristics (ISO-IEC 25010: 2011)
  23. 23. 23 Setting up the extraction project CONFIGURATION Importing the source file TMX IMPORT Performing the extraction to get a bilingual list EXTRACTION Selecting the real terms. VALIDATION Creating and managing term entries RECORD CREATION Exporting the final result for later use in CAT Systems EXPORTATION 6 TASKS TO EVALUATE when performing bilingual extraction
  24. 24. 4. Results
  25. 25. 25 28,30 43,33 10,66 14,85 21,29 62,33 45,42 51,61 0,00 10,00 20,00 30,00 40,00 50,00 60,00 70,00 PRECISION RECALL PRECISION & RECALL IN % Sketch MTE Synchr Similis EXTRACTED NON-EXTRACTED TERMS NO TERMS TERMS NO TERMS GOLD STANDARD TCs PRECISION RECALL A C B D Sketch 283 717 370 653 1,000 28,30 43,34 MTE 97 813 556 910 10,66 14,85 SynchroT. 407 1505 246 1,912 21,29 62,33 Similis 337 405 316 742 45,42 51,61
  26. 26. 26 Characteristics and sub-characteristics to be measured METRICS EFFECTIVENESS Value between 0 (minimum) and 5 (maximum) (EFE1+EFE2+EFE3)/3 EFE1.- Degree of accuracy – precision of tasks & results (P1+P7+P13+P19+P25+P31)/6 EFE2.- Degree of completeness (tasks are accomplished and results are not missing) (P2+P8+P14+P20+P26+P32)/6 EFE3.- Frequency of errors (P3+P9+P15+P21+P27+P33)/6 EFFICIENCY Value between 0 (minimum) and 5 (maximum) (EFI2+EFI3+EFI4)/3 EFI1.- Time spent in the accomplishment of the task. (TM1+TM2+TM3+TM4+TM5+TM6) EFI2.- Need to use additional sources (material, software, etc.) for the task (P4+P10+P16+P22+P28+P34)/6 EFI3.- Productivity – effort exerted by the user to carry out the task (P5+P11+P17+P23+P29+P35)/6 EFI4.- Need to consult the software Help to perform the task (P6+P12+P18+P24+P30+P36)/6 SATISFACTION Value between 0 (minimum) and 5 (maximum) (P37+P38+P39)/3 SAT1.- Usefulness SAT2.- Trust SAT3.- Pleasure CONTEXT COVERAGE Value between 0 (minimum) and 5 (maximum) (P40+P41+P42)/3COB1.- Context of use COB2.- Flexibility PARAMETERS
  27. 27. 27 QUESTIONNAIRE 42 questions grouped by tasks
  28. 28. 28 16 13 14 25 20 26 21 24 0 5 10 15 20 25 30 EXTRACTION VALIDATION RESULTS FOR EXTRACTION & VALIDATION Sketch MTE Synchr Similis 3,33 3,00 4,00 3,50 13,83 4,06 4,44 3,00 1,50 13,00 4,11 4,22 4,33 3,00 15,67 3,72 3,11 3,00 3,00 12,83 0,00 2,00 4,00 6,00 8,00 10,00 12,00 14,00 16,00 18,00 EFFECTIVENESS EFFICIENCY SATISFACTION CONTEXT COVERAGE TOTAL QIU FINAL RESULTS FOR QUALITY IN USE Sketch MTE Synchr Similis
  29. 29. 29 CONCLUSIONS • Managing terminology still takes a lot of time and effort, even in this increasingly computerized profession. • Research on automatic terminology extraction has been around for more than 20 years and significant enhancements concerning bilingual extraction and bilingual corpora exploitation have been introduced. • I briefly described the BATE under evaluation and illustrated some results obtained for accuracy and with the QIU model. • Results make it clear that much more work has to be done for BATE to be considered of real help to translators and terminologists, mainly due to poor accuracy results.
  30. 30. Some references • Baisa, Vit, Barbora Ulipová, and Michal Cukr. 2015. “Bilingual Terminology Extraction in Sketch Engine.” In 9th Workshop on Recent Advances in Slavonic Natural Language Processing (RASLAN 2015), 61–67. • Childress, Mark D. 2007. “Terminology Work Saves More Time than It Cost.” Multilingual, no. April/May: 43–46. • Foo, Jody. 2012. Computational Terminology : Exploring Bilingual and Monolingual Term Extraction. • Foo, Jody; Merkel, Magnus. 2010. “Computer Aided Term Bank Creation and Standardization. Building Stardardize Term Banks through Automated Term Extraction and Advanced Editing Tools.” In Terminology in Everyday Life, edited by Marcel Thelen and Fireda Steurs, 163–80. John Benjamins Publishing Company. doi: 10.1075/tlrp.13.12foo. • Kovář, Vojtěch, Vít Baisa, and Miloš Jakubíček. 2016. “Sketch Engine for Bilingual Lexicography.” International Journal of Lexicography 29 (3): 339–52. doi:10.1093/ijl/ecw029. • Macken, Lieve, Els Lefever, and Veronique Hoste. 2013. “TExSIS: Bilingual Terminology Extraction from Parallel Corpora Using Chunk-Based Alignment.” Terminology 19 (2013): 1–30. doi:10.1075/term.19.1.01mac. • Oliver, Antoni, and M. Vazquez. 2015. “TBXTools: A Free, Fast and Flexible Tool for Automatic Terminology Extraction.” In Proceedings of Recent Advances in Natural Language Processing, 473–79. • Popiolek, Monika. 2015. “Terminology Management within a Translation Quality Assurance Process.” In Handbook of Terminology (Volume 1), edited by Hendrik J Kockaert and Frieda Steurs, 341–59. John Benjamins Publishing Company. doi:10.1075/hot.1.ter6. • Sauron, Véronique. 2002. “Tearing out the Terms : Evaluating Terms Extractors.” In Translating and the Computer 24: Proceedings from the Aslib Conference, 21-22 November 2002. • Vintar, Špela. 2010. “Bilingual Term Recognition revisited<BR> The Bag-of-Equivalents Term Alignment Approach and Its Evaluation.” Terminology 16 (2010): 141–58. doi:10.1075/term.16.2.01vin.
  31. 31. University of Alicante IULMA Campus de San Vicente Apdo. 99 03080 Alicante Phone & Fax Direct Line: +34 965903438 Fax: +34 965903800 Social Media @chelovargas Many thanks for your attention Chelo Vargas-Sierra