Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Dealing with Lexicon Acquired from Comparable Corpora: post-edition and exchange

624 views

Published on

Material presented at the TKE (Terminology and Knowledge Engineering) Conference 2010, Dublin, Ireland.
Download paper at http://hal.archives-ouvertes.fr/hal-00544403
Insitutions: Laboratoire d'Informatique de Nantes Atlantique (LINA), Lingua et Machina.

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

Dealing with Lexicon Acquired from Comparable Corpora: post-edition and exchange

  1. 1. Dealing with Lexicon Acquired from Comparable Corpora Post-edition and Exchange Estelle Delpech, Lingua et Machina Béatrice Daille, U. de Nantes - LINA 1/23
  2. 2. Working w/ lexicon acquired from comparable corpora I. Terminology acquisition from comparable corpora : quick overview II. A tool for terminology post-edition III. Data exchange : a TBX variant for automatically acquired lexicons IV. Future work 2/23
  3. 3. Part I Terminology Acquisition from Comparable Corpora 3/23
  4. 4. Terminology acquisition from comparable corpora   Comparable corpora: “Two corpora, respectively in two languages l1 and l2 are said ”comparable” if there exists a substantial part of the vocabulary of the corpus in language l1 whose translation can be found in the corpus in language l2.” (my translation of [Déjan and Gaussier, 2002] )  Advantages :  Availabily  Real usages 4/23
  5. 5. Terminology acquisition from comparable corpora   Terminology extraction : a contextual analysis     Compare contexts of source and target terms If contexts are similar, there's a good chance source and target terms are translations of each other, ex : mastectomy : reconstruction, prophylactic, treat, undergo, removal mastectomie : reconstruction, prophylactique, traiter, subir, ablation 5/23
  6. 6. Terminology acquisition from comparable corpora   Outputs one-to-many alignments – Evaluation : precision on the TopNBest alignments mastectomy  Results    0,92 ablation 0,89 mastectomie 0,48 opération Not as good as acquisition from parallel corpora ! Fung (1997) : 30 % accuracy on the Top20 candidates Morin et al. (2004) : translation is usually the 34th for 6/23 complex terms
  7. 7. Part II A Tool for Post-edition 7/23
  8. 8. A tool for post-edition  Existing Tools :   ArayaTermExtractor (Waldhör 2006)   iView (Merkel and Foo, 2007) Xerox Terminology Suite ® Our needs :  Deal with one-to-many alignments  Non-aligned contexts  Allow non binary annotation  Display useful information to help finding the right candidate in the corpus 8/23
  9. 9. “Useful” information → Knownledge that helps catching the in vivo behavior terms →Text-driven, term-oriented approach  Useful information :  Variants  Collocations  Distributional neighbors  Contexts → To be harvested during the term extraction / alignment process 9/23
  10. 10. Useful information : example Mastectomy Mastectomie risk reducting ~ simple ~ ~ préventive ~ simple Tumorectomy Lumpectomy Oophorectomy Tumorectomie Ablation Opération ...patient may choose to have risk-reducing bilateral mastectomy if they have a strong family history of breast cancer... ...la mastectomie préventive pourrait supprimer la grande majorité du risque de développer un cancer... 10/23
  11. 11. Post-edition interface http://80.82.238.151/Metricc/InterfaceValidation, user “test”, no password 11/23
  12. 12. Part III Data Exchange : a TBX variant for automatically acquired lexicon 12/23
  13. 13. Quick introduction to TBX (1)    TBX : Term Base eXchange Open, XML-based standard for exchanging structured terminological data approved as an international standard by LISA and ISO (norm 30042)  Maps to TMF data model  Subset of MARTIF  Designed for various use cases  Customizable 13/23
  14. 14. Quick introduction to TBX (2)  2 components :   Structure : core structure based on TMF metamodel Content : formalism to express data-categories and their constraints Content Form Core DTD/Schema Default TBX Default XCS XCS1 TBX variant 1 Adapted from ISO norm 30042:2008, Fig. 4, p.30 XCSn TBX variant n 14/23
  15. 15. Quick introduction to TBX (3)  Form defined in DTD  Content defined in XCS respPerson responsability reliabilityCode partOfSpeech corpusTrace termType usageNote Taken from ISO norm 30042:2008, Fig. 1, p.9 15/23
  16. 16. TBX variant for lexicon acquired from comparable corpora  Default TBX data-categories  termType : entryTerm, variant  externalCrossReference, usageNote  partOfSpeech, frequency, reliabilityCode...  transactionType, responsability + Customized data-categories :  occurrences, occurrenceCount  relatedTerm  termDefinition, definitionRelevance  ntigReference 16/23
  17. 17. TBX variant : A term entry 17/23
  18. 18. TBX variant : 1-to-n alignments 18/23
  19. 19. TBX variant : approved alignment 19/23
  20. 20. Feed-back on TBX TBX is made for stable terminologies with little uncertainy on the status of translations not machine-generated lexicons of “candidate translations” :  difficult to separate of term + properties from its alignments  no data category specific to automatically estimated reliability   Difficult to make text-driven, term-oriented knowledge fit in a concept oriented format  no definition category that would apply to a single term and not the whole concept
  21. 21. Conclusion Future work 21/23
  22. 22. Future work  Integration of prototype in Libellex  TBX import / export  edition of linguistic properties  User testing (ergonomics)  Evaluation of added-value for translation  Explore new ways of :  aligning terms  selecting contexts 22/23
  23. 23. References  Post-edition prototype on line : http://80.82.238.151/Metricc/InterfaceValidation/ user “test”, no password  Metricc project : http://www.metricc.com/  Lingua et Machina : http://www.lingua-et-machina.com/  Comparable corpora : Déjean, H., Gaussier, É. (2002) : “Une nouvelle approche à l'extraction de lexiques bilingues à partir de corpus comparables”, In Lexicometrica, Alignement Lexical dans les corpus multilingues, pp.1-22.  ArayaTermExtractor : http://www.heartsome.de  Xerox Terminology Suite : http://www.temis.com/     Iview : Nyström, M., Merkel, M., Ahrenberg, L., Zweignebaum, P., Petersson, H. and Åhlfeldt H. (2006) : “Creating a medical English-Swedish dictionary using interactive word alignment”', In BMC Medical Informatics and Decision Making, 2006, pp. 6-35 TMF : ISO 16642 - Terminological markup framework TBX : ISO 30042 - Systems to manage terminology, knowledge and content -- TermBase eXchange (TBX) Data categories : ISO 12620 - Terminology and other language and content resources -Specification of data categories and management of a Data Category Registry for language resources

×