0
Dealing with Lexicon Acquired from
Comparable Corpora
Post-edition and Exchange
Estelle Delpech, Lingua et Machina
Béatric...
Working w/ lexicon acquired from
comparable corpora
I. Terminology acquisition from

comparable corpora : quick overview

...
Part I
Terminology Acquisition from
Comparable Corpora

3/23
Terminology acquisition from
comparable corpora




Comparable corpora:
“Two corpora, respectively in two languages l1 a...
Terminology acquisition from
comparable corpora




Terminology extraction : a contextual analysis







Compare co...
Terminology acquisition from
comparable corpora




Outputs one-to-many alignments
– Evaluation : precision on the TopNB...
Part II
A Tool for Post-edition

7/23
A tool for post-edition


Existing Tools :



ArayaTermExtractor (Waldhör 2006)





iView (Merkel and Foo, 2007)
Xer...
“Useful” information
→ Knownledge that helps catching the in vivo
behavior terms
→Text-driven, term-oriented approach


U...
Useful information : example
Mastectomy

Mastectomie

risk reducting ~
simple ~

~ préventive
~ simple

Tumorectomy
Lumpec...
Post-edition interface
http://80.82.238.151/Metricc/InterfaceValidation, user “test”, no password

11/23
Part III
Data Exchange :
a TBX variant for
automatically acquired
lexicon

12/23
Quick introduction to TBX (1)





TBX : Term Base eXchange
Open, XML-based standard for exchanging
structured terminol...
Quick introduction to TBX (2)


2 components :




Structure : core structure based on TMF
metamodel
Content : formalis...
Quick introduction to TBX (3)


Form defined in DTD



Content
defined in XCS

respPerson
responsability
reliabilityCode...
TBX variant for lexicon acquired from
comparable corpora


Default TBX data-categories


termType : entryTerm, variant

...
TBX variant : A term entry

17/23
TBX variant : 1-to-n alignments

18/23
TBX variant : approved alignment

19/23
Feed-back on TBX
TBX is made for stable terminologies with little
uncertainy on the status of translations not
machine-gen...
Conclusion
Future work

21/23
Future work


Integration of prototype in Libellex


TBX import / export



edition of linguistic properties



User t...
References


Post-edition prototype on line : http://80.82.238.151/Metricc/InterfaceValidation/ user “test”,
no password
...
Upcoming SlideShare
Loading in...5
×

Dealing with Lexicon Acquired from Comparable Corpora: post-edition and exchange

319

Published on

Material presented at the TKE (Terminology and Knowledge Engineering) Conference 2010, Dublin, Ireland.
Download paper at http://hal.archives-ouvertes.fr/hal-00544403
Insitutions: Laboratoire d'Informatique de Nantes Atlantique (LINA), Lingua et Machina.

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
319
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
4
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "Dealing with Lexicon Acquired from Comparable Corpora: post-edition and exchange"

  1. 1. Dealing with Lexicon Acquired from Comparable Corpora Post-edition and Exchange Estelle Delpech, Lingua et Machina Béatrice Daille, U. de Nantes - LINA 1/23
  2. 2. Working w/ lexicon acquired from comparable corpora I. Terminology acquisition from comparable corpora : quick overview II. A tool for terminology post-edition III. Data exchange : a TBX variant for automatically acquired lexicons IV. Future work 2/23
  3. 3. Part I Terminology Acquisition from Comparable Corpora 3/23
  4. 4. Terminology acquisition from comparable corpora   Comparable corpora: “Two corpora, respectively in two languages l1 and l2 are said ”comparable” if there exists a substantial part of the vocabulary of the corpus in language l1 whose translation can be found in the corpus in language l2.” (my translation of [Déjan and Gaussier, 2002] )  Advantages :  Availabily  Real usages 4/23
  5. 5. Terminology acquisition from comparable corpora   Terminology extraction : a contextual analysis     Compare contexts of source and target terms If contexts are similar, there's a good chance source and target terms are translations of each other, ex : mastectomy : reconstruction, prophylactic, treat, undergo, removal mastectomie : reconstruction, prophylactique, traiter, subir, ablation 5/23
  6. 6. Terminology acquisition from comparable corpora   Outputs one-to-many alignments – Evaluation : precision on the TopNBest alignments mastectomy  Results    0,92 ablation 0,89 mastectomie 0,48 opération Not as good as acquisition from parallel corpora ! Fung (1997) : 30 % accuracy on the Top20 candidates Morin et al. (2004) : translation is usually the 34th for 6/23 complex terms
  7. 7. Part II A Tool for Post-edition 7/23
  8. 8. A tool for post-edition  Existing Tools :   ArayaTermExtractor (Waldhör 2006)   iView (Merkel and Foo, 2007) Xerox Terminology Suite ® Our needs :  Deal with one-to-many alignments  Non-aligned contexts  Allow non binary annotation  Display useful information to help finding the right candidate in the corpus 8/23
  9. 9. “Useful” information → Knownledge that helps catching the in vivo behavior terms →Text-driven, term-oriented approach  Useful information :  Variants  Collocations  Distributional neighbors  Contexts → To be harvested during the term extraction / alignment process 9/23
  10. 10. Useful information : example Mastectomy Mastectomie risk reducting ~ simple ~ ~ préventive ~ simple Tumorectomy Lumpectomy Oophorectomy Tumorectomie Ablation Opération ...patient may choose to have risk-reducing bilateral mastectomy if they have a strong family history of breast cancer... ...la mastectomie préventive pourrait supprimer la grande majorité du risque de développer un cancer... 10/23
  11. 11. Post-edition interface http://80.82.238.151/Metricc/InterfaceValidation, user “test”, no password 11/23
  12. 12. Part III Data Exchange : a TBX variant for automatically acquired lexicon 12/23
  13. 13. Quick introduction to TBX (1)    TBX : Term Base eXchange Open, XML-based standard for exchanging structured terminological data approved as an international standard by LISA and ISO (norm 30042)  Maps to TMF data model  Subset of MARTIF  Designed for various use cases  Customizable 13/23
  14. 14. Quick introduction to TBX (2)  2 components :   Structure : core structure based on TMF metamodel Content : formalism to express data-categories and their constraints Content Form Core DTD/Schema Default TBX Default XCS XCS1 TBX variant 1 Adapted from ISO norm 30042:2008, Fig. 4, p.30 XCSn TBX variant n 14/23
  15. 15. Quick introduction to TBX (3)  Form defined in DTD  Content defined in XCS respPerson responsability reliabilityCode partOfSpeech corpusTrace termType usageNote Taken from ISO norm 30042:2008, Fig. 1, p.9 15/23
  16. 16. TBX variant for lexicon acquired from comparable corpora  Default TBX data-categories  termType : entryTerm, variant  externalCrossReference, usageNote  partOfSpeech, frequency, reliabilityCode...  transactionType, responsability + Customized data-categories :  occurrences, occurrenceCount  relatedTerm  termDefinition, definitionRelevance  ntigReference 16/23
  17. 17. TBX variant : A term entry 17/23
  18. 18. TBX variant : 1-to-n alignments 18/23
  19. 19. TBX variant : approved alignment 19/23
  20. 20. Feed-back on TBX TBX is made for stable terminologies with little uncertainy on the status of translations not machine-generated lexicons of “candidate translations” :  difficult to separate of term + properties from its alignments  no data category specific to automatically estimated reliability   Difficult to make text-driven, term-oriented knowledge fit in a concept oriented format  no definition category that would apply to a single term and not the whole concept
  21. 21. Conclusion Future work 21/23
  22. 22. Future work  Integration of prototype in Libellex  TBX import / export  edition of linguistic properties  User testing (ergonomics)  Evaluation of added-value for translation  Explore new ways of :  aligning terms  selecting contexts 22/23
  23. 23. References  Post-edition prototype on line : http://80.82.238.151/Metricc/InterfaceValidation/ user “test”, no password  Metricc project : http://www.metricc.com/  Lingua et Machina : http://www.lingua-et-machina.com/  Comparable corpora : Déjean, H., Gaussier, É. (2002) : “Une nouvelle approche à l'extraction de lexiques bilingues à partir de corpus comparables”, In Lexicometrica, Alignement Lexical dans les corpus multilingues, pp.1-22.  ArayaTermExtractor : http://www.heartsome.de  Xerox Terminology Suite : http://www.temis.com/     Iview : Nyström, M., Merkel, M., Ahrenberg, L., Zweignebaum, P., Petersson, H. and Åhlfeldt H. (2006) : “Creating a medical English-Swedish dictionary using interactive word alignment”', In BMC Medical Informatics and Decision Making, 2006, pp. 6-35 TMF : ISO 16642 - Terminological markup framework TBX : ISO 30042 - Systems to manage terminology, knowledge and content -- TermBase eXchange (TBX) Data categories : ISO 12620 - Terminology and other language and content resources -Specification of data categories and management of a Data Category Registry for language resources
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×