Dealing with Lexicon Acquired from
Comparable Corpora
Post-edition and Exchange
Estelle Delpech, Lingua et Machina
Béatrice Daille, U. de Nantes - LINA

1/23
Working w/ lexicon acquired from
comparable corpora
I. Terminology acquisition from

comparable corpora : quick overview

II. A tool for terminology post-edition
III. Data exchange : a TBX variant for
automatically acquired lexicons

IV. Future work

2/23
Part I
Terminology Acquisition from
Comparable Corpora

3/23
Terminology acquisition from
comparable corpora




Comparable corpora:
“Two corpora, respectively in two languages l1 and l2 are said
”comparable” if there exists a substantial part of the
vocabulary of the corpus in language l1 whose translation can
be found in the corpus in language l2.”
(my translation of [Déjan and Gaussier, 2002] )



Advantages :


Availabily



Real usages

4/23
Terminology acquisition from
comparable corpora




Terminology extraction : a contextual analysis







Compare contexts of source and target terms
If contexts are similar, there's a good chance
source and target terms are translations of each
other, ex :
mastectomy : reconstruction, prophylactic, treat,
undergo, removal
mastectomie : reconstruction, prophylactique,
traiter, subir, ablation
5/23
Terminology acquisition from
comparable corpora




Outputs one-to-many alignments
– Evaluation : precision on the TopNBest alignments
mastectomy



Results





0,92 ablation
0,89 mastectomie
0,48 opération

Not as good as acquisition from parallel corpora !
Fung (1997) : 30 % accuracy on the Top20
candidates
Morin et al. (2004) : translation is usually the 34th for
6/23
complex terms
Part II
A Tool for Post-edition

7/23
A tool for post-edition


Existing Tools :



ArayaTermExtractor (Waldhör 2006)





iView (Merkel and Foo, 2007)
Xerox Terminology Suite ®

Our needs :


Deal with one-to-many alignments



Non-aligned contexts



Allow non binary annotation



Display useful information to help finding the right
candidate in the corpus
8/23
“Useful” information
→ Knownledge that helps catching the in vivo
behavior terms
→Text-driven, term-oriented approach


Useful information :


Variants



Collocations



Distributional neighbors



Contexts

→ To be harvested during the term extraction /
alignment process

9/23
Useful information : example
Mastectomy

Mastectomie

risk reducting ~
simple ~

~ préventive
~ simple

Tumorectomy
Lumpectomy
Oophorectomy

Tumorectomie
Ablation
Opération

...patient may choose to have
risk-reducing bilateral
mastectomy if they have a
strong family history of breast
cancer...

...la mastectomie préventive
pourrait supprimer la grande
majorité du risque de
développer un cancer...
10/23
Post-edition interface
http://80.82.238.151/Metricc/InterfaceValidation, user “test”, no password

11/23
Part III
Data Exchange :
a TBX variant for
automatically acquired
lexicon

12/23
Quick introduction to TBX (1)





TBX : Term Base eXchange
Open, XML-based standard for exchanging
structured terminological data
approved as an international standard by LISA
and ISO (norm 30042)



Maps to TMF data model



Subset of MARTIF



Designed for various use cases



Customizable
13/23
Quick introduction to TBX (2)


2 components :




Structure : core structure based on TMF
metamodel
Content : formalism to express data-categories
and their constraints
Content

Form
Core DTD/Schema

Default TBX

Default XCS

XCS1

TBX variant 1

Adapted from ISO norm 30042:2008, Fig. 4, p.30

XCSn

TBX variant n

14/23
Quick introduction to TBX (3)


Form defined in DTD



Content
defined in XCS

respPerson
responsability
reliabilityCode
partOfSpeech
corpusTrace
termType
usageNote
Taken from ISO norm 30042:2008, Fig. 1, p.9

15/23
TBX variant for lexicon acquired from
comparable corpora


Default TBX data-categories


termType : entryTerm, variant



externalCrossReference, usageNote



partOfSpeech, frequency, reliabilityCode...



transactionType, responsability

+ Customized data-categories :


occurrences, occurrenceCount



relatedTerm



termDefinition, definitionRelevance



ntigReference

16/23
TBX variant : A term entry

17/23
TBX variant : 1-to-n alignments

18/23
TBX variant : approved alignment

19/23
Feed-back on TBX
TBX is made for stable terminologies with little
uncertainy on the status of translations not
machine-generated lexicons of “candidate
translations” :



difficult to separate of term + properties from its
alignments



no data category specific to automatically estimated
reliability





Difficult to make text-driven, term-oriented
knowledge fit in a concept oriented format


no definition category that would apply to a single term
and not the whole concept
Conclusion
Future work

21/23
Future work


Integration of prototype in Libellex


TBX import / export



edition of linguistic properties



User testing (ergonomics)



Evaluation of added-value for translation



Explore new ways of :


aligning terms



selecting contexts
22/23
References


Post-edition prototype on line : http://80.82.238.151/Metricc/InterfaceValidation/ user “test”,
no password



Metricc project : http://www.metricc.com/



Lingua et Machina : http://www.lingua-et-machina.com/



Comparable corpora : Déjean, H., Gaussier, É. (2002) : “Une nouvelle approche à
l'extraction de lexiques bilingues à partir de corpus comparables”, In Lexicometrica,
Alignement Lexical dans les corpus multilingues, pp.1-22.



ArayaTermExtractor : http://www.heartsome.de



Xerox Terminology Suite : http://www.temis.com/









Iview : Nyström, M., Merkel, M., Ahrenberg, L., Zweignebaum, P., Petersson, H. and
Åhlfeldt H. (2006) : “Creating a medical English-Swedish dictionary using interactive word
alignment”', In BMC Medical Informatics and Decision Making, 2006, pp. 6-35
TMF : ISO 16642 - Terminological markup framework
TBX : ISO 30042 - Systems to manage terminology, knowledge and content -- TermBase
eXchange (TBX)
Data categories : ISO 12620 - Terminology and other language and content resources -Specification of data categories and management of a Data Category Registry for language
resources

Dealing with Lexicon Acquired from Comparable Corpora: post-edition and exchange

  • 1.
    Dealing with LexiconAcquired from Comparable Corpora Post-edition and Exchange Estelle Delpech, Lingua et Machina Béatrice Daille, U. de Nantes - LINA 1/23
  • 2.
    Working w/ lexiconacquired from comparable corpora I. Terminology acquisition from comparable corpora : quick overview II. A tool for terminology post-edition III. Data exchange : a TBX variant for automatically acquired lexicons IV. Future work 2/23
  • 3.
    Part I Terminology Acquisitionfrom Comparable Corpora 3/23
  • 4.
    Terminology acquisition from comparablecorpora   Comparable corpora: “Two corpora, respectively in two languages l1 and l2 are said ”comparable” if there exists a substantial part of the vocabulary of the corpus in language l1 whose translation can be found in the corpus in language l2.” (my translation of [Déjan and Gaussier, 2002] )  Advantages :  Availabily  Real usages 4/23
  • 5.
    Terminology acquisition from comparablecorpora   Terminology extraction : a contextual analysis     Compare contexts of source and target terms If contexts are similar, there's a good chance source and target terms are translations of each other, ex : mastectomy : reconstruction, prophylactic, treat, undergo, removal mastectomie : reconstruction, prophylactique, traiter, subir, ablation 5/23
  • 6.
    Terminology acquisition from comparablecorpora   Outputs one-to-many alignments – Evaluation : precision on the TopNBest alignments mastectomy  Results    0,92 ablation 0,89 mastectomie 0,48 opération Not as good as acquisition from parallel corpora ! Fung (1997) : 30 % accuracy on the Top20 candidates Morin et al. (2004) : translation is usually the 34th for 6/23 complex terms
  • 7.
    Part II A Toolfor Post-edition 7/23
  • 8.
    A tool forpost-edition  Existing Tools :   ArayaTermExtractor (Waldhör 2006)   iView (Merkel and Foo, 2007) Xerox Terminology Suite ® Our needs :  Deal with one-to-many alignments  Non-aligned contexts  Allow non binary annotation  Display useful information to help finding the right candidate in the corpus 8/23
  • 9.
    “Useful” information → Knownledgethat helps catching the in vivo behavior terms →Text-driven, term-oriented approach  Useful information :  Variants  Collocations  Distributional neighbors  Contexts → To be harvested during the term extraction / alignment process 9/23
  • 10.
    Useful information :example Mastectomy Mastectomie risk reducting ~ simple ~ ~ préventive ~ simple Tumorectomy Lumpectomy Oophorectomy Tumorectomie Ablation Opération ...patient may choose to have risk-reducing bilateral mastectomy if they have a strong family history of breast cancer... ...la mastectomie préventive pourrait supprimer la grande majorité du risque de développer un cancer... 10/23
  • 11.
  • 12.
    Part III Data Exchange: a TBX variant for automatically acquired lexicon 12/23
  • 13.
    Quick introduction toTBX (1)    TBX : Term Base eXchange Open, XML-based standard for exchanging structured terminological data approved as an international standard by LISA and ISO (norm 30042)  Maps to TMF data model  Subset of MARTIF  Designed for various use cases  Customizable 13/23
  • 14.
    Quick introduction toTBX (2)  2 components :   Structure : core structure based on TMF metamodel Content : formalism to express data-categories and their constraints Content Form Core DTD/Schema Default TBX Default XCS XCS1 TBX variant 1 Adapted from ISO norm 30042:2008, Fig. 4, p.30 XCSn TBX variant n 14/23
  • 15.
    Quick introduction toTBX (3)  Form defined in DTD  Content defined in XCS respPerson responsability reliabilityCode partOfSpeech corpusTrace termType usageNote Taken from ISO norm 30042:2008, Fig. 1, p.9 15/23
  • 16.
    TBX variant forlexicon acquired from comparable corpora  Default TBX data-categories  termType : entryTerm, variant  externalCrossReference, usageNote  partOfSpeech, frequency, reliabilityCode...  transactionType, responsability + Customized data-categories :  occurrences, occurrenceCount  relatedTerm  termDefinition, definitionRelevance  ntigReference 16/23
  • 17.
    TBX variant :A term entry 17/23
  • 18.
    TBX variant :1-to-n alignments 18/23
  • 19.
    TBX variant :approved alignment 19/23
  • 20.
    Feed-back on TBX TBXis made for stable terminologies with little uncertainy on the status of translations not machine-generated lexicons of “candidate translations” :  difficult to separate of term + properties from its alignments  no data category specific to automatically estimated reliability   Difficult to make text-driven, term-oriented knowledge fit in a concept oriented format  no definition category that would apply to a single term and not the whole concept
  • 21.
  • 22.
    Future work  Integration ofprototype in Libellex  TBX import / export  edition of linguistic properties  User testing (ergonomics)  Evaluation of added-value for translation  Explore new ways of :  aligning terms  selecting contexts 22/23
  • 23.
    References  Post-edition prototype online : http://80.82.238.151/Metricc/InterfaceValidation/ user “test”, no password  Metricc project : http://www.metricc.com/  Lingua et Machina : http://www.lingua-et-machina.com/  Comparable corpora : Déjean, H., Gaussier, É. (2002) : “Une nouvelle approche à l'extraction de lexiques bilingues à partir de corpus comparables”, In Lexicometrica, Alignement Lexical dans les corpus multilingues, pp.1-22.  ArayaTermExtractor : http://www.heartsome.de  Xerox Terminology Suite : http://www.temis.com/     Iview : Nyström, M., Merkel, M., Ahrenberg, L., Zweignebaum, P., Petersson, H. and Åhlfeldt H. (2006) : “Creating a medical English-Swedish dictionary using interactive word alignment”', In BMC Medical Informatics and Decision Making, 2006, pp. 6-35 TMF : ISO 16642 - Terminological markup framework TBX : ISO 30042 - Systems to manage terminology, knowledge and content -- TermBase eXchange (TBX) Data categories : ISO 12620 - Terminology and other language and content resources -Specification of data categories and management of a Data Category Registry for language resources