th

4 intensive summer school on
Natural Language Processing
Bilingual Terminology Mining
Estelle Delpech
30th November, 2...
About me

●

●

●
●
●

●
●
●

Estelle Delpech
Research engineer at Lingua et Machina,
France
CAT tools provider
ed(at)ling...
Presentation outline

●

●
●

About terms, terminology, terminology
mining
Term Extraction
Term Alignment

3
Presentation outline

●

●
●

About terms, terminology,
terminology mining
Term Extraction
Term Alignment

4
What is a term ?

●

●

●

Classical definition :
●
“unequivocal expression of a concept
within a technical domain“
Traces...
What is a term ?
“Classical terminology challenged in the 1990's
by :
● sociolinguistics
● corpus-based linguistics
● comp...
What is a term ?

●

●

●

●

Definition of « term » depends on the
application / audience of the terminology
Domain exper...
What is a terminology ?
●
●

●

●

Set of terms + terminological records
Terminological record :
●
Part-speech
●
Frequency...
http://www.termiumplus.gc.ca/
9
Were do you find terms ?

●

●

In specialized texts :
●
Research papers on breast cancer
●
Planes crashes reports
Corpora...
Bilingual terminology mining (1)

Specialized texts
term extraction
data mining
terms

terms
term alignment

bilingual
ter...
Bilingual terminology mining (2)

Specialized texts
synchronized
term extraction
and alignment
terms

terms

bilingual
ter...
Presentation outline

●

●
●

About terms, terminology, terminology
mining
Term Extraction
Term Alignment

13
Term extraction : semi-supervised
process
●
●

●

L'Homme, 2004

The notion of term is « slippery »
The same lexical unit ...
Term extraction : semi-supervised,
lexico-semantic process
texts

specialized texts

term extractor

candidate terms
autom...
 Termhood  clues (1) : Frequency

●

●

●

L'Homme, 2004

Term occurs frequently in specialized texts
● the higher, the be...
Termhood clues (2) : form
●

●

●

A term is a well-formed phrase
●
...HER2/neu oncogenes are members of...
Match morpho-s...
Termhood clues (2) : form

●

Preprocessing :
● Tokenization
● Lemmatisation
● POS Tagging
… HER-2/neu oncogenes are membe...
Identification of Syntactic Patterns

●

Patterns expressed as regular expression /
Finite state automata
PREP
START

NOUN...
Term hood clue (3) : words association

●

●

Significant coocurrences are good clues for
term hood :
● … breast cancer …
...
Measure for cooccurrence significance
●

Mutual Information
MI  a , b=log2

P  a , b
P  a⋅P  b

P a , b=nbocc a...
Presentation outline

●

●
●

About terms, terminology, terminology
mining
Term Extraction
Term Alignment

22
Presentation outline

●

●
●

●

About terms, terminology, terminology
mining
Term Extraction
Term Alignment
● in parallel...
Parallel and comparable corpora

●

●

Parallel corpora
●
Source text and target texts are translations
●
Reduce search sp...
Sentence alignement (1)

●

Gale and Church, 1993

Gale and Church (1993) 's hypothesis :
● Translated sentences have roug...
Sentence alignement (2)

●
●

Compute probabilites for all pairs of (S,T)
Build matrix where M(i,j) contains probability
t...
Sentence alignement (2)

●

Use dynamic programming to find the best
“path” i.e. the best alignments

0

2

...

n

0

0,8...
Sub sentence alignment : AnyMalign
(Lardilleux, 2010)

●

Lardilleux et al., 2010

AnyMalign is a sub-sentencial aligner
●...
AnyMalign (Lardilleux, 2010)

●

Algorithm is based on « perfect alignments » :
● words or groups of words that occur
exac...
AnyMalign (Lardilleux, 2010)

●

●

How to get more « perfect alignments » ?
● with smaller corpora
How to get smaller cor...
AnyMalign (Lardilleux, 2010)
Complementaires of perfect alignments are
likely to be good alignments too :

●

ad ↔ AD
b↔B
...
AnyMalign (Lardilleux, 2010)

●

●

●

●

Lardilleux et al., 2010

Process : Iteratively extract random
samples of of rand...
AnyMalign (Lardilleux, 2010)

●
●

●

Output :
alignments sorted by descending number of
occurrences
Alignement probabilit...
AnyMalign (Lardilleux, 2010)

Advantages :
●
can perform alignment with more than 2
languages at the same time
● 1 languag...
AnyMalign (Lardilleux, 2010)
●

●

Words groups are not grammatical phrases :
that sample sentences and
exchange format fi...
Presentation outline

●

●
●

About terms, terminology, terminology
mining
Term Extraction
Term Alignment
● in parallel co...
Advantages of comparable corpora

●

●
●

More available
● new languages
● new language pairs
● new topics / domains
Less ...
Contextual approach

●

●

Based on distributional linguistics (Z.
Harris)
●
Words with similar meaning appear
in similar ...
Contextual approach

●

●

●

●

●

m
ou
th

be
er

wa
te
r
drink

gl
as
s

Representation of the context of a given
word ...
Building context vector for « drink »

●

Collocates : word occuring at a distance of n
words from head
is variety of reas...
Normalized cooccurrences frequency

●

●
●
●
●
●

Normalization : use measure like IM, log
likehood ratio to counteract th...
Log likelyhood ratio
Contingency table :

●

water

¬ water

drink

a

b

e

¬ drink

c

d

h

f

g

N

log likelihood rat...
Context vector comparison

m
ou
th

be
er

●

...

●

น
ดม

●

Rapp 1995 ; Fung 1997

●

●

●

ป

ก

●

ยร
เบ

●

ว
แก

dr...
Context vector comparison

m
ou
th

be
er

●

●

ยร
เบ

●

ว
แก

drink

gl
as
s

Use seed lexicon to map collocates
wa
te
...
Context vector comparison

●

●

Measuring context similarity of words a
and b
= measuring cosinus angle between
vector of...
Contextual approach : improvements

●
●

●

●
●

Using syntactic collocates
Improving dictionary with cognates,
transliter...
Variant to direct translation of vector
●
●

●

« Interlingual » translation
Translate the n-closest words instead of
cont...
Variant to direct translation of vector
●
●
●

To translate term T :
Find n-closest words
these closest words are in the l...
Variant to direct translation of vector
●

Find the target term which is the closest to
the n closest words

SOURCE

TARGE...
Variant to direct translation of vector
●
●

« Interlingual » approach
Translate closest words instead of direct
context

...
Adaptation to multi-word terms

energy
drink

Morin et al., 2004
Morin and Daille, 2009

●

...

be
er

●

●

●

...
m
ou
...
Evaluation
Precison on TopN candidates
50% on Top20
Correct translation is in the Top 20 best
candidates for 50% of source...
Why is it so difficult ?

●
●
●

●

●

●

translation might not be present
target term has not been extracted
polysemous w...
th

4 Franco-Thai Workshop 2010
intensive summer school on
Natural Language Processing
Thank you
ed(a)lingua-et-machina.co...
Upcoming SlideShare
Loading in …5
×

Bilingual terminology mining

618 views

Published on

Material of the 4th Intensive Summer school and collaborative workshop on Natural Language Processing (NAIST Franco-Thai Workshop 2010).
Bangkok, Thaıland.

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
618
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
8
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Bilingual terminology mining

  1. 1. th 4 intensive summer school on Natural Language Processing Bilingual Terminology Mining Estelle Delpech 30th November, 2010 1
  2. 2. About me ● ● ● ● ● ● ● ● Estelle Delpech Research engineer at Lingua et Machina, France CAT tools provider ed(at)lingua-et-machina(dot)com www.lingua-et-machina.com Ph. Candidate at LINA, France taln team : specialises in NLP estelle.delpech(at)univ-nantes(dot)fr 2
  3. 3. Presentation outline ● ● ● About terms, terminology, terminology mining Term Extraction Term Alignment 3
  4. 4. Presentation outline ● ● ● About terms, terminology, terminology mining Term Extraction Term Alignment 4
  5. 5. What is a term ? ● ● ● Classical definition : ● “unequivocal expression of a concept within a technical domain“ Traces back to 1930 Eugene Wüster « General Theory of Terminology » Specialized language is / should be unambiguous concept term referent Ogden semiotic triangle 5
  6. 6. What is a term ? “Classical terminology challenged in the 1990's by : ● sociolinguistics ● corpus-based linguistics ● computational terminology ● Observe terms in texts : ● there is variation, polysemy ● concepts evolve overtime ● no clear-cut border between specialized and general languages 6
  7. 7. What is a term ? ● ● ● ● Definition of « term » depends on the application / audience of the terminology Domain expert : ● Unit of knowledge Information retrieval : ● Descriptors for indexation Translation ● word or phrase that : ● is not part of general language ● Translates differently in a particular domain ● can be : ● Noun, adjective, verb ● Noun phrase, verb phrase, etc. 7
  8. 8. What is a terminology ? ● ● ● ● Set of terms + terminological records Terminological record : ● Part-speech ● Frequency ● Variants ● contexts Relations between terms / concepts ● Hypernoymy : cat is a sort of animal ● Meronymy : head is part of body Bilingual terminology : ● Translation relations 8
  9. 9. http://www.termiumplus.gc.ca/ 9
  10. 10. Were do you find terms ? ● ● In specialized texts : ● Research papers on breast cancer ● Planes crashes reports Corpora building : ● important to gather texts following a well-defined domain / thematic 10
  11. 11. Bilingual terminology mining (1) Specialized texts term extraction data mining terms terms term alignment bilingual terminology terminology management software 11
  12. 12. Bilingual terminology mining (2) Specialized texts synchronized term extraction and alignment terms terms bilingual terminology terminology management software 12
  13. 13. Presentation outline ● ● ● About terms, terminology, terminology mining Term Extraction Term Alignment 13
  14. 14. Term extraction : semi-supervised process ● ● ● L'Homme, 2004 The notion of term is « slippery » The same lexical unit may or may not be considered as a term depending on : ● Audience ● Domain ● Application Term extractors extract candidate terms ● Frequent in texts of a given domain ● HER2 gene ● Look like terms : well-formed phrase ● human cell lines ● Group of words that frequently occur together ● to compile a program
  15. 15. Term extraction : semi-supervised, lexico-semantic process texts specialized texts term extractor candidate terms automatic indexing candidate terms manual selection terms terms terminology concepts
  16. 16.  Termhood  clues (1) : Frequency ● ● ● L'Homme, 2004 Term occurs frequently in specialized texts ● the higher, the better ? Comparison with general language : ● Does the term occur more frequently than expected in general language ? Compute significance tests : ● ex : ² chi-square
  17. 17. Termhood clues (2) : form ● ● ● A term is a well-formed phrase ● ...HER2/neu oncogenes are members of... Match morpho-syntactic patterns ● Ex: NOUN + NOUN Many : ● NOUN PREP DET NOUN ● alternation of the gene ● ● ● ● NOUN PREP NOUN COORD ADJ NOUN susceptibility to breast and ovarian cancer NOUN NOUN NOUN NOUN NOUN human breast cancer cell lines 17
  18. 18. Termhood clues (2) : form ● Preprocessing : ● Tokenization ● Lemmatisation ● POS Tagging … HER-2/neu oncogenes are members of .... HER-2/neu oncogenes are members of NOUN NOUN VERB NOUN PREP HER-2/neu oncogene be member of
  19. 19. Identification of Syntactic Patterns ● Patterns expressed as regular expression / Finite state automata PREP START NOUN NOUN NOUN (PREP? NOUN) ? ● ● ● NOUN : gene NOUN NOUN : HER2 gene NOUN PREP NOUN : member of family
  20. 20. Term hood clue (3) : words association ● ● Significant coocurrences are good clues for term hood : ● … breast cancer … ● ...breast remains... ● .. alternative cancer... Must take into account : ● number of times the two word cooccur ● number of times word A occurs ● number of times word B occurs
  21. 21. Measure for cooccurrence significance ● Mutual Information MI  a , b=log2 P  a , b P  a⋅P  b P a , b=nbocc a , b / N P  a=nbocc a/ N N =total nb of words in corpus invasive carcinoma 20 cancer means 50 invasive 30 cancer 800 carcinoma 20 means 800 MI 9,7 MI 1,69 ● Church and Hanks, 1990 L'Homme, 2004 remarkable attraction between invasive and carcinoma despite relatively low number of cooccurrences
  22. 22. Presentation outline ● ● ● About terms, terminology, terminology mining Term Extraction Term Alignment 22
  23. 23. Presentation outline ● ● ● ● About terms, terminology, terminology mining Term Extraction Term Alignment ● in parallel corpora in comparable corpora 23
  24. 24. Parallel and comparable corpora ● ● Parallel corpora ● Source text and target texts are translations ● Reduce search space little by little ● First sentences ● Then terms Comparable corpora ● Not translation but very similar in topic ● Good proportion of terms translations ● Search space : ● All terms of target corpus 24
  25. 25. Sentence alignement (1) ● Gale and Church, 1993 Gale and Church (1993) 's hypothesis : ● Translated sentences have roughly the same length ● Probability P(S,T) that sentence S translates into T is based on the length difference ● Improvements : use seed-lexicon ● Probability P(S,T) is based on the number of words in common 25
  26. 26. Sentence alignement (2) ● ● Compute probabilites for all pairs of (S,T) Build matrix where M(i,j) contains probability that sentence i translates to sentence j 0 2 ... n 0 0,89 0,56 0,2 ... ... 1 0,45 0,9 0,1 ... ... 2 ... 0,23 0,9 0,3 ... ... ... ... 0,44 0,76 ... m Gale and Church, 1993 1 ... ... ... ... 0,88 26
  27. 27. Sentence alignement (2) ● Use dynamic programming to find the best “path” i.e. the best alignments 0 2 ... n 0 0,89 0,56 0,2 ... ... 1 0,45 0,9 0,1 ... ... 2 ... 0,23 0,9 0,3 ... ... ... ... 0,44 0,76 ... m Gale and Church, 1993 1 ... ... ... ... 0,88 27
  28. 28. Sub sentence alignment : AnyMalign (Lardilleux, 2010) ● Lardilleux et al., 2010 AnyMalign is a sub-sentencial aligner ● Aligns words, groups of words for MT translation tables ● Aligned group of words : ● more or less like statistical collocations ● possible to find term patterns in these groups of words 28
  29. 29. AnyMalign (Lardilleux, 2010) ● Algorithm is based on « perfect alignments » : ● words or groups of words that occur exactly in the same aligned sentences ad ↔ AD b↔B b↔C a e ↔ A DD a ↔ A is a perfect alignment Lardilleux et al., 2010 29
  30. 30. AnyMalign (Lardilleux, 2010) ● ● How to get more « perfect alignments » ? ● with smaller corpora How to get smaller corpora ? ● randomly select sub corpora from your corpora Subcorpora 1 Subcorpora 2 Lardilleux et al., 2010 Sub corpora 1 : b↔B Sub corpora 2 : a↔A ad ↔ AD b↔B b↔C a e ↔ A DD 30
  31. 31. AnyMalign (Lardilleux, 2010) Complementaires of perfect alignments are likely to be good alignments too : ● ad ↔ AD b↔B b↔C a e ↔ A DD Perfect alignment a↔A ● Complementaries d↔D e ↔ DD ● Lardilleux et al., 2010 31
  32. 32. AnyMalign (Lardilleux, 2010) ● ● ● ● Lardilleux et al., 2010 Process : Iteratively extract random samples of of random size from your corpora Extract « perfect alignements » and their complementary The same alignment can occur several times Count, for each alignement the number of times it occurs 32
  33. 33. AnyMalign (Lardilleux, 2010) ● ● ● Output : alignments sorted by descending number of occurrences Alignement probability : CS ,T  P  S∣T = C T  Lardilleux et al., 2010 S = source group of words T = target group of words C (S,T) = number of times S was aligned with T C (T) = number of times T appears in an alignment 33
  34. 34. AnyMalign (Lardilleux, 2010) Advantages : ● can perform alignment with more than 2 languages at the same time ● 1 language → statistical collocations ● Extracts and aligns non contiguous sequences of words to give something up to let someone down ● No a priori expectations on terms ● Sometimes a term in source language is not translated by a term ● Terms = what you can align Lardilleux et al., 2010 34
  35. 35. AnyMalign (Lardilleux, 2010) ● ● Words groups are not grammatical phrases : that sample sentences and exchange format fitted for the but not Solutions : ● find term patterns ● use heuristics ● trim stop words sample sentences exchange format Lardilleux et al., 2010 35
  36. 36. Presentation outline ● ● ● About terms, terminology, terminology mining Term Extraction Term Alignment ● in parallel corpora ● in comparable corpora 36
  37. 37. Advantages of comparable corpora ● ● ● More available ● new languages ● new language pairs ● new topics / domains Less expensive to build More natural ● data was produced spontaneously ● no influence from source text 37
  38. 38. Contextual approach ● ● Based on distributional linguistics (Z. Harris) ● Words with similar meaning appear in similar contexts If source and target words have similar contexts, they might be translations ● Compute contexts for each source and target word ● Compare contexts ● Find the most similar contexts 38
  39. 39. Contextual approach ● ● ● ● ● m ou th be er wa te r drink gl as s Representation of the context of a given word with a vector : ● Head word + collocates ● ... ● Vector associates « head » word with most frequent collocates + some indication of the force of association between head-word and collocates 39
  40. 40. Building context vector for « drink » ● Collocates : word occuring at a distance of n words from head is variety of reasons to drink plenty of water each day simple as a glass of drinking water be the key to the popular in Japan today to drink water from glass after waking ● ● ● ● ● (drink,water) = 3 (drink, glass) = 2 (drink, Japan) = 1 (drink, reason) = 1 (drink, plenty) = 1 40
  41. 41. Normalized cooccurrences frequency ● ● ● ● ● ● Normalization : use measure like IM, log likehood ratio to counteract the influence of high frequency words Ex : log likelihood ratio 1000 cooc. in corpus (drink,x) = 75 cooc. (water,y) = 75 cooc. (drink, water) = 25 cooc. water drink 50 25 75 ¬ drink 25 900 925 75 Dunning, 1993 ¬ water 925 1000 41
  42. 42. Log likelyhood ratio Contingency table : ● water ¬ water drink a b e ¬ drink c d h f g N log likelihood ratio water , drink = log a b log bc log c d log  d  N log  N  −e loge − f log f − g log g −h log h ● Dunning, 1993 loglikelihoo ratio (drink,water) = 45,05 42
  43. 43. Context vector comparison m ou th be er ● ... ● น ดม ● Rapp 1995 ; Fung 1997 ● ● ● ป ก ● ยร เบ ● ว แก drink gl as s Compute context vectors for words in source and target corpus wa te r ● ... ● How to compare words contexts in different languages ? 43
  44. 44. Context vector comparison m ou th be er ● ● ยร เบ ● ว แก drink gl as s Use seed lexicon to map collocates wa te r ● ... ● น ดม Rapp 1995 ; Fung 1997 ● ● ● ป ก thaï-english seed lexicon ... ● 44
  45. 45. Context vector comparison ● ● Measuring context similarity of words a and b = measuring cosinus angle between vector of a and vector of b cosinus angle a , b= ∑ b w c , a⋅w c ,b c∈a∪ ∑ w 2 , a⋅∑ w 2 ,b c  c c∈a c∈ b c ∈ x=collocate in vector of x w  c , x =weight of association of collocate c withhead x ● Rapp 1995 ; Fung 1997 Select the top 1, 10 or 20 most closest words as candidate translations 45
  46. 46. Contextual approach : improvements ● ● ● ● ● Using syntactic collocates Improving dictionary with cognates, transliterations, other dictionaries Give more weight to « anchor words » ● cognates, transliterations ● frequent, monosemous Filter with part-of-speech Favor reciprocal translations SOURCE TARGET a Chiao et Zweignebaum, 2002 Sadat et al., 2003 Gamallo and Campos, 2005 Kohen and Knight, 2002 Prochasson, 2010 a' b b' c c' d d' 46
  47. 47. Variant to direct translation of vector ● ● ● « Interlingual » translation Translate the n-closest words instead of context vector Seed lexicon : some mappings between source and target words SOURCE TARGET seed lexicon Déjean and Gaussier, 2002 47
  48. 48. Variant to direct translation of vector ● ● ● To translate term T : Find n-closest words these closest words are in the lexicon SOURCE TARGET seed lexicon Déjean and Gaussier, 2002 48
  49. 49. Variant to direct translation of vector ● Find the target term which is the closest to the n closest words SOURCE TARGET seed lexicon Déjean and Gaussier, 2002 49
  50. 50. Variant to direct translation of vector ● ● « Interlingual » approach Translate closest words instead of direct context SOURCE Déjean and Gaussier, 2002 TARGET 50
  51. 51. Adaptation to multi-word terms energy drink Morin et al., 2004 Morin and Daille, 2009 ● ... be er ● ● ● ... m ou th ... be er gl as s ● ● ... ● m ou th ... st ro ng drink ● be er ● ... energy gl as s st ro ng ● Context vector : Union of vector of each word of the terms gl as s ● ... ● 51
  52. 52. Evaluation Precison on TopN candidates 50% on Top20 Correct translation is in the Top 20 best candidates for 50% of source terms ● ● ● Single word units Multi-word units Multi-word terms ● Morin and Daille, 2010 ● big, general language corpus small, specialized corpus small, specialized corpus 80% 60% 42% big = hundreds milliions of words small = one million to 100 thousand words vector 52
  53. 53. Why is it so difficult ? ● ● ● ● ● ● translation might not be present target term has not been extracted polysemous words : undiscriminant, fuzzy vector low frequency words : unsignificant vector translation has different usage in target language big search space : all words of target corpus → can not be fully automatic → semi supervised term alignment 53
  54. 54. th 4 Franco-Thai Workshop 2010 intensive summer school on Natural Language Processing Thank you ed(a)lingua-et-machina.com 54

×