Your SlideShare is downloading. ×
  • Like
Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Now you can save presentations on your phone or tablet

Available for both IPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

  • 189 views
Published

Material presented at the 24th International Conference on Computational Linguistics (COLING 2012), Mumbai, India. …

Material presented at the 24th International Conference on Computational Linguistics (COLING 2012), Mumbai, India.
Paper download at http://hal.archives-ouvertes.fr/hal-00743807.
Institutions: Laboratoire d'Informatique de Nantes Atlantique (LINA), Lingua et Machina, Gremuts.

Published in Technology , Business
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
189
On SlideShare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
3
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Extraction of domain-specific bilingual lexicon from comparable corpora compositional translation and ranking Estelle Delpech1 , B´atrice Daille1 , Emmanuel Morin1 , Claire e Lemaire2,3 1 LINA, 2 GREMUTS, Universit´ de Grenoble Universit´ de Nantes e e 3 Lingua et Machina COLING’12 10/12/12 Mumbai, India
  • 2. Outline 1 Context 2 Translation method 3 Ranking method 4 Results of experiments 5 Future work
  • 3. Outline 1 Context 2 Translation method 3 Ranking method 4 Results of experiments 5 Future work
  • 4. Context Translation method Ranking method Results of experiments Future work Context : comparable corpora for Computer-Aided Translation 1 / 31
  • 5. Context Translation method Ranking method Results of experiments Future work Context : comparable corpora for Computer-Aided Translation Aim : provide domain-specific bilingual lexicons to translators when no parallel data is available 1 / 31
  • 6. Context Translation method Ranking method Results of experiments Future work Context : comparable corpora for Computer-Aided Translation Aim : provide domain-specific bilingual lexicons to translators when no parallel data is available ⇒ Comparable corpora : Set of texts in languages L1 and L2, which are not translations, but which deal with the same subject matter, so that there is still a possibility to extract translation pairs 1 / 31
  • 7. Context Translation method Ranking method Results of experiments Future work Motivations for compositional translation 2 / 31
  • 8. Context Translation method Ranking method Results of experiments Future work Motivations for compositional translation Usual context-based methods [Fung, 1997]: 2 / 31
  • 9. Context Translation method Ranking method Results of experiments Future work Motivations for compositional translation Usual context-based methods [Fung, 1997]: 51% to 88% precision on top 20 candidates with specialized corpora [Daille and Morin, 2005] 2 / 31
  • 10. Context Translation method Ranking method Results of experiments Future work Motivations for compositional translation Usual context-based methods [Fung, 1997]: 51% to 88% precision on top 20 candidates with specialized corpora [Daille and Morin, 2005] ⇒ lexicons difficult to use for translators [Delpech, 2011] 2 / 31
  • 11. Context Translation method Ranking method Results of experiments Future work Motivations for compositional translation Usual context-based methods [Fung, 1997]: 51% to 88% precision on top 20 candidates with specialized corpora [Daille and Morin, 2005] ⇒ lexicons difficult to use for translators [Delpech, 2011] Compositional translation : 2 / 31
  • 12. Context Translation method Ranking method Results of experiments Future work Motivations for compositional translation Usual context-based methods [Fung, 1997]: 51% to 88% precision on top 20 candidates with specialized corpora [Daille and Morin, 2005] ⇒ lexicons difficult to use for translators [Delpech, 2011] Compositional translation : 81% to 94% precision on Top1 [Robitaille et al., 2006, Cartoni, 2009, Morin and Daille, 2009] 2 / 31
  • 13. Context Translation method Ranking method Results of experiments Future work Motivations for compositional translation Usual context-based methods [Fung, 1997]: 51% to 88% precision on top 20 candidates with specialized corpora [Daille and Morin, 2005] ⇒ lexicons difficult to use for translators [Delpech, 2011] Compositional translation : 81% to 94% precision on Top1 [Robitaille et al., 2006, Cartoni, 2009, Morin and Daille, 2009] More than 60% of terms in technical and scientific domains are morphologically complex [Namer and Baud, 2007] 2 / 31
  • 14. Context Translation method Ranking method Results of experiments Future work Motivations for compositional translation Usual context-based methods [Fung, 1997]: 51% to 88% precision on top 20 candidates with specialized corpora [Daille and Morin, 2005] ⇒ lexicons difficult to use for translators [Delpech, 2011] Compositional translation : 81% to 94% precision on Top1 [Robitaille et al., 2006, Cartoni, 2009, Morin and Daille, 2009] More than 60% of terms in technical and scientific domains are morphologically complex [Namer and Baud, 2007] Outperforms context-based approaches for the translation of terms with compositional meaning [Morin and Daille, 2009] 2 / 31
  • 15. Context Translation method Ranking method Results of experiments Future work Compositional translation Compositionality “the meaning of the whole is a function of the meaning of the parts” [Keenan and Faltz, 1985, 24-25] 3 / 31
  • 16. Context Translation method Ranking method Results of experiments Future work Compositional translation Compositionality “the meaning of the whole is a function of the meaning of the parts” [Keenan and Faltz, 1985, 24-25] Input : ”ab” 3 / 31
  • 17. Context Translation method Ranking method Results of experiments Future work Compositional translation Compositionality “the meaning of the whole is a function of the meaning of the parts” [Keenan and Faltz, 1985, 24-25] Input : ”ab” Decompose {a, b} 3 / 31
  • 18. Context Translation method Ranking method Results of experiments Future work Compositional translation Compositionality “the meaning of the whole is a function of the meaning of the parts” [Keenan and Faltz, 1985, 24-25] Input : ”ab” Decompose {a, b} Translate {α, β} 3 / 31
  • 19. Context Translation method Ranking method Results of experiments Future work Compositional translation Compositionality “the meaning of the whole is a function of the meaning of the parts” [Keenan and Faltz, 1985, 24-25] Input : ”ab” Decompose {a, b} Translate {α, β} Reorder {αβ, βα} 3 / 31
  • 20. Context Translation method Ranking method Results of experiments Future work Compositional translation Compositionality “the meaning of the whole is a function of the meaning of the parts” [Keenan and Faltz, 1985, 24-25] Input : ”ab” Decompose Translate Reorder Select {a, b} {α, β} {αβ, βα} αβ 3 / 31
  • 21. Context Translation method Ranking method Results of experiments Future work Compositional translation Compositionality “the meaning of the whole is a function of the meaning of the parts” [Keenan and Faltz, 1985, 24-25] Input : ”ab” Decompose Translate Reorder Select {a, b} {α, β} {αβ, βα} αβ Output : ”αβ” 3 / 31
  • 22. Context Translation method Ranking method Results of experiments Future work Related work 4 / 31
  • 23. Context Translation method Ranking method Results of experiments Future work Related work Applied to phrases, decomposed into words [Robitaille et al., 2006, Morin and Daille, 2009] rate of evaporation → taux d’´vaporation e 4 / 31
  • 24. Context Translation method Ranking method Results of experiments Future work Related work Applied to phrases, decomposed into words [Robitaille et al., 2006, Morin and Daille, 2009] rate of evaporation → taux d’´vaporation e Applied to words, decomposed into morphemes [Cartoni, 2009, Harastani et al., 2012] cardiology → cardiologie ricostruire → rebuild 4 / 31
  • 25. Context Translation method Ranking method Results of experiments Future work Related work Applied to phrases, decomposed into words [Robitaille et al., 2006, Morin and Daille, 2009] rate of evaporation → taux d’´vaporation e Applied to words, decomposed into morphemes [Cartoni, 2009, Harastani et al., 2012] cardiology → cardiologie ricostruire → rebuild ⇒ No approach links bound morphemes to words : -cyto- → cellule ’cell’ cytotoxic → toxique pour les cellules ’toxic to the cells’ 4 / 31
  • 26. Context Translation method Ranking method Results of experiments Future work Selection and ranking methods 5 / 31
  • 27. Context Translation method Ranking method Results of experiments Future work Selection and ranking methods Select translations that occur in target texts / Web [Morin and Daille, 2009] 5 / 31
  • 28. Context Translation method Ranking method Results of experiments Future work Selection and ranking methods Select translations that occur in target texts / Web [Morin and Daille, 2009] Select most frequent translation [Grefenstette, 1999] 5 / 31
  • 29. Context Translation method Ranking method Results of experiments Future work Selection and ranking methods Select translations that occur in target texts / Web [Morin and Daille, 2009] Select most frequent translation [Grefenstette, 1999] Compare contexts [Garera and Yarowsky, 2008] 5 / 31
  • 30. Context Translation method Ranking method Results of experiments Future work Selection and ranking methods Select translations that occur in target texts / Web [Morin and Daille, 2009] Select most frequent translation [Grefenstette, 1999] Compare contexts [Garera and Yarowsky, 2008] ML : Binary classifier [Baldwin and Tanaka, 2004] 5 / 31
  • 31. Context Translation method Ranking method Results of experiments Future work Selection and ranking methods Select translations that occur in target texts / Web [Morin and Daille, 2009] Select most frequent translation [Grefenstette, 1999] Compare contexts [Garera and Yarowsky, 2008] ML : Binary classifier [Baldwin and Tanaka, 2004] ⇒ Combination of criterion 5 / 31
  • 32. Context Translation method Ranking method Results of experiments Future work Selection and ranking methods Select translations that occur in target texts / Web [Morin and Daille, 2009] Select most frequent translation [Grefenstette, 1999] Compare contexts [Garera and Yarowsky, 2008] ML : Binary classifier [Baldwin and Tanaka, 2004] ⇒ Combination of criterion ⇒ ML : Learning-to-rank algorithms (IR) 5 / 31
  • 33. Outline 1 Context 2 Translation method 3 Ranking method 4 Results of experiments 5 Future work
  • 34. Context Translation method Ranking method Results of experiments Future work Translation process overview 7 / 31
  • 35. Context Translation method Ranking method Results of experiments Future work Translation process overview Input : ”non-cytotoxic” 7 / 31
  • 36. Context Translation method Ranking method Results of experiments Future work Translation process overview Input : ”non-cytotoxic” Decompose {non, cyto, toxic} 7 / 31
  • 37. Context Translation method Ranking method Results of experiments Future work Translation process overview Input : ”non-cytotoxic” Decompose {non, cyto, toxic} Concatenate {non, cyto, toxic} , {noncyto, toxic}, {non, cytotoxic} , {noncytotoxic} 7 / 31
  • 38. Context Translation method Ranking method Results of experiments Future work Translation process overview Input : ”non-cytotoxic” Decompose {non, cyto, toxic} Concatenate {non, cyto, toxic} , {noncyto, toxic}, {non, cytotoxic} , {noncytotoxic} 7 / 31
  • 39. Context Translation method Ranking method Results of experiments Future work Translation process overview Input : ”non-cytotoxic” Decompose {non, cyto, toxic} Concatenate {non, cyto, toxic} , {noncyto, toxic}, {non, cytotoxic} , {noncytotoxic} Translate {non, cellule, toxique}, {non, cyto, toxique}, {non, cellule, toxicit´}, {non, cyto, toxicit´} e e 7 / 31
  • 40. Context Translation method Ranking method Results of experiments Future work Translation process overview Input : ”non-cytotoxic” Decompose {non, cyto, toxic} Concatenate {non, cyto, toxic} , {noncyto, toxic}, {non, cytotoxic} , {noncytotoxic} Translate {non, cellule, toxique}, {non, cyto, toxique}, {non, cellule, toxicit´}, {non, cyto, toxicit´} e e 7 / 31
  • 41. Context Translation method Ranking method Results of experiments Future work Translation process overview Input : ”non-cytotoxic” Decompose {non, cyto, toxic} Concatenate {non, cyto, toxic} , {noncyto, toxic}, {non, cytotoxic} , {noncytotoxic} Translate {non, cellule, toxique}, {non, cyto, toxique}, {non, cellule, toxicit´}, {non, cyto, toxicit´} e e Reorder {non, toxique, cellule}, {non, cellule, toxique}, {cellule, toxique, non} 7 / 31
  • 42. Context Translation method Ranking method Results of experiments Future work Translation process overview Input : ”non-cytotoxic” Decompose {non, cyto, toxic} Concatenate {non, cyto, toxic} , {noncyto, toxic}, {non, cytotoxic} , {noncytotoxic} Translate {non, cellule, toxique}, {non, cyto, toxique}, {non, cellule, toxicit´}, {non, cyto, toxicit´} e e Reorder {non, toxique, cellule}, {non, cellule, toxique}, {cellule, toxique, non} 7 / 31
  • 43. Context Translation method Ranking method Results of experiments Future work Translation process overview Input : ”non-cytotoxic” Decompose {non, cyto, toxic} Concatenate {non, cyto, toxic} , {noncyto, toxic}, {non, cytotoxic} , {noncytotoxic} Translate {non, cellule, toxique}, {non, cyto, toxique}, {non, cellule, toxicit´}, {non, cyto, toxicit´} e e Reorder {non, toxique, cellule}, {non, cellule, toxique}, {cellule, toxique, non} Concatenate {non, toxique, cellule}, {nontoxique, cellule}, {non, toxiquecellule}, {nontoxiquecellule} 7 / 31
  • 44. Context Translation method Ranking method Results of experiments Future work Translation process overview Input : ”non-cytotoxic” Decompose {non, cyto, toxic} Concatenate {non, cyto, toxic} , {noncyto, toxic}, {non, cytotoxic} , {noncytotoxic} Translate {non, cellule, toxique}, {non, cyto, toxique}, {non, cellule, toxicit´}, {non, cyto, toxicit´} e e Reorder {non, toxique, cellule}, {non, cellule, toxique}, {cellule, toxique, non} Concatenate {non, toxique, cellule}, {nontoxique, cellule}, {non, toxiquecellule}, {nontoxiquecellule} 7 / 31
  • 45. Context Translation method Ranking method Results of experiments Future work Translation process overview Input : ”non-cytotoxic” Decompose {non, cyto, toxic} Concatenate {non, cyto, toxic} , {noncyto, toxic}, {non, cytotoxic} , {noncytotoxic} Translate {non, cellule, toxique}, {non, cyto, toxique}, {non, cellule, toxicit´}, {non, cyto, toxicit´} e e Reorder {non, toxique, cellule}, {non, cellule, toxique}, {cellule, toxique, non} Concatenate {non, toxique, cellule}, {nontoxique, cellule}, {non, toxiquecellule}, {nontoxiquecellule} Match {non, toxique, cellule} 7 / 31
  • 46. Context Translation method Ranking method Results of experiments Future work Translation process overview Input : ”non-cytotoxic” Decompose {non, cyto, toxic} Concatenate {non, cyto, toxic} , {noncyto, toxic}, {non, cytotoxic} , {noncytotoxic} Translate {non, cellule, toxique}, {non, cyto, toxique}, {non, cellule, toxicit´}, {non, cyto, toxicit´} e e Reorder {non, toxique, cellule}, {non, cellule, toxique}, {cellule, toxique, non} Concatenate {non, toxique, cellule}, {nontoxique, cellule}, {non, toxiquecellule}, {nontoxiquecellule} Match {non, toxique, cellule} Output : ”non toxique pour les cellules” ’non toxic to the cells’ 7 / 31
  • 47. Context Translation method Ranking method Results of experiments Future work Decomposition non-cytotoxic → {non, cyto, toxic} 8 / 31
  • 48. Context Translation method Ranking method Results of experiments Future work Decomposition non-cytotoxic → {non, cyto, toxic} Split source term into minimal components with heuristic rules: 8 / 31
  • 49. Context Translation method Ranking method Results of experiments Future work Decomposition non-cytotoxic → {non, cyto, toxic} Split source term into minimal components with heuristic rules: split on hyphens 8 / 31
  • 50. Context Translation method Ranking method Results of experiments Future work Decomposition non-cytotoxic → {non, cyto, toxic} Split source term into minimal components with heuristic rules: split on hyphens match substrings of the source term with: a list of morphemes a list of lexical items 8 / 31
  • 51. Context Translation method Ranking method Results of experiments Future work Decomposition non-cytotoxic → {non, cyto, toxic} Split source term into minimal components with heuristic rules: split on hyphens match substrings of the source term with: a list of morphemes a list of lexical items respect some length constraints on the substrings 8 / 31
  • 52. Context Translation method Ranking method Results of experiments Future work Concatenation 9 / 31
  • 53. Context Translation method Ranking method Results of experiments Future work Concatenation Generate all possible concatenations of the minimal components 9 / 31
  • 54. Context Translation method Ranking method Results of experiments Future work Concatenation Generate all possible concatenations of the minimal components Increases the chances of matching the components with entries of the dictionaries { non, cyto, toxic} → {non, cyto, ∅ } {non, cytotoxic} → {non, cytotoxique } 9 / 31
  • 55. Context Translation method Ranking method Results of experiments Future work Translation with direct dictionary look-up 10 / 31
  • 56. Context Translation method Ranking method Results of experiments Future work Translation with direct dictionary look-up Bilingual dictionary for lexical items: toxic → toxique 10 / 31
  • 57. Context Translation method Ranking method Results of experiments Future work Translation with direct dictionary look-up Bilingual dictionary for lexical items: toxic → toxique Morpheme translation table for bound morphemes: allow bound to free morpheme translation equivalence -cyto- → -cyto-, cellule 10 / 31
  • 58. Context Translation method Ranking method Results of experiments Future work Translation with direct dictionary look-up Bilingual dictionary for lexical items: toxic → toxique Morpheme translation table for bound morphemes: allow bound to free morpheme translation equivalence -cyto- → -cyto-, cellule {-cyto-, toxic} → {-cyto-, toxique}, {cellule, toxique} 10 / 31
  • 59. Context Translation method Ranking method Results of experiments Future work Translation with variation 11 / 31
  • 60. Context Translation method Ranking method Results of experiments Future work Translation with variation Morphological lexicon toxic → toxique → toxicit´ ’toxicity’ e 11 / 31
  • 61. Context Translation method Ranking method Results of experiments Future work Translation with variation Morphological lexicon toxic → toxique → toxicit´ ’toxicity’ e Synonyms toxic → toxique → v´n´neux ’poisonous’ e e 11 / 31
  • 62. Context Translation method Ranking method Results of experiments Future work Translation with variation Morphological lexicon toxic → toxique → toxicit´ ’toxicity’ e Synonyms toxic → toxique → v´n´neux ’poisonous’ e e {-cyto-, toxic} → {-cyto-, toxicit´}, e {-cyto-, v´n´neux}, {cellule, toxicit´}, e e e {cellule, v´n´neux} e e 11 / 31
  • 63. Context Translation method Ranking method Results of experiments Future work Reordering 12 / 31
  • 64. Context Translation method Ranking method Results of experiments Future work Reordering No translation patterns or reordering rules 12 / 31
  • 65. Context Translation method Ranking method Results of experiments Future work Reordering No translation patterns or reordering rules Permutate the translated components : {cellule, toxique} → {cellule, toxique}, {toxique, cellule} 12 / 31
  • 66. Context Translation method Ranking method Results of experiments Future work Concatenation 13 / 31
  • 67. Context Translation method Ranking method Results of experiments Future work Concatenation Recreate target words by generating all possible concatenations of the components : {toxique, cellule} → {toxique cellule}, {toxiquecellule} 13 / 31
  • 68. Context Translation method Ranking method Results of experiments Future work Selection 14 / 31
  • 69. Context Translation method Ranking method Results of experiments Future work Selection Match target words with the words of the target corpus 14 / 31
  • 70. Context Translation method Ranking method Results of experiments Future work Selection Match target words with the words of the target corpus Allow at maximum 3 stop words between two words 14 / 31
  • 71. Context Translation method Ranking method Results of experiments Future work Selection Match target words with the words of the target corpus Allow at maximum 3 stop words between two words {toxique cellule} → ‘‘toxique pour les cellules’’ ’toxic to the cells’ 14 / 31
  • 72. Outline 1 Context 2 Translation method 3 Ranking method 4 Results of experiments 5 Future work
  • 73. Context Translation method Ranking method Results of experiments Future work Target term frequency 16 / 31
  • 74. Context Translation method Ranking method Results of experiments Future work Target term frequency Number of occurrences of target term divided by the total number of occurrences in the target texts Freq(t) = occ(t) N 16 / 31
  • 75. Context Translation method Ranking method Results of experiments Future work Context similarity measure 17 / 31
  • 76. Context Translation method Ranking method Results of experiments Future work Context similarity measure Corresponds to context-based approaches 17 / 31
  • 77. Context Translation method Ranking method Results of experiments Future work Context similarity measure Corresponds to context-based approaches Collect words coocurring with source and target term in a window of 5 words 17 / 31
  • 78. Context Translation method Ranking method Results of experiments Future work Context similarity measure Corresponds to context-based approaches Collect words coocurring with source and target term in a window of 5 words Normalize cooccurrences with log-likelihood ratio 17 / 31
  • 79. Context Translation method Ranking method Results of experiments Future work Context similarity measure Corresponds to context-based approaches Collect words coocurring with source and target term in a window of 5 words Normalize cooccurrences with log-likelihood ratio Compare contexts with weighted jaccard Cont(s, t) = min(c(s, w ), c(t, w )) max(c(s, w ), c(t, w )) w ∈s∪t w ∈s∩t 17 / 31
  • 80. Context Translation method Ranking method Results of experiments Future work Part-of-speech translation probability 18 / 31
  • 81. Context Translation method Ranking method Results of experiments Future work Part-of-speech translation probability Probability that source term with part-of-speech A translates to target term with part of speech B Pos(s, t) = P(pos(t)|pos(s)) = P(B|A) 18 / 31
  • 82. Context Translation method Ranking method Results of experiments Future work Part-of-speech translation probability Probability that source term with part-of-speech A translates to target term with part of speech B Pos(s, t) = P(pos(t)|pos(s)) = P(B|A) Acquired from pos-tagged parallel corpora [Tiedemann, 2009] with word alignment software AnyMalign [Lardrilleux, 2008] 18 / 31
  • 83. Context Translation method Ranking method Results of experiments Future work Resources reliability score 19 / 31
  • 84. Context Translation method Ranking method Results of experiments Future work Resources reliability score Some translation resources might give more reliable translations than others ex : bilingual dictionary > synonyms 19 / 31
  • 85. Context Translation method Ranking method Results of experiments Future work Resources reliability score Some translation resources might give more reliable translations than others ex : bilingual dictionary > synonyms score = mean of the reliability of the resources used for translating the components Reso(t = {c1 , ...cn }) = n i=1 resource reliability (ci ) n 19 / 31
  • 86. Context Translation method Ranking method Results of experiments Future work Resources reliability score Some translation resources might give more reliable translations than others ex : bilingual dictionary > synonyms score = mean of the reliability of the resources used for translating the components Reso(t = {c1 , ...cn }) = n i=1 resource reliability (ci ) n Tuned on training data 19 / 31
  • 87. Context Translation method Ranking method Results of experiments Future work Combination 20 / 31
  • 88. Context Translation method Ranking method Results of experiments Future work Combination Linear combination of the 4 criterion Frequency, Context, Part-of-speech translation probability and Resources reliabilily Combi(t, s) = Freq(s) + Cont(s, t) + Pos(s, t) + Reso(t) 20 / 31
  • 89. Context Translation method Ranking method Results of experiments Future work Machine learning 1 http://people.cs.umass.edu/ vdang/ranklib.html 21 / 31
  • 90. Context Translation method Ranking method Results of experiments Future work Machine learning Learning-to-rank algorithms used in IR for ranking documents 1 http://people.cs.umass.edu/ vdang/ranklib.html 21 / 31
  • 91. Context Translation method Ranking method Results of experiments Future work Machine learning Learning-to-rank algorithms used in IR for ranking documents Tried 3 algorithms implemented in the RankLib software1 1 http://people.cs.umass.edu/ vdang/ranklib.html 21 / 31
  • 92. Context Translation method Ranking method Results of experiments Future work Machine learning Learning-to-rank algorithms used in IR for ranking documents Tried 3 algorithms implemented in the RankLib software1 AdaRank [Li and Xu, 2007] 1 http://people.cs.umass.edu/ vdang/ranklib.html 21 / 31
  • 93. Context Translation method Ranking method Results of experiments Future work Machine learning Learning-to-rank algorithms used in IR for ranking documents Tried 3 algorithms implemented in the RankLib software1 AdaRank [Li and Xu, 2007] Coordinate Ascend [Metzler and Croft, 2000] 1 http://people.cs.umass.edu/ vdang/ranklib.html 21 / 31
  • 94. Context Translation method Ranking method Results of experiments Future work Machine learning Learning-to-rank algorithms used in IR for ranking documents Tried 3 algorithms implemented in the RankLib software1 AdaRank [Li and Xu, 2007] Coordinate Ascend [Metzler and Croft, 2000] LambdaMart [Wu et al., 2010] 1 http://people.cs.umass.edu/ vdang/ranklib.html 21 / 31
  • 95. Context Translation method Ranking method Results of experiments Future work Machine learning Learning-to-rank algorithms used in IR for ranking documents Tried 3 algorithms implemented in the RankLib software1 AdaRank [Li and Xu, 2007] Coordinate Ascend [Metzler and Croft, 2000] LambdaMart [Wu et al., 2010] Features: Freq, Cont, Pos, Reso 1 http://people.cs.umass.edu/ vdang/ranklib.html 21 / 31
  • 96. Outline 1 Context 2 Translation method 3 Ranking method 4 Results of experiments 5 Future work
  • 97. Context Translation method Ranking method Results of experiments Future work Corpora 23 / 31
  • 98. Context Translation method Ranking method Results of experiments Future work Corpora English → French, German 23 / 31
  • 99. Context Translation method Ranking method Results of experiments Future work Corpora English → French, German breast cancer 23 / 31
  • 100. Context Translation method Ranking method Results of experiments Future work Corpora English → French, German breast cancer ≈ 400k words per language 23 / 31
  • 101. Context Translation method Ranking method Results of experiments Future work Lexicons 24 / 31
  • 102. Context Translation method Ranking method Results of experiments Future work Lexicons Morpheme translation table (hand-crafted) 24 / 31
  • 103. Context Translation method Ranking method Results of experiments Future work Lexicons Morpheme translation table (hand-crafted) General language dictionary (Xelda) 24 / 31
  • 104. Context Translation method Ranking method Results of experiments Future work Lexicons Morpheme translation table (hand-crafted) General language dictionary (Xelda) Synonyms (Xelda) 24 / 31
  • 105. Context Translation method Ranking method Results of experiments Future work Lexicons Morpheme translation table (hand-crafted) General language dictionary (Xelda) Synonyms (Xelda) Domain-specific dictionary : cognates extracted from corpus [Hauer and Kondrak, 2011] 24 / 31
  • 106. Context Translation method Ranking method Results of experiments Future work Lexicons Morpheme translation table (hand-crafted) General language dictionary (Xelda) Synonyms (Xelda) Domain-specific dictionary : cognates extracted from corpus [Hauer and Kondrak, 2011] Morphological families [Porter, 1980] 24 / 31
  • 107. Context Translation method Ranking method Results of experiments Future work Training and evaluation datasets 25 / 31
  • 108. Context Translation method Ranking method Results of experiments Future work Training and evaluation datasets EVALUATION ≈ 100 source terms 25 / 31
  • 109. Context Translation method Ranking method Results of experiments Future work Training and evaluation datasets EVALUATION ≈ 100 source terms source terms in UMLS meta-thesaurus with translation(s) in target texts 25 / 31
  • 110. Context Translation method Ranking method Results of experiments Future work Training and evaluation datasets EVALUATION ≈ 100 source terms source terms in UMLS meta-thesaurus with translation(s) in target texts TRAINING ≈ 600 source terms 25 / 31
  • 111. Context Translation method Ranking method Results of experiments Future work Training and evaluation datasets EVALUATION ≈ 100 source terms source terms in UMLS meta-thesaurus with translation(s) in target texts TRAINING ≈ 600 source terms source terms for which a translation could be generated and whose translation(s) is in the target texts 25 / 31
  • 112. Context Translation method Ranking method Results of experiments Future work Training and evaluation datasets EVALUATION ≈ 100 source terms source terms in UMLS meta-thesaurus with translation(s) in target texts TRAINING ≈ 600 source terms source terms for which a translation could be generated and whose translation(s) is in the target texts generated translations were scored manually 25 / 31
  • 113. Context Translation method Ranking method Results of experiments Future work Training and evaluation datasets EVALUATION ≈ 100 source terms source terms in UMLS meta-thesaurus with translation(s) in target texts TRAINING ≈ 600 source terms source terms for which a translation could be generated and whose translation(s) is in the target texts generated translations were scored manually ⇒ evaluation and training datasets are disjoint 25 / 31
  • 114. Context Translation method Ranking method Results of experiments Future work Training and evaluation datasets EVALUATION ≈ 100 source terms source terms in UMLS meta-thesaurus with translation(s) in target texts TRAINING ≈ 600 source terms source terms for which a translation could be generated and whose translation(s) is in the target texts generated translations were scored manually ⇒ evaluation and training datasets are disjoint ⇒ source terms are morphologically complex words with no translation in dictionary 25 / 31
  • 115. Context Translation method Ranking method Results of experiments Future work Results for translation generation # source terms # at least 1 translation EN → FR 126 86 (68%) EN → DE 90 56 (62%) # at least 1 translation 1 trans. in UMLS 1 trans. in UMLS or judged correct 86 68 (79%) 81 (94%) 56 40 (71%) 51 (91%) 26 / 31
  • 116. Context Translation method Ranking method Results of experiments Future work Results for translation ranking Random Freq Cont Pos Reso Combination ML AdaRank ML CoordAsc ML LambdaMart EN → FR .83 .92 .90 .88 .92 .93 .90 .93 .86 EN → DE .80 .84 .82 .91 .82 .89 .84 .89 .88 Average .815 .88 .86 .895 .87 .91 .87 .91 .87 Table: Top1 translation in UMLS or judged correct 27 / 31
  • 117. Context Translation method Ranking method Results of experiments Future work Silence analysis 28 / 31
  • 118. Context Translation method Ranking method Results of experiments Future work Silence analysis Missing translation in resources (≈30%) 28 / 31
  • 119. Context Translation method Ranking method Results of experiments Future work Silence analysis Missing translation in resources (≈30%) Target term is not compositional (≈30%) breastfeeding → allaitement (FR), stillen (DE) 28 / 31
  • 120. Context Translation method Ranking method Results of experiments Future work Silence analysis Missing translation in resources (≈30%) Target term is not compositional (≈30%) breastfeeding → allaitement (FR), stillen (DE) Lexical divergence (≈20%) radiosensitivity → Strahlentoleranz, sensitivity = toleranz 28 / 31
  • 121. Context Translation method Ranking method Results of experiments Future work Silence analysis Missing translation in resources (≈30%) Target term is not compositional (≈30%) breastfeeding → allaitement (FR), stillen (DE) Lexical divergence (≈20%) radiosensitivity → Strahlentoleranz, sensitivity = toleranz Additional elements (≈13%) postpartum→ postpartalperiod 28 / 31
  • 122. Context Translation method Ranking method Results of experiments Future work Error analysis 29 / 31
  • 123. Context Translation method Ranking method Results of experiments Future work Error analysis Problems in word reordering self-examination → untersuchung selbst ’examination self’ 29 / 31
  • 124. Context Translation method Ranking method Results of experiments Future work Error analysis Problems in word reordering self-examination → untersuchung selbst ’examination self’ Wrong or innapropriate translations in-patient → pas malade ’not ill’ in → “inside” → inside patient in → “inverse” → not a patient 29 / 31
  • 125. Context Translation method Ranking method Results of experiments Future work Impact of fertile translations exact translations wrong translations EN → FR 21% 50% EN → DE 10% 80% Table: % of fertile translations 30 / 31
  • 126. Context Translation method Ranking method Results of experiments Future work Impact of fertile translations exact translations wrong translations EN → FR 21% 50% EN → DE 10% 80% Table: % of fertile translations German germanic language: tendency to agglutination oestrogen-independant → Ostrogen-unabh¨ngige a French romance language: creates phrases more easily oestrogen-independant → ind´pendant des œstrog`nes e e 30 / 31
  • 127. Outline 1 Context 2 Translation method 3 Ranking method 4 Results of experiments 5 Future work
  • 128. Context Translation method Ranking method Results of experiments Future work Future work Improve quality of linguistic resources morphological derivation rules instead of stemming use of a thesaurus Try translations patterns on top of permutations Try learning morpheme translation equivalences from cognates bilingual dictionaries out-of-domain parallel data 31 / 31
  • 129. Thank you for your attention. B estelle.delpech@univ-nantes.fr beatrice.daille@univ-nantes.fr emmanuel.morin@univ-nantes.fr cl@lingua-et-machina.com
  • 130. ADDITIONAL SLIDES
  • 131. Exact translations Non fertiles: pathophysiological → physiopathologique overactive → uberaktiv ¨ Fertiles: cardiotoxicity → toxicit´ cardiaque ’cardiac toxicity’ e mastectomy → ablation der brust ’ablation of the breast’
  • 132. Morphological variants Non fertiles: dosimetry → dosim´trique ’dosimetric’ e radiosensitivity → strahlenempfindlich ’radiosensitive’ Fertiles: milk-producing → production de lait ’production of milk’ selfexamination → selbst untersuchen ’self examine’
  • 133. Inexact but semantically related Non fertiles: oncogene → oncog´n`se ’oncogenesis’ e e breakthrough → durchbrechen ’break’ Fertiles: chemoradiotherapy → chemotherapie oder strahlen ’chemotherapy or radiation’ treatable → pouvoir le traiter ’can treat it’
  • 134. Wrong translations Non fertiles: immunoscore → immunomarquer ’immunostain’ check-in → unkontrollieren ’uncontrolled’ Fertiles: bloodstream → fliessen mehr blut ’more blood flow’ risk-reducing → risque de r´duire ’risk of reducing’ e
  • 135. References I Baldwin, T. and Tanaka, T. (2004). Translation by machine of complex nominals. In Proceedings of the ACL 2004 Workshop on Multiword expressions: Integrating Processing, pages 24–31, Barcelona, Spain. Bo, L. and Gaussier, E. (2010). Improving corpus comparability for bilingual lexicon extraction from comparable corpora. In 23`me International Conference on Computational Linguistics, pages 23–27, Beijing, Chine. e Cartoni, B. (2009). Lexical morphology in machine translation: A feasibility study. In Proceedings of the 12th Conference of the European Chapter of the ACL, pages 130–138, Athens, Greece. Daille, B. and Morin, E. (2005). French-English terminology extraction from comparable corpora. In Proceedings, 2nd International Joint Conference on Natural Language Processing, volume 3651 of Lecture Notes in Computer Sciences, page 707–718, Jeju Island, Korea. Springer. Delpech, E. (2011). Evaluation of terminologies acquired from comparable corpora : an application perspective. In Proceedings of the 18th Nordic Conference of Computational Linguistics (NODALIDA 2011), volume 11 of NEALT Proceedings Series,, pages 66–73, Riga, Latvia. Pedersen B.S., Neˇpore G., Skadi¸ a I. s n Fung, P. (1997). Finding terminology translations from non-parallel corpora. pages 192–202, Hong Kong. Garera, N. and Yarowsky, D. (2008). Translating compounds by learning component gloss translation via multiple languages. In Proceedings of the 3rd International Joint Conference on Natural Language Processing, volume 1, pages 403–410, Hyderabad, India.
  • 136. References II Grefenstette, G. (1999). The world wide web as a resource for example-based machine translation tasks. ASLIB’99 Translating and the computer, 21. Harastani, R., Daille, B., and Morin, E. (2012). Neoclassical compound alignments from comparable corpora. In Proceedings of the 13th International Conference on Computational Linguistics and Intelligent Text Processing, volume 2, pages 72–82, New Delhi, India. Hauer, B. and Kondrak, G. (2011). Clustering semantically equivalent words into cognate sets in multilingual lists. In Proceedings of the 5th International Joint Conference on Natural Language Processing, pages 865–873, Chiang Mai, Thailand. Keenan, E. L. and Faltz, L. M. (1985). Boolean semantics for natural language. D. Reidel, Dordrecht, Holland. Lardrilleux, A. (2008). A truly multilingual, high coverage, accurate, yet simple, sub-sentential alignment method. Li, H. and Xu, J. (2007). Adarank: A boosing algorithm for information retrieval. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pages 391–398, Amsterdam, The Netherlands. Metzler, D. and Croft, W. B. (2000). Linear feature-based models for information retrieval. Information Retrieval, 10(3):257–274.
  • 137. References III Morin, E. and Daille, B. (2009). Compositionality and lexical alignment of multi-word terms. In Language Resources and Evaluation (LRE), volume 44 of Multiword expression: hard going or plain sailing, pages 79–95. P. Rayson, S. Piao, S. Sharoff, S. Evert, B. Villada Moir´n, springer netherlands o edition. Morin, E. and Daille, B. (2010). Compositionality and lexical alignment of multi-word terms. In Rayson, P., Piao, S., Sharoff, S., Evert, S., and B., V. M., editors, Language Resources and Evaluation (LRE), volume 44 of Multiword expression: hard going or plain sailing, pages 79–95. Springer Netherlands. Namer, F. and Baud, R. (2007). Defining and relating biomedical terms: Towards a cross-language morphosemantics-based system. International Journal of Medical Informatics, 76(2-3):226–33. Porter, M. F. (1980). An algorithm for suffix stripping. Program, 14(3):130–137. Robitaille, X., Sasaki, X., Tonoike, M., Sato, S., and Utsuro, S. (2006). Compiling French-Japanese terminologies from the web. In Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics, pages 225–232, Trento, Italy. Tiedemann, J. (2009). News from opus - a collection of multilingual parallel corpora with tools and interfaces. Wu, Q., Burges, J. C., Svore, K., and Gao, J. (2010). Adapting boosting for information retrieval measures. Journal of Information Retrieval, 13(3):254–270.