Dictionary Alignment by Rewrite-based Entry Translation

625 views

Published on

Presentation at SLATE 2013 - http://slate-conf/

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
625
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Dictionary Alignment by Rewrite-based Entry Translation

  1. 1. Dictionary Alignment by Rewrite-based Entry Translation Alberto Sim˜oes1 Xavier G´omez Guinovart2 1Centro de Estudos Human´ısticos, Universidade do Minho Campus de Gualtar, Braga, Portugal ambs@ilch.uminho.pt 2Galician Language Technologies and Applications (TALG Group) Universidade de Vigo, Galiza, Spain xgg@uvigo.es SLATE 2013 Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
  2. 2. Motivation We have a running project, Dicion´ario-Aberto, that allows the user to consult a Portuguese dictionary; Dicion´ario-Aberto is also available in TEI and DB formats; Within GALNET project, a Galician Synonyms Dictionary was converted from a WYSIWYG format to a rich TEI format; Would it be possible to integrate the GSD into DA? Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
  3. 3. Problem Dicion´ario-Aberto has more than a hundred thousand entries! Galician Synonyms Dictionary is not that big, and has some dozens of thousand entries. Problem: how to align entries from both dictionaries? The two languages are very close; That help with concepts alignment! There are too many different words; There is a reasonable set of false friend words; There isn’t a a free and big enough translation dictionary. Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
  4. 4. Problem Dicion´ario-Aberto has more than a hundred thousand entries! Galician Synonyms Dictionary is not that big, and has some dozens of thousand entries. Problem: how to align entries from both dictionaries? The two languages are very close; That help with concepts alignment! There are too many different words; There is a reasonable set of false friend words; There isn’t a a free and big enough translation dictionary. Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
  5. 5. Inspiration (part 1) In the first year,“s”will be used instead of the soft“c.” Sertainly, sivil servants will resieve this news with joy. Also, the hard“c”will be replaced with“k”. Not only will this klear up konfusion, but typewriters kan have one less letter. There will be growing publik emthusiasm in the sekond year, when the troublesome“ph”will be replaced by“f”. This will make words like“fotograf”20 persent shorter. In the third year, publik akseptanse of the new spelling kan be expekted to reach the stage where more komplikated changes are possible. Governments will enkorage the removal of double letters, which have always ben a deterent to akurate speling. Also, al wil agre that the horible mes of silent“e”s in the languag is disgrasful, and they would go. Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
  6. 6. Inspiration (part 2) By the fourth year, peopl wil be reseptiv to steps such as replasing “th”by“z”and“w”by“v”. During ze fifz year, ze unesesary“o”kan be dropd from vords kontaining“ou”, and similar changes vud of kors be aplid to ozer kombinations of leters. After zis fifz yer, ve vil hav a reli sensibl riten styl. Zer vil be no mor trubls or difikultis and evrivun vil find it ezi tu understand ech ozer. Ze drem vil finali kum tru!! Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
  7. 7. Approach Define a translation function based on a set or sequence of text transformations (mainly substitutions) that convert (translate) Portuguese words into Galician words. The translation function is defined as T (Lgl , wpt) = wgl Lgl is the target Galician lexicon, obtained from the words present in the Galician Synonyms Dictionary; wpt is the Portuguese word being translated; wgl is the Galician translation. Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
  8. 8. Approach Define a translation function based on a set or sequence of text transformations (mainly substitutions) that convert (translate) Portuguese words into Galician words. The translation function is defined as T (Lgl , wpt) = wgl Lgl is the target Galician lexicon, obtained from the words present in the Galician Synonyms Dictionary; wpt is the Portuguese word being translated; wgl is the Galician translation. Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
  9. 9. Translation Function Substitutions can be simple, as: ss > s — passo > paso j > x — sujeito > suxeito, injectar > inxectar z ([ei´e´ı^e^ı]) > c — bronze > bronce Substitutions can over-generate: -¸c~ao > -ci´on,-z´on — adivinha¸c˜ao > adivi˜naci´on, cora¸c˜ao > coraz´on -velmente > belmente,-blemente — previsivelmente > previsibelmente, previsiblemente rv > rv,rb — preserva¸c˜ao > preservaci´on, estorvar > estorbar Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
  10. 10. Translation Function Substitutions can be simple, as: ss > s — passo > paso j > x — sujeito > suxeito, injectar > inxectar z ([ei´e´ı^e^ı]) > c — bronze > bronce Substitutions can over-generate: -¸c~ao > -ci´on,-z´on — adivinha¸c˜ao > adivi˜naci´on, cora¸c˜ao > coraz´on -velmente > belmente,-blemente — previsivelmente > previsibelmente, previsiblemente rv > rv,rb — preserva¸c˜ao > preservaci´on, estorvar > estorbar Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
  11. 11. Translation Function A word without substitutions can be a valid translation; Substitutions can be inter-dependent; (for example, -¸c~ao > ci´on should be applied before ¸c > z) Substitutions are applied from more generic to more specific; (unless there is interdependence) Substitutions can generate more than one possible translations; Before returning, the first word in the possible translations that exists in the target lexicon is returned. Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
  12. 12. Translation Function Id. Substitution ID — A ss > s B j > x C -¸c~ao > -ci´on,-z´on D ¸c > z E nh > ~n F -dizer > -dicir G z ([ei´e´ı^e^ı]) > c H lh > ll I vr > br J -agem > -axe K g ([ei´e´ı^e^ı]) > x L -´avel > -´abel,-able M -´ıvel > -´ıbel,-ible N -velmente > belmente,-blemente O -eio > -eo P -^ancia > -ancia Q -^encia > -encia R -aria > -er´ıa,-ar´ıa S -´ario > -ario T -´ori[oa] > -ori[oa] Id. Substitution U -s~ao > -si´on,-s´on V -r~ao > -r´on,-r´an W -m~ao > -m´on,-m´an X -i~ao > i´on,-i´an Y -´ıcio > -icio Z -´oide > -oide AA -´ıdio > -idio AB -^anico > -´anico AC -´edia > -edia AD -cimento > -cemento AE -m > -n AF -crever > -cribir AG -u > -u,-o AH -var > -bar AI im- > im-,inm- AJ qua- > cua-,ca- AK qua > cua AL -x~ao > -x´on,-xi´on AM rv > rv,rb AN -iver > -ivir Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
  13. 13. Evaluation 1 Given a small (about 9K pairs) hand-cured translation dictionary. . . Compute Type I/II Hypothesis: T (Lgl , wpt) = wgl Correct Incorrect wgl is a Galician word TP FP wgl is not a Galician word TN FN TP True Positives – Correct Translation FP False Positives – Wrong Translation, but obtained Word is present in Galician Lexicon; TN True Negative – Correct translation, but translation not in Galician Lexicon (always 0). FN False Negative – Wrong Translation, and obtained Word is not in Galician Lexicon; Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
  14. 14. Evaluation 1 — Measures accuracy = TP + TN TP + TN + FP + FN (1) precision = TP TP + FP (2) recall = TP TP + FN (3) F1 = 2 × precision × recall precision + recall (4) Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
  15. 15. Evaluation 1 — Results Id. Precision Recall F1 Accuracy Correct ∆ ID 0.9954 0.5859 0.7376 0.5843 5390 5390 A 0.9952 0.6038 0.7516 0.6020 5553 163 B 0.9951 0.6158 0.7608 0.6139 5663 110 C 0.9952 0.6567 0.7912 0.6546 6038 375 D 0.9951 0.6687 0.7999 0.6665 6148 110 E 0.9952 0.6782 0.8066 0.6760 6235 87 F 0.9952 0.6786 0.8070 0.6764 6239 4 G 0.9953 0.6838 0.8107 0.6816 6287 48 H 0.9953 0.6927 0.8169 0.6905 6369 82 I 0.9953 0.6934 0.8174 0.6911 6375 6 J 0.9953 0.6964 0.8195 0.6942 6403 28 K 0.9955 0.7210 0.8363 0.7187 6629 226 L 0.9955 0.7256 0.8394 0.7232 6671 42 M 0.9955 0.7284 0.8413 0.7260 6697 26 N 0.9957 0.7482 0.8544 0.7458 6879 182 Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
  16. 16. Evaluation 1 — Results Id. Precision Recall F1 Accuracy Correct ∆ O 0.9957 0.7496 0.8553 0.7472 6892 13 P 0.9957 0.7515 0.8565 0.7490 6909 17 Q 0.9957 0.7588 0.8612 0.7563 6976 67 R 0.9957 0.7602 0.8621 0.7577 6989 13 S 0.9958 0.7680 0.8672 0.7655 7061 72 T 0.9958 0.7703 0.8686 0.7678 7082 21 U 0.9958 0.7772 0.8731 0.7747 7146 64 V 0.9958 0.7780 0.8735 0.7755 7153 7 W 0.9958 0.7783 0.8737 0.7758 7156 3 X 0.9958 0.7796 0.8746 0.7771 7168 12 Y 0.9958 0.7806 0.8752 0.7781 7177 9 Z 0.9958 0.7807 0.8753 0.7782 7178 1 AA 0.9958 0.7813 0.8756 0.7787 7183 5 Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
  17. 17. Evaluation 1 — Results Id. Precision Recall F1 Accuracy Correct ∆ AB 0.9958 0.7818 0.8759 0.7793 7188 5 AC 0.9958 0.7822 0.8762 0.7797 7192 4 AD 0.9959 0.7836 0.8770 0.7810 7204 12 AE 0.9959 0.7855 0.8783 0.7830 7222 18 AF 0.9959 0.7863 0.8787 0.7837 7229 7 AG 0.9957 0.7876 0.8795 0.7849 7240 11 AH 0.9957 0.7882 0.8799 0.7856 7246 6 AI 0.9958 0.7903 0.8812 0.7876 7265 19 AJ 0.9956 0.7928 0.8827 0.7900 7287 22 AK 0.9956 0.7940 0.8834 0.7912 7298 11 AL 0.9956 0.7947 0.8839 0.7920 7305 7 AM 0.9956 0.7951 0.8842 0.7924 7309 4 AN 0.9956 0.7955 0.8844 0.7927 7312 3 Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
  18. 18. Evaluation 2 Triangulating a bigger dictionary for evaluation purposes: PT–SP (12 340) ◦ SP–GL (7 581) ⇒ PT–GL (5 045 pairs) from Apertium translation software PT–SP ◦ SP–EN (24 912) ◦ EN–GL (17 626) ⇒ PT–GL (6 644) PT–SP and EN–GL from Apertium, En–GL from CLUVI PT–EN (14 600) ◦ EN–GL ⇒ PT–GL (8 589 pairs) PT–EN from a merchandising app, EN–GL from CLUVI Adding dictionaries together resulted in a 14 492 pairs. Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
  19. 19. Evaluation 2 – Results Id. Precision Recall F1 Accuracy Correct ∆ ID 0.9668 0.5022 0.6611 0.4937 7155 7155 A 0.9664 0.5176 0.6741 0.5084 7368 213 B 0.9663 0.5275 0.6824 0.5179 7506 138 C 0.9668 0.5646 0.7129 0.5538 8026 520 D 0.9661 0.5746 0.7206 0.5633 8163 137 E 0.9658 0.5831 0.7272 0.5713 8279 116 ... ... ... ... ... ... ... AH 0.9660 0.6819 0.7994 0.6659 9650 7 AI 0.9661 0.6841 0.8010 0.6681 9682 32 AJ 0.9660 0.6863 0.8025 0.6701 9711 29 AK 0.9660 0.6873 0.8032 0.6711 9726 15 AL 0.9661 0.6881 0.8037 0.6718 9736 10 AM 0.9660 0.6884 0.8039 0.6721 9740 4 AN 0.9660 0.6887 0.8041 0.6724 9744 4 Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
  20. 20. Dictionary Alignment — Results Portuguese Words Galician Words Substitution Count Percentage Count Percentage ID 12711 15.3502% 12711 33.7475% A 13082 15.7982% 13065 34.6874% B 13447 16.2390% 13421 35.6326% C 14348 17.3270% 14321 38.0220% D 14764 17.8294% 14728 39.1026% E 15174 18.3245% 15138 40.1912% ... ... ... ... ... AI 17712 21.3895% 17627 46.7994% AJ 17740 21.4233% 17648 46.8552% AK 17765 21.4535% 17673 46.9215% AL 17784 21.4764% 17693 46.9746% AM 17813 21.5115% 17718 47.0410% AN 17817 21.5163% 17722 47.0516% DIC 20084 24.2540% 19989 53.0705% Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
  21. 21. Final Remarks An approach to translate Portuguese words in a dictionary into Galician words using a set of string substitutions; Approach is unable to translate all words; Reasonable amount of words in Dicion´ario-Aberto have pre-1930 orthography, that wasn’t dealt with; We deliberately ignored a relevant problem: false friends. two words that share a subset of the meanings. For instance, talho (PT) and tallo (GL) share the majority of their senses, but there are some of them that are specific to Portuguese (for example, the place where meat is sold); two words that have complete different meanings. An example would be the word presunto (written in the same way in the two languages) that means ham in Portuguese (a noun), but means alleged in Galician (an adjective); Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
  22. 22. Final Remarks An approach to translate Portuguese words in a dictionary into Galician words using a set of string substitutions; Approach is unable to translate all words; Reasonable amount of words in Dicion´ario-Aberto have pre-1930 orthography, that wasn’t dealt with; We deliberately ignored a relevant problem: false friends. two words that share a subset of the meanings. For instance, talho (PT) and tallo (GL) share the majority of their senses, but there are some of them that are specific to Portuguese (for example, the place where meat is sold); two words that have complete different meanings. An example would be the word presunto (written in the same way in the two languages) that means ham in Portuguese (a noun), but means alleged in Galician (an adjective); Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
  23. 23. Dictionary Alignment by Rewrite-based Entry Translation Alberto Sim˜oes1 Xavier G´omez Guinovart2 1Centro de Estudos Human´ısticos, Universidade do Minho Campus de Gualtar, Braga, Portugal ambs@ilch.uminho.pt 2Galician Language Technologies and Applications (TALG Group) Universidade de Vigo, Galiza, Spain xgg@uvigo.es SLATE 2013 Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation

×