CSUC - Consorci de Serveis Universitaris de Catalunya•59 views
Dictionary Alignment by Rewrite-based Entry Translation
1. Dictionary Alignment
by Rewrite-based Entry Translation
Alberto Sim˜oes1 Xavier G´omez Guinovart2
1Centro de Estudos Human´ısticos, Universidade do Minho
Campus de Gualtar, Braga, Portugal
ambs@ilch.uminho.pt
2Galician Language Technologies and Applications (TALG Group)
Universidade de Vigo, Galiza, Spain
xgg@uvigo.es
SLATE 2013
Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
2. Motivation
We have a running project, Dicion´ario-Aberto, that allows the
user to consult a Portuguese dictionary;
Dicion´ario-Aberto is also available in TEI and DB formats;
Within GALNET project, a Galician Synonyms Dictionary was
converted from a WYSIWYG format to a rich TEI format;
Would it be possible to integrate the GSD into DA?
Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
3. Problem
Dicion´ario-Aberto has more than a hundred thousand entries!
Galician Synonyms Dictionary is not that big, and has some
dozens of thousand entries.
Problem: how to align entries from both dictionaries?
The two languages are very close;
That help with concepts alignment!
There are too many different words;
There is a reasonable set of false friend words;
There isn’t a a free and big enough translation dictionary.
Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
4. Problem
Dicion´ario-Aberto has more than a hundred thousand entries!
Galician Synonyms Dictionary is not that big, and has some
dozens of thousand entries.
Problem: how to align entries from both dictionaries?
The two languages are very close;
That help with concepts alignment!
There are too many different words;
There is a reasonable set of false friend words;
There isn’t a a free and big enough translation dictionary.
Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
5. Inspiration (part 1)
In the first year,“s”will be used instead of the soft“c.” Sertainly,
sivil servants will resieve this news with joy. Also, the hard“c”will
be replaced with“k”. Not only will this klear up konfusion, but
typewriters kan have one less letter.
There will be growing publik emthusiasm in the sekond year, when
the troublesome“ph”will be replaced by“f”. This will make words
like“fotograf”20 persent shorter.
In the third year, publik akseptanse of the new spelling kan be
expekted to reach the stage where more komplikated changes are
possible. Governments will enkorage the removal of double letters,
which have always ben a deterent to akurate speling. Also, al wil
agre that the horible mes of silent“e”s in the languag is disgrasful,
and they would go.
Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
6. Inspiration (part 2)
By the fourth year, peopl wil be reseptiv to steps such as replasing
“th”by“z”and“w”by“v”.
During ze fifz year, ze unesesary“o”kan be dropd from vords
kontaining“ou”, and similar changes vud of kors be aplid to ozer
kombinations of leters.
After zis fifz yer, ve vil hav a reli sensibl riten styl. Zer vil be no
mor trubls or difikultis and evrivun vil find it ezi tu understand ech
ozer. Ze drem vil finali kum tru!!
Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
7. Approach
Define a translation function based on a set or sequence of text
transformations (mainly substitutions) that convert (translate)
Portuguese words into Galician words.
The translation function is defined as
T (Lgl , wpt) = wgl
Lgl is the target Galician lexicon, obtained from the words
present in the Galician Synonyms Dictionary;
wpt is the Portuguese word being translated;
wgl is the Galician translation.
Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
8. Approach
Define a translation function based on a set or sequence of text
transformations (mainly substitutions) that convert (translate)
Portuguese words into Galician words.
The translation function is defined as
T (Lgl , wpt) = wgl
Lgl is the target Galician lexicon, obtained from the words
present in the Galician Synonyms Dictionary;
wpt is the Portuguese word being translated;
wgl is the Galician translation.
Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
9. Translation Function
Substitutions can be simple, as:
ss > s — passo > paso
j > x — sujeito > suxeito, injectar > inxectar
z ([ei´e´ı^e^ı]) > c — bronze > bronce
Substitutions can over-generate:
-¸c~ao > -ci´on,-z´on —
adivinha¸c˜ao > adivi˜naci´on, cora¸c˜ao > coraz´on
-velmente > belmente,-blemente —
previsivelmente > previsibelmente, previsiblemente
rv > rv,rb —
preserva¸c˜ao > preservaci´on, estorvar > estorbar
Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
10. Translation Function
Substitutions can be simple, as:
ss > s — passo > paso
j > x — sujeito > suxeito, injectar > inxectar
z ([ei´e´ı^e^ı]) > c — bronze > bronce
Substitutions can over-generate:
-¸c~ao > -ci´on,-z´on —
adivinha¸c˜ao > adivi˜naci´on, cora¸c˜ao > coraz´on
-velmente > belmente,-blemente —
previsivelmente > previsibelmente, previsiblemente
rv > rv,rb —
preserva¸c˜ao > preservaci´on, estorvar > estorbar
Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
11. Translation Function
A word without substitutions can be a valid translation;
Substitutions can be inter-dependent;
(for example, -¸c~ao > ci´on should be applied before ¸c > z)
Substitutions are applied from more generic to more specific;
(unless there is interdependence)
Substitutions can generate more than one possible
translations;
Before returning, the first word in the possible translations
that exists in the target lexicon is returned.
Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
12. Translation Function
Id. Substitution
ID —
A ss > s
B j > x
C -¸c~ao > -ci´on,-z´on
D ¸c > z
E nh > ~n
F -dizer > -dicir
G z ([ei´e´ı^e^ı]) > c
H lh > ll
I vr > br
J -agem > -axe
K g ([ei´e´ı^e^ı]) > x
L -´avel > -´abel,-able
M -´ıvel > -´ıbel,-ible
N -velmente > belmente,-blemente
O -eio > -eo
P -^ancia > -ancia
Q -^encia > -encia
R -aria > -er´ıa,-ar´ıa
S -´ario > -ario
T -´ori[oa] > -ori[oa]
Id. Substitution
U -s~ao > -si´on,-s´on
V -r~ao > -r´on,-r´an
W -m~ao > -m´on,-m´an
X -i~ao > i´on,-i´an
Y -´ıcio > -icio
Z -´oide > -oide
AA -´ıdio > -idio
AB -^anico > -´anico
AC -´edia > -edia
AD -cimento > -cemento
AE -m > -n
AF -crever > -cribir
AG -u > -u,-o
AH -var > -bar
AI im- > im-,inm-
AJ qua- > cua-,ca-
AK qua > cua
AL -x~ao > -x´on,-xi´on
AM rv > rv,rb
AN -iver > -ivir
Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
13. Evaluation 1
Given a small (about 9K pairs) hand-cured translation dictionary. . .
Compute Type I/II Hypothesis:
T (Lgl , wpt) = wgl Correct Incorrect
wgl is a Galician word TP FP
wgl is not a Galician word TN FN
TP True Positives – Correct Translation
FP False Positives – Wrong Translation, but obtained Word is
present in Galician Lexicon;
TN True Negative – Correct translation, but translation not in
Galician Lexicon (always 0).
FN False Negative – Wrong Translation, and obtained Word is
not in Galician Lexicon;
Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
15. Evaluation 1 — Results
Id. Precision Recall F1 Accuracy Correct ∆
ID 0.9954 0.5859 0.7376 0.5843 5390 5390
A 0.9952 0.6038 0.7516 0.6020 5553 163
B 0.9951 0.6158 0.7608 0.6139 5663 110
C 0.9952 0.6567 0.7912 0.6546 6038 375
D 0.9951 0.6687 0.7999 0.6665 6148 110
E 0.9952 0.6782 0.8066 0.6760 6235 87
F 0.9952 0.6786 0.8070 0.6764 6239 4
G 0.9953 0.6838 0.8107 0.6816 6287 48
H 0.9953 0.6927 0.8169 0.6905 6369 82
I 0.9953 0.6934 0.8174 0.6911 6375 6
J 0.9953 0.6964 0.8195 0.6942 6403 28
K 0.9955 0.7210 0.8363 0.7187 6629 226
L 0.9955 0.7256 0.8394 0.7232 6671 42
M 0.9955 0.7284 0.8413 0.7260 6697 26
N 0.9957 0.7482 0.8544 0.7458 6879 182
Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
16. Evaluation 1 — Results
Id. Precision Recall F1 Accuracy Correct ∆
O 0.9957 0.7496 0.8553 0.7472 6892 13
P 0.9957 0.7515 0.8565 0.7490 6909 17
Q 0.9957 0.7588 0.8612 0.7563 6976 67
R 0.9957 0.7602 0.8621 0.7577 6989 13
S 0.9958 0.7680 0.8672 0.7655 7061 72
T 0.9958 0.7703 0.8686 0.7678 7082 21
U 0.9958 0.7772 0.8731 0.7747 7146 64
V 0.9958 0.7780 0.8735 0.7755 7153 7
W 0.9958 0.7783 0.8737 0.7758 7156 3
X 0.9958 0.7796 0.8746 0.7771 7168 12
Y 0.9958 0.7806 0.8752 0.7781 7177 9
Z 0.9958 0.7807 0.8753 0.7782 7178 1
AA 0.9958 0.7813 0.8756 0.7787 7183 5
Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
17. Evaluation 1 — Results
Id. Precision Recall F1 Accuracy Correct ∆
AB 0.9958 0.7818 0.8759 0.7793 7188 5
AC 0.9958 0.7822 0.8762 0.7797 7192 4
AD 0.9959 0.7836 0.8770 0.7810 7204 12
AE 0.9959 0.7855 0.8783 0.7830 7222 18
AF 0.9959 0.7863 0.8787 0.7837 7229 7
AG 0.9957 0.7876 0.8795 0.7849 7240 11
AH 0.9957 0.7882 0.8799 0.7856 7246 6
AI 0.9958 0.7903 0.8812 0.7876 7265 19
AJ 0.9956 0.7928 0.8827 0.7900 7287 22
AK 0.9956 0.7940 0.8834 0.7912 7298 11
AL 0.9956 0.7947 0.8839 0.7920 7305 7
AM 0.9956 0.7951 0.8842 0.7924 7309 4
AN 0.9956 0.7955 0.8844 0.7927 7312 3
Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
18. Evaluation 2
Triangulating a bigger dictionary for evaluation purposes:
PT–SP (12 340) ◦ SP–GL (7 581) ⇒ PT–GL (5 045 pairs)
from Apertium translation software
PT–SP ◦ SP–EN (24 912) ◦ EN–GL (17 626) ⇒ PT–GL (6 644)
PT–SP and EN–GL from Apertium, En–GL from CLUVI
PT–EN (14 600) ◦ EN–GL ⇒ PT–GL (8 589 pairs)
PT–EN from a merchandising app, EN–GL from CLUVI
Adding dictionaries together resulted in a 14 492 pairs.
Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
19. Evaluation 2 – Results
Id. Precision Recall F1 Accuracy Correct ∆
ID 0.9668 0.5022 0.6611 0.4937 7155 7155
A 0.9664 0.5176 0.6741 0.5084 7368 213
B 0.9663 0.5275 0.6824 0.5179 7506 138
C 0.9668 0.5646 0.7129 0.5538 8026 520
D 0.9661 0.5746 0.7206 0.5633 8163 137
E 0.9658 0.5831 0.7272 0.5713 8279 116
...
...
...
...
...
...
...
AH 0.9660 0.6819 0.7994 0.6659 9650 7
AI 0.9661 0.6841 0.8010 0.6681 9682 32
AJ 0.9660 0.6863 0.8025 0.6701 9711 29
AK 0.9660 0.6873 0.8032 0.6711 9726 15
AL 0.9661 0.6881 0.8037 0.6718 9736 10
AM 0.9660 0.6884 0.8039 0.6721 9740 4
AN 0.9660 0.6887 0.8041 0.6724 9744 4
Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
20. Dictionary Alignment — Results
Portuguese Words Galician Words
Substitution Count Percentage Count Percentage
ID 12711 15.3502% 12711 33.7475%
A 13082 15.7982% 13065 34.6874%
B 13447 16.2390% 13421 35.6326%
C 14348 17.3270% 14321 38.0220%
D 14764 17.8294% 14728 39.1026%
E 15174 18.3245% 15138 40.1912%
...
...
...
...
...
AI 17712 21.3895% 17627 46.7994%
AJ 17740 21.4233% 17648 46.8552%
AK 17765 21.4535% 17673 46.9215%
AL 17784 21.4764% 17693 46.9746%
AM 17813 21.5115% 17718 47.0410%
AN 17817 21.5163% 17722 47.0516%
DIC 20084 24.2540% 19989 53.0705%
Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
21. Final Remarks
An approach to translate Portuguese words in a dictionary
into Galician words using a set of string substitutions;
Approach is unable to translate all words;
Reasonable amount of words in Dicion´ario-Aberto have
pre-1930 orthography, that wasn’t dealt with;
We deliberately ignored a relevant problem: false friends.
two words that share a subset of the meanings. For instance,
talho (PT) and tallo (GL) share the majority of their senses,
but there are some of them that are specific to Portuguese
(for example, the place where meat is sold);
two words that have complete different meanings. An example
would be the word presunto (written in the same way in the
two languages) that means ham in Portuguese (a noun), but
means alleged in Galician (an adjective);
Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
22. Final Remarks
An approach to translate Portuguese words in a dictionary
into Galician words using a set of string substitutions;
Approach is unable to translate all words;
Reasonable amount of words in Dicion´ario-Aberto have
pre-1930 orthography, that wasn’t dealt with;
We deliberately ignored a relevant problem: false friends.
two words that share a subset of the meanings. For instance,
talho (PT) and tallo (GL) share the majority of their senses,
but there are some of them that are specific to Portuguese
(for example, the place where meat is sold);
two words that have complete different meanings. An example
would be the word presunto (written in the same way in the
two languages) that means ham in Portuguese (a noun), but
means alleged in Galician (an adjective);
Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
23. Dictionary Alignment
by Rewrite-based Entry Translation
Alberto Sim˜oes1 Xavier G´omez Guinovart2
1Centro de Estudos Human´ısticos, Universidade do Minho
Campus de Gualtar, Braga, Portugal
ambs@ilch.uminho.pt
2Galician Language Technologies and Applications (TALG Group)
Universidade de Vigo, Galiza, Spain
xgg@uvigo.es
SLATE 2013
Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation