Dictionary Alignment by Rewrite-based Entry Translation

Alberto Simões
Alberto SimõesTeacher, programmer at University of Minho
Dictionary Alignment
by Rewrite-based Entry Translation
Alberto Sim˜oes1 Xavier G´omez Guinovart2
1Centro de Estudos Human´ısticos, Universidade do Minho
Campus de Gualtar, Braga, Portugal
ambs@ilch.uminho.pt
2Galician Language Technologies and Applications (TALG Group)
Universidade de Vigo, Galiza, Spain
xgg@uvigo.es
SLATE 2013
Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
Motivation
We have a running project, Dicion´ario-Aberto, that allows the
user to consult a Portuguese dictionary;
Dicion´ario-Aberto is also available in TEI and DB formats;
Within GALNET project, a Galician Synonyms Dictionary was
converted from a WYSIWYG format to a rich TEI format;
Would it be possible to integrate the GSD into DA?
Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
Problem
Dicion´ario-Aberto has more than a hundred thousand entries!
Galician Synonyms Dictionary is not that big, and has some
dozens of thousand entries.
Problem: how to align entries from both dictionaries?
The two languages are very close;
That help with concepts alignment!
There are too many different words;
There is a reasonable set of false friend words;
There isn’t a a free and big enough translation dictionary.
Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
Problem
Dicion´ario-Aberto has more than a hundred thousand entries!
Galician Synonyms Dictionary is not that big, and has some
dozens of thousand entries.
Problem: how to align entries from both dictionaries?
The two languages are very close;
That help with concepts alignment!
There are too many different words;
There is a reasonable set of false friend words;
There isn’t a a free and big enough translation dictionary.
Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
Inspiration (part 1)
In the first year,“s”will be used instead of the soft“c.” Sertainly,
sivil servants will resieve this news with joy. Also, the hard“c”will
be replaced with“k”. Not only will this klear up konfusion, but
typewriters kan have one less letter.
There will be growing publik emthusiasm in the sekond year, when
the troublesome“ph”will be replaced by“f”. This will make words
like“fotograf”20 persent shorter.
In the third year, publik akseptanse of the new spelling kan be
expekted to reach the stage where more komplikated changes are
possible. Governments will enkorage the removal of double letters,
which have always ben a deterent to akurate speling. Also, al wil
agre that the horible mes of silent“e”s in the languag is disgrasful,
and they would go.
Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
Inspiration (part 2)
By the fourth year, peopl wil be reseptiv to steps such as replasing
“th”by“z”and“w”by“v”.
During ze fifz year, ze unesesary“o”kan be dropd from vords
kontaining“ou”, and similar changes vud of kors be aplid to ozer
kombinations of leters.
After zis fifz yer, ve vil hav a reli sensibl riten styl. Zer vil be no
mor trubls or difikultis and evrivun vil find it ezi tu understand ech
ozer. Ze drem vil finali kum tru!!
Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
Approach
Define a translation function based on a set or sequence of text
transformations (mainly substitutions) that convert (translate)
Portuguese words into Galician words.
The translation function is defined as
T (Lgl , wpt) = wgl
Lgl is the target Galician lexicon, obtained from the words
present in the Galician Synonyms Dictionary;
wpt is the Portuguese word being translated;
wgl is the Galician translation.
Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
Approach
Define a translation function based on a set or sequence of text
transformations (mainly substitutions) that convert (translate)
Portuguese words into Galician words.
The translation function is defined as
T (Lgl , wpt) = wgl
Lgl is the target Galician lexicon, obtained from the words
present in the Galician Synonyms Dictionary;
wpt is the Portuguese word being translated;
wgl is the Galician translation.
Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
Translation Function
Substitutions can be simple, as:
ss > s — passo > paso
j > x — sujeito > suxeito, injectar > inxectar
z ([ei´e´ı^e^ı]) > c — bronze > bronce
Substitutions can over-generate:
-¸c~ao > -ci´on,-z´on —
adivinha¸c˜ao > adivi˜naci´on, cora¸c˜ao > coraz´on
-velmente > belmente,-blemente —
previsivelmente > previsibelmente, previsiblemente
rv > rv,rb —
preserva¸c˜ao > preservaci´on, estorvar > estorbar
Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
Translation Function
Substitutions can be simple, as:
ss > s — passo > paso
j > x — sujeito > suxeito, injectar > inxectar
z ([ei´e´ı^e^ı]) > c — bronze > bronce
Substitutions can over-generate:
-¸c~ao > -ci´on,-z´on —
adivinha¸c˜ao > adivi˜naci´on, cora¸c˜ao > coraz´on
-velmente > belmente,-blemente —
previsivelmente > previsibelmente, previsiblemente
rv > rv,rb —
preserva¸c˜ao > preservaci´on, estorvar > estorbar
Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
Translation Function
A word without substitutions can be a valid translation;
Substitutions can be inter-dependent;
(for example, -¸c~ao > ci´on should be applied before ¸c > z)
Substitutions are applied from more generic to more specific;
(unless there is interdependence)
Substitutions can generate more than one possible
translations;
Before returning, the first word in the possible translations
that exists in the target lexicon is returned.
Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
Translation Function
Id. Substitution
ID —
A ss > s
B j > x
C -¸c~ao > -ci´on,-z´on
D ¸c > z
E nh > ~n
F -dizer > -dicir
G z ([ei´e´ı^e^ı]) > c
H lh > ll
I vr > br
J -agem > -axe
K g ([ei´e´ı^e^ı]) > x
L -´avel > -´abel,-able
M -´ıvel > -´ıbel,-ible
N -velmente > belmente,-blemente
O -eio > -eo
P -^ancia > -ancia
Q -^encia > -encia
R -aria > -er´ıa,-ar´ıa
S -´ario > -ario
T -´ori[oa] > -ori[oa]
Id. Substitution
U -s~ao > -si´on,-s´on
V -r~ao > -r´on,-r´an
W -m~ao > -m´on,-m´an
X -i~ao > i´on,-i´an
Y -´ıcio > -icio
Z -´oide > -oide
AA -´ıdio > -idio
AB -^anico > -´anico
AC -´edia > -edia
AD -cimento > -cemento
AE -m > -n
AF -crever > -cribir
AG -u > -u,-o
AH -var > -bar
AI im- > im-,inm-
AJ qua- > cua-,ca-
AK qua > cua
AL -x~ao > -x´on,-xi´on
AM rv > rv,rb
AN -iver > -ivir
Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
Evaluation 1
Given a small (about 9K pairs) hand-cured translation dictionary. . .
Compute Type I/II Hypothesis:
T (Lgl , wpt) = wgl Correct Incorrect
wgl is a Galician word TP FP
wgl is not a Galician word TN FN
TP True Positives – Correct Translation
FP False Positives – Wrong Translation, but obtained Word is
present in Galician Lexicon;
TN True Negative – Correct translation, but translation not in
Galician Lexicon (always 0).
FN False Negative – Wrong Translation, and obtained Word is
not in Galician Lexicon;
Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
Evaluation 1 — Measures
accuracy =
TP + TN
TP + TN + FP + FN
(1)
precision =
TP
TP + FP
(2)
recall =
TP
TP + FN
(3)
F1 = 2 ×
precision × recall
precision + recall
(4)
Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
Evaluation 1 — Results
Id. Precision Recall F1 Accuracy Correct ∆
ID 0.9954 0.5859 0.7376 0.5843 5390 5390
A 0.9952 0.6038 0.7516 0.6020 5553 163
B 0.9951 0.6158 0.7608 0.6139 5663 110
C 0.9952 0.6567 0.7912 0.6546 6038 375
D 0.9951 0.6687 0.7999 0.6665 6148 110
E 0.9952 0.6782 0.8066 0.6760 6235 87
F 0.9952 0.6786 0.8070 0.6764 6239 4
G 0.9953 0.6838 0.8107 0.6816 6287 48
H 0.9953 0.6927 0.8169 0.6905 6369 82
I 0.9953 0.6934 0.8174 0.6911 6375 6
J 0.9953 0.6964 0.8195 0.6942 6403 28
K 0.9955 0.7210 0.8363 0.7187 6629 226
L 0.9955 0.7256 0.8394 0.7232 6671 42
M 0.9955 0.7284 0.8413 0.7260 6697 26
N 0.9957 0.7482 0.8544 0.7458 6879 182
Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
Evaluation 1 — Results
Id. Precision Recall F1 Accuracy Correct ∆
O 0.9957 0.7496 0.8553 0.7472 6892 13
P 0.9957 0.7515 0.8565 0.7490 6909 17
Q 0.9957 0.7588 0.8612 0.7563 6976 67
R 0.9957 0.7602 0.8621 0.7577 6989 13
S 0.9958 0.7680 0.8672 0.7655 7061 72
T 0.9958 0.7703 0.8686 0.7678 7082 21
U 0.9958 0.7772 0.8731 0.7747 7146 64
V 0.9958 0.7780 0.8735 0.7755 7153 7
W 0.9958 0.7783 0.8737 0.7758 7156 3
X 0.9958 0.7796 0.8746 0.7771 7168 12
Y 0.9958 0.7806 0.8752 0.7781 7177 9
Z 0.9958 0.7807 0.8753 0.7782 7178 1
AA 0.9958 0.7813 0.8756 0.7787 7183 5
Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
Evaluation 1 — Results
Id. Precision Recall F1 Accuracy Correct ∆
AB 0.9958 0.7818 0.8759 0.7793 7188 5
AC 0.9958 0.7822 0.8762 0.7797 7192 4
AD 0.9959 0.7836 0.8770 0.7810 7204 12
AE 0.9959 0.7855 0.8783 0.7830 7222 18
AF 0.9959 0.7863 0.8787 0.7837 7229 7
AG 0.9957 0.7876 0.8795 0.7849 7240 11
AH 0.9957 0.7882 0.8799 0.7856 7246 6
AI 0.9958 0.7903 0.8812 0.7876 7265 19
AJ 0.9956 0.7928 0.8827 0.7900 7287 22
AK 0.9956 0.7940 0.8834 0.7912 7298 11
AL 0.9956 0.7947 0.8839 0.7920 7305 7
AM 0.9956 0.7951 0.8842 0.7924 7309 4
AN 0.9956 0.7955 0.8844 0.7927 7312 3
Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
Evaluation 2
Triangulating a bigger dictionary for evaluation purposes:
PT–SP (12 340) ◦ SP–GL (7 581) ⇒ PT–GL (5 045 pairs)
from Apertium translation software
PT–SP ◦ SP–EN (24 912) ◦ EN–GL (17 626) ⇒ PT–GL (6 644)
PT–SP and EN–GL from Apertium, En–GL from CLUVI
PT–EN (14 600) ◦ EN–GL ⇒ PT–GL (8 589 pairs)
PT–EN from a merchandising app, EN–GL from CLUVI
Adding dictionaries together resulted in a 14 492 pairs.
Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
Evaluation 2 – Results
Id. Precision Recall F1 Accuracy Correct ∆
ID 0.9668 0.5022 0.6611 0.4937 7155 7155
A 0.9664 0.5176 0.6741 0.5084 7368 213
B 0.9663 0.5275 0.6824 0.5179 7506 138
C 0.9668 0.5646 0.7129 0.5538 8026 520
D 0.9661 0.5746 0.7206 0.5633 8163 137
E 0.9658 0.5831 0.7272 0.5713 8279 116
...
...
...
...
...
...
...
AH 0.9660 0.6819 0.7994 0.6659 9650 7
AI 0.9661 0.6841 0.8010 0.6681 9682 32
AJ 0.9660 0.6863 0.8025 0.6701 9711 29
AK 0.9660 0.6873 0.8032 0.6711 9726 15
AL 0.9661 0.6881 0.8037 0.6718 9736 10
AM 0.9660 0.6884 0.8039 0.6721 9740 4
AN 0.9660 0.6887 0.8041 0.6724 9744 4
Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
Dictionary Alignment — Results
Portuguese Words Galician Words
Substitution Count Percentage Count Percentage
ID 12711 15.3502% 12711 33.7475%
A 13082 15.7982% 13065 34.6874%
B 13447 16.2390% 13421 35.6326%
C 14348 17.3270% 14321 38.0220%
D 14764 17.8294% 14728 39.1026%
E 15174 18.3245% 15138 40.1912%
...
...
...
...
...
AI 17712 21.3895% 17627 46.7994%
AJ 17740 21.4233% 17648 46.8552%
AK 17765 21.4535% 17673 46.9215%
AL 17784 21.4764% 17693 46.9746%
AM 17813 21.5115% 17718 47.0410%
AN 17817 21.5163% 17722 47.0516%
DIC 20084 24.2540% 19989 53.0705%
Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
Final Remarks
An approach to translate Portuguese words in a dictionary
into Galician words using a set of string substitutions;
Approach is unable to translate all words;
Reasonable amount of words in Dicion´ario-Aberto have
pre-1930 orthography, that wasn’t dealt with;
We deliberately ignored a relevant problem: false friends.
two words that share a subset of the meanings. For instance,
talho (PT) and tallo (GL) share the majority of their senses,
but there are some of them that are specific to Portuguese
(for example, the place where meat is sold);
two words that have complete different meanings. An example
would be the word presunto (written in the same way in the
two languages) that means ham in Portuguese (a noun), but
means alleged in Galician (an adjective);
Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
Final Remarks
An approach to translate Portuguese words in a dictionary
into Galician words using a set of string substitutions;
Approach is unable to translate all words;
Reasonable amount of words in Dicion´ario-Aberto have
pre-1930 orthography, that wasn’t dealt with;
We deliberately ignored a relevant problem: false friends.
two words that share a subset of the meanings. For instance,
talho (PT) and tallo (GL) share the majority of their senses,
but there are some of them that are specific to Portuguese
(for example, the place where meat is sold);
two words that have complete different meanings. An example
would be the word presunto (written in the same way in the
two languages) that means ham in Portuguese (a noun), but
means alleged in Galician (an adjective);
Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
Dictionary Alignment
by Rewrite-based Entry Translation
Alberto Sim˜oes1 Xavier G´omez Guinovart2
1Centro de Estudos Human´ısticos, Universidade do Minho
Campus de Gualtar, Braga, Portugal
ambs@ilch.uminho.pt
2Galician Language Technologies and Applications (TALG Group)
Universidade de Vigo, Galiza, Spain
xgg@uvigo.es
SLATE 2013
Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
1 of 23

More Related Content

More from Alberto Simões(20)

Google Maps JS APIGoogle Maps JS API
Google Maps JS API
Alberto Simões2.5K views
EMLex-A5: Specialized DictionariesEMLex-A5: Specialized Dictionaries
EMLex-A5: Specialized Dictionaries
Alberto Simões2.9K views
Modelação de DadosModelação de Dados
Modelação de Dados
Alberto Simões5.8K views
Aula 02 - Engenharia de RequisitosAula 02 - Engenharia de Requisitos
Aula 02 - Engenharia de Requisitos
Alberto Simões1.6K views
Building C and C++ libraries with PerlBuilding C and C++ libraries with Perl
Building C and C++ libraries with Perl
Alberto Simões1.6K views
PLN em PerlPLN em Perl
PLN em Perl
Alberto Simões766 views
Classification SystemsClassification Systems
Classification Systems
Alberto Simões1.7K views
Redes de PertRedes de Pert
Redes de Pert
Alberto Simões15.4K views
Dancing TutorialDancing Tutorial
Dancing Tutorial
Alberto Simões14.3K views
Sistemas de NumeraçãoSistemas de Numeração
Sistemas de Numeração
Alberto Simões1.3K views
Álgebra de BooleÁlgebra de Boole
Álgebra de Boole
Alberto Simões3K views
Arquitecturas de Tradução AutomáticaArquitecturas de Tradução Automática
Arquitecturas de Tradução Automática
Alberto Simões751 views
Dicionário AbertoDicionário Aberto
Dicionário Aberto
Alberto Simões884 views
Keynote GlobsKeynote Globs
Keynote Globs
Alberto Simões421 views
Workshop GLOBSWorkshop GLOBS
Workshop GLOBS
Alberto Simões318 views

Recently uploaded(20)

Green Leaf Consulting: Capabilities DeckGreen Leaf Consulting: Capabilities Deck
Green Leaf Consulting: Capabilities Deck
GreenLeafConsulting177 views
METHOD AND SYSTEM FOR PREDICTING OPTIMAL LOAD FOR WHICH THE YIELD IS MAXIMUM ...METHOD AND SYSTEM FOR PREDICTING OPTIMAL LOAD FOR WHICH THE YIELD IS MAXIMUM ...
METHOD AND SYSTEM FOR PREDICTING OPTIMAL LOAD FOR WHICH THE YIELD IS MAXIMUM ...
Prity Khastgir IPR Strategic India Patent Attorney Amplify Innovation24 views
ChatGPT and AI for Web DevelopersChatGPT and AI for Web Developers
ChatGPT and AI for Web Developers
Maximiliano Firtman161 views
The Research Portal of Catalonia: Growing more (information) & more (services)The Research Portal of Catalonia: Growing more (information) & more (services)
The Research Portal of Catalonia: Growing more (information) & more (services)
CSUC - Consorci de Serveis Universitaris de Catalunya59 views

Dictionary Alignment by Rewrite-based Entry Translation

  • 1. Dictionary Alignment by Rewrite-based Entry Translation Alberto Sim˜oes1 Xavier G´omez Guinovart2 1Centro de Estudos Human´ısticos, Universidade do Minho Campus de Gualtar, Braga, Portugal ambs@ilch.uminho.pt 2Galician Language Technologies and Applications (TALG Group) Universidade de Vigo, Galiza, Spain xgg@uvigo.es SLATE 2013 Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
  • 2. Motivation We have a running project, Dicion´ario-Aberto, that allows the user to consult a Portuguese dictionary; Dicion´ario-Aberto is also available in TEI and DB formats; Within GALNET project, a Galician Synonyms Dictionary was converted from a WYSIWYG format to a rich TEI format; Would it be possible to integrate the GSD into DA? Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
  • 3. Problem Dicion´ario-Aberto has more than a hundred thousand entries! Galician Synonyms Dictionary is not that big, and has some dozens of thousand entries. Problem: how to align entries from both dictionaries? The two languages are very close; That help with concepts alignment! There are too many different words; There is a reasonable set of false friend words; There isn’t a a free and big enough translation dictionary. Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
  • 4. Problem Dicion´ario-Aberto has more than a hundred thousand entries! Galician Synonyms Dictionary is not that big, and has some dozens of thousand entries. Problem: how to align entries from both dictionaries? The two languages are very close; That help with concepts alignment! There are too many different words; There is a reasonable set of false friend words; There isn’t a a free and big enough translation dictionary. Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
  • 5. Inspiration (part 1) In the first year,“s”will be used instead of the soft“c.” Sertainly, sivil servants will resieve this news with joy. Also, the hard“c”will be replaced with“k”. Not only will this klear up konfusion, but typewriters kan have one less letter. There will be growing publik emthusiasm in the sekond year, when the troublesome“ph”will be replaced by“f”. This will make words like“fotograf”20 persent shorter. In the third year, publik akseptanse of the new spelling kan be expekted to reach the stage where more komplikated changes are possible. Governments will enkorage the removal of double letters, which have always ben a deterent to akurate speling. Also, al wil agre that the horible mes of silent“e”s in the languag is disgrasful, and they would go. Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
  • 6. Inspiration (part 2) By the fourth year, peopl wil be reseptiv to steps such as replasing “th”by“z”and“w”by“v”. During ze fifz year, ze unesesary“o”kan be dropd from vords kontaining“ou”, and similar changes vud of kors be aplid to ozer kombinations of leters. After zis fifz yer, ve vil hav a reli sensibl riten styl. Zer vil be no mor trubls or difikultis and evrivun vil find it ezi tu understand ech ozer. Ze drem vil finali kum tru!! Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
  • 7. Approach Define a translation function based on a set or sequence of text transformations (mainly substitutions) that convert (translate) Portuguese words into Galician words. The translation function is defined as T (Lgl , wpt) = wgl Lgl is the target Galician lexicon, obtained from the words present in the Galician Synonyms Dictionary; wpt is the Portuguese word being translated; wgl is the Galician translation. Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
  • 8. Approach Define a translation function based on a set or sequence of text transformations (mainly substitutions) that convert (translate) Portuguese words into Galician words. The translation function is defined as T (Lgl , wpt) = wgl Lgl is the target Galician lexicon, obtained from the words present in the Galician Synonyms Dictionary; wpt is the Portuguese word being translated; wgl is the Galician translation. Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
  • 9. Translation Function Substitutions can be simple, as: ss > s — passo > paso j > x — sujeito > suxeito, injectar > inxectar z ([ei´e´ı^e^ı]) > c — bronze > bronce Substitutions can over-generate: -¸c~ao > -ci´on,-z´on — adivinha¸c˜ao > adivi˜naci´on, cora¸c˜ao > coraz´on -velmente > belmente,-blemente — previsivelmente > previsibelmente, previsiblemente rv > rv,rb — preserva¸c˜ao > preservaci´on, estorvar > estorbar Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
  • 10. Translation Function Substitutions can be simple, as: ss > s — passo > paso j > x — sujeito > suxeito, injectar > inxectar z ([ei´e´ı^e^ı]) > c — bronze > bronce Substitutions can over-generate: -¸c~ao > -ci´on,-z´on — adivinha¸c˜ao > adivi˜naci´on, cora¸c˜ao > coraz´on -velmente > belmente,-blemente — previsivelmente > previsibelmente, previsiblemente rv > rv,rb — preserva¸c˜ao > preservaci´on, estorvar > estorbar Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
  • 11. Translation Function A word without substitutions can be a valid translation; Substitutions can be inter-dependent; (for example, -¸c~ao > ci´on should be applied before ¸c > z) Substitutions are applied from more generic to more specific; (unless there is interdependence) Substitutions can generate more than one possible translations; Before returning, the first word in the possible translations that exists in the target lexicon is returned. Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
  • 12. Translation Function Id. Substitution ID — A ss > s B j > x C -¸c~ao > -ci´on,-z´on D ¸c > z E nh > ~n F -dizer > -dicir G z ([ei´e´ı^e^ı]) > c H lh > ll I vr > br J -agem > -axe K g ([ei´e´ı^e^ı]) > x L -´avel > -´abel,-able M -´ıvel > -´ıbel,-ible N -velmente > belmente,-blemente O -eio > -eo P -^ancia > -ancia Q -^encia > -encia R -aria > -er´ıa,-ar´ıa S -´ario > -ario T -´ori[oa] > -ori[oa] Id. Substitution U -s~ao > -si´on,-s´on V -r~ao > -r´on,-r´an W -m~ao > -m´on,-m´an X -i~ao > i´on,-i´an Y -´ıcio > -icio Z -´oide > -oide AA -´ıdio > -idio AB -^anico > -´anico AC -´edia > -edia AD -cimento > -cemento AE -m > -n AF -crever > -cribir AG -u > -u,-o AH -var > -bar AI im- > im-,inm- AJ qua- > cua-,ca- AK qua > cua AL -x~ao > -x´on,-xi´on AM rv > rv,rb AN -iver > -ivir Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
  • 13. Evaluation 1 Given a small (about 9K pairs) hand-cured translation dictionary. . . Compute Type I/II Hypothesis: T (Lgl , wpt) = wgl Correct Incorrect wgl is a Galician word TP FP wgl is not a Galician word TN FN TP True Positives – Correct Translation FP False Positives – Wrong Translation, but obtained Word is present in Galician Lexicon; TN True Negative – Correct translation, but translation not in Galician Lexicon (always 0). FN False Negative – Wrong Translation, and obtained Word is not in Galician Lexicon; Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
  • 14. Evaluation 1 — Measures accuracy = TP + TN TP + TN + FP + FN (1) precision = TP TP + FP (2) recall = TP TP + FN (3) F1 = 2 × precision × recall precision + recall (4) Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
  • 15. Evaluation 1 — Results Id. Precision Recall F1 Accuracy Correct ∆ ID 0.9954 0.5859 0.7376 0.5843 5390 5390 A 0.9952 0.6038 0.7516 0.6020 5553 163 B 0.9951 0.6158 0.7608 0.6139 5663 110 C 0.9952 0.6567 0.7912 0.6546 6038 375 D 0.9951 0.6687 0.7999 0.6665 6148 110 E 0.9952 0.6782 0.8066 0.6760 6235 87 F 0.9952 0.6786 0.8070 0.6764 6239 4 G 0.9953 0.6838 0.8107 0.6816 6287 48 H 0.9953 0.6927 0.8169 0.6905 6369 82 I 0.9953 0.6934 0.8174 0.6911 6375 6 J 0.9953 0.6964 0.8195 0.6942 6403 28 K 0.9955 0.7210 0.8363 0.7187 6629 226 L 0.9955 0.7256 0.8394 0.7232 6671 42 M 0.9955 0.7284 0.8413 0.7260 6697 26 N 0.9957 0.7482 0.8544 0.7458 6879 182 Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
  • 16. Evaluation 1 — Results Id. Precision Recall F1 Accuracy Correct ∆ O 0.9957 0.7496 0.8553 0.7472 6892 13 P 0.9957 0.7515 0.8565 0.7490 6909 17 Q 0.9957 0.7588 0.8612 0.7563 6976 67 R 0.9957 0.7602 0.8621 0.7577 6989 13 S 0.9958 0.7680 0.8672 0.7655 7061 72 T 0.9958 0.7703 0.8686 0.7678 7082 21 U 0.9958 0.7772 0.8731 0.7747 7146 64 V 0.9958 0.7780 0.8735 0.7755 7153 7 W 0.9958 0.7783 0.8737 0.7758 7156 3 X 0.9958 0.7796 0.8746 0.7771 7168 12 Y 0.9958 0.7806 0.8752 0.7781 7177 9 Z 0.9958 0.7807 0.8753 0.7782 7178 1 AA 0.9958 0.7813 0.8756 0.7787 7183 5 Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
  • 17. Evaluation 1 — Results Id. Precision Recall F1 Accuracy Correct ∆ AB 0.9958 0.7818 0.8759 0.7793 7188 5 AC 0.9958 0.7822 0.8762 0.7797 7192 4 AD 0.9959 0.7836 0.8770 0.7810 7204 12 AE 0.9959 0.7855 0.8783 0.7830 7222 18 AF 0.9959 0.7863 0.8787 0.7837 7229 7 AG 0.9957 0.7876 0.8795 0.7849 7240 11 AH 0.9957 0.7882 0.8799 0.7856 7246 6 AI 0.9958 0.7903 0.8812 0.7876 7265 19 AJ 0.9956 0.7928 0.8827 0.7900 7287 22 AK 0.9956 0.7940 0.8834 0.7912 7298 11 AL 0.9956 0.7947 0.8839 0.7920 7305 7 AM 0.9956 0.7951 0.8842 0.7924 7309 4 AN 0.9956 0.7955 0.8844 0.7927 7312 3 Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
  • 18. Evaluation 2 Triangulating a bigger dictionary for evaluation purposes: PT–SP (12 340) ◦ SP–GL (7 581) ⇒ PT–GL (5 045 pairs) from Apertium translation software PT–SP ◦ SP–EN (24 912) ◦ EN–GL (17 626) ⇒ PT–GL (6 644) PT–SP and EN–GL from Apertium, En–GL from CLUVI PT–EN (14 600) ◦ EN–GL ⇒ PT–GL (8 589 pairs) PT–EN from a merchandising app, EN–GL from CLUVI Adding dictionaries together resulted in a 14 492 pairs. Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
  • 19. Evaluation 2 – Results Id. Precision Recall F1 Accuracy Correct ∆ ID 0.9668 0.5022 0.6611 0.4937 7155 7155 A 0.9664 0.5176 0.6741 0.5084 7368 213 B 0.9663 0.5275 0.6824 0.5179 7506 138 C 0.9668 0.5646 0.7129 0.5538 8026 520 D 0.9661 0.5746 0.7206 0.5633 8163 137 E 0.9658 0.5831 0.7272 0.5713 8279 116 ... ... ... ... ... ... ... AH 0.9660 0.6819 0.7994 0.6659 9650 7 AI 0.9661 0.6841 0.8010 0.6681 9682 32 AJ 0.9660 0.6863 0.8025 0.6701 9711 29 AK 0.9660 0.6873 0.8032 0.6711 9726 15 AL 0.9661 0.6881 0.8037 0.6718 9736 10 AM 0.9660 0.6884 0.8039 0.6721 9740 4 AN 0.9660 0.6887 0.8041 0.6724 9744 4 Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
  • 20. Dictionary Alignment — Results Portuguese Words Galician Words Substitution Count Percentage Count Percentage ID 12711 15.3502% 12711 33.7475% A 13082 15.7982% 13065 34.6874% B 13447 16.2390% 13421 35.6326% C 14348 17.3270% 14321 38.0220% D 14764 17.8294% 14728 39.1026% E 15174 18.3245% 15138 40.1912% ... ... ... ... ... AI 17712 21.3895% 17627 46.7994% AJ 17740 21.4233% 17648 46.8552% AK 17765 21.4535% 17673 46.9215% AL 17784 21.4764% 17693 46.9746% AM 17813 21.5115% 17718 47.0410% AN 17817 21.5163% 17722 47.0516% DIC 20084 24.2540% 19989 53.0705% Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
  • 21. Final Remarks An approach to translate Portuguese words in a dictionary into Galician words using a set of string substitutions; Approach is unable to translate all words; Reasonable amount of words in Dicion´ario-Aberto have pre-1930 orthography, that wasn’t dealt with; We deliberately ignored a relevant problem: false friends. two words that share a subset of the meanings. For instance, talho (PT) and tallo (GL) share the majority of their senses, but there are some of them that are specific to Portuguese (for example, the place where meat is sold); two words that have complete different meanings. An example would be the word presunto (written in the same way in the two languages) that means ham in Portuguese (a noun), but means alleged in Galician (an adjective); Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
  • 22. Final Remarks An approach to translate Portuguese words in a dictionary into Galician words using a set of string substitutions; Approach is unable to translate all words; Reasonable amount of words in Dicion´ario-Aberto have pre-1930 orthography, that wasn’t dealt with; We deliberately ignored a relevant problem: false friends. two words that share a subset of the meanings. For instance, talho (PT) and tallo (GL) share the majority of their senses, but there are some of them that are specific to Portuguese (for example, the place where meat is sold); two words that have complete different meanings. An example would be the word presunto (written in the same way in the two languages) that means ham in Portuguese (a noun), but means alleged in Galician (an adjective); Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
  • 23. Dictionary Alignment by Rewrite-based Entry Translation Alberto Sim˜oes1 Xavier G´omez Guinovart2 1Centro de Estudos Human´ısticos, Universidade do Minho Campus de Gualtar, Braga, Portugal ambs@ilch.uminho.pt 2Galician Language Technologies and Applications (TALG Group) Universidade de Vigo, Galiza, Spain xgg@uvigo.es SLATE 2013 Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation