Cross language alignments - challenges guidelines and gold sets

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory 1
Cross-Language Alignments:
Challenges, Guidelines and Gold Sets
Anabela Barreiro Luísa Coheur Tiago Luís
Ângela Costa Fernando Batista João Graça

technology
from seed
Outline – Part 1
• Word alignment
• Basic concepts
• Applications
• State of the art
• Limitations
• Paraphrase alignment
• Multiword, meaning and translation unit alignment: importance
• Our task
• Alignment tool: CLUE-Aligner

technology
from seed
Outline – Part 2
• General annotation guidelines
• Cross-linguistic major challenges to word alignment
• Annotation guidelines for multiword units and lexical and non-lexical
realization phenomena
• Pro-dropping
• Articles and zero articles
• Examples: continuous multiword units
• Examples: continuous and discontinuous support verb constructions
Preposition-dependency
(V, N and Adj)
Active vs passive Choice of noun pre-modifiers Different PoS with same
semantics (V vs process N)
Noun adjuncts Coordination Anaphora: choice of co-
referents
Impersonal constructions
Contractions Style Antonyms and negation
constructions
Romance languages double
negation
Singular vs plural idiomatic vs non-idiomatic Flexible/loose paraphrasing
constructions;
Idiosyncrasies of each
language

technology
from seed
Outline – Part 3
• Our contribution
• Annotation process
• Preliminary results
• Discussion
• Future work

technology
from seed
L2 F - Spoken Language Systems Laboratory
Word Alignment: Basic Concepts
• Objects representing the mapping of words (or expressions),
which are semantically equivalent in a source and a target
sentence of a parallel corpus [Brown at al., 1990]
– Matrix of n * m entries, where n is a position on the source sentence, and
m is a position on the target sentence. An entry in that matrix an,m
specifies if the word at position n is part of a translation of the word at a
position m on the target language
• Task of word alignment - identifying translational equivalences
(= semantic correspondences) in the aligned sentence pairs of
a parallel text [Hearne & Way, 2011]
• Translational equivalences - graphically represented in a grid
by the intersection of single segments (individual words) or
blocks (semantico-syntactic units, phrases, expressions)
5

technology
from seed
Word Alignment: Basic Concepts
6
• Sure alignment (S-alignment)
– Unambiguous and valid in all contexts
• EN system
• ES sistema
• FR système
• PT sistema
• Possible alignment (P-alignment)
– Ambiguous and invalid in some contexts
• EN be
• ES ser/estar/haber/existir
• FR être/avoir/exister
• PT ser/estar/haver/existir

technology
from seed
Word Alignment: Applications
• Statistical machine translation
– [Brown et al., 1990] – statistical machine translation
– [Och and Ney, 2004] – phrase base machine translation
– [Galley et al., 2004] – syntax base machine translation
• Annotations’ projections
• Extraction of bilingual lexica
• Evaluation of machine translation systems
7

technology
from seed
Word Alignment: State of the Art
• Workshops and evaluation tasks (multi-language)
– http://www.cse.unt.edu/~rada/wp/
– http://www.statmt.org/wpt05
– http://www.lpl.univ-aix.fr/projects/arcade
• Projects
– Blinker project –French-English
http://nlp.cs.nyu.edu/blinker/
• Guidelines
[Melamed, 1998] [Och and Ney, 2000]
[Lambert et al., 2005] [Kruijff-Korbayová et al., 2006]
[Graça et al., 2004]
8

technology
from seed
Word Alignment: Limitations
• Language does not operate on a word-for-word basis
• A large number of words are undissociated
– Multiword units
• [Gross and Senellart, 1998] - +40% of 1 year of Le Monde are MWU
• [Sag et al., 2002] – 50-70% of specialized lexica are MWU
• [Ramisch et al., 2010] – 56.7% of terms in Genia corpus have 2+
words (not included general purpose MWU, e.g., generic compounds,
lexical bundles, phrasal verbs, fixed expressions, which also occur in
domain-specific texts)
– Translation units
– Meaning units
– Paraphrases
• Segment and block alignment (sure and possible)
9

technology
from seed
Example: Segment and Block
Alignment (Sure and Possible)
10

technology
from seed
Paraphrase Alignment
• Monolingual
– [Callison-Burch et al., 2006]
• Annotation guidelines for paraphrase alignment
• Paraphrases - sentences that convey the same meaning but are
worded differently
• Alignment of words, phrases, expressions, within the same language
• Bilingual = (non-literal) translation
– Need to account for paraphrases across languages
11

technology
from seed
Multiword, Meaning and Translation
Unit Alignment: Importance
• Publicly available manual word alignments are restricted
to a few language pairs
• Manual word alignments are a desired resource
– Evaluation of word alignment algorithms
– Training of supervised and semi-supervised algorithms
– Tuning of parameters for different types of model
• But, “name”, “concept” and “techniques” of alignment need
to be linguistically sophisticated to be more useful and
help provide improved machine translation!
12

technology
from seed
Our Task
• EuroParl corpus [Koehn, 2005]
• 6 gold alignments sets
– 400 alignments each set (400x6=2,400)
• Languages: English, French, Portuguese and Spanish
– Language pairs: [en-es], [en-fr], [en-pt], [es-fr], [pt-es], [pt-fr]
• Guidelines for multi-language manual word annotations
(with inter-annotator agreement)
• Linguistically-informed (and linguistically-motivated) cross-
language multiword unit and paraphrase alignment
(translation unit alignment)
13

technology
from seed
CLUE-Aligner Alignment Tool
14
CLUE-Aligner =
Cross-Language Unit Elicitation Aligner
• Helps reduce ambiguity in the alignment process
• Facilitates the alignment of translation units

technology
from seed
Major Challenges (4 different classes)
• semantico-discursive
– emphatic linguistic constructions
• tautology
• pleonasm and repetition
• focus constructions
• lexical and semantico-syntactic
– multiword units
– compound verbs
– prepositional predicates
15

technology
from seed
Major Challenges (4 different classes)
• morphological
– contracted forms
– lexical versus non-lexical realization
• articles and zero articles
• pro-dropping
– subject pronoun drop
– empty relative pronoun
• morpho-syntactic
– free noun adjuncts
16

technology
from seed
Linguistic phenomenon No alignment P-alignment
Incomplete or non-translation X
Incorrect translation and typo X*
Approximate correspondence (numeric) X
Non-obligatory
linguistic structure
Pleonasm X
Repetition of words or expressions X
Redundancy or additional/extra information X
Mismatching pronoun, determiner, verbs, etc. X
Abbreviations versus full word X
Punctuation mark
Different but correct X
Incorrect / mismatch X
Missing X
17
General Annotation Guidelines
* If a multiword unit is incorrectly translated or contains a typo, none of its internal segments are aligned

technology
from seed
Linguistic phenomenon No alignment Block-alignment
S-align P-align
Multiword Unit
continuous X X
discontinuous X*
Lexical
versus
non-lexical
realization
article+ N
versus
zero-article + N
Ø people
=
PT - as pessoas
X
Pro-drop + V
versus
pronoun + V
I went
=
PT - Ø fui
X
Empty relative pronoun
versus
realized relative pronoun
N that I met = N I met
=
PT - que (eu) conheci
X
Relative
versus
participial adjective
that was writen = writen
=
PT – escrito
X
18
Annotation Guidelines
* Some discontinous multiword units are candidates for block-alignment (e.g., when the number of inserts is small or the multiword unit
is “semi-frozen”

technology
from seed
Continuous multiword units Block-S-alignment Block-P-alignment
Support verb construction X X
Compound X X
Phrasal verb X X
Named entity X X
Date and time expression X
Lexical bundle X
Idiomatic expression X
Domain term X
French negation (ne pas) X
English infinitive (to + V) X X
19
Annotation Guidelines
[Barreiro, 2008] presents a detailed description and examples of the different types of multiword unit

technology
from seed
Example: Continuous Support Verb
Constructions (alignment)
20
ES aprueba plenamente
FR approuve pleinement

technology
from seed
Example: Discontinuous Support Verb
Constructions (no alignment)
21
ES para que acelere la directiva sobre pensiones
complementares
FR pour faire avancer la directive sur les pensions
complementaires

technology
from seed
Cross-Linguistic Challenges
• Prepositional predicates
EN I too should like to congratulate [NE] on his excellent report
ES también yo quisiera felicitar a mi colega [NE] por su excelente informe
FR je voudrais féliciter moi aussi mon collègue [NE] pour son excellent
rapport
PT também eu gostaria de felicitar o meu colega [NE] pelo seu excelente
relatório
EN […] our Asian partners prefer to deal with questions which unite us
ES […] nuestros socios asiáticos prefieren dedicarse a las questiones que
nos unen
FR […] nos partenaires asiatiques préfèrent s’attacher à ce qui nous unit
PT […] os nossos parceiros asiáticos preferem centrar-se unicamente nas
questões comuns
22
Segment S-alignment
Impossible to annotate discontinuous preposition-dependency
Block P-alignment

technology
from seed
agree with belong to forgive s/o for pay for stand for
aim at/for choose between hope for prepare for thank s/o for
allow for comment on insist on prevent s/o from think of/about
apologise for compare with interfere with/in provide s/o with volunteer to
apply for complain about joke about refer to wait for
approve of concentrate on laugh at rely on warn s/o about
argue with/about congratulate on lend s/th to s/o run for worry about
ask for consist of listen to smile at
attend to deal with long for succeed in
believe in decide on object to suffer from
• Prepositional verbs
23

technology
from seed
• Prepositional nouns
24
attack on attitude towards in agreement on strike
cruelty towards comparison between on average in trouble
difficulty in/with decrease in on condition on behalf of
knowledge of disadvantage of delay in connection between
reason for incerase in in doubt difference between/of
rise in preference for information about under guarantee
solution to reduction in need for in power
use of at risk protection from reaction to
in a hurry at stake report on result of
in practice in theory room for trouble with

technology
from seed
• Prepositional adjectives
25
delighted at/about frightened of opposed to similar to
different from friendly with pleased with sorry for/about
dissatisfied with good at popular with suspicious of
doubtful about guilty of proud of sympathetic to(wards)
enthusiastic about incapable of puzzled by/about tired of
envious of interested in safe from typical of
excited about jealous of satisfied with unaware of
famous for keen on sensitive to(wards) used to
fed up with kind to serious about
fond of mad at/about sick of

technology
from seed
• Noun Adjuncts
– Compounds
• European investment bank banco europeu de investimento
[Adj N N] [N Adj [de N]]
– Free noun phrases (not compounds)
• presidency communication comunicação da presidência
[N N] [N [de N]]
26
Block S-alignment
Segment S-alignment
Block-P-alignment
of [de N]

technology
from seed
• Contractions
– two or more words with different parts-of-speech overlap, which
makes syntactic analysis and generation difficult
– in cross-language analysis, the contrast between languages that
have contractions and languages that do not have them, or do not
have them in the same contexts, presents additional difficulties
– The alignment of one segment that corresponds to a contracted form
in one language with the corresponding segments where elements
are not contracted in the other language of the parallel pair is
pragmatically motivated
27

technology
from seed
Example: Contractions (block-P-
alignment)
28
Interference with the support verb construction
EN to make a reference to
PT fazer uma referência a

technology
from seed
Example: Contractions (block-P-
alignment)
29
Interference with the support verb construction
ES hacer una referencia a
FR faire référence a

technology
from seed
• Singular versus plural (related to determiner)
EN in every official language of the union
ES en todos los idiomas oficiales de la unión
FR dans toutes les langues officielles de l'union
PT em cada uma das línguas oficiais da união
• Active versus passive
EN before new member states are admitted
ES antes de la incorporación de nuevos miembros
FR avant l'admission de nouveaux membres
PT antes da entrada de novos membros
30
Block or segment
P-alignment
Block-S-alignment if there
is some fixedness
(such as in this case)
Block P-alignment

technology
from seed
• Coordination
EN which we will send to the council and Ø parliament
ES que enviaremos al consejo y al parlamento
FR qui sera envoyée au conseil et au parlement
PT que remeterá ao conselho e ao parlamento
• Style: idiomatic versus non-idiomatic
EN which began four years ago
ES que empezó hace quatro años
FR qui a vu le jour il y a quatre ans
PT que se iniciou há quatro anos
31
No alignment
Block P-alignment

technology
from seed
• Choice of noun pre-modifiers
EN we should use that public funding for those types of project which are
most difficult to finance through the private sector
ES deberíamos utilizar esa financiación pública para aquel tipo de proyectos
que tienen mayor dificuldad para ser financiados por el sector privado
FR nous devrions recourir au financement public pour les projets que le
secteur privé boude
PT o financiamento público deveria ser utilizado para os projectos que
registam maiores dificuldades em serem financiados pelo sector privado
32
Block P-alignment
EN despite certain difficulties
PT apesar das dificuldades

technology
from seed
• Anaphora - choice of co-referents (noun versus pronoun)
EN it is not acceptable that we assisted Korea during the Asean crisis by
means of IMF loans and suchlike, only for Korea still to be subsidising its
shipyards
EN no resulta procedente que hayamos ayudado a Corea en la crisis de la
Asean a través de préstamos del FMI, etc. y que Corea siga
subvencionando sus astilleros
FR il n’est pas acceptable que nous ayons aidé la Corée dans la crise de
l’Anase, avec des prêts du FMI, etc. et qu’elle continue à subventionner
ses chantiers navals
PT é inadmissível que, depois de termos ajudado a Coreia, através de
créditos do FMI, etc., na crise da Asean, este país continue a
subvencionar agora os seus estaleiros navais
33
Segment or block
P-alignment

technology
from seed
• Antonyms and negation constructions
EN the countries of Asia have not unfortunately been in favour of that
proposal
ES los países de Asia desgraciadamente no han sido favorables a dicha
propuesta
FR les pays d'Asie ont malheureusement rejeté cette proposition
PT os países da Ásia, infelizmente, não se mostraram favoráveis a esta
proposta
34
Block S-alignment together
with adverb
(insert in EN and FR)

technology
from seed
• Flexible/loose paraphrasing constructions
EN and we shall vote against it
ES y merece nuestra condena
FR et dénonçons
PT e merece a nossa condenação
EN 1993 was a significant year
ES el año 1993 es una fecha notable
FR l’année 1993 est à marquer d’une pierre blanche
PT 1993 é uma data charneira
35
Block P-alignment

technology
from seed
• Different parts-of-speech with same semantics (verbs versus
process nouns)
EN we must use all the financial instruments at our disposal to rapidly
develop the market
ES es preciso utilizar todos los instrumentos financieros disponibles para un
rápido desarollo ulterior del mercado
FR il faut utiliser tous les instruments financiers disponibles pour
développer rapidement le marché
PT todos os instrumentos financeiros disponíveis deverão ser aplicados
para continuar a desenvolver rapidamente o mercado
36
Block S-alignment (with internal segment P-alignments)
EN and PT :
Segment S-alignment
No alignment of [continuar a]

technology
from seed
• Impersonal constructions
(+ “impersonal” relative versus participial adjective)
EN we must fully support the demands that have been made
ES hay que apoyar plenamente las exigencias que se han formulado
FR il faut par conséquent appuyer les requêtes formulées
PT as reivindicações formuladas deverão ser plenamente apoiadas
37
Block P-alignment
Internal P-alignment
EN we must
ES hay que
FR il faut
Internal segment S-alignment - adverb and verb (EN, ES, FR)
Internal segment P-alignment - verb (PT)

technology
from seed
• Romance languages double negation (+ coordination)
EN it is not, therefore, surprising that there is, in this context, no real
integration or gennuine political dialogue
ES no es nada sorprendente, entonces, que en ese contexto, no haya ni
verdadera integración ni verdadero diálogo político
FR rien d’étonnant donc, qu'il n'y ait dans ce contexte, ni intégration
véritable, ni dialogue politique véritable
PT assim, não é de espantar que, nesse contexto, não exista verdadeira
integração nem verdadeiro diálogo político
38
Block P-alignment of the relative existential with adverbial (insert)
EN that there is, in this context, no
ES que en esse contexto, no haya
FR qu’il n’y ait dans ce contexte
PT que, nesse contexto, não exista
Segment P-alignment of negation
and negation connector
EN no – or
ES ni – ni
FR n’ – ni
PT Ø - nem

technology
from seed
• Idiosyncrasies of languages
• Portuguese inflected infinitive (peculiar verb tense)
• English to+Infinitive
• French negation
• English apostrophe
• …
• Sociolinguistic differences
39

technology
from seed
Our Contribution
• Tool CLUE-Aligner
• Annotated corpora
• Cross-language resources – gold collection
Publicly available on the META-NET website:
http://metanet4u.l2f.inesc-id.pt/
• Guidelines
– http://www.inesc-id.pt/ficheiros/publicacoes/8204.pdf
40

technology
from seed
Annotation Process
• Annotation of 400 x 6 (2,400 sentence alignments) by a
linguist
• Alignment on a subset of by a second linguist (25
• sentences of the English-Portuguese language pair)
• Inter-annotators agreement
41

technology
from seed
Preliminary Results
42
language words avg. words
en 11158 27.9
es 11664 29.2
fr 12464 31.2
pt 11649 29.1
pair Sure Possible Total
en-pt 6684 418 7102
en-fr 7025 569 7594
en-es 7636 399 8035
es-fr 7477 767 8244
pt-es 7958 557 8515
pt-fr 7029 782 7811
pair Sure Possible Total
en-pt 2588 602 3190
en-fr 3865 414 4279
en-es 3551 351 3902
es-fr 3516 495 4011
pt-es 3162 382 3544
pt-fr 3253 698 3951
Block (MWU) alignmentSegment (word) alignment

technology
from seed
Inter-annotators Agreement
43
• Statistical significance for kappa is rarely reported. However, a
number magnitude guidelines have appeared in the literature.
– Landis & Koch (1977) consider
• kappas between .4 and .6 as a moderate agreement
• kappas between .8 and 1 correspond to an almost perfect agreement
– Fleiss (1981) (equally arbitrary guidelines) characterize
• kappas from .40 to .75 as fair to good
• kappas over .75 as excellent
• This set of guidelines is however by no means universally accepted
Cohen's kappa
coefficient
Multi-word units (MWU) 0.541
Word alignments (WA) 0.984
Total 0.871

technology
from seed
Discussion
• Difficulties in analyzing fluency, stylistics (including word order),
paraphrase, etc.
• Alignments do not always work bi-directionally (sometimes the source-
target direction for a language pair matters)
• Levels of alignment and ranking systems (n-grams, morphology,
semantico-syntactic level, phrase, paraphrase, etc.)
• Terminology imprecision is found in corpora (it leads to poor quality
machine translation)
45

technology
from seed
Future Work
• Integration of lexica (multiword units, etc.) obtained via the use of local
grammars – use multiword units as ONE (1) segment of alignment,
whenever that is possible (contiguous, etc.)
• Pre-processing of contractions and post-processing of elements that
need to be contracted is important if applied to machine translation or
to create “more polished” lexica
• Evaluation of the current alignments in a statistical machine translation
system to see if translation quality improves
46

technology
from seed
Future Work
• Machine learning of recognition and alignment of multiword units
• based on segment alignments, i.e., individual words inside the
multiword unit
• based on multiword units of a parallel sentence in another language or
language pair alignment
• Use of local grammars that identify and process discontinuous
multiword units and other complex linguistic phenomena to combine
with word alignment techniques – how to combine?
47

technology
from seed
Main Conclusion
• Bringing linguistics into STM at the start is the first inevitable place
where hybridization should be possible.
• We believe that it would be productive to convert texts on both sides of
a translation pair into a common semantico-syntactic
representation before applying statistics into them. For this, each
language would have to have a parser capable of producing
homogeneous output.
• If this common representation were available, that would bring vast
possibilities for multi-linguistic SMT.
48

technology
from seed
technology
from seed
Thank you!

technology
from seed
Cross-Language Alignments:
Challenges, Guidelines and Gold Sets
Anabela Barreiro Luísa Coheur Tiago Luís
Ângela Costa Fernando Batista João Graça

Cross language alignments - challenges guidelines and gold sets

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Viewers also liked

Viewers also liked (9)

Similar to Cross language alignments - challenges guidelines and gold sets

Similar to Cross language alignments - challenges guidelines and gold sets (20)

More from INESC-ID (Spoken Language Systems Laboratory - L2F)

More from INESC-ID (Spoken Language Systems Laboratory - L2F) (20)

Recently uploaded

Recently uploaded (20)

Cross language alignments - challenges guidelines and gold sets

Editor's Notes