CLUE-Aligner: An Alignment Tool to Annotate Pairs of Paraphrastic and Translation Units

technology
from seed
CLUE-Aligner
An Alignment Tool to Annotate Pairs of
Paraphrastic and Translation Units
LREC - Portorož, May2th 2016
ANABELA BARREIRO
INESC-ID
FRANCISCO RAPOSO
INESC-ID / UTL TIAGO
LUÍS
VOICEINTERACTION

Alignment
• Set of correspondences or relationships between linguistic
units which are semantico-syntactically related
– Paraphrases (found within the same language = monolingual)
• EN: to make a distinction between | EN: to distinguish between
– Translations (found in different languages = bilingual)
• EN: to keep it simple | PT: simplificar
Alignment task
• NLP task that consists of the identification of translation or
paraphrastic relationships among those linguistic units
(words, MWU or expressions) in sentence pairs that have been
identified as paraphrases or translations of each other
Introduction
2

• Sure alignments correspond to expressions/translations that
satisfy the criteria for optimum/full equivalence
• They are reciprocal – it is possible to translate the expression
from the source to the target language and vice-versa
• Optimum equivalence refers to the highest level of translation equivalence on
both linguistic and extra-linguistic levels (Bayar,2007)
• venture capital markets | mercados de capital de risco (S)
• Possible alignments correspond to expressions/translations
that satisfy the criteria for approximate equivalence
• They do not meet all of the requirements for absolute
equivalence. They are not reciprocal wrt source/target
language
• began | a vu le jour (P)
has seen the day
3
Sure and Possible Alignments

• Supervised learning uses high quality alignments, hand-
made by linguists (Blunsom & Cohn, 2006; Ambati et al., 2010)
– supervised methods take into consideration context, syntax
and other grammatical and sematic information
• Guidelines for manual alignment:
– English–French - Blinker project (Melamed, 1998)
– Czech–English (Kruijff-Korbayová et al., 2006; Bojar &
Prokopová, 2006)
– Spanish–English (Lambert et al., 2005)
– Paraphrase alignment guidelines (Callison-Burch et al. 2008)
Background
4

1. Lack of multilingual datasets
– Publicly available alignments are mostly bilingual, with the
exception of 6 multilingual sets (Graça et al., 2008)
2. Lack of linguistically-motivated alignment guidelines
– Previously proposed guidelines cover cross-linguistic
phenomena superficially, excluding important alignment
challenges presented by discontiguous MWU (DMWU) and
other non-adjacent linguistic phenomena or syntactic
discontinuity (e.g., extraposition, topicalization, etc.)
3. Lack of tools
– Tools are inefficient with DMWU and phrasal expressions
that are complex to align and require representation as non-
contiguous block alignments
Current Shortcomings
5

– Alpaco - Blinker project (Rassier & Pedersen, 2003)
– ICA - Interactive Clue Aligner (Tiedemann, 2003; 2004; 2011)
*The "clue alignment approach” is based on mainly word-level alignment
clues. Our approach is based on manual alignments of cross-language MWU
and phrasal expressions -- that allows representing semantically equivalent
non-adjacent structures, such as DMWU in translation and paraphrasing
– Yawat (Germann, 2008)
– SWIFT (Gilmanov et al., 2014)
– among others
Related Alignment Tools
6

• Web alignment interactive tool inspired in Linear-B (Callison-
Burch & Bannard, 2004), (Callison-Burch, 2007)
• Allows the block-alignment of contiguous and DMWU
• Uses a matrix visualization and a coloring schemes that help
distinguish between sure and possible alignments
• Allows storage of pairs of paraphrastic units, with indication
of the place of insertions, represented by "[ ]"
– I urge [ ] to | Exorto [ ] a
– This feature is valuable in the construction of translation
rules or grammars and syntactic parsers that use those
paraphrastic pairs, for which precision is important
– It is also important in ML to help learning constituents
7
CLUE* = Cross-Language Unit Elicitation
CLUE-Aligner

insertion
insertion
Black cells represent full/optimal semantic correspondence
Grey cells represent approximate semantic correspondence
Light orange cell groups represent unaligned P-insertions
Dark orange cell groups represent unaligned S-insertions

pre-processing of
contracted forms
still ainda
CLUE-Aligner Interface
Single Word Alignments
and Block Alignments
Discontiguous Multiwords
and InsertionsLight green cell / cell groups represent aligned P-insertions
Dark green cell / cell groups represent aligned S-insertions

• Inspired by the Logos Model (Scott, 2003; Barreiro et al.,
2011), which relies on deep semantico-syntactic analysis to
translate contiguous and DMWU, often mistranslated by MT
systems – have proven successful in commercial MT systems
• to draw a distinction between
• to bring [INSERTION] to a conclusion
• I would urge the European Commission to bring the process of
adopting the directive on additional pensions to a conclusion
• Supported by the Lexicon-Grammar theoretical framework
and transformational grammar (Gross, 1968; 1975)
• The alignment task of the translation pairs of units resulted in
a gold collection, achievable due to the CLUE-Aligner
Alignment Guidelines
10

• Allows visualization of automatic phrase alignments and can
be used for correcting inaccurate alignments
– can load previously (and, possibly, automatically) generated
alignments (segments) for the parallel sentences
• Allows alignment of smaller individual or MWU inside DMWU
• Useful in human and machine translation evaluation
• Future development plans include automatic alignment
– alignments containing pairs of paraphrastic or translation
units can be used to train ML systems
• Developed under the scope of the eSPERTo project
https://esperto.l2f.inesc-id.pt/esperto/aligner/index.pl?
11
CLUE-Aligner

Use of Paraphrastic Units in eSPERTo
12
the man who is American
the man from America
the man with American nationality
…
The American man
https://esperto.l2f.inesc-id.pt/esperto/esperto/demo.pl

• Linguistic-based alignments extracted from quality corpora:
– Contribute to increased precision and recall in SMT systems, with
subsequent improvement of translation quality
– Are a valuable asset for applications that require monolingual
paraphrases
• We moved forward by creating a tool that handles non-
adjacent structures, allowing the alignment of DMWU and
phrasal expressions to improve translation applications
• Improvements to CLUE-Aligner include:
– to feed it with existing translation or paraphrastic knowledge
previously aligned or generated with a linguistic processing tool
– To enhance it in order to align and extract automatically large
amounts of alignment pairs to be applied to paraphrasing and MT
case studies
Conclusions and Future Work
13

14
Thank you!
Acknowledgements
This research work was supported by Fundação para a Ciência e Tecnologia (FCT), under project eSPERTo
EXPL/MHC-LIN/2260/2013, UID/CEC/50021/2013, and post-doctoral grant SFRH/BPD/91446/2012

CLUE-Aligner: An Alignment Tool to Annotate Pairs of Paraphrastic and Translation Units

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (7)

Similar to CLUE-Aligner: An Alignment Tool to Annotate Pairs of Paraphrastic and Translation Units

Similar to CLUE-Aligner: An Alignment Tool to Annotate Pairs of Paraphrastic and Translation Units (20)

More from INESC-ID (Spoken Language Systems Laboratory - L2F)

More from INESC-ID (Spoken Language Systems Laboratory - L2F) (20)

Recently uploaded

Recently uploaded (20)

CLUE-Aligner: An Alignment Tool to Annotate Pairs of Paraphrastic and Translation Units