Pptphrase tagset mapping for french and english treebanks and its application in machine translation evaluation

25th International Conference, GSCL 2013
Aaron L.-F. Han, Derek F. Wong, Lidia S. Chao, Liangye He, Shuo Li,
and Ling Zhu
September 25th -27th, 2013, Darmstadt, Germany
Natural Language Processing & Portuguese-Chinese Machine Translation
Laboratory
Department of Computer and Information Science
University of Macau

 Background of language Treebank
 Motivation
 Designed phrase tagset mapping
 Application in MT evaluation
1. Manual evaluations
2. Traditional automatic MT evaluation methods
3. Designed unsupervised MT evaluation
4. Evaluating the evaluation method
5. Experiments
6. Open source code
 Discussion
 Further information

• To promote the development of syntactic analysis
• Many language treebanks are developed
– English Penn Treebank (Marcus et al., 1993; Mitchell et al.,
1994)
– German Negra Treebank (Skut et al., 1997)
– French Treebank (Abeillé et al., 2003)
– Chinese Sinica Treebank (Chen et al., 2003)
– Etc.

• Problems
– Different treebanks use their own syntactic tagsets
– The number of tags ranging from tens (e.g. English Penn
Treebank) to hundreds (e.g. Chinese Sinica Treebank)
– Inconvenient when undertaking the multilingual or cross-lingual
research

• To bridge the gap between these treebanks and
facilitate future research
– E.g. the unsupervised induction of syntactic structure
• Petrov et al. (2012) develop a universal POS tagset
• How about the phrase level tags?
• The disaccord problem in the phrase level tags
remains unsolved
– Let’s try to solve it

• Tentative design of phrase tagset mapping
– On English Penn Treebank I, II & French Treebank
• 9 universal phrasal categories covering
– 14 phrase tags in English Penn Treebank I
– 26 phrase tags in English Penn Treebank II
– 14 phrase tags in French Treebank

Table 1: phrase tagset mapping for French and English treebanks

• Universal phrasal categories: NP (noun phrase),
VP (verb phrase), AJP (adjective phrase), AVP
(adverbial phrase), PP (prepositional phrase), S (sub/-
sentence), CONJP (conjunction phrase), COP
(coordinated phrse), X (other phrases or unknown)
• NP covering
– French tags: NP
– English tags: NP, NAC (the scope of certain prenominal
modifiers within an NP), NX (within certain complex NPs to
mark the head of NP), WHNP (wh-noun phrase), QP
(quantifier phrase)

• VP covering
– French tags: VN (verbal nucleus), VP (infinitives and
nonfinite clauses)
– English tags: VP (verb phrase)
• AJP covering
– French tags: AP (adjectival phrase)
– English tags: ADJP (adjective phrase), WHADJP (wh-adjective
phrase)

• AVP covering
– French tags: AdP (adverbial phrases)
– English tags: ADVP (adverb phrase), WHAVP (wh-adverb
phrase), PRT (particle)
• PP covering
– French tags: PP
– English tags: PP, WHPP (wh-propositional phrase phrase)

• S covering
– French tags: SENT (sentence), S (finite clause)
– English tags: S (simple declarative clause), SBAR (clause
introduced by a subordinating conjunction), SBARQ (direct
question introduced by a wh-phrase), SINV (declarative
sentence with subject-aux inversion), SQ (sub-constituent
of SBARQ), PRN (parenthetical), FRAG (fragment), RRC
(reduced relative clause).
• CONJP covering
– French tags: N/A
– English tags: CONJP

• COP covering
– French tags: COORD (coordinated phrase)
– English tags: UCP (coordinated phrases belonging to
different categories)
• X covering
– French tags: unknown
– English tags: X (unknown or uncertain), INTJ (interjection),
LST (list marker)

4. Application in Machine Translation
evaluation

• Rapid development of Machine Translations
– MT began as early as in the 1950s (Weaver, 1955)
– Big progress science the 1990s due to the development of
computers (storage capacity and computational power)
and the enlarged bilingual corpora (Marino et al. 2006)
• Difficulties of MT evaluation
– language variability results in no single correct translation
– the natural languages are highly ambiguous and different
languages do not always express the same content in the
same way (Arnold, 2003)

• Traditional manual evaluation criteria:
– intelligibility (measuring how understandable the
sentence is)
– fidelity (measuring how much information the translated
sentence retains as compared to the original) by the
Automatic Language Processing Advisory Committee
(ALPAC) around 1966 (Carroll, 1966)
– adequacy (similar as fidelity), fluency (whether the
sentence is well-formed and fluent) and comprehension
(improved intelligibility) by Defense Advanced Research
Projects Agency (DARPA) of US (White et al., 1994)

• Problems of manual evaluations :
– Time-consuming
– Expensive
– Unrepeatable
– Low agreement (Callison-Burch, et al., 2011)

• Measuring the similarity of automatic translation and
reference translation
– Automatic translation (or hypothesis translation, target
translation): by automatic MT system
– Reference translation: by professional translators
– Source language and source document: not used
• Traditional automatic evaluation:
– BLEU: n-gram precisions (Papineni et al., 2002)
– TER: edit distances (Snover et al., 2006)
– METEOR: precision and recall (Banerjee and Lavie, 2005)

• Problems in supervised MT evaluation
– Reference translations are expensive
– Reference translations are not available is some cases
• Could we get rid of the reference translation?
– Unsupervised MT evaluation method
– Extract information from source and target language
– How to use the designed universal phrase tagset?

• Assume that the translated sentence should have a
similar set of phrase categories with the source
sentence.
– This design is inspired by the synonymous relation
between source and target sentence.
• Two sentences that have similar set of phrases may
talk about different things.
– However, this evaluation approach is not designed for
general circumstance
– Assume that the targeted sentences are indeed the
translated sentences from the source document

• First, we parse the source and target languages
respectively
• Then we extract the phrase set from the source and
target sentences
• Third, we convert the phrases into the developed
universal phrase categories
• Last, we measure the similarity of source and target
language on the universal phrase sequences

Figure 1: the parsed French and English sentence

The level of extracted phrase tags: just the upper level of POS tags, bottom-up
Figure 2: convert the extracted phrase into universal phrase tags

• What is the similarity metric we employed?
• Designed similarity metric: HPPR
– N1 gram position order difference penalty
– Weighted N2 gram precision
– Weighted N3 gram recall
– Weighted geometric mean in n-gram precision & recall
– Weighted harmonic mean to combine sub-factors
– The parameters are tunable according to different
language pairs

• 퐻푃푃푅 = 퐻푎푟(푤푃푠푁1푃푠퐷푖푓, 푤푃푟푁2푃푟푒, 푤푅푐푁3푅푒푐)
• 퐻푃푃푅 =
푤푃푠+푤푃푟+푤푅푐
푤푃푠
푁1푃푠퐷푖푓
푤푃푟
푁2푃푟푒
+
푤푅푐
푁3푅푒푐
+
• 푁1푃푠퐷푖푓, 푁2푃푟푒, and 푁3푅푒푐 are the corpus level
scores of sub-factors position difference penalty,
precision and recall.

• The sentence level 푁1푃푠퐷푖푓 score:
• 푁1푃푠퐷푖푓 = exp(−푁1푃퐷)
1
• 푁1푃퐷 =
퐿푒푛푔푡ℎℎ푦푝
Σ|푃퐷푖 |
• 푃퐷푖 = |푃푠푁ℎ푦푝 − 푀푎푡푐ℎ푃푠푁푠푟푐 |
• 푃푠푁ℎ푦푝 and 푀푎푡푐ℎ푃푠푁푠푟푐 are the position number
of matching tag in the hypothesis and source
sentence respectively. When no match for the tag:
푃퐷푖 = |푃푠푁ℎ푦푝 − 0|

Figure 3: N1 gram tag alignment algorithm

Figure 4: 푁1푃퐷 calculation example

• Corpus-level weighted n-gram precision & recall
• 푁2푃푟푒 = exp(Σ푁2 푤푛푙표푔푃푛)
푛=1
푁3 푤푛푙표푔푅푛)
• 푁3푅푒푐 = exp(Σ푛=1
• 푃푛 =
#푚푎푡푐ℎ푒푑 푛푔푟푎푚 푐ℎ푢푛푘푠
#푛푔푟푎푚 푐ℎ푢푛푘푠 표푓 ℎ푦푝표푡ℎ푒푠푖푠 푐표푟푝푢푠
• 푅푛 =
#푚푎푡푐ℎ푒푑 푛푔푟푎푚 푐ℎ푢푛푘푠
#푛푔푟푎푚 푐ℎ푢푛푘푠 표푓 푠표푢푟푐푒 푐표푟푝푢푠

Figure 5: bigram chunk matching example

• How reliable is the automatic metric?
• Evaluation criteria for evaluation metrics:
– Human judgments are the golden to approach, currently
– Correlation with human judgments (Callison-Burch, et al.,
2011, 2012)
• Spearman rank correlation coefficient rs:
– 푟푠 푋푌 = 1 −
푛 푑푖
6 Σ푖=1
2
푛(푛2−1)
– Two rank sequences 푋 = 푥1, … , 푥푛 , 푌 = {푦1, … , 푦푛}

• Corpus from WMT
– Workshop of statistical machine translation
– SIGMT, ACL’S special interest group of machine translation
• Training data (WMT11), tune the parameters
– 3, 003 sentences for each document
– 18 automatic French-to-English MT systems
• Testing data (WMT12)
– 3, 003 sentences for each document
– 15 automatic French-to-English MT systems

• Training, tune the parameters
– N1, N2 and N3 are tuned as 2, 3 and 3 due to the fact that
the 4-gram chunk match usually results in 0 score.
– Tuned values of factor weights are shown in table
Table 2: tuned parameter values

• Comparisons with:
– BLEU, measure the closeness of the hypothesis and
reference translations, n-gram precision
– TER, measure the editing distance of hypothesis to
reference translations

Table 3: training (development) scores on WMT11 corpus
Table 4: testing scores on WMT12 corpus

Table 5: correlation score intro (Cohen, 1988)
 The experiment results on the development and testing corpora show that
HPPR without using reference translations has yielded promising
correlation scores (0.63 and 0.59 respectively).
 There is still potential to improve the performances of all the three
metrics, even though that the correlation scores which are higher than 0.5
are already considered as strong correlation as shown in Table 5.

• Phrase Tagset Mapping for French and English
Treebanks and Its Application in Machine
Translation Evaluation
– Aaron L.-F. Han, Derek F. Wong, Lidia S. Chao, Liangye He,
Shuo Li, and Ling Zhu. GSCL 2013, Darmstadt, Germany.
LNCS Vol. 8105, pp. 119-131, Volume Editors: Iryna
Gurevych, Chris Biemann and Torsten Zesch.
• Open source tool for phrase tagset mapping
and HPPR similarity measuring algorithms:
https://github.com/aaronlifenghan/aaron-project-hppr

• Facilitate future research in multilingual or cross-lingual
literature, this paper designs a phrase tags
mapping between the French Treebank and the
English Penn Treebank using 9 phrase categories.
• One of the potential applications of the designed
universal phrase tagset is shown in the unsupervised
MT evaluation task in the experiment section.

• There are still some limitations in this work to be
addressed in the future.
– The designed universal phrase categories may not be
able to cover all the phrase tags of other language
treebanks, so this tagset could be expanded when
necessary.
– The designed HPPR formula contains the n-gram factors
of position difference, precision and recall, which may not
be sufficient or suitable for some of the other language
pairs, so different measuring factors should be added or
switched when facing new tasks.

• Actually speaking, the designed models are very
related to the similarity measuring. Where we
have employed them is in the MT evaluation. These
works may be further developed into other
literature:
– information retrieval
– question and answering
– Searching
– text analysis
– etc.

• Ongoing and further works:
– The combination of translation and evaluation, tuning the
translation model using evaluation metrics
– Evaluation models from the perspective of semantics
– The further explorations of unsupervised evaluation
models, extracting other features from source and target
languages
• Aaron open source tools: https://github.com/aaronlifenghan
• Aaron network Home: http://www.linkedin.com/in/aaronhan

GSCL 2013, Darmstadt, Germany
Aaron L.-F. Han
email: hanlifengaaron AT gmail DOT com
Natural Language Processing & Portuguese-Chinese Machine Translation
Laboratory
Department of Computer and Information Science
University of Macau

Pptphrase tagset mapping for french and english treebanks and its application in machine translation evaluation

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Pptphrase tagset mapping for french and english treebanks and its application in machine translation evaluation

Similar to Pptphrase tagset mapping for french and english treebanks and its application in machine translation evaluation (20)

More from Lifeng (Aaron) Han

More from Lifeng (Aaron) Han (20)

Pptphrase tagset mapping for french and english treebanks and its application in machine translation evaluation