Using Parallel Propbanks to enhance Word-alignments Jinho Choi, Martha Palmer, Niawen Xue Institute of Cognitive Science, ...
Upcoming SlideShare
Loading in …5
×

Using Parallel Propbanks to Enhance Word-alignments

336
-1

Published on

This short paper describes the use of the linguistic annotation available in parallel PropBanks (Chinese and English) for the enhancement of automatically derived word alignments. Specifically, we suggest ways to refine and expand word alignments for verb-predicates by using predicate-argument structures. Evaluations demonstrate improved alignment accuracies that vary by corpus type.

Published in: Education, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
336
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
4
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Using Parallel Propbanks to Enhance Word-alignments

  1. 1. Using Parallel Propbanks to enhance Word-alignments Jinho Choi, Martha Palmer, Niawen Xue Institute of Cognitive Science, University of Colorado at Boulder Average Top-down Argument Matching Scores Average Bottom-up Argument Matching Scores Background <ul><li>Propbank </li></ul><ul><li>- A corpus annotated with verbal propositions and their arguments. </li></ul><ul><li>- Adds semantic information (semantic roles) to the phrase structures. </li></ul><ul><li>e.g. John opened the door with his foot </li></ul><ul><li>Word-alignments </li></ul><ul><li>- Parallel sentences: a sentence s and t are called parallel if t is a translation of s . </li></ul><ul><li>- Word alignment: Given parallel sentences, align words that are semantically close. </li></ul><ul><li>- GIZA++: a statistical machine translation toolkit used to train word- alignment models. </li></ul>Phrase Structure System Overview Motivation Issues with GIZA++ generated word-alignments - It is hard to verify if the alignments are correct. - Words with low frequencies may not get aligned to any words. - GIZA++ does not account for semantics. Using parallel Propbanks to enhance word-alignments for verb-predicates - Let S and T be a source and a target language, respectively. - For each verb-predicate v s ∈ S aligned to some word w t ∈ T , : if w t is also a verb-predicate and the arguments of v s and w t match, consider the alignment is correct (top-down matching). - For each verb-predicate v s ∈ S aligned to no word ∈ T , : if the arguments of v s match to the arguments of some verb- predicate v t ∈ T , align v s to v t (bottom-up matching). Propbank Annotations Corpus Description English Chinese Translation Treebank (ECTB) - A parallel corpus between English and Chinese - The corpus is divided into two parts : Xinhua Chinese newswire with literal English translations (4,363 parallel sentences) : Sinorama Chinese news magazine with non-literal English translations (12,600 parallel sentences) Predicate Matching For each Chinese verb-predicate v c aligned to some English word w e , we checked if w e is also a verb-predicate. pred = predicates, be = be-verbs, else = non-verbs, none = no words Top-down Argument Matching <ul><li>For each Chinese verb v c aligned to an English verb v e </li></ul><ul><li>- Convert all Chinese words in the arguments of v c to their English alignments (skip ones not aligned to any English words). </li></ul><ul><li>- Compare the converted arguments of v c with the arguments of v e . </li></ul><ul><li>For each argument, check how many words are matched. If the matching is above a certain threshold, consider the alignment is correct. </li></ul><ul><li>Measurements </li></ul><ul><li>- CA = a set of arguments of v c , where ca i ∈ CA </li></ul><ul><li>EA = a set of arguments of v e , where ea i ∈ EA </li></ul><ul><li>Macro average argument matching score </li></ul><ul><li>= </li></ul><ul><li>Micro average argument matching score = </li></ul>Evaluations <ul><li>Test corpus </li></ul><ul><li>- English-Chinese parallel corpus provided by Wei Wang (Information Sciences Institute at the Univ. of Southern California) </li></ul><ul><li>100 parallel sentences, 273 Chinese verb-types (365 verb-tokens) </li></ul><ul><li>Test if word-alignments found in ECTB can correctly translate Chinese verbs to English verbs </li></ul><ul><li>Measurements </li></ul><ul><li>- Term coverage (TC): how many Chinese verb-types are covered by word-alignments found in ECTB </li></ul><ul><li>- Term expansion (TE): for each covered Chinese verb-type, how many English verb-types are suggested by the word-alignments </li></ul><ul><li>- Alignment accuracy (AA): how many suggested English verb-types are correct </li></ul><ul><li>Refining word-alignments </li></ul><ul><li>- Apply only the word-alignments whose macro-average scores are above a certain threshold </li></ul><ul><li>Thresholds: 0 (accept all alignments), 0.4 (accept alignments whose macro average scores are above 40%) </li></ul><ul><li>ATE = Average term expansion, AAA = Average alignment accuracy </li></ul><ul><li>Expanding word-alignments </li></ul><ul><li>Apply only the word-alignments whose macro and micro average scores are above certain thresholds </li></ul>Bottom-up Matching <ul><li>For each Chinese verb v c aligned to no English word </li></ul><ul><li>- Convert all Chinese words to their English alignments. </li></ul><ul><li>Compare the converted arguments of v c with the arguments of each English verb v e that is not aligned to any Chinese verb, and find the one, say v m , with the maximum micro average score. </li></ul><ul><li>- If the micro average score of v c and v m is above a certain threshold, align v c to v m . </li></ul>Xinhua Sinorama Macro Avg. 80.55% 53.56% Micro Avg. 83.91% 52.62% Xinhua Sinorama Threshold 0.7 0.8 0.7 0.8 Macro Avg. 80.74% 83.99% 77.70% 82.86% Micro Avg. 82.63% 86.46% 79.45& 85.07% Xinhua Sinorama TH TC ATE AAA TC ATE AAA 0.0 79 1.77 83.35% 129 2.29 57.76% 0.4 76 1.72 83.54% 93 1.8 65.88% 0.5 76 1.68 83.71% 62 1.58 78.09% Macro – 0.7 Macro – 0.8 TC ATE AAA TC ATE AAA Micro Xinhua 0.0 22 4.27 50.38% 20 3.35 57.50% 0.6 21 3.9 54.76% 18 3.39 63.89% 0.7 19 3.47 55.26% 17 3.12 61.76% Micro Sinorama 0.0 37 3.59 18.01% 29 3.14 14.95% 0.6 31 3.06 15.11% 27 2.93 14.46% 0.7 21 2.81 11.99% 25 2.6 11.82% Summary and Future Works • Top-down Argument Matching is most effective with non-literal translations that have proven difficult for GIZA++. • Bottom-up Argument Matching shows promise for expanding the coverage of GIZA++ alignments that are based on literal translations. • In future work, we will try to enhance word-alignments by using automatically labeled Propbanks, Nombanks, and Named-entity tags.
  1. Gostou de algum slide específico?

    Recortar slides é uma maneira fácil de colecionar informações para acessar mais tarde.

×