Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
A Comparison of Unsupervised
Bilingual Term Extraction
Methods Using Phrase-Tables
Masamichi Ideue†
Kazuhide Yamamoto
Masa...
Background
• Automatic bilingual term extraction
• Helpful for human translators
• Applicable to other NLP tasks
Develop u...
Related works
• Using parallel corpus
• Using existing bilingual dictionary
Tonoike et al. (2006) translated the number of...
3
Statistical measures 4
Three statistical scores are used to eliminate
the wrong pairs, respectively.
: Significance of the...
Bilingual term counting and
combination of measures
• Combination of scores
5
,
2 , 2 , ,
Score ( )
(Score ( )) (Score ( )...
Experiments
100 bilingual term candidates that were randomly
selected from the top 1,000 candidates were manually
evaluate...
Translation accuracy
A A' B C
F1 43 25 24 8
L1 77 5 18 0
C 78 6 14 2
F2 71 18 8 3
L2 79 4 17 0
FLC 87 2 11 0
, , , and
can...
Characteristics of extracted
bilingual terms
occurrences words
F2 Many Few
L2 Many Many
C Few Many
FLC Few Many
• Each mea...
Conclusion
We compared three statistical measures for
extracting bilingual terms from the phrase-
table built from a paral...
Fisher's exact test
: Significance of the candidates
• Fisher’s exact test has been used by Johnson et
al. (2007) to selec...
Score_F
C(J,E) C(J)-C(J,E) C(J)
C(E)-C(J,E) N-C(J)-C(E)+C(J,E) N-C(J)
C(E) N-C(E) N
N : All parallel sentences
C(J) : Japa...
Log-likelihood Ratio
• Tonoike et al. (2007) said alignments of
a component of the term is useful for
automatic bilingual ...
Alignment information
We used the alignment information produced by
Moses (Koehn et al., 2007).
Alignments
in the parallel...
C-Value
If the term candidates of both language are
highly ranked in C-value ranking, the
bilingual term candidate has val...
Bilingual term counting and
combination of measures
Our experiments show that the counting method is better
than normal co...
Examples of the extracted bilingual term
daiya diamond
daun
jaketto
down
jacket
kitake
nagame
long
length
wanpi-
su
one-
p...
A Comparison of Unsuperviesed Bilingual Term Extraction Methods Using Phrase Tables
Upcoming SlideShare
Loading in …5
×

A Comparison of Unsuperviesed Bilingual Term Extraction Methods Using Phrase Tables

405 views

Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

A Comparison of Unsuperviesed Bilingual Term Extraction Methods Using Phrase Tables

  1. 1. A Comparison of Unsupervised Bilingual Term Extraction Methods Using Phrase-Tables Masamichi Ideue† Kazuhide Yamamoto Masao Utiyama Eiichiro Sumita ‡ Nagaoka University of Technology, Japan † National Institutre of Information and Communications Technology † ‡ ‡
  2. 2. Background • Automatic bilingual term extraction • Helpful for human translators • Applicable to other NLP tasks Develop unsupervised methods for extracting bilingual terms from a phrase-table, and compare them. Goal 1
  3. 3. Related works • Using parallel corpus • Using existing bilingual dictionary Tonoike et al. (2006) translated the number of word in each source language term using the bilingual dictionary and combined these translations to form term candidates. 2 Itagaki et al. (2007) proposed a supervised method for extracting bilingual terms from the phrase-table built from a parallel corpus. We usually do not have annotated data for training supervised methods nor bilingual dictionaries specific to the documents under translation.
  4. 4. 3
  5. 5. Statistical measures 4 Three statistical scores are used to eliminate the wrong pairs, respectively. : Significance of the candidates based on Fisher’s exact test. ScoreF : Strength of the alignment between words of the candidates. ScoreL : Termhood of the candidate based on C-value. ScoreC
  6. 6. Bilingual term counting and combination of measures • Combination of scores 5 , 2 , 2 , , Score ( ) (Score ( )) (Score ( )) (Score ( )) 3 FLC J E F J E L J E C J E T R T R T R T = + + • Two methods for counting the number of occurrences of term T Method 1 : Counting without regarding where T occurs Method 2 : Counting T only when it occurs alone, i.e., we do not count the number of occurrences of term T when it occurs as a substring of a longer term.
  7. 7. Experiments 100 bilingual term candidates that were randomly selected from the top 1,000 candidates were manually evaluated for each score. A : correct A' : correct depending on contexts B : partly correct C : incorrect Evaluation criterion 6 • 22,543 bilingual term candidates were extracted from the Phrase-table. • Training corpus : Japanese-English parallel corpus, consisting of about 60,000 pairs, related to apparel products.
  8. 8. Translation accuracy A A' B C F1 43 25 24 8 L1 77 5 18 0 C 78 6 14 2 F2 71 18 8 3 L2 79 4 17 0 FLC 87 2 11 0 , , , and can filter the extracted noisy bilingual term. 7 2ScoreF ScoreFLC ScoreC2ScoreL
  9. 9. Characteristics of extracted bilingual terms occurrences words F2 Many Few L2 Many Many C Few Many FLC Few Many • Each measure extracts different bilingual term candidates. • The characteristic of indicated a tendency similar to . From this, the ’s residual noise was filtered by and . 8 ScoreFLC 2ScoreF 2ScoreL ScoreC ScoreC
  10. 10. Conclusion We compared three statistical measures for extracting bilingual terms from the phrase- table built from a parallel corpus. Each method differs in the number of words and the occurrences of bilingual terms. The combination of these measures ranks valid bilingual terms highly. 9
  11. 11. Fisher's exact test : Significance of the candidates • Fisher’s exact test has been used by Johnson et al. (2007) to select valid phrase pairs from the phrase-table for statistical machine translation. We use the statistic of Fisher’s exact test as Score_F to measure the validity of each bilingual term candidate. If Score_F of a bilingual term candidate is high, the candidate has the validity. ScoreF
  12. 12. Score_F C(J,E) C(J)-C(J,E) C(J) C(E)-C(J,E) N-C(J)-C(E)+C(J,E) N-C(J) C(E) N-C(E) N N : All parallel sentences C(J) : Japanese sentences containing J C(E): English sentences containing E C(J,E) : The number of parallel sentences containing J and E • P_h (C(J, E)) is the probability of observing the contingency table under the null hypothesis of J and E being independent of each other.
  13. 13. Log-likelihood Ratio • Tonoike et al. (2007) said alignments of a component of the term is useful for automatic bilingual term extraction. : Strength of the alignmentScoreL Using the word alignments of each candidate term to measure the validity of the candidates.
  14. 14. Alignment information We used the alignment information produced by Moses (Koehn et al., 2007). Alignments in the parallel sentences Alignments in ,J ET
  15. 15. C-Value If the term candidates of both language are highly ranked in C-value ranking, the bilingual term candidate has validity. : Termhood of the candidate color denim pants (C-Value = 6.34) color denim (2.0) denim pants (60.33) ScoreC The C-value (Frantzi et al., 1996) has been used to measure the stability of nested multi- word term candidates.
  16. 16. Bilingual term counting and combination of measures Our experiments show that the counting method is better than normal counting and characteristics of each measure are different. Therefore, we combine them. • Combination of measures • Two methods for counting the number of occurrences of term T Method 1 : Counting without regarding where T occurs Method 2 : Counting T only when it occurs alone, i.e., we do not count the number of occurrences of term T when it occurs as a substring of a longer term. 2 , 2 , , , (Score ( )) (Score ( )) (Score ( )) Score ( ) 3 F J E L J E C J E FLC J E R T R T R T T + + =
  17. 17. Examples of the extracted bilingual term daiya diamond daun jaketto down jacket kitake nagame long length wanpi- su one- piece kata osi embosse d leather ga-ze sozai gauze material siagari finish kisetu kan seasonal look kobana gara floral pattern pointo accent iro zukai coloring pasu ke-su card case B uesuto bubun (waist part) waist konbou sozai (blend material) blend iro oti (faced color) faded look C sodeguti (cuff) hem siruetto bodi- (body silhouette) item features A' F2 L2 C A

×