SlideShare a Scribd company logo
1 of 18
Download to read offline
A Comparison of Unsupervised
Bilingual Term Extraction
Methods Using Phrase-Tables
Masamichi Ideue†
Kazuhide Yamamoto
Masao Utiyama
Eiichiro Sumita
‡
Nagaoka University of
Technology, Japan
†
National Institutre of
Information
and Communications
Technology
†
‡
‡
Background
• Automatic bilingual term extraction
• Helpful for human translators
• Applicable to other NLP tasks
Develop unsupervised methods
for extracting bilingual terms
from a phrase-table, and compare
them.
Goal
1
Related works
• Using parallel corpus
• Using existing bilingual dictionary
Tonoike et al. (2006) translated the number of word in
each source language term using the bilingual dictionary
and combined these translations to form term candidates.
2
Itagaki et al. (2007) proposed a supervised method for
extracting bilingual terms from the phrase-table built from
a parallel corpus.
We usually do not have annotated data for
training supervised methods nor bilingual
dictionaries specific to the documents under
translation.
3
Statistical measures 4
Three statistical scores are used to eliminate
the wrong pairs, respectively.
: Significance of the candidates
based on Fisher’s exact test.
ScoreF
: Strength of the alignment
between words of the candidates.
ScoreL
: Termhood of the candidate
based on C-value.
ScoreC
Bilingual term counting and
combination of measures
• Combination of scores
5
,
2 , 2 , ,
Score ( )
(Score ( )) (Score ( )) (Score ( ))
3
FLC J E
F J E L J E C J E
T
R T R T R T
=
+ +
• Two methods for counting the number of
occurrences of term T
Method 1 : Counting without regarding where T occurs
Method 2 : Counting T only when it occurs alone, i.e.,
we do not count the number of occurrences of term T
when it occurs as a substring of a longer term.
Experiments
100 bilingual term candidates that were randomly
selected from the top 1,000 candidates were manually
evaluated for each score.
A : correct
A' : correct depending on contexts
B : partly correct
C : incorrect
Evaluation criterion
6
• 22,543 bilingual term candidates were extracted
from the Phrase-table.
• Training corpus : Japanese-English parallel corpus,
consisting of about 60,000 pairs, related to apparel
products.
Translation accuracy
A A' B C
F1 43 25 24 8
L1 77 5 18 0
C 78 6 14 2
F2 71 18 8 3
L2 79 4 17 0
FLC 87 2 11 0
, , , and
can filter the extracted noisy
bilingual term.
7
2ScoreF
ScoreFLC
ScoreC2ScoreL
Characteristics of extracted
bilingual terms
occurrences words
F2 Many Few
L2 Many Many
C Few Many
FLC Few Many
• Each measure extracts different bilingual term
candidates.
• The characteristic of indicated a
tendency similar to . From this, the
’s residual noise was filtered by
and .
8
ScoreFLC
2ScoreF 2ScoreL
ScoreC
ScoreC
Conclusion
We compared three statistical measures for
extracting bilingual terms from the phrase-
table built from a parallel corpus.
Each method differs in the number of
words and the occurrences of bilingual
terms.
The combination of these measures
ranks valid bilingual terms highly.
9
Fisher's exact test
: Significance of the candidates
• Fisher’s exact test has been used by Johnson et
al. (2007) to select valid phrase pairs from the
phrase-table for statistical machine translation.
We use the statistic of Fisher’s exact test as Score_F
to measure the validity of each bilingual term
candidate. If Score_F of a bilingual term candidate
is high, the candidate has the validity.
ScoreF
Score_F
C(J,E) C(J)-C(J,E) C(J)
C(E)-C(J,E) N-C(J)-C(E)+C(J,E) N-C(J)
C(E) N-C(E) N
N : All parallel sentences
C(J) : Japanese sentences containing J
C(E): English sentences containing E
C(J,E) : The number of parallel
sentences containing J and E
• P_h (C(J, E)) is the probability of observing the
contingency table under the null hypothesis of J and E
being independent of each other.
Log-likelihood Ratio
• Tonoike et al. (2007) said alignments of
a component of the term is useful for
automatic bilingual term extraction.
: Strength of the alignmentScoreL
Using the word alignments of each
candidate term to measure the validity of
the candidates.
Alignment information
We used the alignment information produced by
Moses (Koehn et al., 2007).
Alignments
in the parallel
sentences
Alignments in ,J ET
C-Value
If the term candidates of both language are
highly ranked in C-value ranking, the
bilingual term candidate has validity.
: Termhood of the candidate
color denim pants (C-Value = 6.34)
color denim (2.0) denim pants (60.33)
ScoreC
The C-value (Frantzi et al., 1996) has been
used to measure the stability of nested multi-
word term candidates.
Bilingual term counting and
combination of measures
Our experiments show that the counting method is better
than normal counting and characteristics of each measure
are different. Therefore, we combine them.
• Combination of measures
• Two methods for counting the
number of occurrences of term T
Method 1 : Counting without regarding where T occurs
Method 2 : Counting T only when it occurs alone, i.e.,
we do not count the number of occurrences of term T
when it occurs as a substring of a longer term.
2 , 2 , ,
,
(Score ( )) (Score ( )) (Score ( ))
Score ( )
3
F J E L J E C J E
FLC J E
R T R T R T
T
+ +
=
Examples of the extracted bilingual term
daiya diamond
daun
jaketto
down
jacket
kitake
nagame
long
length
wanpi-
su
one-
piece
kata
osi
embosse
d leather
ga-ze
sozai
gauze
material
siagari finish
kisetu
kan
seasonal
look
kobana
gara
floral
pattern
pointo accent
iro
zukai
coloring
pasu
ke-su
card
case
B
uesuto
bubun
(waist
part)
waist
konbou
sozai
(blend
material)
blend
iro oti
(faced
color)
faded
look
C
sodeguti
(cuff) hem
siruetto
bodi-
(body
silhouette)
item
features
A'
F2 L2 C
A

More Related Content

What's hot

End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRFEnd-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRFJayavardhan Reddy Peddamail
 
Measurement Metrics for Object Oriented Design
Measurement Metrics for Object Oriented DesignMeasurement Metrics for Object Oriented Design
Measurement Metrics for Object Oriented Designzebew
 
Computer Science XII - Hissan 2078
Computer Science XII - Hissan 2078Computer Science XII - Hissan 2078
Computer Science XII - Hissan 2078YEP Nepal
 
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT Lifeng (Aaron) Han
 
Named Entity Recognition System for Hindi Language: A Hybrid Approach
Named Entity Recognition System for Hindi Language: A Hybrid ApproachNamed Entity Recognition System for Hindi Language: A Hybrid Approach
Named Entity Recognition System for Hindi Language: A Hybrid ApproachWaqas Tariq
 

What's hot (8)

End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRFEnd-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
 
Study_of_Sequence_labeling_Systems
Study_of_Sequence_labeling_SystemsStudy_of_Sequence_labeling_Systems
Study_of_Sequence_labeling_Systems
 
Zrm
ZrmZrm
Zrm
 
D2 anandkumar
D2 anandkumarD2 anandkumar
D2 anandkumar
 
Measurement Metrics for Object Oriented Design
Measurement Metrics for Object Oriented DesignMeasurement Metrics for Object Oriented Design
Measurement Metrics for Object Oriented Design
 
Computer Science XII - Hissan 2078
Computer Science XII - Hissan 2078Computer Science XII - Hissan 2078
Computer Science XII - Hissan 2078
 
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
 
Named Entity Recognition System for Hindi Language: A Hybrid Approach
Named Entity Recognition System for Hindi Language: A Hybrid ApproachNamed Entity Recognition System for Hindi Language: A Hybrid Approach
Named Entity Recognition System for Hindi Language: A Hybrid Approach
 

Similar to A Comparison of Unsuperviesed Bilingual Term Extraction Methods Using Phrase Tables

Two Level Disambiguation Model for Query Translation
Two Level Disambiguation Model for Query TranslationTwo Level Disambiguation Model for Query Translation
Two Level Disambiguation Model for Query TranslationIJECEIAES
 
The Effect of Translationese on Statistical Machine Translation
The Effect of Translationese on Statistical Machine TranslationThe Effect of Translationese on Statistical Machine Translation
The Effect of Translationese on Statistical Machine TranslationGennadi Lembersky
 
Personalising speech to-speech translation
Personalising speech to-speech translationPersonalising speech to-speech translation
Personalising speech to-speech translationbehzad66
 
Lepor: augmented automatic MT evaluation metric
Lepor: augmented automatic MT evaluation metricLepor: augmented automatic MT evaluation metric
Lepor: augmented automatic MT evaluation metricLifeng (Aaron) Han
 
TSD2013 PPT.AUTOMATIC MACHINE TRANSLATION EVALUATION WITH PART-OF-SPEECH INFO...
TSD2013 PPT.AUTOMATIC MACHINE TRANSLATION EVALUATION WITH PART-OF-SPEECH INFO...TSD2013 PPT.AUTOMATIC MACHINE TRANSLATION EVALUATION WITH PART-OF-SPEECH INFO...
TSD2013 PPT.AUTOMATIC MACHINE TRANSLATION EVALUATION WITH PART-OF-SPEECH INFO...Lifeng (Aaron) Han
 
Intrinsic and Extrinsic Evaluations of Word Embeddings
Intrinsic and Extrinsic Evaluations of Word EmbeddingsIntrinsic and Extrinsic Evaluations of Word Embeddings
Intrinsic and Extrinsic Evaluations of Word EmbeddingsJinho Choi
 
ACL-WMT2013.A Description of Tunable Machine Translation Evaluation Systems i...
ACL-WMT2013.A Description of Tunable Machine Translation Evaluation Systems i...ACL-WMT2013.A Description of Tunable Machine Translation Evaluation Systems i...
ACL-WMT2013.A Description of Tunable Machine Translation Evaluation Systems i...Lifeng (Aaron) Han
 
Measuring word alignment_quality_for_statistical_machine_translation_tcm17-29663
Measuring word alignment_quality_for_statistical_machine_translation_tcm17-29663Measuring word alignment_quality_for_statistical_machine_translation_tcm17-29663
Measuring word alignment_quality_for_statistical_machine_translation_tcm17-29663Yafi Azhari
 
COLING 2012 - LEPOR: A Robust Evaluation Metric for Machine Translation with ...
COLING 2012 - LEPOR: A Robust Evaluation Metric for Machine Translation with ...COLING 2012 - LEPOR: A Robust Evaluation Metric for Machine Translation with ...
COLING 2012 - LEPOR: A Robust Evaluation Metric for Machine Translation with ...Lifeng (Aaron) Han
 
Evaluation of hindi english mt systems, challenges and solutions
Evaluation of hindi english mt systems, challenges and solutionsEvaluation of hindi english mt systems, challenges and solutions
Evaluation of hindi english mt systems, challenges and solutionsSajeed Mahaboob
 
MT SUMMIT13.Language-independent Model for Machine Translation Evaluation wit...
MT SUMMIT13.Language-independent Model for Machine Translation Evaluation wit...MT SUMMIT13.Language-independent Model for Machine Translation Evaluation wit...
MT SUMMIT13.Language-independent Model for Machine Translation Evaluation wit...Lifeng (Aaron) Han
 
A survey on phrase structure learning methods for text classification
A survey on phrase structure learning methods for text classificationA survey on phrase structure learning methods for text classification
A survey on phrase structure learning methods for text classificationijnlc
 
Fast and Accurate Preordering for SMT using Neural Networks
Fast and Accurate Preordering for SMT using Neural NetworksFast and Accurate Preordering for SMT using Neural Networks
Fast and Accurate Preordering for SMT using Neural NetworksSDL
 
Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...Lifeng (Aaron) Han
 
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...Lifeng (Aaron) Han
 
Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...Lifeng (Aaron) Han
 
Selecting Proper Lexical Paraphrase for Children
Selecting Proper Lexical Paraphrase for ChildrenSelecting Proper Lexical Paraphrase for Children
Selecting Proper Lexical Paraphrase for ChildrenTomoyuki Kajiwara
 
Neural machine translation of rare words with subword units
Neural machine translation of rare words with subword unitsNeural machine translation of rare words with subword units
Neural machine translation of rare words with subword unitsTae Hwan Jung
 
Grammarly AI-NLP Club #4 - Understanding and assessing language with neural n...
Grammarly AI-NLP Club #4 - Understanding and assessing language with neural n...Grammarly AI-NLP Club #4 - Understanding and assessing language with neural n...
Grammarly AI-NLP Club #4 - Understanding and assessing language with neural n...Grammarly
 

Similar to A Comparison of Unsuperviesed Bilingual Term Extraction Methods Using Phrase Tables (20)

Two Level Disambiguation Model for Query Translation
Two Level Disambiguation Model for Query TranslationTwo Level Disambiguation Model for Query Translation
Two Level Disambiguation Model for Query Translation
 
The Effect of Translationese on Statistical Machine Translation
The Effect of Translationese on Statistical Machine TranslationThe Effect of Translationese on Statistical Machine Translation
The Effect of Translationese on Statistical Machine Translation
 
Personalising speech to-speech translation
Personalising speech to-speech translationPersonalising speech to-speech translation
Personalising speech to-speech translation
 
Lepor: augmented automatic MT evaluation metric
Lepor: augmented automatic MT evaluation metricLepor: augmented automatic MT evaluation metric
Lepor: augmented automatic MT evaluation metric
 
TSD2013 PPT.AUTOMATIC MACHINE TRANSLATION EVALUATION WITH PART-OF-SPEECH INFO...
TSD2013 PPT.AUTOMATIC MACHINE TRANSLATION EVALUATION WITH PART-OF-SPEECH INFO...TSD2013 PPT.AUTOMATIC MACHINE TRANSLATION EVALUATION WITH PART-OF-SPEECH INFO...
TSD2013 PPT.AUTOMATIC MACHINE TRANSLATION EVALUATION WITH PART-OF-SPEECH INFO...
 
Intrinsic and Extrinsic Evaluations of Word Embeddings
Intrinsic and Extrinsic Evaluations of Word EmbeddingsIntrinsic and Extrinsic Evaluations of Word Embeddings
Intrinsic and Extrinsic Evaluations of Word Embeddings
 
ACL-WMT2013.A Description of Tunable Machine Translation Evaluation Systems i...
ACL-WMT2013.A Description of Tunable Machine Translation Evaluation Systems i...ACL-WMT2013.A Description of Tunable Machine Translation Evaluation Systems i...
ACL-WMT2013.A Description of Tunable Machine Translation Evaluation Systems i...
 
Measuring word alignment_quality_for_statistical_machine_translation_tcm17-29663
Measuring word alignment_quality_for_statistical_machine_translation_tcm17-29663Measuring word alignment_quality_for_statistical_machine_translation_tcm17-29663
Measuring word alignment_quality_for_statistical_machine_translation_tcm17-29663
 
COLING 2012 - LEPOR: A Robust Evaluation Metric for Machine Translation with ...
COLING 2012 - LEPOR: A Robust Evaluation Metric for Machine Translation with ...COLING 2012 - LEPOR: A Robust Evaluation Metric for Machine Translation with ...
COLING 2012 - LEPOR: A Robust Evaluation Metric for Machine Translation with ...
 
Evaluation of hindi english mt systems, challenges and solutions
Evaluation of hindi english mt systems, challenges and solutionsEvaluation of hindi english mt systems, challenges and solutions
Evaluation of hindi english mt systems, challenges and solutions
 
MT SUMMIT13.Language-independent Model for Machine Translation Evaluation wit...
MT SUMMIT13.Language-independent Model for Machine Translation Evaluation wit...MT SUMMIT13.Language-independent Model for Machine Translation Evaluation wit...
MT SUMMIT13.Language-independent Model for Machine Translation Evaluation wit...
 
A survey on phrase structure learning methods for text classification
A survey on phrase structure learning methods for text classificationA survey on phrase structure learning methods for text classification
A survey on phrase structure learning methods for text classification
 
Fast and Accurate Preordering for SMT using Neural Networks
Fast and Accurate Preordering for SMT using Neural NetworksFast and Accurate Preordering for SMT using Neural Networks
Fast and Accurate Preordering for SMT using Neural Networks
 
Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...
 
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
 
Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...
 
Selecting Proper Lexical Paraphrase for Children
Selecting Proper Lexical Paraphrase for ChildrenSelecting Proper Lexical Paraphrase for Children
Selecting Proper Lexical Paraphrase for Children
 
Selecting proper lexical paraphrase for children
Selecting proper lexical paraphrase for childrenSelecting proper lexical paraphrase for children
Selecting proper lexical paraphrase for children
 
Neural machine translation of rare words with subword units
Neural machine translation of rare words with subword unitsNeural machine translation of rare words with subword units
Neural machine translation of rare words with subword units
 
Grammarly AI-NLP Club #4 - Understanding and assessing language with neural n...
Grammarly AI-NLP Club #4 - Understanding and assessing language with neural n...Grammarly AI-NLP Club #4 - Understanding and assessing language with neural n...
Grammarly AI-NLP Club #4 - Understanding and assessing language with neural n...
 

More from 長岡技術科学大学 自然言語処理研究室

More from 長岡技術科学大学 自然言語処理研究室 (20)

小学生の読解支援に向けた複数の換言知識を併用した語彙平易化と評価
小学生の読解支援に向けた複数の換言知識を併用した語彙平易化と評価小学生の読解支援に向けた複数の換言知識を併用した語彙平易化と評価
小学生の読解支援に向けた複数の換言知識を併用した語彙平易化と評価
 
小学生の読解支援に向けた語釈文から語彙的換言を選択する手法
小学生の読解支援に向けた語釈文から語彙的換言を選択する手法小学生の読解支援に向けた語釈文から語彙的換言を選択する手法
小学生の読解支援に向けた語釈文から語彙的換言を選択する手法
 
Selecting Proper Lexical Paraphrase for Children
Selecting Proper Lexical Paraphrase for ChildrenSelecting Proper Lexical Paraphrase for Children
Selecting Proper Lexical Paraphrase for Children
 
Automatic Selection of Predicates for Common Sense Knowledge Expression
Automatic Selection of Predicates for Common Sense Knowledge ExpressionAutomatic Selection of Predicates for Common Sense Knowledge Expression
Automatic Selection of Predicates for Common Sense Knowledge Expression
 
用言等換言辞書を用いた換言結果の考察
用言等換言辞書を用いた換言結果の考察用言等換言辞書を用いた換言結果の考察
用言等換言辞書を用いた換言結果の考察
 
用言等換言辞書の構築
用言等換言辞書の構築用言等換言辞書の構築
用言等換言辞書の構築
 
質問意図によるQAサイト質問文の自動分類
質問意図によるQAサイト質問文の自動分類質問意図によるQAサイト質問文の自動分類
質問意図によるQAサイト質問文の自動分類
 
役所からの公的文書に対する「やさしい日本語」への変換システムの構築
役所からの公的文書に対する「やさしい日本語」への変換システムの構築役所からの公的文書に対する「やさしい日本語」への変換システムの構築
役所からの公的文書に対する「やさしい日本語」への変換システムの構築
 
対訳コーパスから生成したワードグラフによる部分的機械翻訳
対訳コーパスから生成したワードグラフによる部分的機械翻訳対訳コーパスから生成したワードグラフによる部分的機械翻訳
対訳コーパスから生成したワードグラフによる部分的機械翻訳
 
用言等換言辞書を人手で作りました
用言等換言辞書を人手で作りました用言等換言辞書を人手で作りました
用言等換言辞書を人手で作りました
 
文字列の出現頻度情報を用いた分かち書き単位の自動取得
文字列の出現頻度情報を用いた分かち書き単位の自動取得文字列の出現頻度情報を用いた分かち書き単位の自動取得
文字列の出現頻度情報を用いた分かち書き単位の自動取得
 
「やさしい日本語」変換システムの試作
「やさしい日本語」変換システムの試作「やさしい日本語」変換システムの試作
「やさしい日本語」変換システムの試作
 
常識表現となり得る用言の自動選定の検討
常識表現となり得る用言の自動選定の検討常識表現となり得る用言の自動選定の検討
常識表現となり得る用言の自動選定の検討
 
動詞意味類型の曖昧性解消に向けた格フレーム情報との関連調査
動詞意味類型の曖昧性解消に向けた格フレーム情報との関連調査動詞意味類型の曖昧性解消に向けた格フレーム情報との関連調査
動詞意味類型の曖昧性解消に向けた格フレーム情報との関連調査
 
二格深層格の定量的分析
二格深層格の定量的分析二格深層格の定量的分析
二格深層格の定量的分析
 
大規模常識知識ベース構築のための常識表現の自動獲得
大規模常識知識ベース構築のための常識表現の自動獲得大規模常識知識ベース構築のための常識表現の自動獲得
大規模常識知識ベース構築のための常識表現の自動獲得
 
文脈の多様性に基づく名詞換言の提案
文脈の多様性に基づく名詞換言の提案文脈の多様性に基づく名詞換言の提案
文脈の多様性に基づく名詞換言の提案
 
保険関連文書を対象とした文章校正支援のための変換誤り検出
保険関連文書を対象とした文章校正支援のための変換誤り検出保険関連文書を対象とした文章校正支援のための変換誤り検出
保険関連文書を対象とした文章校正支援のための変換誤り検出
 
Developing User-friendly and Customizable Text Analyzer
Developing User-friendly and Customizable Text AnalyzerDeveloping User-friendly and Customizable Text Analyzer
Developing User-friendly and Customizable Text Analyzer
 
普通名詞換言辞書の構築
普通名詞換言辞書の構築普通名詞換言辞書の構築
普通名詞換言辞書の構築
 

Recently uploaded

Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 

Recently uploaded (20)

Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 

A Comparison of Unsuperviesed Bilingual Term Extraction Methods Using Phrase Tables

  • 1. A Comparison of Unsupervised Bilingual Term Extraction Methods Using Phrase-Tables Masamichi Ideue† Kazuhide Yamamoto Masao Utiyama Eiichiro Sumita ‡ Nagaoka University of Technology, Japan † National Institutre of Information and Communications Technology † ‡ ‡
  • 2. Background • Automatic bilingual term extraction • Helpful for human translators • Applicable to other NLP tasks Develop unsupervised methods for extracting bilingual terms from a phrase-table, and compare them. Goal 1
  • 3. Related works • Using parallel corpus • Using existing bilingual dictionary Tonoike et al. (2006) translated the number of word in each source language term using the bilingual dictionary and combined these translations to form term candidates. 2 Itagaki et al. (2007) proposed a supervised method for extracting bilingual terms from the phrase-table built from a parallel corpus. We usually do not have annotated data for training supervised methods nor bilingual dictionaries specific to the documents under translation.
  • 4.
  • 5. 3
  • 6. Statistical measures 4 Three statistical scores are used to eliminate the wrong pairs, respectively. : Significance of the candidates based on Fisher’s exact test. ScoreF : Strength of the alignment between words of the candidates. ScoreL : Termhood of the candidate based on C-value. ScoreC
  • 7. Bilingual term counting and combination of measures • Combination of scores 5 , 2 , 2 , , Score ( ) (Score ( )) (Score ( )) (Score ( )) 3 FLC J E F J E L J E C J E T R T R T R T = + + • Two methods for counting the number of occurrences of term T Method 1 : Counting without regarding where T occurs Method 2 : Counting T only when it occurs alone, i.e., we do not count the number of occurrences of term T when it occurs as a substring of a longer term.
  • 8. Experiments 100 bilingual term candidates that were randomly selected from the top 1,000 candidates were manually evaluated for each score. A : correct A' : correct depending on contexts B : partly correct C : incorrect Evaluation criterion 6 • 22,543 bilingual term candidates were extracted from the Phrase-table. • Training corpus : Japanese-English parallel corpus, consisting of about 60,000 pairs, related to apparel products.
  • 9. Translation accuracy A A' B C F1 43 25 24 8 L1 77 5 18 0 C 78 6 14 2 F2 71 18 8 3 L2 79 4 17 0 FLC 87 2 11 0 , , , and can filter the extracted noisy bilingual term. 7 2ScoreF ScoreFLC ScoreC2ScoreL
  • 10. Characteristics of extracted bilingual terms occurrences words F2 Many Few L2 Many Many C Few Many FLC Few Many • Each measure extracts different bilingual term candidates. • The characteristic of indicated a tendency similar to . From this, the ’s residual noise was filtered by and . 8 ScoreFLC 2ScoreF 2ScoreL ScoreC ScoreC
  • 11. Conclusion We compared three statistical measures for extracting bilingual terms from the phrase- table built from a parallel corpus. Each method differs in the number of words and the occurrences of bilingual terms. The combination of these measures ranks valid bilingual terms highly. 9
  • 12. Fisher's exact test : Significance of the candidates • Fisher’s exact test has been used by Johnson et al. (2007) to select valid phrase pairs from the phrase-table for statistical machine translation. We use the statistic of Fisher’s exact test as Score_F to measure the validity of each bilingual term candidate. If Score_F of a bilingual term candidate is high, the candidate has the validity. ScoreF
  • 13. Score_F C(J,E) C(J)-C(J,E) C(J) C(E)-C(J,E) N-C(J)-C(E)+C(J,E) N-C(J) C(E) N-C(E) N N : All parallel sentences C(J) : Japanese sentences containing J C(E): English sentences containing E C(J,E) : The number of parallel sentences containing J and E • P_h (C(J, E)) is the probability of observing the contingency table under the null hypothesis of J and E being independent of each other.
  • 14. Log-likelihood Ratio • Tonoike et al. (2007) said alignments of a component of the term is useful for automatic bilingual term extraction. : Strength of the alignmentScoreL Using the word alignments of each candidate term to measure the validity of the candidates.
  • 15. Alignment information We used the alignment information produced by Moses (Koehn et al., 2007). Alignments in the parallel sentences Alignments in ,J ET
  • 16. C-Value If the term candidates of both language are highly ranked in C-value ranking, the bilingual term candidate has validity. : Termhood of the candidate color denim pants (C-Value = 6.34) color denim (2.0) denim pants (60.33) ScoreC The C-value (Frantzi et al., 1996) has been used to measure the stability of nested multi- word term candidates.
  • 17. Bilingual term counting and combination of measures Our experiments show that the counting method is better than normal counting and characteristics of each measure are different. Therefore, we combine them. • Combination of measures • Two methods for counting the number of occurrences of term T Method 1 : Counting without regarding where T occurs Method 2 : Counting T only when it occurs alone, i.e., we do not count the number of occurrences of term T when it occurs as a substring of a longer term. 2 , 2 , , , (Score ( )) (Score ( )) (Score ( )) Score ( ) 3 F J E L J E C J E FLC J E R T R T R T T + + =
  • 18. Examples of the extracted bilingual term daiya diamond daun jaketto down jacket kitake nagame long length wanpi- su one- piece kata osi embosse d leather ga-ze sozai gauze material siagari finish kisetu kan seasonal look kobana gara floral pattern pointo accent iro zukai coloring pasu ke-su card case B uesuto bubun (waist part) waist konbou sozai (blend material) blend iro oti (faced color) faded look C sodeguti (cuff) hem siruetto bodi- (body silhouette) item features A' F2 L2 C A