SlideShare a Scribd company logo
Consistent Improvement in Translation Quality of
Chinese–Japanese Technical Texts by Adding Additional
Quasi-parallel Training Data
Wei Yang and Yves Lepage
Graduate School of Information, Production and Systems
Waseda University
kevinyoogi@akane.waseda.jp ; yves.lepage@waseda.jp
Bilingual parallel corpora are an extremely important resource as they are typically used
in data-driven machine translation. There already exist many freely available corpora for
European languages, but almost none between Chinese and Japanese. The constitution
of large bilingual corpora is a problem for less documented language pairs. We construct
a quasi-parallel corpus automatically by using analogical associations based on certain
number of parallel corpus and a small number of monolingual data. Furthermore, in SMT
experiments, by adding this kind of Chinese–Japanese data into the baseline training cor-
pus, on the same test set, the evaluation scores of the translation results we obtained
were significantly or slightly improved over the baseline systems.
Building analogical clusters according to proportional analogies
• Proportional analogy establishes a general relationship between four objects A, B, C
and D: ”A is to B as C is to D”. An efficient algorithm for the resolution of analogical
equations has been proposed in (Lepage, 1998)1.
A : B :: C : D ⇒



|A|a − |B|a = |C|a − |D|a, ∀a
d (A, B) = d (C, D)
d (A, C) = d (B, D)
• Sentential analogy:
早急に対応し
て下さい。
:
早急に対応し
て欲しい。
::
元に戻して
下さい。
:
元 に 戻 し て
欲しい。
• Analogical cluster: We can cluster sentential analogies as a sequence of lines, where
each line contains one sentence pair and where any two pairs of sentences form a
sentential analogy.
早急に対応して下さい。: 早急に対応して欲しい。
元に戻して下さい。 : 元に戻して欲しい。
やめて下さい。 : やめて欲しい。
• We produced all possible analogical clusters from Chinese and Japanese unrelated
unaligned monolingual data collected from the Web.
Chinese Japanese
# of different sentences 70,000 70,000
# of clusters 23,182 21,975
Such clusters can be considered as rewriting models that can generate new sen-
tences.
• Extracting corresponding clusters by computing similarity according to a classical Dice
formula:
Sim =
2 × |Szh ∩ Sja|
|Szh| + |Sja|
⇒ SimCzh−Cja
=
1
2
(Simleft + Simright)
Szh and Sja denote the minimal sets of changes across the clusters (both on the left or
right) in both languages (after translation and conversion).
Chinese cluster
left part : right part
经经经典典典游戏 : 游戏很很很不不不错错错
‘classic game’ ‘The game is very good.’
喜欢经经经典典典 : 很很很不不不错错错喜欢
‘I like classic.’ ‘Very good, I like it.’
经经经典典典啊 : 很很很不不不错错错啊
‘Classic!’ ‘Very good!’
Japanese cluster
left part : right part
クククラララシシシッッッククク物語 : こここののの物語はははとととてててもももいいいいいい
‘classic narrative’ ‘The narrative is very good.’
クククラララシシシッッッククク音楽 : こここののの音楽はははとととてててもももいいいいいい
‘classic music’ ‘The music is very good.’
Generation of new sentences using analogical associations
• Generation of new sentences
We use analogy as an operation by which, given two related forms (rewriting model) and
only one form, the fourth missing form is coined2. Applied on sentences, this principle
can be illustrated as follows:
早急に対応して下さい。 :
早急に対応し
て欲しい。
:: 正式版に戻して下さい。 : x
⇒ x = 正式版に戻して欲しい。
• Experiments on new sentence generation and filtering by N-sequences
We eliminate any sentence that contains an N-sequence of a given length unseen in
our data. For valid sentences, we remember their corresponding seed sentences and
the cluster identifiers they were generated from.
Chinese Japanese
# of seed sentences 99,538 97,152
# of clusters 23,182 21,975
# of candidate sentences 105,038,200 80,183,424
Q= 29% Q= 40%
# of filtered sentences
unique seed–new–# unique seed–new–#
33,141 67,099 40,234 84,533
Q= 96% Q= 96%
• Deducing and acquiring quasi-parallel sentences
We deduce translation relations based on the initial parallel corpus and corresponding
clusters between Chinese and Japanese.
Chinese Japanese Chinese–Japanese
seed–new–# seed–new–#
Initial par-
allel corpus
Corresponding
clusters
Quasi-parallel
corpus
67,099 84,533 103,629 15,710 35,817
A : B :: Cseed : Xnew−zh
经经经典典典游戏 : 游戏很很很不不不错错错
喜欢经经经典典典 : 很很很不不不错错错喜欢 :: 经典电影
‘classic film’
⇒
电影很不错
‘The film is very good.’
很不错电影
经经经典典典啊 : 很很很不不不错错错啊 ‘That’s very good, the film.’
A : B :: Cseed : Xnew−ja
クククラララシシシッッッククク物語 : こここののの物語はははとととてててもももいいいいいい
:: クラシック映画
‘classic film’
⇒
この映画はとてもいい
‘The film is very good.’クククラララシシシッッッククク音楽 : こここののの音楽はははとととてててもももいいいいいい
SMT experiments
• Experimental protocol: To assess the contribution of the generated quasi-parallel cor-
pus, we compare two SMT systems. The first one is constructed using the initial given
ASPEC-JC parallel corpus. This is the baseline. The second one adds the additional
quasi-parallel corpus obtained using analogical associations and analogical clusters.
Baseline Chinese Japanese
train
sentences 672,315 672,315
words 18,847,514 23,480,703
mean ± std.dev. 28.12 ± 15.20 35.05 ± 18.88
+ Quasi-parallel Chinese Japanese
train
sentences 708,132 708,132
words 19,212,187 24,512,079
mean ± std.dev. 27.13 ± 14.19 34.23 ± 17.22
Both experiments Chinese Japanese
tune
sentences 2,090 2,090
words 60,458 73,177
mean ± std.dev. 28.93 ± 15.86 35.01 ± 18.87
test
sentences 2,107 2,107
words 59,594 72,027
mean ± std.dev. 28.28 ± 14.55 34.18 ± 17.43
• Experimental results (using the different segmentation tools and moses version):
– segmentation tools: urheen and mecab, moses 1.0: significant.
BLEU NIST WER TER RIBES
zh-ja
baseline 29.10 7.5677 0.5352 0.5478 0.7801
+ additional training data 32.03 7.9741 0.5069 0.5172 0.7906
ja-zh
baseline 22.98 7.0103 0.5481 0.5711 0.7893
+ additional training data 24.87 7.3208 0.5273 0.5482 0.8013
– segmentation tools: urheen and mecab, moses 2.1.1
BLEU NIST WER TER RIBES
zh-ja
baseline 33.41 8.1537 0.4967 0.5061 0.7956
+ additional training data 33.68 8.1820 0.4955 0.5039 0.7964
ja-zh
baseline 25.53 7.3885 0.5227 0.5427 0.8053
+ additional training data 25.80 7.4571 0.5176 0.5378 0.8060
– segmentation tools: kytea, moses 1.0
BLEU NIST WER TER RIBES
zh-ja
baseline 28.35 7.3123 0.5667 0.5741 0.7610
+ additional training data 28.87 7.4637 0.5566 0.5615 0.7739
ja-zh
baseline 22.83 6.9533 0.5633 0.5853 0.7807
+ additional training data 23.18 7.0402 0.5547 0.5778 0.7865
1
Yves Lepage. Solving analogies on words: An algorithm, COLING-ACL’98, Volume I, pp. 728-735, Montr´eal, Aug. 1998.
2
Ferdinand de Saussure. Cours de linguistique g´en´erale, Payot, Lausanne et Paris, [1`ere ´ed. 1916] edition, 1995.

More Related Content

What's hot

Chapter 03
Chapter 03Chapter 03
Chapter 03
Rooney Joh
 
Combinatorial optimization CO-4
Combinatorial optimization CO-4Combinatorial optimization CO-4
Combinatorial optimization CO-4
man003
 
theory of computation lecture 02
theory of computation lecture 02theory of computation lecture 02
theory of computation lecture 02
8threspecter
 
Sergey Nikolenko and Elena Tutubalina - Constructing Aspect-Based Sentiment ...
Sergey Nikolenko and  Elena Tutubalina - Constructing Aspect-Based Sentiment ...Sergey Nikolenko and  Elena Tutubalina - Constructing Aspect-Based Sentiment ...
Sergey Nikolenko and Elena Tutubalina - Constructing Aspect-Based Sentiment ...
AIST
 
Minimizing DFA
Minimizing DFAMinimizing DFA
Minimizing DFA
Animesh Chaturvedi
 
Numerical on bisection method
Numerical on bisection methodNumerical on bisection method
Numerical on bisection method
Sumita Das
 
Pigeonhole principle
Pigeonhole principlePigeonhole principle
Pigeonhole principle
Nivegeetha
 
Data Complexity in EL Family of Description Logics
Data Complexity in EL Family of Description LogicsData Complexity in EL Family of Description Logics
Data Complexity in EL Family of Description LogicsAdila Krisnadhi
 
Exam 3 Math 189
Exam 3 Math 189Exam 3 Math 189
Exam 3 Math 189
Tyler Murphy
 
Polygraphic Substitution Cipher - Part 2
Polygraphic Substitution Cipher  - Part 2Polygraphic Substitution Cipher  - Part 2
Polygraphic Substitution Cipher - Part 2
SHUBHA CHATURVEDI
 
Numerical on dichotomous search
Numerical on dichotomous searchNumerical on dichotomous search
Numerical on dichotomous search
Sumita Das
 
Mit203 analysis and design of algorithms
Mit203  analysis and design of algorithmsMit203  analysis and design of algorithms
Mit203 analysis and design of algorithms
smumbahelp
 
Mca 4040 analysis and design of algorithm
Mca 4040  analysis and design of algorithmMca 4040  analysis and design of algorithm
Mca 4040 analysis and design of algorithm
smumbahelp
 
Mca 4040 analysis and design of algorithm
Mca 4040  analysis and design of algorithmMca 4040  analysis and design of algorithm
Mca 4040 analysis and design of algorithm
smumbahelp
 
Toward Disentanglement through Understand ELBO
Toward Disentanglement through Understand ELBOToward Disentanglement through Understand ELBO
Toward Disentanglement through Understand ELBO
Kai-Wen Zhao
 
Graph generation using a graph grammar
Graph generation using a graph grammarGraph generation using a graph grammar
Graph generation using a graph grammar
Hiroshi Kajino
 

What's hot (18)

Chapter 03
Chapter 03Chapter 03
Chapter 03
 
Combinatorial optimization CO-4
Combinatorial optimization CO-4Combinatorial optimization CO-4
Combinatorial optimization CO-4
 
theory of computation lecture 02
theory of computation lecture 02theory of computation lecture 02
theory of computation lecture 02
 
Sergey Nikolenko and Elena Tutubalina - Constructing Aspect-Based Sentiment ...
Sergey Nikolenko and  Elena Tutubalina - Constructing Aspect-Based Sentiment ...Sergey Nikolenko and  Elena Tutubalina - Constructing Aspect-Based Sentiment ...
Sergey Nikolenko and Elena Tutubalina - Constructing Aspect-Based Sentiment ...
 
Minimizing DFA
Minimizing DFAMinimizing DFA
Minimizing DFA
 
Numerical on bisection method
Numerical on bisection methodNumerical on bisection method
Numerical on bisection method
 
Pigeonhole principle
Pigeonhole principlePigeonhole principle
Pigeonhole principle
 
Sol8
Sol8Sol8
Sol8
 
Data Complexity in EL Family of Description Logics
Data Complexity in EL Family of Description LogicsData Complexity in EL Family of Description Logics
Data Complexity in EL Family of Description Logics
 
Exam 3 Math 189
Exam 3 Math 189Exam 3 Math 189
Exam 3 Math 189
 
Polygraphic Substitution Cipher - Part 2
Polygraphic Substitution Cipher  - Part 2Polygraphic Substitution Cipher  - Part 2
Polygraphic Substitution Cipher - Part 2
 
Numerical on dichotomous search
Numerical on dichotomous searchNumerical on dichotomous search
Numerical on dichotomous search
 
Mit203 analysis and design of algorithms
Mit203  analysis and design of algorithmsMit203  analysis and design of algorithms
Mit203 analysis and design of algorithms
 
Lec17
Lec17Lec17
Lec17
 
Mca 4040 analysis and design of algorithm
Mca 4040  analysis and design of algorithmMca 4040  analysis and design of algorithm
Mca 4040 analysis and design of algorithm
 
Mca 4040 analysis and design of algorithm
Mca 4040  analysis and design of algorithmMca 4040  analysis and design of algorithm
Mca 4040 analysis and design of algorithm
 
Toward Disentanglement through Understand ELBO
Toward Disentanglement through Understand ELBOToward Disentanglement through Understand ELBO
Toward Disentanglement through Understand ELBO
 
Graph generation using a graph grammar
Graph generation using a graph grammarGraph generation using a graph grammar
Graph generation using a graph grammar
 

Similar to Wei Yang - 2014 - Consistent Improvement in Translation Quality of Chinese–Japanese Technical Texts by Adding Additional Quasi-parallel Training Data

Wei Yang - 2015 - Sampling-based Alignment and Hierarchical Sub-sentential Al...
Wei Yang - 2015 - Sampling-based Alignment and Hierarchical Sub-sentential Al...Wei Yang - 2015 - Sampling-based Alignment and Hierarchical Sub-sentential Al...
Wei Yang - 2015 - Sampling-based Alignment and Hierarchical Sub-sentential Al...
Association for Computational Linguistics
 
multi threaded and distributed algorithms
multi threaded and distributed algorithms multi threaded and distributed algorithms
multi threaded and distributed algorithms
Dr Shashikant Athawale
 
Chris Dyer - 2017 - Neural MT Workshop Invited Talk: The Neural Noisy Channel...
Chris Dyer - 2017 - Neural MT Workshop Invited Talk: The Neural Noisy Channel...Chris Dyer - 2017 - Neural MT Workshop Invited Talk: The Neural Noisy Channel...
Chris Dyer - 2017 - Neural MT Workshop Invited Talk: The Neural Noisy Channel...
Association for Computational Linguistics
 
Design and analysis of algorithms question paper 2015 tutorialsduniya.com
Design and analysis of algorithms  question paper 2015   tutorialsduniya.comDesign and analysis of algorithms  question paper 2015   tutorialsduniya.com
Design and analysis of algorithms question paper 2015 tutorialsduniya.com
TutorialsDuniya.com
 
Source-Level Proof Reconstruction for Interactive Proving
Source-Level Proof Reconstruction for Interactive ProvingSource-Level Proof Reconstruction for Interactive Proving
Source-Level Proof Reconstruction for Interactive Proving
Lawrence Paulson
 
Computer Science Exam Help
Computer Science Exam Help Computer Science Exam Help
Computer Science Exam Help
Programming Exam Help
 
Joint contrastive learning with infinite possibilities
Joint contrastive learning with infinite possibilitiesJoint contrastive learning with infinite possibilities
Joint contrastive learning with infinite possibilities
taeseon ryu
 
Stochastic Processes Homework Help
Stochastic Processes Homework Help Stochastic Processes Homework Help
Stochastic Processes Homework Help
Statistics Homework Helper
 
Stochastic Processes Homework Help
Stochastic Processes Homework HelpStochastic Processes Homework Help
Stochastic Processes Homework Help
Excel Homework Help
 
HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...
HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...
HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...
eXascale Infolab
 
Word embeddings
Word embeddingsWord embeddings
Word embeddings
Ajay Taneja
 
Superefficient Monte Carlo Simulations
Superefficient Monte Carlo SimulationsSuperefficient Monte Carlo Simulations
Superefficient Monte Carlo SimulationsCheng-An Yang
 
Daa unit 5
Daa unit 5Daa unit 5
Daa unit 5
Abhimanyu Mishra
 
P, NP and NP-Complete, Theory of NP-Completeness V2
P, NP and NP-Complete, Theory of NP-Completeness V2P, NP and NP-Complete, Theory of NP-Completeness V2
P, NP and NP-Complete, Theory of NP-Completeness V2
S.Shayan Daneshvar
 
Fast and Accurate Preordering for SMT using Neural Networks
Fast and Accurate Preordering for SMT using Neural NetworksFast and Accurate Preordering for SMT using Neural Networks
Fast and Accurate Preordering for SMT using Neural Networks
SDL
 
【博士論文発表会】パラメータ制約付き特異モデルの統計的学習理論
【博士論文発表会】パラメータ制約付き特異モデルの統計的学習理論【博士論文発表会】パラメータ制約付き特異モデルの統計的学習理論
【博士論文発表会】パラメータ制約付き特異モデルの統計的学習理論
Naoki Hayashi
 
Word2vec and Friends
Word2vec and FriendsWord2vec and Friends
Word2vec and Friends
Bruno Gonçalves
 
CNN for modeling sentence
CNN for modeling sentenceCNN for modeling sentence
CNN for modeling sentence
ANISH BHANUSHALI
 
Word2vec slide(lab seminar)
Word2vec slide(lab seminar)Word2vec slide(lab seminar)
Word2vec slide(lab seminar)
Jinpyo Lee
 

Similar to Wei Yang - 2014 - Consistent Improvement in Translation Quality of Chinese–Japanese Technical Texts by Adding Additional Quasi-parallel Training Data (20)

Wei Yang - 2015 - Sampling-based Alignment and Hierarchical Sub-sentential Al...
Wei Yang - 2015 - Sampling-based Alignment and Hierarchical Sub-sentential Al...Wei Yang - 2015 - Sampling-based Alignment and Hierarchical Sub-sentential Al...
Wei Yang - 2015 - Sampling-based Alignment and Hierarchical Sub-sentential Al...
 
multi threaded and distributed algorithms
multi threaded and distributed algorithms multi threaded and distributed algorithms
multi threaded and distributed algorithms
 
Chris Dyer - 2017 - Neural MT Workshop Invited Talk: The Neural Noisy Channel...
Chris Dyer - 2017 - Neural MT Workshop Invited Talk: The Neural Noisy Channel...Chris Dyer - 2017 - Neural MT Workshop Invited Talk: The Neural Noisy Channel...
Chris Dyer - 2017 - Neural MT Workshop Invited Talk: The Neural Noisy Channel...
 
Design and analysis of algorithms question paper 2015 tutorialsduniya.com
Design and analysis of algorithms  question paper 2015   tutorialsduniya.comDesign and analysis of algorithms  question paper 2015   tutorialsduniya.com
Design and analysis of algorithms question paper 2015 tutorialsduniya.com
 
Source-Level Proof Reconstruction for Interactive Proving
Source-Level Proof Reconstruction for Interactive ProvingSource-Level Proof Reconstruction for Interactive Proving
Source-Level Proof Reconstruction for Interactive Proving
 
Computer Science Exam Help
Computer Science Exam Help Computer Science Exam Help
Computer Science Exam Help
 
Joint contrastive learning with infinite possibilities
Joint contrastive learning with infinite possibilitiesJoint contrastive learning with infinite possibilities
Joint contrastive learning with infinite possibilities
 
Stochastic Processes Homework Help
Stochastic Processes Homework Help Stochastic Processes Homework Help
Stochastic Processes Homework Help
 
Stochastic Processes Homework Help
Stochastic Processes Homework HelpStochastic Processes Homework Help
Stochastic Processes Homework Help
 
HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...
HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...
HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...
 
Word embeddings
Word embeddingsWord embeddings
Word embeddings
 
Superefficient Monte Carlo Simulations
Superefficient Monte Carlo SimulationsSuperefficient Monte Carlo Simulations
Superefficient Monte Carlo Simulations
 
Daa unit 5
Daa unit 5Daa unit 5
Daa unit 5
 
P, NP and NP-Complete, Theory of NP-Completeness V2
P, NP and NP-Complete, Theory of NP-Completeness V2P, NP and NP-Complete, Theory of NP-Completeness V2
P, NP and NP-Complete, Theory of NP-Completeness V2
 
Fast and Accurate Preordering for SMT using Neural Networks
Fast and Accurate Preordering for SMT using Neural NetworksFast and Accurate Preordering for SMT using Neural Networks
Fast and Accurate Preordering for SMT using Neural Networks
 
【博士論文発表会】パラメータ制約付き特異モデルの統計的学習理論
【博士論文発表会】パラメータ制約付き特異モデルの統計的学習理論【博士論文発表会】パラメータ制約付き特異モデルの統計的学習理論
【博士論文発表会】パラメータ制約付き特異モデルの統計的学習理論
 
Word2vec and Friends
Word2vec and FriendsWord2vec and Friends
Word2vec and Friends
 
CNN for modeling sentence
CNN for modeling sentenceCNN for modeling sentence
CNN for modeling sentence
 
Word2vec slide(lab seminar)
Word2vec slide(lab seminar)Word2vec slide(lab seminar)
Word2vec slide(lab seminar)
 
Unger
UngerUnger
Unger
 

More from Association for Computational Linguistics

Muis - 2016 - Weak Semi-Markov CRFs for NP Chunking in Informal Text
Muis - 2016 - Weak Semi-Markov CRFs for NP Chunking in Informal TextMuis - 2016 - Weak Semi-Markov CRFs for NP Chunking in Informal Text
Muis - 2016 - Weak Semi-Markov CRFs for NP Chunking in Informal Text
Association for Computational Linguistics
 
Castro - 2018 - A High Coverage Method for Automatic False Friends Detection ...
Castro - 2018 - A High Coverage Method for Automatic False Friends Detection ...Castro - 2018 - A High Coverage Method for Automatic False Friends Detection ...
Castro - 2018 - A High Coverage Method for Automatic False Friends Detection ...
Association for Computational Linguistics
 
Castro - 2018 - A Crowd-Annotated Spanish Corpus for Humour Analysis
Castro - 2018 - A Crowd-Annotated Spanish Corpus for Humour AnalysisCastro - 2018 - A Crowd-Annotated Spanish Corpus for Humour Analysis
Castro - 2018 - A Crowd-Annotated Spanish Corpus for Humour Analysis
Association for Computational Linguistics
 
Muthu Kumar Chandrasekaran - 2018 - Countering Position Bias in Instructor In...
Muthu Kumar Chandrasekaran - 2018 - Countering Position Bias in Instructor In...Muthu Kumar Chandrasekaran - 2018 - Countering Position Bias in Instructor In...
Muthu Kumar Chandrasekaran - 2018 - Countering Position Bias in Instructor In...
Association for Computational Linguistics
 
Daniel Gildea - 2018 - The ACL Anthology: Current State and Future Directions
Daniel Gildea - 2018 - The ACL Anthology: Current State and Future DirectionsDaniel Gildea - 2018 - The ACL Anthology: Current State and Future Directions
Daniel Gildea - 2018 - The ACL Anthology: Current State and Future Directions
Association for Computational Linguistics
 
Elior Sulem - 2018 - Semantic Structural Evaluation for Text Simplification
Elior Sulem - 2018 - Semantic Structural Evaluation for Text SimplificationElior Sulem - 2018 - Semantic Structural Evaluation for Text Simplification
Elior Sulem - 2018 - Semantic Structural Evaluation for Text Simplification
Association for Computational Linguistics
 
Daniel Gildea - 2018 - The ACL Anthology: Current State and Future Directions
Daniel Gildea - 2018 - The ACL Anthology: Current State and Future DirectionsDaniel Gildea - 2018 - The ACL Anthology: Current State and Future Directions
Daniel Gildea - 2018 - The ACL Anthology: Current State and Future Directions
Association for Computational Linguistics
 
Wenqiang Lei - 2018 - Sequicity: Simplifying Task-oriented Dialogue Systems w...
Wenqiang Lei - 2018 - Sequicity: Simplifying Task-oriented Dialogue Systems w...Wenqiang Lei - 2018 - Sequicity: Simplifying Task-oriented Dialogue Systems w...
Wenqiang Lei - 2018 - Sequicity: Simplifying Task-oriented Dialogue Systems w...
Association for Computational Linguistics
 
Matthew Marge - 2017 - Exploring Variation of Natural Human Commands to a Rob...
Matthew Marge - 2017 - Exploring Variation of Natural Human Commands to a Rob...Matthew Marge - 2017 - Exploring Variation of Natural Human Commands to a Rob...
Matthew Marge - 2017 - Exploring Variation of Natural Human Commands to a Rob...
Association for Computational Linguistics
 
Venkatesh Duppada - 2017 - SeerNet at EmoInt-2017: Tweet Emotion Intensity Es...
Venkatesh Duppada - 2017 - SeerNet at EmoInt-2017: Tweet Emotion Intensity Es...Venkatesh Duppada - 2017 - SeerNet at EmoInt-2017: Tweet Emotion Intensity Es...
Venkatesh Duppada - 2017 - SeerNet at EmoInt-2017: Tweet Emotion Intensity Es...
Association for Computational Linguistics
 
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 Workshop
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 WorkshopSatoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 Workshop
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 Workshop
Association for Computational Linguistics
 
Chenchen Ding - 2015 - NICT at WAT 2015
Chenchen Ding - 2015 - NICT at WAT 2015Chenchen Ding - 2015 - NICT at WAT 2015
Chenchen Ding - 2015 - NICT at WAT 2015
Association for Computational Linguistics
 
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
Association for Computational Linguistics
 
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
Association for Computational Linguistics
 
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Association for Computational Linguistics
 
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Association for Computational Linguistics
 
Hyoung-Gyu Lee - 2015 - NAVER Machine Translation System for WAT 2015
Hyoung-Gyu Lee - 2015 - NAVER Machine Translation System for WAT 2015Hyoung-Gyu Lee - 2015 - NAVER Machine Translation System for WAT 2015
Hyoung-Gyu Lee - 2015 - NAVER Machine Translation System for WAT 2015
Association for Computational Linguistics
 
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 Workshop
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 WorkshopSatoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 Workshop
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 Workshop
Association for Computational Linguistics
 
Chenchen Ding - 2015 - NICT at WAT 2015
Chenchen Ding - 2015 - NICT at WAT 2015Chenchen Ding - 2015 - NICT at WAT 2015
Chenchen Ding - 2015 - NICT at WAT 2015
Association for Computational Linguistics
 
Graham Neubig - 2015 - Neural Reranking Improves Subjective Quality of Machin...
Graham Neubig - 2015 - Neural Reranking Improves Subjective Quality of Machin...Graham Neubig - 2015 - Neural Reranking Improves Subjective Quality of Machin...
Graham Neubig - 2015 - Neural Reranking Improves Subjective Quality of Machin...
Association for Computational Linguistics
 

More from Association for Computational Linguistics (20)

Muis - 2016 - Weak Semi-Markov CRFs for NP Chunking in Informal Text
Muis - 2016 - Weak Semi-Markov CRFs for NP Chunking in Informal TextMuis - 2016 - Weak Semi-Markov CRFs for NP Chunking in Informal Text
Muis - 2016 - Weak Semi-Markov CRFs for NP Chunking in Informal Text
 
Castro - 2018 - A High Coverage Method for Automatic False Friends Detection ...
Castro - 2018 - A High Coverage Method for Automatic False Friends Detection ...Castro - 2018 - A High Coverage Method for Automatic False Friends Detection ...
Castro - 2018 - A High Coverage Method for Automatic False Friends Detection ...
 
Castro - 2018 - A Crowd-Annotated Spanish Corpus for Humour Analysis
Castro - 2018 - A Crowd-Annotated Spanish Corpus for Humour AnalysisCastro - 2018 - A Crowd-Annotated Spanish Corpus for Humour Analysis
Castro - 2018 - A Crowd-Annotated Spanish Corpus for Humour Analysis
 
Muthu Kumar Chandrasekaran - 2018 - Countering Position Bias in Instructor In...
Muthu Kumar Chandrasekaran - 2018 - Countering Position Bias in Instructor In...Muthu Kumar Chandrasekaran - 2018 - Countering Position Bias in Instructor In...
Muthu Kumar Chandrasekaran - 2018 - Countering Position Bias in Instructor In...
 
Daniel Gildea - 2018 - The ACL Anthology: Current State and Future Directions
Daniel Gildea - 2018 - The ACL Anthology: Current State and Future DirectionsDaniel Gildea - 2018 - The ACL Anthology: Current State and Future Directions
Daniel Gildea - 2018 - The ACL Anthology: Current State and Future Directions
 
Elior Sulem - 2018 - Semantic Structural Evaluation for Text Simplification
Elior Sulem - 2018 - Semantic Structural Evaluation for Text SimplificationElior Sulem - 2018 - Semantic Structural Evaluation for Text Simplification
Elior Sulem - 2018 - Semantic Structural Evaluation for Text Simplification
 
Daniel Gildea - 2018 - The ACL Anthology: Current State and Future Directions
Daniel Gildea - 2018 - The ACL Anthology: Current State and Future DirectionsDaniel Gildea - 2018 - The ACL Anthology: Current State and Future Directions
Daniel Gildea - 2018 - The ACL Anthology: Current State and Future Directions
 
Wenqiang Lei - 2018 - Sequicity: Simplifying Task-oriented Dialogue Systems w...
Wenqiang Lei - 2018 - Sequicity: Simplifying Task-oriented Dialogue Systems w...Wenqiang Lei - 2018 - Sequicity: Simplifying Task-oriented Dialogue Systems w...
Wenqiang Lei - 2018 - Sequicity: Simplifying Task-oriented Dialogue Systems w...
 
Matthew Marge - 2017 - Exploring Variation of Natural Human Commands to a Rob...
Matthew Marge - 2017 - Exploring Variation of Natural Human Commands to a Rob...Matthew Marge - 2017 - Exploring Variation of Natural Human Commands to a Rob...
Matthew Marge - 2017 - Exploring Variation of Natural Human Commands to a Rob...
 
Venkatesh Duppada - 2017 - SeerNet at EmoInt-2017: Tweet Emotion Intensity Es...
Venkatesh Duppada - 2017 - SeerNet at EmoInt-2017: Tweet Emotion Intensity Es...Venkatesh Duppada - 2017 - SeerNet at EmoInt-2017: Tweet Emotion Intensity Es...
Venkatesh Duppada - 2017 - SeerNet at EmoInt-2017: Tweet Emotion Intensity Es...
 
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 Workshop
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 WorkshopSatoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 Workshop
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 Workshop
 
Chenchen Ding - 2015 - NICT at WAT 2015
Chenchen Ding - 2015 - NICT at WAT 2015Chenchen Ding - 2015 - NICT at WAT 2015
Chenchen Ding - 2015 - NICT at WAT 2015
 
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
 
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
 
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
 
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
 
Hyoung-Gyu Lee - 2015 - NAVER Machine Translation System for WAT 2015
Hyoung-Gyu Lee - 2015 - NAVER Machine Translation System for WAT 2015Hyoung-Gyu Lee - 2015 - NAVER Machine Translation System for WAT 2015
Hyoung-Gyu Lee - 2015 - NAVER Machine Translation System for WAT 2015
 
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 Workshop
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 WorkshopSatoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 Workshop
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 Workshop
 
Chenchen Ding - 2015 - NICT at WAT 2015
Chenchen Ding - 2015 - NICT at WAT 2015Chenchen Ding - 2015 - NICT at WAT 2015
Chenchen Ding - 2015 - NICT at WAT 2015
 
Graham Neubig - 2015 - Neural Reranking Improves Subjective Quality of Machin...
Graham Neubig - 2015 - Neural Reranking Improves Subjective Quality of Machin...Graham Neubig - 2015 - Neural Reranking Improves Subjective Quality of Machin...
Graham Neubig - 2015 - Neural Reranking Improves Subjective Quality of Machin...
 

Recently uploaded

The basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptxThe basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptx
heathfieldcps1
 
World environment day ppt For 5 June 2024
World environment day ppt For 5 June 2024World environment day ppt For 5 June 2024
World environment day ppt For 5 June 2024
ak6969907
 
Digital Artefact 1 - Tiny Home Environmental Design
Digital Artefact 1 - Tiny Home Environmental DesignDigital Artefact 1 - Tiny Home Environmental Design
Digital Artefact 1 - Tiny Home Environmental Design
amberjdewit93
 
Normal Labour/ Stages of Labour/ Mechanism of Labour
Normal Labour/ Stages of Labour/ Mechanism of LabourNormal Labour/ Stages of Labour/ Mechanism of Labour
Normal Labour/ Stages of Labour/ Mechanism of Labour
Wasim Ak
 
DRUGS AND ITS classification slide share
DRUGS AND ITS classification slide shareDRUGS AND ITS classification slide share
DRUGS AND ITS classification slide share
taiba qazi
 
Digital Artifact 2 - Investigating Pavilion Designs
Digital Artifact 2 - Investigating Pavilion DesignsDigital Artifact 2 - Investigating Pavilion Designs
Digital Artifact 2 - Investigating Pavilion Designs
chanes7
 
RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3
RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3
RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3
IreneSebastianRueco1
 
PIMS Job Advertisement 2024.pdf Islamabad
PIMS Job Advertisement 2024.pdf IslamabadPIMS Job Advertisement 2024.pdf Islamabad
PIMS Job Advertisement 2024.pdf Islamabad
AyyanKhan40
 
How to Add Chatter in the odoo 17 ERP Module
How to Add Chatter in the odoo 17 ERP ModuleHow to Add Chatter in the odoo 17 ERP Module
How to Add Chatter in the odoo 17 ERP Module
Celine George
 
MATATAG CURRICULUM: ASSESSING THE READINESS OF ELEM. PUBLIC SCHOOL TEACHERS I...
MATATAG CURRICULUM: ASSESSING THE READINESS OF ELEM. PUBLIC SCHOOL TEACHERS I...MATATAG CURRICULUM: ASSESSING THE READINESS OF ELEM. PUBLIC SCHOOL TEACHERS I...
MATATAG CURRICULUM: ASSESSING THE READINESS OF ELEM. PUBLIC SCHOOL TEACHERS I...
NelTorrente
 
Pride Month Slides 2024 David Douglas School District
Pride Month Slides 2024 David Douglas School DistrictPride Month Slides 2024 David Douglas School District
Pride Month Slides 2024 David Douglas School District
David Douglas School District
 
Introduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp NetworkIntroduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp Network
TechSoup
 
Lapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdfLapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdf
Jean Carlos Nunes Paixão
 
Assignment_4_ArianaBusciglio Marvel(1).docx
Assignment_4_ArianaBusciglio Marvel(1).docxAssignment_4_ArianaBusciglio Marvel(1).docx
Assignment_4_ArianaBusciglio Marvel(1).docx
ArianaBusciglio
 
ANATOMY AND BIOMECHANICS OF HIP JOINT.pdf
ANATOMY AND BIOMECHANICS OF HIP JOINT.pdfANATOMY AND BIOMECHANICS OF HIP JOINT.pdf
ANATOMY AND BIOMECHANICS OF HIP JOINT.pdf
Priyankaranawat4
 
Aficamten in HCM (SEQUOIA HCM TRIAL 2024)
Aficamten in HCM (SEQUOIA HCM TRIAL 2024)Aficamten in HCM (SEQUOIA HCM TRIAL 2024)
Aficamten in HCM (SEQUOIA HCM TRIAL 2024)
Ashish Kohli
 
Executive Directors Chat Leveraging AI for Diversity, Equity, and Inclusion
Executive Directors Chat  Leveraging AI for Diversity, Equity, and InclusionExecutive Directors Chat  Leveraging AI for Diversity, Equity, and Inclusion
Executive Directors Chat Leveraging AI for Diversity, Equity, and Inclusion
TechSoup
 
Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.
Ashokrao Mane college of Pharmacy Peth-Vadgaon
 
Natural birth techniques - Mrs.Akanksha Trivedi Rama University
Natural birth techniques - Mrs.Akanksha Trivedi Rama UniversityNatural birth techniques - Mrs.Akanksha Trivedi Rama University
Natural birth techniques - Mrs.Akanksha Trivedi Rama University
Akanksha trivedi rama nursing college kanpur.
 
A Survey of Techniques for Maximizing LLM Performance.pptx
A Survey of Techniques for Maximizing LLM Performance.pptxA Survey of Techniques for Maximizing LLM Performance.pptx
A Survey of Techniques for Maximizing LLM Performance.pptx
thanhdowork
 

Recently uploaded (20)

The basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptxThe basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptx
 
World environment day ppt For 5 June 2024
World environment day ppt For 5 June 2024World environment day ppt For 5 June 2024
World environment day ppt For 5 June 2024
 
Digital Artefact 1 - Tiny Home Environmental Design
Digital Artefact 1 - Tiny Home Environmental DesignDigital Artefact 1 - Tiny Home Environmental Design
Digital Artefact 1 - Tiny Home Environmental Design
 
Normal Labour/ Stages of Labour/ Mechanism of Labour
Normal Labour/ Stages of Labour/ Mechanism of LabourNormal Labour/ Stages of Labour/ Mechanism of Labour
Normal Labour/ Stages of Labour/ Mechanism of Labour
 
DRUGS AND ITS classification slide share
DRUGS AND ITS classification slide shareDRUGS AND ITS classification slide share
DRUGS AND ITS classification slide share
 
Digital Artifact 2 - Investigating Pavilion Designs
Digital Artifact 2 - Investigating Pavilion DesignsDigital Artifact 2 - Investigating Pavilion Designs
Digital Artifact 2 - Investigating Pavilion Designs
 
RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3
RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3
RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3
 
PIMS Job Advertisement 2024.pdf Islamabad
PIMS Job Advertisement 2024.pdf IslamabadPIMS Job Advertisement 2024.pdf Islamabad
PIMS Job Advertisement 2024.pdf Islamabad
 
How to Add Chatter in the odoo 17 ERP Module
How to Add Chatter in the odoo 17 ERP ModuleHow to Add Chatter in the odoo 17 ERP Module
How to Add Chatter in the odoo 17 ERP Module
 
MATATAG CURRICULUM: ASSESSING THE READINESS OF ELEM. PUBLIC SCHOOL TEACHERS I...
MATATAG CURRICULUM: ASSESSING THE READINESS OF ELEM. PUBLIC SCHOOL TEACHERS I...MATATAG CURRICULUM: ASSESSING THE READINESS OF ELEM. PUBLIC SCHOOL TEACHERS I...
MATATAG CURRICULUM: ASSESSING THE READINESS OF ELEM. PUBLIC SCHOOL TEACHERS I...
 
Pride Month Slides 2024 David Douglas School District
Pride Month Slides 2024 David Douglas School DistrictPride Month Slides 2024 David Douglas School District
Pride Month Slides 2024 David Douglas School District
 
Introduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp NetworkIntroduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp Network
 
Lapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdfLapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdf
 
Assignment_4_ArianaBusciglio Marvel(1).docx
Assignment_4_ArianaBusciglio Marvel(1).docxAssignment_4_ArianaBusciglio Marvel(1).docx
Assignment_4_ArianaBusciglio Marvel(1).docx
 
ANATOMY AND BIOMECHANICS OF HIP JOINT.pdf
ANATOMY AND BIOMECHANICS OF HIP JOINT.pdfANATOMY AND BIOMECHANICS OF HIP JOINT.pdf
ANATOMY AND BIOMECHANICS OF HIP JOINT.pdf
 
Aficamten in HCM (SEQUOIA HCM TRIAL 2024)
Aficamten in HCM (SEQUOIA HCM TRIAL 2024)Aficamten in HCM (SEQUOIA HCM TRIAL 2024)
Aficamten in HCM (SEQUOIA HCM TRIAL 2024)
 
Executive Directors Chat Leveraging AI for Diversity, Equity, and Inclusion
Executive Directors Chat  Leveraging AI for Diversity, Equity, and InclusionExecutive Directors Chat  Leveraging AI for Diversity, Equity, and Inclusion
Executive Directors Chat Leveraging AI for Diversity, Equity, and Inclusion
 
Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.
 
Natural birth techniques - Mrs.Akanksha Trivedi Rama University
Natural birth techniques - Mrs.Akanksha Trivedi Rama UniversityNatural birth techniques - Mrs.Akanksha Trivedi Rama University
Natural birth techniques - Mrs.Akanksha Trivedi Rama University
 
A Survey of Techniques for Maximizing LLM Performance.pptx
A Survey of Techniques for Maximizing LLM Performance.pptxA Survey of Techniques for Maximizing LLM Performance.pptx
A Survey of Techniques for Maximizing LLM Performance.pptx
 

Wei Yang - 2014 - Consistent Improvement in Translation Quality of Chinese–Japanese Technical Texts by Adding Additional Quasi-parallel Training Data

  • 1. Consistent Improvement in Translation Quality of Chinese–Japanese Technical Texts by Adding Additional Quasi-parallel Training Data Wei Yang and Yves Lepage Graduate School of Information, Production and Systems Waseda University kevinyoogi@akane.waseda.jp ; yves.lepage@waseda.jp Bilingual parallel corpora are an extremely important resource as they are typically used in data-driven machine translation. There already exist many freely available corpora for European languages, but almost none between Chinese and Japanese. The constitution of large bilingual corpora is a problem for less documented language pairs. We construct a quasi-parallel corpus automatically by using analogical associations based on certain number of parallel corpus and a small number of monolingual data. Furthermore, in SMT experiments, by adding this kind of Chinese–Japanese data into the baseline training cor- pus, on the same test set, the evaluation scores of the translation results we obtained were significantly or slightly improved over the baseline systems. Building analogical clusters according to proportional analogies • Proportional analogy establishes a general relationship between four objects A, B, C and D: ”A is to B as C is to D”. An efficient algorithm for the resolution of analogical equations has been proposed in (Lepage, 1998)1. A : B :: C : D ⇒    |A|a − |B|a = |C|a − |D|a, ∀a d (A, B) = d (C, D) d (A, C) = d (B, D) • Sentential analogy: 早急に対応し て下さい。 : 早急に対応し て欲しい。 :: 元に戻して 下さい。 : 元 に 戻 し て 欲しい。 • Analogical cluster: We can cluster sentential analogies as a sequence of lines, where each line contains one sentence pair and where any two pairs of sentences form a sentential analogy. 早急に対応して下さい。: 早急に対応して欲しい。 元に戻して下さい。 : 元に戻して欲しい。 やめて下さい。 : やめて欲しい。 • We produced all possible analogical clusters from Chinese and Japanese unrelated unaligned monolingual data collected from the Web. Chinese Japanese # of different sentences 70,000 70,000 # of clusters 23,182 21,975 Such clusters can be considered as rewriting models that can generate new sen- tences. • Extracting corresponding clusters by computing similarity according to a classical Dice formula: Sim = 2 × |Szh ∩ Sja| |Szh| + |Sja| ⇒ SimCzh−Cja = 1 2 (Simleft + Simright) Szh and Sja denote the minimal sets of changes across the clusters (both on the left or right) in both languages (after translation and conversion). Chinese cluster left part : right part 经经经典典典游戏 : 游戏很很很不不不错错错 ‘classic game’ ‘The game is very good.’ 喜欢经经经典典典 : 很很很不不不错错错喜欢 ‘I like classic.’ ‘Very good, I like it.’ 经经经典典典啊 : 很很很不不不错错错啊 ‘Classic!’ ‘Very good!’ Japanese cluster left part : right part クククラララシシシッッッククク物語 : こここののの物語はははとととてててもももいいいいいい ‘classic narrative’ ‘The narrative is very good.’ クククラララシシシッッッククク音楽 : こここののの音楽はははとととてててもももいいいいいい ‘classic music’ ‘The music is very good.’ Generation of new sentences using analogical associations • Generation of new sentences We use analogy as an operation by which, given two related forms (rewriting model) and only one form, the fourth missing form is coined2. Applied on sentences, this principle can be illustrated as follows: 早急に対応して下さい。 : 早急に対応し て欲しい。 :: 正式版に戻して下さい。 : x ⇒ x = 正式版に戻して欲しい。 • Experiments on new sentence generation and filtering by N-sequences We eliminate any sentence that contains an N-sequence of a given length unseen in our data. For valid sentences, we remember their corresponding seed sentences and the cluster identifiers they were generated from. Chinese Japanese # of seed sentences 99,538 97,152 # of clusters 23,182 21,975 # of candidate sentences 105,038,200 80,183,424 Q= 29% Q= 40% # of filtered sentences unique seed–new–# unique seed–new–# 33,141 67,099 40,234 84,533 Q= 96% Q= 96% • Deducing and acquiring quasi-parallel sentences We deduce translation relations based on the initial parallel corpus and corresponding clusters between Chinese and Japanese. Chinese Japanese Chinese–Japanese seed–new–# seed–new–# Initial par- allel corpus Corresponding clusters Quasi-parallel corpus 67,099 84,533 103,629 15,710 35,817 A : B :: Cseed : Xnew−zh 经经经典典典游戏 : 游戏很很很不不不错错错 喜欢经经经典典典 : 很很很不不不错错错喜欢 :: 经典电影 ‘classic film’ ⇒ 电影很不错 ‘The film is very good.’ 很不错电影 经经经典典典啊 : 很很很不不不错错错啊 ‘That’s very good, the film.’ A : B :: Cseed : Xnew−ja クククラララシシシッッッククク物語 : こここののの物語はははとととてててもももいいいいいい :: クラシック映画 ‘classic film’ ⇒ この映画はとてもいい ‘The film is very good.’クククラララシシシッッッククク音楽 : こここののの音楽はははとととてててもももいいいいいい SMT experiments • Experimental protocol: To assess the contribution of the generated quasi-parallel cor- pus, we compare two SMT systems. The first one is constructed using the initial given ASPEC-JC parallel corpus. This is the baseline. The second one adds the additional quasi-parallel corpus obtained using analogical associations and analogical clusters. Baseline Chinese Japanese train sentences 672,315 672,315 words 18,847,514 23,480,703 mean ± std.dev. 28.12 ± 15.20 35.05 ± 18.88 + Quasi-parallel Chinese Japanese train sentences 708,132 708,132 words 19,212,187 24,512,079 mean ± std.dev. 27.13 ± 14.19 34.23 ± 17.22 Both experiments Chinese Japanese tune sentences 2,090 2,090 words 60,458 73,177 mean ± std.dev. 28.93 ± 15.86 35.01 ± 18.87 test sentences 2,107 2,107 words 59,594 72,027 mean ± std.dev. 28.28 ± 14.55 34.18 ± 17.43 • Experimental results (using the different segmentation tools and moses version): – segmentation tools: urheen and mecab, moses 1.0: significant. BLEU NIST WER TER RIBES zh-ja baseline 29.10 7.5677 0.5352 0.5478 0.7801 + additional training data 32.03 7.9741 0.5069 0.5172 0.7906 ja-zh baseline 22.98 7.0103 0.5481 0.5711 0.7893 + additional training data 24.87 7.3208 0.5273 0.5482 0.8013 – segmentation tools: urheen and mecab, moses 2.1.1 BLEU NIST WER TER RIBES zh-ja baseline 33.41 8.1537 0.4967 0.5061 0.7956 + additional training data 33.68 8.1820 0.4955 0.5039 0.7964 ja-zh baseline 25.53 7.3885 0.5227 0.5427 0.8053 + additional training data 25.80 7.4571 0.5176 0.5378 0.8060 – segmentation tools: kytea, moses 1.0 BLEU NIST WER TER RIBES zh-ja baseline 28.35 7.3123 0.5667 0.5741 0.7610 + additional training data 28.87 7.4637 0.5566 0.5615 0.7739 ja-zh baseline 22.83 6.9533 0.5633 0.5853 0.7807 + additional training data 23.18 7.0402 0.5547 0.5778 0.7865 1 Yves Lepage. Solving analogies on words: An algorithm, COLING-ACL’98, Volume I, pp. 728-735, Montr´eal, Aug. 1998. 2 Ferdinand de Saussure. Cours de linguistique g´en´erale, Payot, Lausanne et Paris, [1`ere ´ed. 1916] edition, 1995.