[Paper Introduction] Evaluating MT Systems with Second Language Proficiency Tests

Evaluating MT Systems with
Second Language Proﬁciency Tests
Takuya Matsuzaki, Akira Fujita,  
Naoya Todo, Noriko H. Arai
ACL 2015
2015/09/24
AHCLab M1 Makoto Morishita

Abstract
• BLEU have some weak points to evaluate the
system in a real situation.
• In this paper, evaluate the system by using
second language ability test (TOEIC, etc).
• It revealed that the context-unawareness of the
current MT systems severely damages human
performance when solving the test problems.
2

Weak Points of BLEU
1. Unreliability in evaluating short translations
2. Non-interpretability of the scores beyond
numerical comparison
3. Bias towards SMT systems
3

Weak Points of Manual Evaluation
1. It costs much.
2. It is not easy to analyze the characteristics of
MT systems based solely on the evaluation
results.
4

Solution
• Task-based evaluation of MT systems 
- Measures the human performance in a task
• Human do some task such as information
extraction from a machine-translated text.
5

Weak Points of 
Task-Based Evaluation
• It costs much. 
- We have to make test materials,  
and gather appropriate human subjects.
• This paper use second-language proﬁciency tests
(SLPTs) such as TOEIC, as the source of test materials.
• Human solve the problem which is translated and
evaluate the system by the test scores.
6

Second-Language Proﬁciency Tests 
(SLPT)
• There are a lot of SLPTs in many languages.
• They are carefully designed to evaluate
various aspects of language ability.
• SLPTs are designed to assess the language
ability, but not general intelligence. 
- Can be robust against the heterogeneity of
the subjects.
7
(多様性)

Materials
• We chose 40 problems randomly from  
National Center Test for University
Admissions (センター試験).
• All the problem consisted of a short
conversation between two people.
8

Materials
• In this paper, we use a multiple-choice
dialogue completion problems.
9

Experiment
• The original problems were English, and we
translated them into Japanese.
• The human subjects solved the translated
problems.
• The translation quality was evaluated based
on the rate of correct answers given by the
human subjects.
10

Experiment
• Evaluated 4 systems. 
- G: Google Translate 
- Y: Yahoo Translate 
- Hs: Human translation which do not  
consider context 
- Ho: Human translation which consider  
context
11

Participants
• 320 Japanese junior high school student
12
School A School B
1st: 80
2nd: 80
3rd: 78
1st: 82

Extrinsic Evaluation Metric
• CAR: Correct Answer Rate
13
CARM (p) =
# of subjects that correctly answered M(p)
# of subjects who solved M(p)
Avg CARM =
1
|P|
X
p2P
CARM (P)

Robustness against the
Heterogeneity of the Human Subjects
14
School A
1st: 80
2nd: 80
3rd: 78
No difference
School A
1st: 80
School B
1st: 82
No difference
→The participants’ Heterogeneity did not affect the test result

System-level Evaluation
• We cannot ﬁnd signiﬁcant difference
between Y and Hs
15

17
Better
Better

18
Same
Better

19
• Refo: Do not consider context
• Refs: Consider context
Better

Agreement
• If Score of Intrinsic Measure M 
System A’s translation > B’s translation 
And 
Score of CAR 
System A’s translation > B’s translation 
then Agree
• Check the agreement rate of each problems
20

Agreement Rate
• Agreement Rates between Automatic
Evaluation Metrics and Human Evaluation
21

Agreement Rate
• Agreement Rates between Intrinsic
Evaluation Metrics and Correct Answer Rate
22

Agreement Rate
• The human evaluation agrees with the CAR
slightly better than the automatic metrics.
• But still less than 0.7
• CAR can be critically damaged by a subtle
mistake.
23

Conclusion
• Comparing 4 systems, it is important to
consider contexts of individual sentences in
translating dialogues.
• SLPT can evaluate a different dimension of
translation quality.
• SLPT can be robust against the heterogeneity
of human subjects.
24

[Paper Introduction] Evaluating MT Systems with Second Language Proficiency Tests

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Similar to [Paper Introduction] Evaluating MT Systems with Second Language Proficiency Tests

Similar to [Paper Introduction] Evaluating MT Systems with Second Language Proficiency Tests (20)

More from NAIST Machine Translation Study Group

More from NAIST Machine Translation Study Group (14)

Recently uploaded

Recently uploaded (20)

[Paper Introduction] Evaluating MT Systems with Second Language Proficiency Tests