WMT14_sakaguchi

Efﬁcient Elicitation of Annotations for
Human Evaluation of Machine Translation
Keisuke Sakaguchi, Matt Post, Benjamin Van Durme
ACL 2014 NINTH WORKSHOP ON STATISTICAL MACHINE TRANSLATION (WMT)

WMT Competition
2

5-way ranking

WMT Competition
3

5-way ranking

Pairwise Comparisons

WMT Competition
4

5-way ranking

Pairwise Comparisons

1
System A
System E
3
System B
...
...
12
System J
System C

Problem
5

1
System A
System E
3
System B
...
...
12
System J
System C
Needs lots of data (94k) to get good clusters

Problem
6

1
System A
System E
3
System B
...
...
12
System J
System C
Needs lots of data (94k) to get good clusters

“TrueSkill” lets us do it with less data (1/3) J

Models
7

1.  Expected Wins
2.  Hopkins and May
3.  TrueSkill

Models
8

1.  Expected Wins
2.  Hopkins and May
3.  TrueSkill
1.  rank and cluster with much less data
2.  predict pairwise comparisons with
higher accuracy
3.  be learned by online learning

Expected Wins
10

Average relative frequency of wins

Ranked by the score

Ties are ignored.

Hopkins and May: Overview
11

Source: The cat sat on the couch.
3
1
1
>
=
Rank
Translation
quality

Hopkins and May: Inference
12

N(µ, 2
)
Relative Ability
Variance (FIXED)
Inference!!

14

Player A
#1
1,000,000
Player B
#2
900,000
Player C
#3
850,000
Player D
#4
800,000
Player E
#5
790,000
TrueSkill (Herbrich et al. 2006)

15

N(µ, 2
)
System Ability
Uncertainty of
µ

16

N(µ, 2
)
System Ability
Uncertainty of
Inference!!
Inference!!
µ

How to update?
18

Observation: S1 wins S2
Shift and reduce 　
µ

How much update and ?
19

µ
ˆµ = µ ± · Surprisal
ˆ2
= 2
(1 2
· Surprisal)

Compute Surprisals (for )
20

S1 wins S2
S1 ties S2
t = µS1 µS2 t = µS1 µS2
1.0 1.50.5 0.5 1.51.0 0.5 0.00.0 0.51.0 1.0
0.0
1.0
1.00.0
0.5
1.0
1.5
µ

1 2 3 4 5 6 7 8 9 10 11 12
1
2
3
4
5
6
7
8
9
10
11
12
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
1.8
2.0
Data Collection in Batch Models
22

Uniform data collection (inefﬁcient)

Data Collection in TrueSkill
23

Match Quality
… … …
Good
Match
Bad
Match
Player T
Player S
Player U
Player A
Player B
#1001
#1002
#1003
#1
#2
1,000,000
900,000
300,000
299,900
299,800

Match Selection
25

Select S1 (highest )
S1

Match Selection
26

Compute match probability
(with normalizing)
ˆpdraw = 0.7 ˆpdraw = 0.3S1

Match Selection
27

Draw a system S2
ˆpdraw = 0.7 ˆpdraw = 0.3
S1
S2

Match Selection
28

Update by observation
S1
S2

Experiment: Setting
30

WMT13 dataset (10 language pairs)

Annotation: researchers

Training size: 400, 800,1600, 3200, 6400

Test size: 2,000

Evaluation: Perplexity and Accuracy

Probability of pairwise judgment
r0
is tuned by held-out
development set.
r
31

>
=
<

p( |S1, S2)
NS1 NS2

1000 2000 3000 4000 5000 6000
Training Data Size
2.85
2.90
2.95
3.00
Perplexity
HopkinsMay
TrueSkill
Experimental Result: Perplexity
32

H&M showed lower perplexity.

1000 2000 3000 4000 5000 6000
Training Data Size
0.460
0.465
0.470
0.475
0.480
0.485
0.490
0.495
0.500
Accuracy
ExpWins
HopkinsMay
TrueSkill
Experimental Result: Accuracy
33

TrueSkill > Hopkins&May ≒ ExpectedWins

TrueSkill: higher acc. with small data

Efﬁcient Data Collection
35

First 20%: Uniformly distributed

Last 20%: Diagonal = Competitive matches

Efﬁcient usage of judgments by TrueSkill
…
...
…
Training

Clustering: Experimental Setting
36

Bootstrap resampling

Rank range (by 95% conﬁdence band)

Cluster systems (if rank ranges overlap)

Training size and the number of clusters

Clustering: Result
37

Efﬁciency of clustering (more clusters, less variances)

TrueSkill > ExpectedWins > Hopkins&May

Summary
38

TrueSkill is able to …
1.  rank and cluster with less data
2.  predict pairwise comparisons with
higher accuracy
3.  be learned by online learning

Code is available at
http://github.com/keisks/wmt-trueskill

Future Directions
39

Sentence-level quality estimation

Parameterizing translation difﬁculty

c.f. Item-Response-Theory (2PL)

Thanks to the comments from Mark Hopkins

42

Exp. Win
Hopkins&May
TrueSkill
Learning
Batch
Batch+Iteration
Online
Ties allowed
No
Yes
Yes
Variance
None
Fixed
Learned
Model Comparison at a glance

Update comparison: TS vs. HM
43

TrueSkill (Online)
H&M (Batch)
Iterations

How to update?
44

N(µ1, 2
1)N(µ2, 2
2)

How to update?
45

N(µ1, 2
1)N(µ2, 2
2)
Translations

How to update?
46

Observation: S1 wins S2
Decision radius

How to update?
48

Observation: S1 draws S2
Decision radius

How to update?
49

Update for each iteration
µ
µ

How to update?
50

ˆµµ
Update for each iteration
µ

Clustering: Result
51

Training with 1K
Training with 25K

Clusters are generated with less amount of
training data (c.f. 80K in WMT13 fr-en)

Same ordering (TS is stable and accurate.)

Accuracies when training with N-way free-for-all
models, ﬁxing the number of matches
54

WMT14_sakaguchi

Recommended

Recommended

More Related Content

What's hot

What's hot (7)

Similar to WMT14_sakaguchi

Similar to WMT14_sakaguchi (20)

More from Keisuke Sakaguchi

More from Keisuke Sakaguchi (10)

Recently uploaded

Recently uploaded (20)

WMT14_sakaguchi