1. The document proposes the TrueSkill algorithm as an improvement over existing models for ranking machine translation systems based on pairwise comparisons from human evaluators.
2. TrueSkill is shown to outperform baselines by requiring less training data to achieve accurate rankings while also better predicting pairwise preferences.
3. It functions by modeling systems as distributions that are efficiently updated online during a matching process, unlike batch models, allowing more effective data collection and system clustering from fewer annotations.
Presentation for the Strategic Dialogue on the Future of Agriculture, Brussel...
WMT14_sakaguchi
1. Efficient Elicitation of Annotations for
Human Evaluation of Machine Translation
Keisuke Sakaguchi, Matt Post, Benjamin Van Durme
ACL 2014 NINTH WORKSHOP ON STATISTICAL MACHINE TRANSLATION (WMT)
4. WMT Competition
4
5-way ranking
Pairwise Comparisons
1
System A
System E
3
System B
...
...
12
System J
System C
5. Problem
5
1
System A
System E
3
System B
...
...
12
System J
System C
Needs lots of data (94k) to get good clusters
6. Problem
6
1
System A
System E
3
System B
...
...
12
System J
System C
Needs lots of data (94k) to get good clusters
“TrueSkill” lets us do it with less data (1/3) J
8. Models
8
1. Expected Wins
2. Hopkins and May
3. TrueSkill
1. rank and cluster with much less data
2. predict pairwise comparisons with
higher accuracy
3. be learned by online learning
23. Data Collection in TrueSkill
23
Match Quality
… … …
Good
Match
Bad
Match
Player T
Player S
Player U
Player A
Player B
#1001
#1002
#1003
#1
#2
1,000,000
900,000
300,000
299,900
299,800
35. Efficient Data Collection
35
First 20%: Uniformly distributed
Last 20%: Diagonal = Competitive matches
Efficient usage of judgments by TrueSkill
…
...
…
Training
36. Clustering: Experimental Setting
36
Bootstrap resampling
Rank range (by 95% confidence band)
Cluster systems (if rank ranges overlap)
Training size and the number of clusters
38. Summary
38
TrueSkill is able to …
1. rank and cluster with less data
2. predict pairwise comparisons with
higher accuracy
3. be learned by online learning
Code is available at
http://github.com/keisks/wmt-trueskill
39. Future Directions
39
Sentence-level quality estimation
Parameterizing translation difficulty
c.f. Item-Response-Theory (2PL)
Thanks to the comments from Mark Hopkins
42. 42
Exp. Win
Hopkins&May
TrueSkill
Learning
Batch
Batch+Iteration
Online
Ties allowed
No
Yes
Yes
Variance
None
Fixed
Learned
Model Comparison at a glance
51. Clustering: Result
51
Training with 1K
Training with 25K
Clusters are generated with less amount of
training data (c.f. 80K in WMT13 fr-en)
Same ordering (TS is stable and accurate.)