Thank you for sharing these examples comparing SNMT and UNMT outputs. While automatic metrics have limitations, MacroF1 appears to better capture differences in semantic adequacy compared to BLEU in this case. Continued analysis of various metrics is important to better evaluate progress in machine translation.
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
Macro average: rare types are important too
1. University of Southern California
Information Sciences Institute
!"#$%&'()$"*)+,
-"$),./0)1,'$),230%$4"54,.%%
!"#$$%&'()*#
!"#$%&'()$"*+,)-",-.*!".()(/(-
0")1-%.)(2*$#*+$/(3-%"*4'5)#$%")'
tg@isi.edu
+%,-,. /(.
6-7(*$#*4$&7/(-%*'"8*!"#$%&'()$"*+,)-",-
0")1-%.)(2*$#*9-"".251'")'
weiqiuy@seas.upenn.edu
0(123#13,1%&4,51(2
:),3($& +,3$$5*$#*4$&7/(-%*+,)-",-
;%'"8-).*0")1-%.)(2
lignos@brandeis.edu
6(1#3"#1&7#8
!"#$%&'()$"*+,)-",-.*!".()(/(-
0")1-%.)(2*$#*+$/(3-%"*4'5)#$%")'
jonmay@isi.edu
!"#$#%&#'()&(*++,-(./.0(
2. University of Southern California
Information Sciences Institute
!
67087"5,9714$7:;47%5
• !"#$%$&'()*+,-(.
"#$%&'$#()#*+,#-&'$.%&/$.0*&1.$2)-3
456&7.8#*-9&4:;<&7=2#-&
3. University of Southern California
Information Sciences Institute
>
!%47("47%5
• !"#$%$&'()*+,-(.
• 7#9",1%&:%#;1,15&3%9"1,-.%2&
2.<<%;&<;($&9:#22&,$=#:#19%
– !"#$%&'()&*%)+,)-'&'./)0&.)7?.'?7@#?A.B
• >#;%&38?%2&#;%&,$?(;3#13
12-)$&(34"'$56(7&##'8)%6(.//9:
• @!"#$%&'%$()&*+&,((-(-
#.&/%0.$(&$%$((+
– 1,%2-3&4$5)62-3&!"#$#%&'()*+,-(.(/-01 23234
– !"#$%&'()*+"*,$-%'!"#$%&'()*+
P(x)
- log2 P(x)
,-%./&(0&".)1&2).34.05$.%&2)(1&,)('0&6()74%8
9:;&"(*.0%<&9=>?&"@7.%&
4. University of Southern California
Information Sciences Institute
C
<()$(7)=
/0 12$%3$+(*45*6&*7"#$%$&'()*)$+$*3.7&8*4$'96:/
;0 <3.+7=7'$+76&*=69*4$'96:/>*
./ !$#(0,'1"2*%'*33(332(%,'
4/ !-5%3,#(*2',*367'0#-338+$%&"*+'9:
?0 @3$%7+$+72(*)7==(9(&'(.*#(+A((&*B3-(927.()*$&)*
C&.3-(927.()*D45>*
!"#$%&''(')*+,'*-'')+./01+$)2+3$4&567
5. University of Southern California
Information Sciences Institute
:
>("?;"475*,!.,"1,",!;?47&#?"11,@?"11787)$
%5,23:"?"5#)A,.)14,B)41
8. University of Southern California
Information Sciences Institute
P
!"#$%&#'(()*#'((%+%,-)).,-+/-0'1&,
OP'+988%D'+;)+G9.('%$%9.%9P'+94'%);%:.*:P:*298%(8955%D'+;)+G9.('5
<= "9(+)N9P'+94'0%2.B':4A3'*%:='=@%'I298%:GD)+39.('%3)%'9(A%3CD'
Macro𝐹) =
∑!∈# +$;!
|-|
:; 3"4&5<$='&$%'>+-'"%?*'2+';%;@+,A+B&'CD')4A>+'CD$E+"(F5&*$)4'+*5+'$4?+*5G')
3𝑖𝑐𝑟𝑜𝐹) =
∑!∈# .! × +$;!
∑
!&∈#
.!&
7()%)8&7)23('&*$%&96"008&w' = 𝑅𝑒𝑓𝑠 𝑐 + 𝑘 *$%&0$4)&&𝑘 ≥ 1
o J#(>$#(𝑘 = 1 ; *4&#K(GI(𝑘 → ∞, F𝑖𝑐𝑟𝑜𝐹! → F)<"4𝐹!
o J#(>$#( 𝛽 = 1 3#($<)?#(1/L/6(0L/:(&4(1/6(0//:6(M>$&(?GN#(O-PQ
9. University of Southern California
Information Sciences Institute
Q
E;14787#"47%5,8%$,!"#$%FG,
"1,"5,!.,)("?;"47%5,3)4$7#
èCompare MacroF1 with
Ø BLEU and ChrF1 – popular options
Ø MicroF1 – so we can see the difference micro and macro avg makes
Ø BLEURT – model-based metric based on BERT
* BLEURT has undesirable biases, shown in paper (not talking about here)
10. University of Southern California
Information Sciences Institute
5;
!"#$%F1 (1,<4H)$1
A:B*CD*6EFEG*G-H.B-.(I*:',%$JC*3'.*-K/'5*H-)L3(*#$%*'55*(27-.
6,+$.?HR#$HS#G&%#7$,+-&.R#$O..8&,%2$.R#%#*7-&'$.%&$H$#&7=2#-9&H'7#$&$.)*G,*S&7.&.*#&.$&70.&G#+,%HO-
6&'CD')4A+H(,$E$)4'
0CD$E+-'"%?*
11. University of Southern California
Information Sciences Institute
55
I):CJK 9"4"&4%&.)L4,>("?;"47%5
;TCC
;T:;
;T!P
;TJP
;T!!
;TJ!
;T>>
;TD5
;TCC
;TP>
;TD5
;TDJ
H/L0
/L0
/LR
/LS
/LT
/LU
J5/-",2*'"8*M%'&&'% +-&'"(),.
𝜏
3G&V(W>8)%(X>'Y#8#%&$(
;NE0 43%JC :',%$JC :),%$JC ;NE0OB&-'" ;NE0OB&-8)'"
MacroF1 is a poor indicator of Fluency + Grammar,
but one of the strong indicators of Semantics
Correlation with Fluency, Grammar, and Semantics on English only
12. University of Southern California
Information Sciences Institute
5!
I!.,!)4$7#1+,I751,0)$,!)4$7#
S
0
R
Z
.
R
.
R
[
. .
R
[ [
S
/
0
.
R
Z
S
[
T
!"#$%&#'( !"#)%&#*( !"#+%&#)(
JG%$
O-PQ O-PQ F)<"4]0 FG<"4]0 ,V"]0
• A)".*P*G/&Q-%*$#*()&-.*'*&-(%),*.,$%-8*3)L3-.(*,$%%-5'()$"*H)(3*3/&'"*R/8L-&-"(.
• S;NE0*).*#%$&*(3-*A:B*&-(%),.*7',T'L-?*7%-,$&7/(-8*Q2*('.T*$%L'")U-%.
• :',%$JC*'"8*:),%$JC*/.-*(3-*.'&-*($T-")U-%*'.*;NE0?*$Q(')"-8*/.)"L*+',%-;NE0
MacroF1 has more wins in the recent year -- when systems are mostly fluent, adequacy is a key discriminator
13. University of Southern California
Information Sciences Institute
5>
9%=514$)"3,@J2-,."1M N@JBB.B,OPOPQ
/L.T
/LS.
/L09
/LZ[
/L[Z
/LZR
/LZ.
/LS.
/L.[
/LZ.
/LZ9
/LRR
/LZ.
/LSZ
/L.U
/LZ/
/LSR
/L.U
/L//
/L./
/LZ/
/L[/
/L9/
-^HP* !7HP* O_HP*
𝜏
3G&V(8+!
O-PQ F)<"4]0 FG<"4]0 ,V"]0 O-PQ`^F#)% O-PQ`^F#'G)%
MacroF1 is the strongest indicator of downstream IR task performance
• IR task with queries and docs in different languages
• Translate source docs to target language, and match queries with docs
• MT metric having strong correlation with IR metric (e.g. mAP) is more useful
14. University of Southern California
Information Sciences Institute
5C
R;"?74"47(),9788)$)5#),S)4=))5,
B;0)$(71)A,"5A,T51;0)$(71)A,C!.
16. University of Southern California
Information Sciences Institute
5D
BC!.,(1,TC!.+,9>&>C
Frequent Types:
UNMT > SNMT Rare Types:
SNMT > UNMT
* Only the top 500 types
are shown here
17. University of Southern California
Information Sciences Institute
5J
BC!.,"5A,TC!.
MacroF1 diff: 0.044, favors SNMT; BLEU diff: -0.00087, favors UNMT
Src "Er ist witzig, er ist sarkastisch, witzig, er liebt es, Leute zu unterhalten, und er ist ein
großartiger Kerl, den er im Teamraum haben kann", erklärte er.
Ref "He's funny, he's sarcastic, witty, likes to poke fun at people, and he's a great guy to have in
the team room," he explained.
SNMT "He is funny, he is sarcastic, funny, he loves to talk people, and he is a great guy he can have
in the team room," he explained
UNMT "He's funny, he's sly, funny, he loves to unterhalten people, and he's a great guy he can have
in the team room," he explained.
Problems SNMT: punctuation, wrong_verb. UNMT: synonym, untranslation
-=4=<0%%"9(+)M< :5 G)+' :.3)8'+9.3%3) 2.3+9.5893:). 3A9. QR-S
18. University of Southern California
Information Sciences Institute
5P
BC!.,"5A,TC!.
MacroF1 diff: 0.044, favors SNMT; BLEU diff: -0.00087, favors UNMT
Src Vor 32 Jahren schloss ich mich als Schüler, wegen der Vernachlässigung der Thatcher-Regierung, Labour an. Diese
Vernachlässigung hatte dazu geführt, dass mein Klassenzimmer buchstäblich zusammengebrochen war.
Infolgedessen habe ich versucht, mich für bessere öffentliche Dienstleistungen für diejenigen einzusetzen, die sie am
meisten brauchen. Egal ob als Gemeinderat oder Minister.
Ref Ever since I joined Labour 32 years ago as a school pupil, provoked by the Thatcher government‘s neglect that had left
my comprehensive school classroom literally falling down, I’ve sought to champion better public services for those
who need them most whether as a local councillor or government minister.
SNMT 32 years ago, I joined Labour as a student because of the neglect of the Thatcher government, which had led to my
classroom literally collapsed, and as a result I tried to promote better public services for those who need it most,
whether as a local council or ministers.
UNMT Last 32 years ago, as a student, because of the disdain for the Thatcher-era government, Labour joined Labour.
Problems SNMT: synonym, word_order UNMT: subject, truncation, word_order
-=4= >% "9(+)M< :5 G)+' :.3)8'+9.3%3) 3+2.(93:). 3A9. QR-S
19. University of Southern California
Information Sciences Institute
5Q
ü 45*(2$%3$+76&*$.*"3%+7'%$..*'%$..7=7(9*6&*7"#$%$&'()*+(.+*.(+
ü <3.+7=7()*4$'96:/*$.*$*%(87+7"$+(*(2$%*"(+97'
ü!"#$%&'())$))*$+&,''-$./012''(+3'-45'6789:;9<=
ü!>?+)&#$(*'&()@,'A#>))'B"+CD(B'"+E>#*(&">+'#$&#"$F(B
üBD45*2.*CD45*E3$%7+$+72(*)7==(9(&'(.
!"##$%&
20. University of Southern California
Information Sciences Institute
!;
References
• Mark Steedman. 2008. On Becoming a Discipline. Computational Linguistics 34, 1 (2008), 137–144. https://doi.org/10.
1162/coli.2008.34.1.137 arXiv: https://doi.org/10.1162/coli.2008.34.1.137
• Gowda, Thamme, and Jonathan May. "Finding the Optimal Vocabulary Size for Neural Machine Translation." Proceedings of the
2020 Conference on Empirical Methods in Natural Language Processing: Findings. 2020.
• Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural Machine Translation of Rare Words with Subword Units. In
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association
for Computational Linguistics, Berlin, Germany, 1715–1725. https://doi.org/10.18653/v1/P16-1162
• Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a Method for Automatic Evaluation of Machine
Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. Association for
Computational Linguistics, Philadelphia, Pennsylvania, USA, 311–318. https://doi.org/10.3115/1073083. 1073135
• Maja Popović. 2015. ChrF: character n-gram F-score for automatic MT evaluation. In Proceedings of the Tenth Workshop on
Statistical Machine Translation. Association for Computational Linguistics, Lisbon, Portugal, 392–395. https:
//doi.org/10.18653/v1/W15-3049
• Thibault Sellam, Dipanjan Das, and Ankur Parikh. 2020. BLEURT: Learning Robust Metrics for Text Generation. In Proceedings
of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online,
7881–7892. https://www.aclweb.org/anthology/2020.acl-main.704
• Ilya Zavorin, Aric Bills, Cassian Corey, Michelle Morrison, Audrey Tong, and Richard Tong. 2020. Corpora for Cross-Language
Information Retrieval in Six Less-Resourced Languages. In Proceedings of the workshop on Cross-Language Search and
Summarization of Text and Speech (CLSSTS2020). European Language Resources Association, Marseille, France, 7–13.
https://www.aclweb.org/anthology/2020.clssts-1.2
21. University of Southern California
Information Sciences Institute
!5
.U'CV,W<T,🙏
• Fork of SacreBLEU: https://github.com/isi-nlp/sacrebleu
• Pull request is under review: https://github.com/mjpost/sacrebleu/pull/153