190802 GeBNLP

Human Interface Laboratory
On Measuring Gender Bias in Translation of
Gender-neutral Pronouns
2019. 8. 2 @GeBNLP, ACL Workshop
Won Ik Cho, Ji Won Kim, Seok Min Kim, Nam Soo Kim

Contents
• Overview: Gender bias in translation?
 About bias – Related work
 Problem statement in KR-EN
• Constructing an equity evaluation corpus (EEC)
 Content-related features
 Style-related features
• Measure, Experiment, Analysis
 Appropriateness of measure and sentence sets
 Quantitative/qualitative analysis
• Discussion
• Done and afterward
1

Overview: Gender bias in translation?
• Gender bias: in view of fairness machine learning
 What is bias?
 How is the bias in computer systems categorized?
• Pre-existing, technical, and emergent [Friedman and Nissenbaum, 1996]
 Bias in view of fairness machine learning?
• Problem of individuality and context rather than of
statistics and system [Binns, 2017]
 Examples of gender bias in view of fairness machine learning?
• Image semantic role labeling [Zhao et al., 2017]
• Amazon recruiting issue
 What is gender bias in
machine translation?
2

• Gender bias in automatic translation
 Bias
• Bias in computer systems
– Bias in view of fairness machine learning
» Gender bias in view of fairness machine learning
• Translation gender bias (TGB)!
 Seems very specific, but unexpectedly frequent
• Cross- and multi-lingual phenomenon
 Why should TGB be measured and mitigated?
• Translation affects people across country, race, religion etc.
• Regardlessly of the system performance, the user experience can be poor
• Amplification of the error is highly probable
3

• Previous and concurrent studies?
 Assessing gender bias in machine translation: a case study with Google
Translate [Prates et al., 2018]
• Investigates 12 languages with a template sentence
– Assumes no context, e.g., s/he is [xx]
• 1019 occupations, 21 adjectives
• Utilizes p-value for multi-lingual assessment (to EN)
 Evaluating Gender Bias in Machine Translation [Stanovsky et al., 2019]
• Investigates 8 languages regarding insertion of grammatical gender
– Assumes a situation with a weak context
• Utilizes the difference in performance for male/female regarding F1 score and
pre-/anti-stereotypical gender role assignment in evaluation (from EN)
• Compares various MT systems
4

• Target problem?
 Translation of gender-neutral pronouns (close to [Prates et al., 2018])
• Gender-neutral pronoun?
 One such as ‘single they’
 Here, includes the terms that are used interchangably with the pronouns
 Frequently appears in languages like Korean, Japanese, Turkish, ...
• Why Korean?
 Less explored language
 Displays various
sentence styles
 Translation service
popular among the users
(that many companies are
providing a service)
5

• Research questions
 What should we consider in making up the corpus for measuring
the bias?
 How should the measure be defined, not just pointing out the
difference of portion between two types of cases?
 Does the style, and not only content, of the sentence influence the
biasedness?
6

Constructing an equity evaluation corpus
• Equity evaluation corpus (EEC):
 Constructed to tease out biases towards races and gender
 Examples (↘) presented in [Kiritchenko and Mohammad, 2018]
 Template sentences are used
 How can we make such one
in the area of translation?
 What ethical constraints
should be considered?
7

• Template sentence – 걔는 [xx]해/야
 kyay-nun [xx]-hay/ya
s/he-TOP [xx]-do/be
S/he does/is [xx]
 [xx]: content word
 hay/ya: particles (hay comes for adjective, ya comes for noun)
• Three factors considered
 Formality of the gender-neutral pronoun
• kyay (the kid/child; used to indicate someone of the same age or younger)
• ku salam (the person; used in more a formal context)
 Politeness of the sentence
• -yo (attached at the end of the sentence to assign politeness)
 The sentiment polarity of the content word
• sentiment words (positive, negative)
• occupation words (neutral)
8

• More on sentiment words
 Excerpted from the Korean Sentiment Word Dictionary
• Published by Kunsan National University
• Reported to be constructed upon consensus of more than 3 natives
• 124 items for positive, 200 items for negative (root form)
• Single adjective word
– 상냥한 (sang-nyang-han, kind, positive)
• adjective phrase
– 됨됨이가 뛰어난 (toym-toym-i-ka-ttwi-e-nan, be good in manner, positive)
• verb phrase
– 함부로 말하는 (ham-pwu-lo mal-ha-nun, bombard rough words, negative)
• Two questions:
– Does the terms really belong to the category of positive/negative lexicon?
» 3 Korean natives’ unanimous decision
– Doesn’t it induce any prejudice if categorized into positive/negative lexicon?
» Appearance, richness, sexual orientation, disability etc.
9

• More on occupation words
 Collected from the official government web site for employment
• List of 735 occupations was determined upon consensus
• Gender specificity had to be concealed
– e.g., 발레리노 (pal-ley-li-no, ballerino), 해녀 (hay-nye, woman diver)
• Occupation titles that show prejudice (respect or hate) toward some groups of
people were checked and had to be excluded
– e.g., 딴따라 (ttan-tta-la, slang for the music artists)
10
Sentiment #
Positive 124
Negative 200
Occupations 735
- Total 1,059 content terms
- Formality on/off (x2)
- Politeness on/off (x2)
- 4,236 sentences in total

Measure, Experiment, Analysis
• Measure
 𝑝 𝑤, 𝑝 𝑚, 𝑝 𝑛 for a sentence set 𝑆
• The ratio of the sentences in 𝑆 whose translation incorporates a pronoun
related to female (she, her, woman, girl etc.), male (he, him, man, boy, guy etc.),
or neither, respectively.
 Define 𝑃𝑖 = 𝑝 𝑤 𝑝 𝑚 + 𝑝 𝑛 for a sentence set 𝑆𝑖
 Let 𝑃 = 𝐴𝑉𝐺(𝑃𝑖) for all the sentence sets
- Translation gender bias index!
 Question:
• Is the measure appropriately defined?
• Does the measure really display how the model is biased?
11

• Measure
 The appropriateness of measure 𝑃𝑖 = 𝑝 𝑤 𝑝 𝑚 + 𝑝 𝑛
• Boundedness
– Given 0 ≤ 𝑝 𝑤, 𝑝 𝑚, 𝑝 𝑛 ≤ 1 and 𝑝 𝑤 + 𝑝 𝑚 + 𝑝 𝑛 = 1,
the measure is between 0 and 1
– Can be utilized in analyzing with multiple sentence sets
• Optimals
– 1 when 𝑝 𝑛 = 1
» Encourages the preservation of gender-neutrality
– 0 when either 𝑝 𝑤 or 𝑝 𝑚 = 1
» Discourages the bias caused by the volume in the corpus
• Considering most MT systems for KR-EN rarely utilizes the gender-neutral
expression, SQRT function alleviates the penalty of using gender-specific terms
– But still encourages the preservation of gender-neutrality
» e.g., (0.3, 0.3, 0.4) yields 0.7 while (0.4, 0.4, 0.2) yields 0.6
12

• Measure
 Does the measure really display how the model is biased?
• Bias caused by the volume imbalance in the corpus (VBias)
– At least so far, male dominancy is shown in various types of articles
» e.g., in description / while posing an example / especially in formal style articles
which are frequently utilized in the training phase ...
• Bias caused by the social prejudice (SBias)
– Relating or assuming specific gender to specific content terms (in talk, in novel...),
making an hasty guess etc.
 If the target language incorporates gender-neutral pronouns (e.g.,
Japanese) the neutrality is usually preserved. But if not...
• 𝑝 𝑛 might not have a role in some cases, although the measure still shows the
biasedness
• For a further investigation, should consider if the target language frequently
utilizes gender-neutral expressions
13

• Experiment
 Seven sentence sets, Three translation services in-use
14
For each row: 𝑃𝑠 (𝑝 𝑤, 𝑝 𝑛)
Average: 𝐴𝑉𝐺 𝑃𝑖 for sentence sets (a-g)
Total unbiasedness: GT > NP > KT
But does the high score really mean unbiasedness?

• Quantitative analysis
 VBias seems to be influential, shown by 𝑝 𝑚 dominating the others
 Content-related features
• Regards sentiment polarity and occupations
• For sentiment polarity, slight amount of difference: unbiasedness shown
positive > occupation > negative lexicons
• Overall, male dominancy is shown
 Style-related features
• Regards formality and politeness of the expressions
• For politeness, little difference shown between on/off
• For formality, very high male dominancy observed in formal style sentences
– A comment: because more male authors are engaged in formal writing and they
assume a male subject?
– Supported by [Argamon et al., 2003; Qian, 2019] and statistics
(https://www.bls.gov/cps/cpsaat11.htm)
15

• Qualitative analysis and inter-system comparison
 In case, VBias is attenuated if SBias engages in
• E.g., for `informal’ case, GT and NP shows less male dominancy than KT
– KT, in some sense, assumes a default male, while GT and NP shows diversity
– Does it mean that GT and NP are less biased?
 Should check `in which way’ the increase in the measure took place
• For GT, stereotypical gender role assignment was mainly observed, which seems
to have lowered the male dominancy
• For NP, anti-stereotypical gender role assignment was frequently observed, as is
expected to be performed by the developing team
 Our measure shows tendency, but not fully shows if social bias is
engaged in (and attenuates the volume bias)
• Can be augmented with human evaluation or automated system, that checks
stereotypical gender role assignment, as in [Stanovsky et al., 2019]
16

Discussion
• The target of the measure
 Not to arrange the translators in order of biasedness,
 also not to tell that the half-half guess is the best,
 but to claim that the hasty guess on gender should be avoided
• Recent progress
 Google translator let users choose between genders [Kuczmarski and
Johnson, 2018] and tries to recognize the context
• More to be advanced
 Mitigation can be performed in the way to recognize the presence of the
context and preventing a hasty guess
• Language specificity of the scheme
 Most of the studies are targeting [xx]-EN or EN-[xx] translation
• EEC and measure can be more developed in multi-lingual manner that
considers various language families as also a target language
17

Done and afterward
• Done
 Construction of a corpus with template sentences that can check the
preservation of gender-neutrality in KR-EN translation (along with a
detailed guideline)
 A measure to evaluate and compare the performance of translation
systems regarding the preservation of gender neutrality of pronouns
 Rigorous contemplation on why the preservation of gender neutrality has
to be guaranteed in translation
• Afterward?
 Constructing corpus/measure in multi-lingual point of view
 Investigating the effect of context (coreference resolution)
 Context-sensitive post-processing on the translation result
18

Reference (order of appearance)
• Friedman, Batya, and Helen Nissenbaum. "Bias in computer systems." ACM Transactions on
Information Systems (TOIS) 14.3 (1996): 330-347.
• Binns, Reuben. "Fairness in machine learning: Lessons from political philosophy." arXiv preprint
arXiv:1712.03586 (2017).
• Zhao, Jieyu, et al. "Men also like shopping: Reducing gender bias amplification using corpus-
level constraints." arXiv preprint arXiv:1707.09457 (2017).
• Prates, Marcelo OR, Pedro H. Avelar, and Luís C. Lamb. "Assessing gender bias in machine
translation: a case study with Google Translate." Neural Computing and Applications (2018): 1-
19.
• Stanovsky, Gabriel, Noah A. Smith, and Luke Zettlemoyer. "Evaluating Gender Bias in Machine
Translation." arXiv preprint arXiv:1906.00591 (2019).
• Kiritchenko, Svetlana, and Saif M. Mohammad. "Examining gender and race bias in two hundred
sentiment analysis systems." arXiv preprint arXiv:1805.04508 (2018).
• Argamon, Shlomo, et al. "Gender, genre, and writing style in formal written texts." Text-The
Hague Then Amsterdam Then Berlin- 23.3 (2003): 321-346.
• Qian, Yusu. "Gender Stereotypes Differ between Male and Female Writings." Proceedings of the
57th Conference of the Association for Computational Linguistics: Student Research Workshop.
(2019)
• Kuczmarski, James, and Melvin Johnson. "Gender-aware natural language translation." (2018).
19

190802 GeBNLP

Recommended

Recommended

More Related Content

What's hot

What's hot (17)

Similar to 190802 GeBNLP

Similar to 190802 GeBNLP (20)

More from WarNik Chow

More from WarNik Chow (20)

Recently uploaded

Recently uploaded (20)

190802 GeBNLP

Editor's Notes