Call Girls In Mahipalpur O9654467111 Escorts Service
190802 GeBNLP
1. Human Interface Laboratory
On Measuring Gender Bias in Translation of
Gender-neutral Pronouns
2019. 8. 2 @GeBNLP, ACL Workshop
Won Ik Cho, Ji Won Kim, Seok Min Kim, Nam Soo Kim
2. Contents
• Overview: Gender bias in translation?
About bias – Related work
Problem statement in KR-EN
• Constructing an equity evaluation corpus (EEC)
Content-related features
Style-related features
• Measure, Experiment, Analysis
Appropriateness of measure and sentence sets
Quantitative/qualitative analysis
• Discussion
• Done and afterward
1
3. Overview: Gender bias in translation?
• Gender bias: in view of fairness machine learning
What is bias?
How is the bias in computer systems categorized?
• Pre-existing, technical, and emergent [Friedman and Nissenbaum, 1996]
Bias in view of fairness machine learning?
• Problem of individuality and context rather than of
statistics and system [Binns, 2017]
Examples of gender bias in view of fairness machine learning?
• Image semantic role labeling [Zhao et al., 2017]
• Amazon recruiting issue
What is gender bias in
machine translation?
2
4. Overview: Gender bias in translation?
• Gender bias in automatic translation
Bias
• Bias in computer systems
– Bias in view of fairness machine learning
» Gender bias in view of fairness machine learning
• Translation gender bias (TGB)!
Seems very specific, but unexpectedly frequent
• Cross- and multi-lingual phenomenon
Why should TGB be measured and mitigated?
• Translation affects people across country, race, religion etc.
• Regardlessly of the system performance, the user experience can be poor
• Amplification of the error is highly probable
3
5. Overview: Gender bias in translation?
• Previous and concurrent studies?
Assessing gender bias in machine translation: a case study with Google
Translate [Prates et al., 2018]
• Investigates 12 languages with a template sentence
– Assumes no context, e.g., s/he is [xx]
• 1019 occupations, 21 adjectives
• Utilizes p-value for multi-lingual assessment (to EN)
Evaluating Gender Bias in Machine Translation [Stanovsky et al., 2019]
• Investigates 8 languages regarding insertion of grammatical gender
– Assumes a situation with a weak context
• Utilizes the difference in performance for male/female regarding F1 score and
pre-/anti-stereotypical gender role assignment in evaluation (from EN)
• Compares various MT systems
4
6. Overview: Gender bias in translation?
• Target problem?
Translation of gender-neutral pronouns (close to [Prates et al., 2018])
• Gender-neutral pronoun?
One such as ‘single they’
Here, includes the terms that are used interchangably with the pronouns
Frequently appears in languages like Korean, Japanese, Turkish, ...
• Why Korean?
Less explored language
Displays various
sentence styles
Translation service
popular among the users
(that many companies are
providing a service)
5
7. Overview: Gender bias in translation?
• Research questions
What should we consider in making up the corpus for measuring
the bias?
How should the measure be defined, not just pointing out the
difference of portion between two types of cases?
Does the style, and not only content, of the sentence influence the
biasedness?
6
8. Constructing an equity evaluation corpus
• Equity evaluation corpus (EEC):
Constructed to tease out biases towards races and gender
Examples (↘) presented in [Kiritchenko and Mohammad, 2018]
Template sentences are used
How can we make such one
in the area of translation?
What ethical constraints
should be considered?
7
9. Constructing an equity evaluation corpus
• Template sentence – 걔는 [xx]해/야
kyay-nun [xx]-hay/ya
s/he-TOP [xx]-do/be
S/he does/is [xx]
[xx]: content word
hay/ya: particles (hay comes for adjective, ya comes for noun)
• Three factors considered
Formality of the gender-neutral pronoun
• kyay (the kid/child; used to indicate someone of the same age or younger)
• ku salam (the person; used in more a formal context)
Politeness of the sentence
• -yo (attached at the end of the sentence to assign politeness)
The sentiment polarity of the content word
• sentiment words (positive, negative)
• occupation words (neutral)
8
10. Constructing an equity evaluation corpus
• More on sentiment words
Excerpted from the Korean Sentiment Word Dictionary
• Published by Kunsan National University
• Reported to be constructed upon consensus of more than 3 natives
• 124 items for positive, 200 items for negative (root form)
• Single adjective word
– 상냥한 (sang-nyang-han, kind, positive)
• adjective phrase
– 됨됨이가 뛰어난 (toym-toym-i-ka-ttwi-e-nan, be good in manner, positive)
• verb phrase
– 함부로 말하는 (ham-pwu-lo mal-ha-nun, bombard rough words, negative)
• Two questions:
– Does the terms really belong to the category of positive/negative lexicon?
» 3 Korean natives’ unanimous decision
– Doesn’t it induce any prejudice if categorized into positive/negative lexicon?
» Appearance, richness, sexual orientation, disability etc.
9
11. Constructing an equity evaluation corpus
• More on occupation words
Collected from the official government web site for employment
• List of 735 occupations was determined upon consensus
• Gender specificity had to be concealed
– e.g., 발레리노 (pal-ley-li-no, ballerino), 해녀 (hay-nye, woman diver)
• Occupation titles that show prejudice (respect or hate) toward some groups of
people were checked and had to be excluded
– e.g., 딴따라 (ttan-tta-la, slang for the music artists)
10
Sentiment #
Positive 124
Negative 200
Occupations 735
- Total 1,059 content terms
- Formality on/off (x2)
- Politeness on/off (x2)
- 4,236 sentences in total
12. Measure, Experiment, Analysis
• Measure
𝑝 𝑤, 𝑝 𝑚, 𝑝 𝑛 for a sentence set 𝑆
• The ratio of the sentences in 𝑆 whose translation incorporates a pronoun
related to female (she, her, woman, girl etc.), male (he, him, man, boy, guy etc.),
or neither, respectively.
Define 𝑃𝑖 = 𝑝 𝑤 𝑝 𝑚 + 𝑝 𝑛 for a sentence set 𝑆𝑖
Let 𝑃 = 𝐴𝑉𝐺(𝑃𝑖) for all the sentence sets
- Translation gender bias index!
Question:
• Is the measure appropriately defined?
• Does the measure really display how the model is biased?
11
13. Measure, Experiment, Analysis
• Measure
The appropriateness of measure 𝑃𝑖 = 𝑝 𝑤 𝑝 𝑚 + 𝑝 𝑛
• Boundedness
– Given 0 ≤ 𝑝 𝑤, 𝑝 𝑚, 𝑝 𝑛 ≤ 1 and 𝑝 𝑤 + 𝑝 𝑚 + 𝑝 𝑛 = 1,
the measure is between 0 and 1
– Can be utilized in analyzing with multiple sentence sets
• Optimals
– 1 when 𝑝 𝑛 = 1
» Encourages the preservation of gender-neutrality
– 0 when either 𝑝 𝑤 or 𝑝 𝑚 = 1
» Discourages the bias caused by the volume in the corpus
• Considering most MT systems for KR-EN rarely utilizes the gender-neutral
expression, SQRT function alleviates the penalty of using gender-specific terms
– But still encourages the preservation of gender-neutrality
» e.g., (0.3, 0.3, 0.4) yields 0.7 while (0.4, 0.4, 0.2) yields 0.6
12
14. Measure, Experiment, Analysis
• Measure
Does the measure really display how the model is biased?
• Bias caused by the volume imbalance in the corpus (VBias)
– At least so far, male dominancy is shown in various types of articles
» e.g., in description / while posing an example / especially in formal style articles
which are frequently utilized in the training phase ...
• Bias caused by the social prejudice (SBias)
– Relating or assuming specific gender to specific content terms (in talk, in novel...),
making an hasty guess etc.
If the target language incorporates gender-neutral pronouns (e.g.,
Japanese) the neutrality is usually preserved. But if not...
• 𝑝 𝑛 might not have a role in some cases, although the measure still shows the
biasedness
• For a further investigation, should consider if the target language frequently
utilizes gender-neutral expressions
13
15. Measure, Experiment, Analysis
• Experiment
Seven sentence sets, Three translation services in-use
14
For each row: 𝑃𝑠 (𝑝 𝑤, 𝑝 𝑛)
Average: 𝐴𝑉𝐺 𝑃𝑖 for sentence sets (a-g)
Total unbiasedness: GT > NP > KT
But does the high score really mean unbiasedness?
16. Measure, Experiment, Analysis
• Quantitative analysis
VBias seems to be influential, shown by 𝑝 𝑚 dominating the others
Content-related features
• Regards sentiment polarity and occupations
• For sentiment polarity, slight amount of difference: unbiasedness shown
positive > occupation > negative lexicons
• Overall, male dominancy is shown
Style-related features
• Regards formality and politeness of the expressions
• For politeness, little difference shown between on/off
• For formality, very high male dominancy observed in formal style sentences
– A comment: because more male authors are engaged in formal writing and they
assume a male subject?
– Supported by [Argamon et al., 2003; Qian, 2019] and statistics
(https://www.bls.gov/cps/cpsaat11.htm)
15
17. Measure, Experiment, Analysis
• Qualitative analysis and inter-system comparison
In case, VBias is attenuated if SBias engages in
• E.g., for `informal’ case, GT and NP shows less male dominancy than KT
– KT, in some sense, assumes a default male, while GT and NP shows diversity
– Does it mean that GT and NP are less biased?
Should check `in which way’ the increase in the measure took place
• For GT, stereotypical gender role assignment was mainly observed, which seems
to have lowered the male dominancy
• For NP, anti-stereotypical gender role assignment was frequently observed, as is
expected to be performed by the developing team
Our measure shows tendency, but not fully shows if social bias is
engaged in (and attenuates the volume bias)
• Can be augmented with human evaluation or automated system, that checks
stereotypical gender role assignment, as in [Stanovsky et al., 2019]
16
18. Discussion
• The target of the measure
Not to arrange the translators in order of biasedness,
also not to tell that the half-half guess is the best,
but to claim that the hasty guess on gender should be avoided
• Recent progress
Google translator let users choose between genders [Kuczmarski and
Johnson, 2018] and tries to recognize the context
• More to be advanced
Mitigation can be performed in the way to recognize the presence of the
context and preventing a hasty guess
• Language specificity of the scheme
Most of the studies are targeting [xx]-EN or EN-[xx] translation
• EEC and measure can be more developed in multi-lingual manner that
considers various language families as also a target language
17
19. Done and afterward
• Done
Construction of a corpus with template sentences that can check the
preservation of gender-neutrality in KR-EN translation (along with a
detailed guideline)
A measure to evaluate and compare the performance of translation
systems regarding the preservation of gender neutrality of pronouns
Rigorous contemplation on why the preservation of gender neutrality has
to be guaranteed in translation
• Afterward?
Constructing corpus/measure in multi-lingual point of view
Investigating the effect of context (coreference resolution)
Context-sensitive post-processing on the translation result
18
20. Reference (order of appearance)
• Friedman, Batya, and Helen Nissenbaum. "Bias in computer systems." ACM Transactions on
Information Systems (TOIS) 14.3 (1996): 330-347.
• Binns, Reuben. "Fairness in machine learning: Lessons from political philosophy." arXiv preprint
arXiv:1712.03586 (2017).
• Zhao, Jieyu, et al. "Men also like shopping: Reducing gender bias amplification using corpus-
level constraints." arXiv preprint arXiv:1707.09457 (2017).
• Prates, Marcelo OR, Pedro H. Avelar, and Luís C. Lamb. "Assessing gender bias in machine
translation: a case study with Google Translate." Neural Computing and Applications (2018): 1-
19.
• Stanovsky, Gabriel, Noah A. Smith, and Luke Zettlemoyer. "Evaluating Gender Bias in Machine
Translation." arXiv preprint arXiv:1906.00591 (2019).
• Kiritchenko, Svetlana, and Saif M. Mohammad. "Examining gender and race bias in two hundred
sentiment analysis systems." arXiv preprint arXiv:1805.04508 (2018).
• Argamon, Shlomo, et al. "Gender, genre, and writing style in formal written texts." Text-The
Hague Then Amsterdam Then Berlin- 23.3 (2003): 321-346.
• Qian, Yusu. "Gender Stereotypes Differ between Male and Female Writings." Proceedings of the
57th Conference of the Association for Computational Linguistics: Student Research Workshop.
(2019)
• Kuczmarski, James, and Melvin Johnson. "Gender-aware natural language translation." (2018).
19
overview:
gender bias in NLP – various problems
translation: real-world problem - example e.g. Turkish, Korean..?
How is it treated in previous works?
Why should it be guaranteed?
problem statement: with KR-EN example
why not investigated in previous works?
why appropriate for investigating gender bias?
what examples are observed?
construction:
what are to be considered?
formality (걔 vs 그 사람)
politeness (-어 vs –어요)
lexicon sentiment polarity (positive & negative & occupation)
+ things to be considered in... (not to threaten the fairness)
- Measure?
how the measure is defined, and proved to be bounded (and have optimum when the condition fits with the ideal case)
concept of Vbias and Sbias – how they are aggregated into the measure << disadvantage?
how the usage is justified despite disadvantages
the strong points?
- Experiment?
how the EEC is used in evaluation, and how the arithmetic averaging is justified
the result: GT > NP > KT?
- Analysis?
quantitative analysis – Vbias and Sbias, significant with style-related features
qualitative analysis – observed with the case of occupation words
Done: tgbi for KR-EN, with an EEC
Afterward: how Sbias can be considered more explicitly? what if among context? how about with other target/source language?