1
SCAFFOLD STEP #4: DIVERSITY PERSPECTIVES WORKSHEET
My full name is Marcell Tywa'n Scott
1 July 2017
1. What is your faculty-approved global issue/problem? - My faculty approved global issue/ problem is racism.
2. Explain how you narrowed your focus to examine some aspect of that issue that affects disenfranchised and underrepresented groups. - I narrowed to this focus to represent the different ethnic and cultural groups around the globe. I wanted to highlight the mistreatment along with highlight the myths of racism that affects disenfranchised and underrepresented groups.
3. Draft a working thesis for your Diversity Perspectives paper. - Racism is a world wide problem caused by ignorance that differentiates people with skin color and can be resolved through continuous education.
4. What three to five points will you make to explain the significance of the issue? - What is racism .How racism affects our progress in the world.How racism affect growth in communities and businesses.
5. Identify the competing entities (populations) affected by this issue. Which of them are disadvantaged and underrepresented? - The quality of life for specific groups of people are affected tremendously by this topic. There are so many different races in the world to specifically name them all here in this form; Jewish and Americans from the African decent are the two big ones most are aware of .
6. For each of the groups identified, what cultural perspective will you present? - I will present how Hitler attempted to kill off all the Jewish, I may talk about how there culture changed based off of this action. I will talk about how the everyday discrimination that the Americans from African decent have to endure in America. I will deliver this from the White American along with the African Americans perspective.
7. What cultural inequalities are evident? What evidence will you use from your literature review and additional sources? - Evident cultural inequities have been acknowledge throughout history for both discriminated groups. In America, blacks have been subjected to mistreatment from elected officials, Public servants (police officers). I will utilize evidence from my sources identified in my first proposal, I will also utilize social media groups and show the reader different sides of the issues at concern.
8. How will you use Hofstede's Cultural Values Framework to explain the issues involved? -
9. How does in-group favoritism influence the competing populations? - I will attempt to describe this to the reader as racism.
10. How has out-group bias manifested itself among those involved in the issue? - The bias factor manifested into racism will be a huge portion on my work. I will establish bias and preconceived prejudice that some have.
11. Which justice theory will you choose to frame your argument and why? - I have yet to commit to a justice theory, I would like to stay focused on a moral high ground for this project.
12. What solutions hav ...
Cognitive Systems Institute Speaker Series talk by Mona Diab from George Washington University on May 14, 2015 "Towards Building Effective Computational Sociopragmatics Models of Human Cognition."
SOCIAL NETWORK HATE SPEECH DETECTION FOR AMHARIC LANGUAGEcscpconf
The anonymity of social networks makes it attractive for hate speech to mask their criminal
activities online posing a challenge to the world and in particular Ethiopia. With this everincreasing
volume of social media data, hate speech identification becomes a challenge in
aggravating conflict between citizens of nations. The high rate of production, has become
difficult to collect, store and analyze such big data using traditional detection methods. This
paper proposed the application of apache spark in hate speech detection to reduce the
challenges. Authors developed an apache spark based model to classify Amharic Facebook
posts and comments into hate and not hate. Authors employed Random forest and Naïve Bayes
for learning and Word2Vec and TF-IDF for feature selection. Tested by 10-fold crossvalidation,
the model based on word2vec embedding performed best with 79.83%accuracy. The
proposed method achieve a promising result with unique feature of spark for big data.
1
SCAFFOLD STEP #4: DIVERSITY PERSPECTIVES WORKSHEET
My full name is Marcell Tywa'n Scott
1 July 2017
1. What is your faculty-approved global issue/problem? - My faculty approved global issue/ problem is racism.
2. Explain how you narrowed your focus to examine some aspect of that issue that affects disenfranchised and underrepresented groups. - I narrowed to this focus to represent the different ethnic and cultural groups around the globe. I wanted to highlight the mistreatment along with highlight the myths of racism that affects disenfranchised and underrepresented groups.
3. Draft a working thesis for your Diversity Perspectives paper. - Racism is a world wide problem caused by ignorance that differentiates people with skin color and can be resolved through continuous education.
4. What three to five points will you make to explain the significance of the issue? - What is racism .How racism affects our progress in the world.How racism affect growth in communities and businesses.
5. Identify the competing entities (populations) affected by this issue. Which of them are disadvantaged and underrepresented? - The quality of life for specific groups of people are affected tremendously by this topic. There are so many different races in the world to specifically name them all here in this form; Jewish and Americans from the African decent are the two big ones most are aware of .
6. For each of the groups identified, what cultural perspective will you present? - I will present how Hitler attempted to kill off all the Jewish, I may talk about how there culture changed based off of this action. I will talk about how the everyday discrimination that the Americans from African decent have to endure in America. I will deliver this from the White American along with the African Americans perspective.
7. What cultural inequalities are evident? What evidence will you use from your literature review and additional sources? - Evident cultural inequities have been acknowledge throughout history for both discriminated groups. In America, blacks have been subjected to mistreatment from elected officials, Public servants (police officers). I will utilize evidence from my sources identified in my first proposal, I will also utilize social media groups and show the reader different sides of the issues at concern.
8. How will you use Hofstede's Cultural Values Framework to explain the issues involved? -
9. How does in-group favoritism influence the competing populations? - I will attempt to describe this to the reader as racism.
10. How has out-group bias manifested itself among those involved in the issue? - The bias factor manifested into racism will be a huge portion on my work. I will establish bias and preconceived prejudice that some have.
11. Which justice theory will you choose to frame your argument and why? - I have yet to commit to a justice theory, I would like to stay focused on a moral high ground for this project.
12. What solutions hav ...
Cognitive Systems Institute Speaker Series talk by Mona Diab from George Washington University on May 14, 2015 "Towards Building Effective Computational Sociopragmatics Models of Human Cognition."
SOCIAL NETWORK HATE SPEECH DETECTION FOR AMHARIC LANGUAGEcscpconf
The anonymity of social networks makes it attractive for hate speech to mask their criminal
activities online posing a challenge to the world and in particular Ethiopia. With this everincreasing
volume of social media data, hate speech identification becomes a challenge in
aggravating conflict between citizens of nations. The high rate of production, has become
difficult to collect, store and analyze such big data using traditional detection methods. This
paper proposed the application of apache spark in hate speech detection to reduce the
challenges. Authors developed an apache spark based model to classify Amharic Facebook
posts and comments into hate and not hate. Authors employed Random forest and Naïve Bayes
for learning and Word2Vec and TF-IDF for feature selection. Tested by 10-fold crossvalidation,
the model based on word2vec embedding performed best with 79.83%accuracy. The
proposed method achieve a promising result with unique feature of spark for big data.
Liking violence: A study of hate speech on Facebook in Sri LankaSanjana Hattotuwa
Based on a report looking at hate and dangerous speech on Facebook in Sri Lanka - http://www.cpalanka.org/liking-violence-a-study-of-hate-speech-on-facebook-in-sri-lanka/
Susan Windsor - Critical Thinking for Testers - EuroSTAR 2010TEST Huddle
EuroSTAR Software Testing Conference 2010 presentation onCritical Thinking for Testers by Susan Windsor. See more at: http://conference.eurostarsoftwaretesting.com/past-presentations/
This presentation has been used to guide workshops on research and academic writing conventions for upperclassman and first-year graduate students. However, it could be adapted for a first and second year student audience. The content is rich, emphasizing reflection, research/inquiry, as well as grammar. This material also demonstrates how to use new media as part of an overall research strategy. The presentation is designed to be presented interactively with writers across the disciplines, multilingual writers, and any writer unfamiliar with the academic writing process. The content is not linear, as many slides could be clipped and customized for integration into a first-year writing course, or even a session or workshop for graduate student writers of any classification.
The words used and our interpretation of images and statistics are an insight into our perspective or bias – our view of the world. Bias influences our attitudes and behaviours towards other people, places and issues. Our experiences, gender, age, class, religion and values all affect our bias. People who are passionate about an issue will generally be quite overt about their bias. People who want to promote a particular point of view may be less overt and more subtle in their use of words and images.
Global education aims to assist students to recognise bias in written and visual texts, consider different points of view and make judgements about how bias can lead to discrimination and inequality.
Researchers have long known that the words of a text have always contained more information than on the surface. As such, texts have been studied for subtexts and other latent or hidden information. One approach has involved the machine-enabled analysis of human sentiment, usually mapped out on a positive-negative polarity. NVivo 11 Plus (a qualitative research tool released in late 2015) enables the automated sentiment analysis of texts (coded research, formal articles, text corpora, Tweetstream datasets, Facebook wall posts, websites, and other sources) based on four categories: very positive, moderately positive, moderately negative, and very negative. The tool feature compares the target text set against a sentiment dictionary and enables coding at different units of analysis: sentence, paragraph, or cell. Further, the sentiment capability extracts the coded text into respective text sets which may be further analyzed using text frequency counts, text searches, automated theme and sub-theme extractions (topic modeling), and data visualizations.
Learning in the Wild: Coding Reddit for Learning and PracticePriya Kumar
This presentation introduces a ‘learning in the wild’ coding
schema, an approach developed to support learning
analytics researchers interested in understanding the
different types of discourse, exploratory talk, and
conversational dialogue happening on social media. The
research examines how learner-participants (‘Redditors’)
are leveraging subreddit communities to facilitate selfdirected
informal learning practices on the social
networking site.
Liking violence: A study of hate speech on Facebook in Sri LankaSanjana Hattotuwa
Based on a report looking at hate and dangerous speech on Facebook in Sri Lanka - http://www.cpalanka.org/liking-violence-a-study-of-hate-speech-on-facebook-in-sri-lanka/
Susan Windsor - Critical Thinking for Testers - EuroSTAR 2010TEST Huddle
EuroSTAR Software Testing Conference 2010 presentation onCritical Thinking for Testers by Susan Windsor. See more at: http://conference.eurostarsoftwaretesting.com/past-presentations/
This presentation has been used to guide workshops on research and academic writing conventions for upperclassman and first-year graduate students. However, it could be adapted for a first and second year student audience. The content is rich, emphasizing reflection, research/inquiry, as well as grammar. This material also demonstrates how to use new media as part of an overall research strategy. The presentation is designed to be presented interactively with writers across the disciplines, multilingual writers, and any writer unfamiliar with the academic writing process. The content is not linear, as many slides could be clipped and customized for integration into a first-year writing course, or even a session or workshop for graduate student writers of any classification.
The words used and our interpretation of images and statistics are an insight into our perspective or bias – our view of the world. Bias influences our attitudes and behaviours towards other people, places and issues. Our experiences, gender, age, class, religion and values all affect our bias. People who are passionate about an issue will generally be quite overt about their bias. People who want to promote a particular point of view may be less overt and more subtle in their use of words and images.
Global education aims to assist students to recognise bias in written and visual texts, consider different points of view and make judgements about how bias can lead to discrimination and inequality.
Researchers have long known that the words of a text have always contained more information than on the surface. As such, texts have been studied for subtexts and other latent or hidden information. One approach has involved the machine-enabled analysis of human sentiment, usually mapped out on a positive-negative polarity. NVivo 11 Plus (a qualitative research tool released in late 2015) enables the automated sentiment analysis of texts (coded research, formal articles, text corpora, Tweetstream datasets, Facebook wall posts, websites, and other sources) based on four categories: very positive, moderately positive, moderately negative, and very negative. The tool feature compares the target text set against a sentiment dictionary and enables coding at different units of analysis: sentence, paragraph, or cell. Further, the sentiment capability extracts the coded text into respective text sets which may be further analyzed using text frequency counts, text searches, automated theme and sub-theme extractions (topic modeling), and data visualizations.
Learning in the Wild: Coding Reddit for Learning and PracticePriya Kumar
This presentation introduces a ‘learning in the wild’ coding
schema, an approach developed to support learning
analytics researchers interested in understanding the
different types of discourse, exploratory talk, and
conversational dialogue happening on social media. The
research examines how learner-participants (‘Redditors’)
are leveraging subreddit communities to facilitate selfdirected
informal learning practices on the social
networking site.
Assessing How Users Display Self-Disclosure and Authenticity in Conversation with Human-Like Agents: A Case Study of Luda Lee (presented at AACL-IJCNLP 2022)
20/09/17 DevC Seongnam Opening Event
https://festa.io/events/1158
SLU? BERT? Distillation? 그게 뭔데… 어떻게 하는 건데… (feat. PyTorch)
본 talk에서는 음성으로부터 intent를 추출하는 SLU task에 BERT와 같은 pretrained langauge model을 적용하는 과정에서 직접적으로 적용하는 것의 난점과 knowledge distillation으로써 이를 해결하는 과정에 대해 다룹니다.
Buy Pinterest Followers, Reactions & Repins Go Viral on Pinterest with Socio...SocioCosmos
Get more Pinterest followers, reactions, and repins with Sociocosmos, the leading platform to buy all kinds of Pinterest presence. Boost your profile and reach a wider audience.
https://www.sociocosmos.com/product-category/pinterest/
Surat Digital Marketing School is created to offer a complete course that is specifically designed as per the current industry trends. Years of experience has helped us identify and understand the graduate-employee skills gap in the industry. At our school, we keep up with the pace of the industry and impart a holistic education that encompasses all the latest concepts of the Digital world so that our graduates can effortlessly integrate into the assigned roles.
This is the place where you become a Digital Marketing Expert.
“To be integrated is to feel secure, to feel connected.” The views and experi...AJHSSR Journal
ABSTRACT: Although a significant amount of literature exists on Morocco's migration policies and their
successes and failures since their implementation in 2014, there is limited research on the integration of subSaharan African children into schools. This paperis part of a Ph.D. research project that aims to fill this gap. It
reports the main findings of a study conducted with migrant children enrolled in two public schools in Rabat,
Morocco, exploring how integration is defined by the children themselves and identifying the obstacles that they
have encountered thus far. The following paper uses an inductive approach and primarily focuses on the
relationships of children with their teachers and peers as a key aspect of integration for students with a migration
background. The study has led to several crucial findings. It emphasizes the significance of speaking Colloquial
Moroccan Arabic (Darija) and being part of a community for effective integration. Moreover, it reveals that the
use of Modern Standard Arabic as the language of instruction in schools is a source of frustration for students,
indicating the need for language policy reform. The study underlines the importanceof considering the
children‟s agency when being integrated into mainstream public schools.
.
KEYWORDS: migration, education, integration, sub-Saharan African children, public school
Unlock TikTok Success with Sociocosmos..SocioCosmos
Discover how Sociocosmos can boost your TikTok presence with real followers and engagement. Achieve your social media goals today!
https://www.sociocosmos.com/product-category/tiktok/
Your Path to YouTube Stardom Starts HereSocioCosmos
Skyrocket your YouTube presence with Sociocosmos' proven methods. Gain real engagement and build a loyal audience. Join us now.
https://www.sociocosmos.com/product-category/youtube/
Enhance your social media strategy with the best digital marketing agency in Kolkata. This PPT covers 7 essential tips for effective social media marketing, offering practical advice and actionable insights to help you boost engagement, reach your target audience, and grow your online presence.
Improving Workplace Safety Performance in Malaysian SMEs: The Role of Safety ...AJHSSR Journal
ABSTRACT: In the Malaysian context, small and medium enterprises (SMEs) experience a significant
burden of workplace accidents. A consensus among scholars attributes a substantial portion of these incidents to
human factors, particularly unsafe behaviors. This study, conducted in Malaysia's northern region, specifically
targeted Safety and Health/Human Resource professionals within the manufacturing sector of SMEs. We
gathered a robust dataset comprising 107 responses through a meticulously designed self-administered
questionnaire. Employing advanced partial least squares-structural equation modeling (PLS-SEM) techniques
with SmartPLS 3.2.9, we rigorously analyzed the data to scrutinize the intricate relationship between safety
behavior and safety performance. The research findings unequivocally underscore the palpable and
consequential impact of safety behavior variables, namely safety compliance and safety participation, on
improving safety performance indicators such as accidents, injuries, and property damages. These results
strongly validate research hypotheses. Consequently, this study highlights the pivotal significance of cultivating
safety behavior among employees, particularly in resource-constrained SME settings, as an essential step toward
enhancing workplace safety performance.
KEYWORDS :Safety compliance, safety participation, safety performance, SME
Multilingual SEO Services | Multilingual Keyword Research | Filosemadisonsmith478075
Multilingual SEO services are essential for businesses aiming to expand their global presence. They involve optimizing a website for search engines in multiple languages, enhancing visibility, and reaching diverse audiences. Filose offers comprehensive multilingual SEO services designed to help businesses optimize their websites for search engines in various languages, enhancing their global reach and market presence. These services ensure that your content is not only translated but also culturally and contextually adapted to resonate with local audiences.
Visit us at -https://www.filose.com/
Grow Your Reddit Community Fast.........SocioCosmos
Sociocosmos helps you gain Reddit followers quickly and easily. Build your community and expand your influence.
https://www.sociocosmos.com/product-category/reddit/
1. Building a Dataset to Measure
Toxicity and Social Bias within Language:
A Low-Resource Perspective
Won Ik Cho (SNU ECE)
2022. 6. 22 @FAccT, Seoul, Korea
2. Introduction
• CHO, Won Ik (조원익)
B.S. in EE/Mathematics (SNU, ’10~’14)
Ph.D. student (SNU ECE, ’14~)
• Academic interests
Built Korean NLP datasets on various
spoken language understanding areas
Currently interested in computational
approaches of:
• Dialogue analysis
• AI for social good
1
3. Contents
• Introduction
• Hate speech in real and cyber spaces
What is hate speech and why does it matter?
Study on hate speech detection
• In English – Dataset and analysis
• Notable approaches in other languages
• Low-resource perspective: Creating a hate speech corpus from
scratch
Analysis on existing language resources
Hate speech as bias detection and toxicity measurement
Building a guideline for data annotation
Worker pilot, crowdsourcing, and agreement
• Challenges of hate speech corpus construction
• Conclusion
2
4. Contents
Caution! This presenation may contain contents that can be offensive to
certain groups of people, such as gender bias, racism, or other
unethical contents including multimodal materials
3
5. Contents
• Handled in this tutorial
How to build up a hate speech detection dataset in a specific setting
(language, text domain, etc.)
How to check the validity of the created hate speech corpus
• Less handled in this tutorial
Comprehensive definition of hate speech and social bias in the literature
Reliability of specific ethical guideline for hate speech corpus construction
4
6. Hate speech in real and cyber spaces
• What is hate speech and why does it matter?
Difficulty of defining hate speech
• Political and legal term, and not just a theoretical term
• Has no unified/universal definition accepted to all
• Definition differs upon language, culture, domain, discipline, etc.
Definition given by United Nations
• “Any kind of communication in speech, writing or behaviour, that attacks or
uses pejorative or discriminatory language with reference to a person or a
group on the basis of who they are, in other words, based on their religion,
ethnicity, nationality, race, colour, descent, gender or other identity factor.”
– Not a legal definition
– Broader than the notion of “incitement to discrimination, hostility or violence”
prohibited under international human rights law
5
https://www.un.org/en/hate-speech/understanding-hate-speech/what-is-hate-speech?
7. Hate speech in real and cyber spaces
• What is hate speech and why does it matter?
Hate speech in cyber spaces
• Definition is deductive, but its detection is inductive
• Hate speech appears online as various expressions, including:
– Offensive language
– Pejorative expressions
– Discriminative words
– Profanity terms
– Insulting ... etc.
• Whether to include specific terms or expressions in the category of `hate
speech’ is a tricky issue
– What if pejorative expression or profanity term does not target any group or
individuals?
– What if a (sexual) harrassment is considered offensive to readers but not for the
target figure?
6
8. Hate speech in real and cyber spaces
• Discussion on hate speech detection
Studies for English
• Waseem and Hovy (2016)
– Annotates tweets upon around 10 features that make the post offensive
7
A tweet is offensive if it
1. uses a sexist or racial slur.
2. attacks a minority.
3. seeks to silence a minority.
4. criticizes a minority (without a well founded argument).
5. promotes, but does not directly use, hate speech or violent crime.
6. criticizes a minority and uses a straw man argument.
7. blatantly misrepresents truth or seeks to distort views on a minority with unfounded claims.
8. shows support of problematic hash tags. E.g. “#BanIslam”, “#whoriental”, “#whitegenocide”
9. negatively stereotypes a minority.
10. defends xenophobia or sexism.
11. contains a screen name that is offensive, as per the previous criteria, the tweet is
ambiguous (at best), and the tweet is on a topic that satisfies any of the above criteria
Waseem and Hovy, Hateful Symbols or Hateful People? Predictive Features for Hate Speech Detection on Twitter, 2016.
9. Hate speech in real and cyber spaces
• Discussion on hate speech detection
Studies for English
• Davidson et al. (2017)
– Mentions the discrepancy between the theoretical definition and real world
expressions of hate speech
– Puts `offensive’ expressions in between `hate’ and `non-hate’, to incorporate the
expressions that are in the grey area
– Incorporate profanity terms used
prevalent in social media, which
does not necessarily targets minority
but induces offensiveness
8
Davidson et al., Automated Hate Speech Detection and the Problem of Offensive Language, 2017.
10. Hate speech in real and cyber spaces
• Discussion on hate speech detection
Notable approaches in other languages
• Sanguinetti et al. (2018)
– Investigates hate speech for the posts on Italian immigrants
– Beyond hate speech, tags if the post is offensive, aggressive, intensive, has irony and
sarcasm, shows stereotype
– `Stereotype’ as a factor that can be a clue to discrimination
9
Sanguinetti et al., An Italian Twitter Corpus of Hate Speech against Immigrants, 2018.
• hate speech: no - yes
• aggressiveness: no - weak – strong
• offensiveness: no - weak - strong
• irony: no - yes
• stereotype: no - yes
• intensity: 0 - 1 - 2 - 3 - 4
11. Hate speech in real and cyber spaces
• Discussion on hate speech detection
Notable approaches in other languages
• Assimakopoulos et al. (2020)
– Motivated by the critical analysis of posts made in reaction to news reports on the
Mediterranean migration crisis and LGBTIQ+ matters in Malta
– Annotates Malta web texts
– Investigates the attitude (positive/negative) of the text, and asks for target if negative,
also asking the way the negativeness is conveyed
10
1. Does the post communicate a positive, negative or neutral attitude? [Positive / Negative / Neutral]
2. If negative, who does this attitude target? [Individual / Group]
• (a) If it targets an individual, does it do so because of the individual’s affiliation to a group? [Yes / No]
If yes, name the group.
• (b) If it targets a group, name the group.
3. How is the attitude expressed in relation to the target group? Select all that apply.
[ Derogatory term / Generalisation / Insult / Sarcasm (including jokes and trolling) / Stereotyping /
Suggestion / Threat ]
4. If the post involves a suggestion, is it a suggestion that calls for violence against the target group? [Yes / No]
Assimakopoulos et al., Annotating for Hate Speech: The MaNeCo Corpus and Some Input from Critical Discourse Analysis, 2020.
12. Hate speech in real and cyber spaces
• Discussion on hate speech detection
Notable approaches in other languages
• Moon et al. (2020)
– Annotation on Korean celebrity news comments
– Investigate the existence of social bias and the degree of toxicity
» Social bias – Gender-related bias and other biases
» Toxicity – Hate/Offensive/None (following Davidson et al. 2017)
11
Moon et al., BEEP! Korean Corpus of Online News Comments for Toxic Speech Detection, 2020.
Detecting social bias
• Is there a gender-related bias, either explicit or implicit, in the text?
• Are there any other kinds of bias in the text?
• A comment that does not incorporate the bias
Measuring toxicity
• Is strong hate or insulting towards the article’s target or related
figures, writers of the article or comments, etc. displayed in a
comment?
• Although a comment is not as much hateful or insulting as the
above, does it make the target or the reader feel offended?
• A comment that does not incorporate any hatred or insulting
13. Low-resource perspective
• Creating a hate speech corpus from scratch
ASSUMPTION: There is no manually created hate speech detection corpus
so far for the Korean language (was true before July 2020...)
• Generally, clear motivation is required for hate speech corpus construction
– Why?
» Takes resources (time and money)
» Potential mental harm
» Potential attack towards the researchers
– Nonetheless, it is required in some circumstances
» Detecting offensive language in services
» Severe harm has been displayed publicly
12
14. Low-resource perspective
• Creating a hate speech corpus from scratch
Step 1: Is there anything available?
Analysis on existing language resources
• Language resources on hate speech detection regards various other similar
datasets (though slightly different in definition and goal)
– Dictionary of profanity terms (e.g., hatebase.org)
– Sarcasm detection dataset
– Sentiment analysis dataset
– Offensive language detection dataset
• Why we should search existing resources?
– To lessen the consumption of time and money
– To make the problem easier by building upon existing dataset
– To confirm what we should aim by creating a new dataset
13
15. Low-resource perspective
• Creating a hate speech corpus from scratch
Step 1: Is there anything available?
Analysis on existing language resources
• Dictionary of profanity terms
– e.g., https://github.com/doublems/korean-bad-words
• Sarcasm detection dataset
– e.g, https://github.com/SpellOnYou/korean-sarcasm
• Sentiment analysis dataset
– e.g., https://github.com/e9t/nsmc
The datasets may not completely overlap with hate speech corpus, but at
least they can be a good source of annotation
• Here, one should think of:
– Text style
– Text domain
– Appearing types of toxicity and bias
14
16. Low-resource perspective
• Creating a hate speech corpus from scratch
Step 1: Is there anything available?
Analysis on existing language resources
• Text style
– Written/spoken/web text?
• Text domain
– News/wiki/tweets/chat/comments?
• Appearing types of toxicity and bias
– Gender-related?
– Politics/religion?
– Region/nationality/ethnicity?
• Appearing amount of toxicity and bias
15
17. Low-resource perspective
• Creating a hate speech corpus from scratch
Step 1: Is there anything available?
Analysis on existing language resources
• Data collection example (BEEP!)
– Comments from the most popular Korean entertainment news platform
» Jan. 2018 ~ Feb. 2020
» 10,403,368 comments from 23,700 articles
» 1,580 articles acquired by stratified sampling
» Top 20 comments in the order of Wilson score on the downvote for each article
– Filter the duplicates and leave comments having more than single token and less
than 100 characters
– 10K comments were selected
• Data sampling process matters much in the final distribution of the dataset!
16
18. Low-resource perspective
• Creating a hate speech corpus from scratch
Step 2: What should we define first?
Hate speech as bias detection and toxicity measurement
• Local definition of hate speech discussed by Korean sociolinguistics society
– Definition of hate speech
» Expressions that discriminate/hate or incite discrimination/hatred/violence
towards some individual or group of people because they have characteristics
as a social minority
– Types of hate speech
» Discriminative bullying
» Discrimination
» Public insult/threatening
» Inciting hatred
17
Hong et al., Study on the State and Regulation of Hate Speech, 2016.
19. Low-resource perspective
• Creating a hate speech corpus from scratch
Step 2: What should we define first?
Hate speech as bias detection and toxicity measurement
• Set up criteria
– Analyze ‘Discriminate/hate or incite discrimination/hatred/violence’ as a combination
of ‘Social bias’ and ’Toxicity’
– Further discussion required on social minority
» `Gender, age, profession, religion, nationality, skin color, political stance’ and all
other factors that comprises one’s identity
» Criteria for social minority vs. Who will be acknowledged as social minority
18
20. Low-resource perspective
• Creating a hate speech corpus from scratch
Step 2: What should we define first?
Hate speech as bias detection and toxicity measurement
• Set up criteria for bias detection
– `People with a specific characteristic may behave in some way’
– Differs from the judgment
» Gender-related bias
» Other biases
» None
19
Cho and Moon, How Does the Hate Speech Corpus Concern Sociolinguistic Discussions? A Case Study on Korean Online News Comments, 2021.
21. Low-resource perspective
• Creating a hate speech corpus from scratch
Step 2: What should we define first?
Hate speech as bias detection and toxicity measurement
• Set up criteria for toxicity measurement
– Hate
» Hostility towards a specific group or individual
» Can be represented by some profanity terms, but terms do not imply hate
– Insult
» Expressions that can harm the prestige of individuals or group
» Various profanity terms are included
– Offensive expressions
» Does not count as hate or insult, but may make the readers offensive
» Includes sarcasm, irony, bad guessing, unethical expressions
20
22. Low-resource perspective
• Creating a hate speech corpus from scratch
Step 2: What should we define first?
Hate speech as bias detection and toxicity measurement
• Set up criteria for toxicity measurement
» Severe hate or insult
» Not hateful but offensive or sarcastic
» None
21
Cho and Moon, How Does the Hate Speech Corpus Concern Sociolinguistic Discussions? A Case Study on Korean Online News Comments, 2021.
23. Low-resource perspective
• Creating a hate speech corpus from scratch
Step 3: What is required for the annotation?
Building a guideline for data annotation
• Stakeholders
– Researchers
– Moderators (crowdsourcing platform)
– Workers
• How is guideline used as?
– Setting up research direction (for researchers)
– Task understanding (for moderators)
– Data annotation (for workers)
22
24. Low-resource perspective
• Creating a hate speech corpus from scratch
Step 3: What is required for the annotation?
Building a guideline for data annotation
• Guideline is not built at once!
– Usual process
» Making up draft guideline based on source corpus
» Pilot study of researchers & guideline update (𝑁 times iteration)
» Moderators’ and researchers’ alignment on the guideline
» Worker recruitment & pilot tagging
» Guideline update with worker feedback (cautions & exceptions)
» Final guideline (for main annotation)
23
25. Low-resource perspective
• Creating a hate speech corpus from scratch
Step 3: What is required for the annotation?
Building a guideline for data annotation
• Draft guideline
– Built based upon a small portion of source corpus (about hundreds of instances)
– Researchers’ intuition is highly involved in
– Concept-based description
» e.g., for `bias’,
`People with a specific characteristic may behave in some way’
(instead of listing up all stereotyped expressions)
• Pilot study
– Researchers’ tagging on slightly larger portion of source corpus (~1K instances)
– Fitting researchers’ intuition on the proposed concepts
» e.g., ``Does this expression contain bias or toxicity?’’
(discussion is important, but don’t fight!)
– Update descriptions or add examples
– Labeling, re-labeling, re-re-labeling...
24
26. Low-resource perspective
• Creating a hate speech corpus from scratch
Step 3: What is required for the annotation?
Building a guideline for data annotation
• Pilot study
– Labeling, re-labeling, re-re-labeling... + Agreement?
– Inter-annotator agreement (IAA)
» Calculating the reliability of annotation
» Cohen’s Kappa for two annotators
» Fleiss’ Kappa for more than two annotators
– Sufficiently high agreement? (> 0.6?)
» Let’s go annotating in the wild!
25
Pustejovsky and Stubbs, Natural Language Annotation, 2012.
27. Low-resource perspective
• Creating a hate speech corpus from scratch
Step 4: How is the annotation process conducted and evaluated?
Worker pilot, crowdsourcing, and agreement
• Finding a crowdsourcing platform
– Moderator
» Usually an expert in data creation and management
» Comprehends the task, gives feedback in view of workers
» Helps communication between researchers and workers
» Instructs, and sometimes hurries workers to meet the timeline
» Manages financial or legal issues
» Let researchers concentrate on the task itself
– Without moderator?
» Researchers are the moderator!
(Unless there are some automated functions in the platform)
– With moderator?
» The closest partner of researchers
26
28. Low-resource perspective
• Creating a hate speech corpus from scratch
Step 4: How is the annotation process conducted and evaluated?
Worker pilot, crowdsourcing, and agreement
• Finding a crowdsourcing platform
– Existence and experience of the moderator
» Experience of similar dataset construction
» Comprehension of the task & proper feedbacks
» Sufficient worker pool
» Trust between the moderator and workers
– Reasonable cost estimation
» Appropriateness of price per tagging or reviewing
» Appropriateness of worker compensation
» Fit with the budget
27
29. Low-resource perspective
• Creating a hate speech corpus from scratch
Step 4: How is the annotation process conducted and evaluated?
Worker pilot, crowdsourcing, and agreement
• Finding a crowdsourcing platform
– Usefulness of the platform UI
» Progress status (In progress, Submitted, Waiting for reviewing... etc.)
» Statistics: The number of workers and reviewers, Average work/review duration...
» Demographics, Worker history by individuals & in total...
28
30. Low-resource perspective
• Creating a hate speech corpus from scratch
Step 4: How is the annotation process conducted and evaluated?
Worker pilot, crowdsourcing, and agreement
• Pilot tagging (by workers)
– Goal of worker pilot
» Guideline update in workers’ view (especially on cautions & exceptions)
» Worker selection
– Procedure
» Advertisement or recruitment
» Worker tagging
» Researchers’ (or moderators’) review & rejection
» Workers’ revise & resubmit
29
31. Low-resource perspective
• Creating a hate speech corpus from scratch
Step 4: How is the annotation process conducted and evaluated?
Worker pilot, crowdsourcing, and agreement
• Details on worker selection process?
– Human checking
» Ethical standard not too far from the guideline?
» Is feedback effective for the rejected samples?
– Automatic checking
» Enough taggings done?
» Too frequent cases of skipping the annotation?
30
UI screenshots provided by Deep Natural AI.
32. Low-resource perspective
• Creating a hate speech corpus from scratch
Step 4: How is the annotation process conducted and evaluated?
Worker pilot, crowdsourcing, and agreement
• Crowdsourcing: A simplified version is required for crowd annotation!
– Multi-class, multi-attribute tagging
» 3 classes for bias
» 3 classes for toxicity
– Given a comment (without context), the annotator should tag each attribute
– Detailed guideline (with examples, cautions, and exceptions) is provided separately
31
1. What kind of bias does the comment contain?
- Gender bias, Other biases, or None
2. Which is the adequate category for the comment in terms of toxicity?
- Hate, Offensive, or None
33. Low-resource perspective
• Creating a hate speech corpus from scratch
Step 4: How is the annotation process conducted and evaluated?
Worker pilot, crowdsourcing, and agreement
• Main annotation
– Based on the final version of the guideline
» 3~5 annotators (per sample) for usual classification tasks
– Tagging done by selected workers
» Worker selection and education
» Short quiz (if workers are not selected)
– Annotation toolkit
» Assign samples randomly to workers, with multiple annotators per sample
» Interface developed or provided by the platform (usually takes budget)
» Open-source interfaces (e.g., Labelstudio)
– Data check for further guarantee of quality
» If sufficiently many annotators per sample?
» If not...?
32
34. Low-resource perspective
• Creating a hate speech corpus from scratch
Step 4: How is the annotation process conducted and evaluated?
Worker pilot, crowdsourcing, and agreement
• Data selection after main annotation (8,000 samples)
– Data reviewing strategy may differ by subtask
– Researchers decide the final label after adjudication
– Common for bias and toxicity
» Cases with all three annotators differ
– Only for toxicity
» Since the problem regards the continuum of degree,
cases with only hate (o) and none (x) need to be investigated again
– Failure for decision (unable to majority vote) – discarded
33
35. Low-resource perspective
• Creating a hate speech corpus from scratch
Step 4: How is the annotation process conducted and evaluated?
Worker pilot, crowdsourcing, and agreement
• Final decision
– Test: 974
» Data tagged while constructing the guideline
(Mostly adjust to the intention of the guideline)
– Validation: 471
» Data which went through tag/review/reject
and accept in the pilot phase,
done with a large number of annotators
(Roughly aligned with the guideline)
– Train: 7,896
» Data which were crowd-sourced with the
selected workers, not reviewed totally but
went through adjudication only for some special cases
• Agreement
– 0.492 for bias detection, 0.496 for toxicity measurement
34
Moon et al., BEEP! Korean Corpus of Online News Comments for Toxic Speech Detection, 2020.
36. Low-resource perspective
• Creating a hate speech corpus from scratch
Step 4: How is the annotation process conducted and evaluated?
Beyond creation - Model training and deployment
• Model training
– Traditionally
» High performance – relatively easy?
» Low performance – relatively challenging?
– But in PLM-based training these days...
» Pretraining corpora
» Model size
» Model architecture
– Model deployment
» Performance & size
» User feedbacks
35
Yang, Transformer-based Korean Pretrained Language Models: A Survey on Three Years of Progress, 2021.
37. Challenges
• Challenges of hate speech corpus construction
Context-dependency
• News comment – articles
• Tweets – threads
• Web community comments – posts
Multi-modal or noisy inputs
• Image and audio
– Kiela et al. (2020)
- Hateful memes challenge
• Perturbated texts
– Cho and Kim
(2021)
- Leetcodes
- Yaminjeongeum
36
Kiela et al., The Hateful Memes Challenge: Detecting Hate Speech in Multimodal Memes, 2020.
Cho and Kim, Google-trickers, Yaminjeongeum, and Leetspeak: An Empirical Taxonomy for Intentionally Noisy User-Generated Text, 2021.
38. Challenges
• Challenges of hate speech corpus construction
Categorical or binary output has limitation
• Limitation of categorizing the degree of intensity
– Hate/offensive/none categorization is sub-optimal
– Polleto et al. (2019)
Scale-based annotation:
Unbalanced Rating Scale
» Used to determine the label
(or used as a target score?)
37
Poletto et al., Annotating Hate Speech: Three Schemes at Comparison, 2019.
39. Challenges
• Challenges of hate speech corpus construction
Annotation requires multiple label
• Aspect of discrimination may differ by attributes
– Gender, Race, Nationality, Ageism ...
• Tagging `all the target attributes’ that appear?
– Kang et al. (2022)
» Detailed guideline with terms and concepts defined for each atttribute
38
Women
& family
Male Sexual
minorities
Race &
nationality
Ageism Regionalism Religion Other Malicious None
S1 1 0 0 0 1 0 0 0 0 0
S2 0 0 0 0 0 0 0 0 1 0
S3 0 0 0 1 0 0 1 0 0 0
S4 0 0 0 0 0 0 0 0 0 1
Kang et al., Korean Online Hate Speech Dataset for Multilabel Classification - How Can Social Science Improve Dataset on Hate Speech?, 2022.
40. Challenges
• Challenges of hate speech corpus construction
Privacy and license issues
• Privacy and license can be violated with the text crawling
• Hate speech corpus may contain personal information on (public) figures
• Text could have been brought from elsewhere (copy & paste)
How about creating hate (and non-hate) speech from scratch?
• Yang et al. (2022): Recruit workers and enable `anonymous’ text generation!
39
Yang et al., APEACH: Attacking Pejorative Expressions with Analysis on Crowd-Generated Hate Speech Evaluation Datasets, 2022.
41. Challenges
• Ambiguity is inevitable
Text may incorporate various ways of interpretation
• Text accompanies omission or replacement to trick the monitoring
• Intention is apparent considering the context
• Temporal diachronicity of hate speech
Non-hate speech in the past can be interpreted as hate speech these days
Diachronicity may deter the utility of prediction systems
• e.g., [a name of celebrity who commited crime] before 20xx / after 20xx
• Boundary of hate speech and freedom of speech
Grey area that cannot be resolved
• Some readers are offended by false positives
• Some users are offended by false negatives
40
42. Conclusion
• Hate speech prevalent in real and cyber spaces
Discussions on hate speech have diverse viewpoints, from academia to
society and industry – and they are reflected to the dataset construction
• No corpus is built perfect from the beginning
... and hate speech is one of the most difficult kind of corpus to create
• Considerations in low-resource hate speech corpus construction
Why? How? How much? How well?
• Still more challenges left
Context, input noise, output format, indecisiveness ...
• Takeaways
There is discrepancy between theoretical and practical definition of hate
speech, and their aim may differ
There is no hate speech detection guideline that satisfies ALL, so let’s find
the boundary that satisfies the most and improve it
41
43. Reference
• Waseem and Hovy, Hateful Symbols or Hateful People? Predictive Features for Hate Speech Detection on Twitter, 2016.
• Davidson et al., Automated Hate Speech Detection and the Problem of Offensive Language, 2017.
• Sanguinetti et al., An Italian Twitter Corpus of Hate Speech against Immigrants, 2018.
• Assimakopoulos et al., Annotating for Hate Speech: The MaNeCo Corpus and Some Input from Critical Discourse Analysis, 2020.
• Moon et al., BEEP! Korean Corpus of Online News Comments for Toxic Speech Detection, 2020.
• Hong et al., Study on the State and Regulation of Hate Speech, 2016.
• Cho and Moon, How Does the Hate Speech Corpus Concern Sociolinguistic Discussions? A Case Study on Korean Online News
Comments, 2021.
• Pustejovsky and Stubbs, Natural Language Annotation, 2012.
• Yang, Transformer-based Korean Pretrained Language Models: A Survey on Three Years of Progress, 2021.
• Kiela et al., The Hateful Memes Challenge: Detecting Hate Speech in Multimodal Memes, 2020.
• Cho and Kim, Google-trickers, Yaminjeongeum, and Leetspeak: An Empirical Taxonomy for Intentionally Noisy User-Generated
Text, 2021.
• Poletto et al., Annotating Hate Speech: Three Schemes at Comparison, 2019.
• Kang et al., Korean Online Hate Speech Dataset for Multilabel Classification - How Can Social Science Improve Dataset on Hate
Speech?, 2022.
• Yang et al., APEACH: Attacking Pejorative Expressions with Analysis on Crowd-Generated Hate Speech Evaluation Datasets, 2022.
42