The document summarizes the construction and analysis of a Korean hate speech corpus. It discusses how hate speech was defined and annotated, including the guidelines developed for identifying social bias and measuring toxicity. Over 10,000 online comments were annotated. Analysis found that toxicity is usually present with biased comments, and gender-related bias tended to coincide with more toxic expressions than other biases. The corpus was created to help analyze real-world hate speech detection in Korean and address gaps in previous work.
Peer review - Why does it matter for your academic career?Marco Kalz
Presentation provided in the context of the Young Researchers Special Issue for the International Journal of Technology-Enhanced Learning (IJTEL). Download available via http://dspace.ou.nl
Invited Keynote by Professor Noah A Smith (University of Washington) for ACL 2017. 1 Aug 2017. Vancouver, Canada.
Also available at https://homes.cs.washington.edu/~nasmith/slides/acl-8-1-17.pdf
License: CC BY 4.0
Peer review - Why does it matter for your academic career?Marco Kalz
Presentation provided in the context of the Young Researchers Special Issue for the International Journal of Technology-Enhanced Learning (IJTEL). Download available via http://dspace.ou.nl
Invited Keynote by Professor Noah A Smith (University of Washington) for ACL 2017. 1 Aug 2017. Vancouver, Canada.
Also available at https://homes.cs.washington.edu/~nasmith/slides/acl-8-1-17.pdf
License: CC BY 4.0
1
SCAFFOLD STEP #4: DIVERSITY PERSPECTIVES WORKSHEET
My full name is Marcell Tywa'n Scott
1 July 2017
1. What is your faculty-approved global issue/problem? - My faculty approved global issue/ problem is racism.
2. Explain how you narrowed your focus to examine some aspect of that issue that affects disenfranchised and underrepresented groups. - I narrowed to this focus to represent the different ethnic and cultural groups around the globe. I wanted to highlight the mistreatment along with highlight the myths of racism that affects disenfranchised and underrepresented groups.
3. Draft a working thesis for your Diversity Perspectives paper. - Racism is a world wide problem caused by ignorance that differentiates people with skin color and can be resolved through continuous education.
4. What three to five points will you make to explain the significance of the issue? - What is racism .How racism affects our progress in the world.How racism affect growth in communities and businesses.
5. Identify the competing entities (populations) affected by this issue. Which of them are disadvantaged and underrepresented? - The quality of life for specific groups of people are affected tremendously by this topic. There are so many different races in the world to specifically name them all here in this form; Jewish and Americans from the African decent are the two big ones most are aware of .
6. For each of the groups identified, what cultural perspective will you present? - I will present how Hitler attempted to kill off all the Jewish, I may talk about how there culture changed based off of this action. I will talk about how the everyday discrimination that the Americans from African decent have to endure in America. I will deliver this from the White American along with the African Americans perspective.
7. What cultural inequalities are evident? What evidence will you use from your literature review and additional sources? - Evident cultural inequities have been acknowledge throughout history for both discriminated groups. In America, blacks have been subjected to mistreatment from elected officials, Public servants (police officers). I will utilize evidence from my sources identified in my first proposal, I will also utilize social media groups and show the reader different sides of the issues at concern.
8. How will you use Hofstede's Cultural Values Framework to explain the issues involved? -
9. How does in-group favoritism influence the competing populations? - I will attempt to describe this to the reader as racism.
10. How has out-group bias manifested itself among those involved in the issue? - The bias factor manifested into racism will be a huge portion on my work. I will establish bias and preconceived prejudice that some have.
11. Which justice theory will you choose to frame your argument and why? - I have yet to commit to a justice theory, I would like to stay focused on a moral high ground for this project.
12. What solutions hav ...
This presentation investigates sexism as a sociological issue. It focuses on five elements. Patriarchy & male dominance, misogyny, sexist jokes, objectification of women and minimising women's voice-the boys will be boys brigade. The objective for examining these issues is to comprehend how practicing counsellor, social workers and mental health support workers may address some of these issue in a professional manner.
This tutorial corresponds with my Psyc 300 class at American River College and should be completed with Module D Lesson 3 - Lessons in social psychology.
Recsys 2017 -- Understanding How People Use Natural Language to Ask for Recom...Max Harper
Talk given at Recsys 2017 by Max Harper: The technical barriers for conversing with recommender systems using natural language are vanishing. Already, there are commercial systems that facilitate interactions with an AI agent. For instance, it is possible to say “what should I watch” to an Apple TV remote to get recommendations. In this research, we investigate how users initially interact with a new natural language recommender to deepen our understanding of the range of inputs that these technologies can expect. We deploy a natural language interface to a recommender system, we observe users’ first interactions and follow-up queries, and we measure the differences between speaking- and typing-based interfaces. We employ qualitative methods to derive a categorization of users’ first queries (objective, subjective, and navigation) and follow-up queries (refine, reformulate, start over). We employ quantitative methods to determine the differences between speech and text, finding that speech inputs are typically longer and more conversational.
1
SCAFFOLD STEP #4: DIVERSITY PERSPECTIVES WORKSHEET
My full name is Marcell Tywa'n Scott
1 July 2017
1. What is your faculty-approved global issue/problem? - My faculty approved global issue/ problem is racism.
2. Explain how you narrowed your focus to examine some aspect of that issue that affects disenfranchised and underrepresented groups. - I narrowed to this focus to represent the different ethnic and cultural groups around the globe. I wanted to highlight the mistreatment along with highlight the myths of racism that affects disenfranchised and underrepresented groups.
3. Draft a working thesis for your Diversity Perspectives paper. - Racism is a world wide problem caused by ignorance that differentiates people with skin color and can be resolved through continuous education.
4. What three to five points will you make to explain the significance of the issue? - What is racism .How racism affects our progress in the world.How racism affect growth in communities and businesses.
5. Identify the competing entities (populations) affected by this issue. Which of them are disadvantaged and underrepresented? - The quality of life for specific groups of people are affected tremendously by this topic. There are so many different races in the world to specifically name them all here in this form; Jewish and Americans from the African decent are the two big ones most are aware of .
6. For each of the groups identified, what cultural perspective will you present? - I will present how Hitler attempted to kill off all the Jewish, I may talk about how there culture changed based off of this action. I will talk about how the everyday discrimination that the Americans from African decent have to endure in America. I will deliver this from the White American along with the African Americans perspective.
7. What cultural inequalities are evident? What evidence will you use from your literature review and additional sources? - Evident cultural inequities have been acknowledge throughout history for both discriminated groups. In America, blacks have been subjected to mistreatment from elected officials, Public servants (police officers). I will utilize evidence from my sources identified in my first proposal, I will also utilize social media groups and show the reader different sides of the issues at concern.
8. How will you use Hofstede's Cultural Values Framework to explain the issues involved? -
9. How does in-group favoritism influence the competing populations? - I will attempt to describe this to the reader as racism.
10. How has out-group bias manifested itself among those involved in the issue? - The bias factor manifested into racism will be a huge portion on my work. I will establish bias and preconceived prejudice that some have.
11. Which justice theory will you choose to frame your argument and why? - I have yet to commit to a justice theory, I would like to stay focused on a moral high ground for this project.
12. What solutions hav ...
This presentation investigates sexism as a sociological issue. It focuses on five elements. Patriarchy & male dominance, misogyny, sexist jokes, objectification of women and minimising women's voice-the boys will be boys brigade. The objective for examining these issues is to comprehend how practicing counsellor, social workers and mental health support workers may address some of these issue in a professional manner.
This tutorial corresponds with my Psyc 300 class at American River College and should be completed with Module D Lesson 3 - Lessons in social psychology.
Recsys 2017 -- Understanding How People Use Natural Language to Ask for Recom...Max Harper
Talk given at Recsys 2017 by Max Harper: The technical barriers for conversing with recommender systems using natural language are vanishing. Already, there are commercial systems that facilitate interactions with an AI agent. For instance, it is possible to say “what should I watch” to an Apple TV remote to get recommendations. In this research, we investigate how users initially interact with a new natural language recommender to deepen our understanding of the range of inputs that these technologies can expect. We deploy a natural language interface to a recommender system, we observe users’ first interactions and follow-up queries, and we measure the differences between speaking- and typing-based interfaces. We employ qualitative methods to derive a categorization of users’ first queries (objective, subjective, and navigation) and follow-up queries (refine, reformulate, start over). We employ quantitative methods to determine the differences between speech and text, finding that speech inputs are typically longer and more conversational.
Assessing How Users Display Self-Disclosure and Authenticity in Conversation with Human-Like Agents: A Case Study of Luda Lee (presented at AACL-IJCNLP 2022)
20/09/17 DevC Seongnam Opening Event
https://festa.io/events/1158
SLU? BERT? Distillation? 그게 뭔데… 어떻게 하는 건데… (feat. PyTorch)
본 talk에서는 음성으로부터 intent를 추출하는 SLU task에 BERT와 같은 pretrained langauge model을 적용하는 과정에서 직접적으로 적용하는 것의 난점과 knowledge distillation으로써 이를 해결하는 과정에 대해 다룹니다.
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
1. Hate Speech as Toxic and Biased Words:
Construction and Analysis of
Korean Hate Speech Corpus
Won Ik Cho (SNU ECE)
2021. 6. 4 @JWLLP
2. Contents
• Introduction
• Source Corpus
• Guideline and Annotation
• Analysis
• Conclusion
Caution! This presenation contains contents that can be offensive
1
3. Introduction
• Hate speech
What are the aspects of hate speech?
• Hate speech and hatred
• Bad words and insulting
• Discrimination and bias
Various projects undergoing in the name of ...
• Abusive language, Toxic words, etc.
Social agreement that prevalent hate speech `matters’ a lot
However, some argues are on:
• What really is `hate speech’?
• Can some expressions be called as `hate speech’?
• Is hate speech really hateful?
2
4. Introduction
• Hate speech
What are the aspects of hate speech?
• Hate speech and hatred
• Bad words and insulting
• Discrimination and bias
Various projects undergoing in the name of ...
• Abusive language, Toxic words, etc.
Social agreement that prevalent hate speech `matters’ a lot
However, some argues are on:
• What really is `hate speech’?
• Can some expressions be called as `hate speech’?
• Is hate speech really hateful?
3
5. Introduction
• Hate speech
Hate speech detection in practice
• Finding and blinding malicious expressions in game or broadcasting chat
• Blinding posts/comments of Youtube, Facebook or Twitter based on detecting
system
Does current practical studies consider theoretical/social discussions?
• Current practical studies in Korean hate speech detection
– Detecting swear words and profanity terms: Usually dictionary-based
– Defines the sentences that contain the terms as `hate speech’
– OR sometimes defines the expressions from certain communities as hate speech
– Less study on human annotating the utterances
4
6. Introduction
• Hate speech
Hate speech detection in practice
• Finding and blinding malicious expressions in game or broadcasting chat
• Blinding posts/comments of Youtube, Facebook or Twitter based on detecting
system
Does current practical studies consider theoretical/social discussions?
• Current practical studies in Korean hate speech detection
– Detecting swear words and profanity terms: Usually dictionary-based
– Defines the sentences that contain the terms as `hate speech’
– OR sometimes defines the expressions from certain communities as hate speech
– Less study on human annotating the utterances
5
7. Introduction
• Hate speech
In literature (and in other languages)
• Waseem and Hovy (2016)
– Tags English twitter posts, with around 10 or more characteristics that imply hate
speech
• Davidson et al. (2017)
– Mentions the discrepancy between the theoretical definition and real world
expressions of hate speech
– Puts `offensive’ expressions in between `hate’ and `non-hate’, to incorporate the
expressions that are in the grey area
• Sanguinetti et al. (2018)
– Investigates hate speech for the posts on Italian immigrants
» Beyond hate speech, detects if the post is offensive, aggressive, intensive, has
irony and sarcasm, shows stereotype.
» `Stereotype’ as a factor that can be a clue to discrimination
6
8. Introduction
• Hate speech
In literature (and in other languages)
• Waseem and Hovy (2016)
– Tags English twitter posts, with around 10 or more characteristics that imply hate
speech
• Davidson et al. (2017)
– Mentions the discrepancy between the theoretical definition and real world
expressions of hate speech
– Puts `offensive’ expressions in between `hate’ and `non-hate’, to incorporate the
expressions that are in the grey area
• Sanguinetti et al. (2018)
– Investigates hate speech for the posts on Italian immigrants
» Beyond hate speech, detects if the post is offensive, aggressive, intensive, has
irony and sarcasm, shows stereotype.
» `Stereotype’ as a factor that can be a clue to discrimination
7
9. Introduction
• Hate speech
Research Questions
• RQ1
– How is hate speech displayed in Korean online comments?
» What is bias and which categories are included in?
» How can we represent the amount of toxicity of expressions?
• RQ2
– What characteristics does the Korean hate speech corpus incorporate?
» Does bias accompany the toxicity of expression?
» Does toxicity matter with the type of shown bias?
8
10. Source Corpus
• Comments from the most popular Korean entertainment news
platform
Jan. 2018 ~ Feb. 2020
10,403,368 comments from 23,700 articles
Sampling and Filtering
Top 20 comments in the order of Wilson score on the downvote for each
1,580 articles acquired by stratified sampling
• Filter the duplicates and leave comments having more than single
token and less than 100 characters
• 10K comments were selected
9
11. Guideline and Annotation
• Formulation
Hate speech
• Discussion with 1,000 comments over total 10,000
• Which factors make the comment `hate speech’?
– Bias
» `People with a specific characteristic may behave in some way’
» May differ from the judgment
– Hate
» Hostility towards a specific group or individual
» Can be represented by some profanity terms, but terms does not imply hate
– Insult
» Expressions that can harm the prestige of individuals or group
» Various profanity terms are included
– Offensive expressions
» Does not count as hate or insult, but may make the readers offensive
» Includes sarcasm, irony, bad guessing, unethical expressions
10
12. Guideline and Annotation
• Formulation
Hate speech
• Discussion with 1,000 comments over total 10,000
• Which factors make the comment `hate speech’?
– Bias
» `People with a specific characteristic may behave in some way’
» May differ from the judgment
– Hate
» Hostility towards a specific group or individual
» Can be represented by some profanity terms, but terms does not imply hate
– Insult
» Expressions that can harm the prestige of individuals or group
» Various profanity terms are included
– Offensive expressions
» Does not count as hate or insult, but may make the readers offensive
» Includes sarcasm, irony, bad guessing, unethical expressions
11
13. Guideline and Annotation
• Formulation
Social bias + Toxicity
• Detection of bias (ternary)
– Gender-related bias (Why?)
– Other biases
– None
» Close to the problem of `detection’
» Why concentrated on gender issue?
• Measuring toxicity (ternary)
– Severe hate or insult
– Not hateful but offensive or sarcastic
– None
» Close to the problem of `amount’
» Why formulated as a problem of intensity?
12
14. Guideline and Annotation
• Formulation
Social bias + Toxicity
• Detection of bias (ternary)
– Gender-related bias (Why?)
– Other biases
– None
» Close to the problem of `detection’
» Why concentrated on gender issue?
• Measuring toxicity (ternary)
– Severe hate or insult
– Not hateful but offensive or sarcastic
– None
» Close to the problem of `amount’
» Why formulated as a problem of intensity?
13
17. Guideline and Annotation
• Guideline
Multi-label tagging
• 3 classes for bias
• 3 classes for toxicity
Given a comment (without context), the annotator should tag each
attribute
Every comments provided to three random annotators
• Total 32 participants (in pilot and main tagging phase)
• Female : male = 6 : 4 / 20s : 30s : 40s = 3 : 2 : 1
16
1. What kind of bias does the comment contain?
- Gender bias, Other biases, or None
2. Which is the adequate category for the comment in terms of toxicity?
- Hate, Offensive, or None
18. Guideline and Annotation
• Pilot tagging – Which workers would fit?
Human checked
• Ethical standard not too far from the guideline?
• Is feedback effective for the rejected samples?
Automatically checked
• Enough taggings done?
• Too frequent cases of skipping the annotation?
17
19. Guideline and Annotation
• Crowd-sourcing – With selected workers
Feedback for each annotator is not conducted in the sourcing phase
18
20. Analysis
• Data Post-processing
After whole annotation (8,000 instances)
• Commonly checked for social bias and toxicity
– If all three annotators differ
» Task managers decide the final label after adjudication
• For toxicity
– Since the problem regarding ‘Intensity’, only (o) and (x) cases need to be reorganized
» Final decision after adjudication
• Failure for decision (unable to majority vote) - discarded
Annotator agreement (Krippendorff’s alpha): overall moderate
• Bias (binary) – 0.767 (Existence of gender-related bias is relatively explicit)
• Bias (ternary) – 0.492
• Hate (ternary) – 0.496
19
21. Analysis
• Data Post-processing
After whole annotation (8,000 instances)
• Commonly checked for social bias and toxicity
– If all three annotators differ
» Task managers decide the final label after adjudication
• For toxicity
– Since the problem regarding ‘Intensity’, only (o) and (x) cases need to be reorganized
» Final decision after adjudication
• Failure for decision (unable to majority vote) - discarded
Annotator agreement (Krippendorff’s alpha): Overall moderate
• Bias (binary) – 0.767 (Existence of gender-related bias is relatively explicit)
• Bias (ternary) – 0.492
• Hate (ternary) – 0.496
20
22. Analysis
• Final data
Data split
• Discarded 659 over 10,000
• Split train/valid/test with the rest
Data composition
• Test: 974
– Data tagged while constructing the guideline (Most adjusted to the intention of the
guideline)
• Valid: 471
– Data which went through tagging/review/reject and accept in the pilot phase, done
with a large number of annotators (Roughly aligned with the guideline)
• Train: 7,896
– Data which were crowd-sourced with the selected annotators, not reviewed totally
but went through adjudication for some special case
21
23. Analysis
• Final data
Characteristics
• Toxic comments possess slightly
larger portion towards None
• For bias, the same does not hold
Something to remark
• ‘Lots of toxic expressions in celebrity news domain’?
– Though we sampled in the order of downvote, the overall portion does not
necessarily reflect the toxicity of random comments
• ‘Higher portion of toxic comments compared to bias’?
– Though the results tell so, biases are usually implicit and might not have been visible
to the users
» So that they were not accurately reflected to up/downvotes
22
24. Analysis
• Final data
Characteristics
• Toxic comments possess slightly
larger portion towards None
• For bias, the same does not hold
Something to remark
• ‘Lots of toxic expressions in celebrity news domain’?
– Though we sampled in the order of downvote, the overall portion does not
necessarily reflect the toxicity of random comments
• ‘Higher portion of toxic comments compared to bias’?
– Though the results tell so, biases are usually implicit and might not have been visible
to the users
» So that they were not accurately reflected to up/downvotes
23
25. Analysis
• Final data
Bias and toxicity
• Toxicity is observed in most texts
with gender-related or other biases
– Gender-related bias?
» 93.76% toxic
– Other biases?
» 90.42% toxic
• In contrast, toxic comments do not necessarily contain biases
The category of bias and amount of toxicity
• About 1.4 times gender-related bias in `hate’ compared to other biases
– Portion of gender-related bias goes half of other biases in `offensive’
• Maybe largely influenced by our guideline, but still suggests that the amount of
toxicity in celebrity news domain matters a lot with gender-related contents
24
26. Analysis
• Final data
Bias and toxicity
• Toxicity is observed in most texts
with gender-related or other biases
– Gender-related bias?
» 93.76% toxic
– Other biases?
» 90.42% toxic
• In contrast, toxic comments do not necessarily contain biases
The category of bias and amount of toxicity
• About 1.4 times gender-related bias in `hate’ compared to other biases
– Portion of gender-related bias goes half of other biases in `offensive’
• Maybe largely influenced by our guideline, but still suggests that the amount of
toxicity in celebrity news domain matters a lot with gender-related contents
25
27. Analysis
• Research questions
RQ1
• How is hate speech displayed
in Korean online comments?
– Social bias and Toxicity
RQ2
• What characteristics does the
Korean hate speech corpus
incorporate?
– Bias usually accompanies toxicity
– Gender-related bias seems to
accompany more toxic expressions
26
28. Conclusion
• Discussions on hate speech have diverse viewpoints, from
academia, to social and industry
• Construction of hate speech corpus in Korean links the above
discussions, to be useful in real world hate speech detection
• We observed bias and toxicity in Korean hate speech, which is
weighted to gender-related factors in celebrity news comments
• Our future work includes building up hate speech corpus for
various domain of texts, from formal to colloquial, to deal with the
uncovered cases
27
29. Conclusion
• Model and data release
Annotation guideline
• https://www.notion.so/c1ecb7cc52d446cc93d928d172ef8442
Kaggle competition
• https://www.kaggle.com/c/korean-gender-bias-detection
• https://www.kaggle.com/c/korean-bias-detection/
• https://www.kaggle.com/c/korean-hate-speech-detection/
Github repository
• https://github.com/kocohub/korean-hate-speech
• For easier data importing
Koco package
• https://github.com/inmoonlight/koco
– Library to easily access kocohub datasets
– Kocohub contains KOrean COrpus for natural language processing
» https://github.com/kocohub
28