Do characters abuse more than words?

•Download as PPTX, PDF•

1 like•195 views

Tharushi Ruwandika

A Summary on the research paper "Do characters abuse more than words?"

Science

Do Characters Abuse More
Than Words?
By: Yashar Mehdad
Joel Tetreault
(Yahoo! Research)
1

Outline
1. Introduction
2. Related Work
3. Methodology
4. Evaluation
5. Conclusion
6. Review
2

Introduction
 Rise of online communities
 Connect different communities
 Ease of communication
“Hurl insults, bully & threaten through
the use of profanity & hate speech”
Abusive
Language
4

“Kill all Jews, they are a headache”
 Straightforward methods to handle
blacklists & regular expressions
“Kill yrself a$$hole”
 Tokenization & Normalization issue
 Complex methods needed
5

2. Related Work
 Profanity detection
Sood et al 2012
 Hate speech detection
Warner & Hirschberg 2012
 Cyberbullying
Dadvar et al 2013
 Generally abusive language detection
Chen et al 2012, Djuric et al 2015
7

 Majority of the work focused on
supervised classification with
canonical NLP features
 Token n-gram features
 Hand crafted regular expressions &
blacklist features
8

 Features which model the user’s past
behavior
Dadvar et al 2013
 Semi supervised LDA approach
Xiang et al 2012
 Paragraph2vec approach
Djuric et al 2015
 Usage of many features
Nobata et al 2016
9

3. Methodology
 Supervised classification methods with
lexical & morphological features to
measure various aspects of user
comments
 Hybrid method based on
discriminative & generative classifiers
 Binary classification task (abusive or
not) 11

 Feature classes
Tokens
Characters
Distributional semantics
 Methods
Distributional Representation of Comments
(C2V)
Recurrent Neural Network Language Model
(RNNLM)
Support Vector Machine with Naïve Bayes
(NBSVM)
12

3.1. Distributional Representation of
Comments (C2V)
 Modeling lexical semantics using
vector space model (Mikolov et al
2013)
 A vector representation with comment
embeddings
 A skip-bigram model to train the
embeddings of the words in comments
13

C2V
 Window size 5 & 10
 Low dimensional models (100,300)
d300w10 d300w5
d100w10 d100w5
 10 Iterations
 Multi-core LibLinear Library Logisitc
regression classifier
15

3.2. Recurrent Neural Network
Language Model (RNNLM)
 RNN is potential in representing more
advanced patterns
 Models are trained for both classes
(abusive & clean)
 Token n-gram where n=1,2,3,4,5
 Character n-grams where n=1,2,3,4,5
Space is considered as a character to
investigate character Vs word claim
16

RNNLM
 Testing:- Estimate the ratio of
probability of the comment belonging
to each class via bayes rule
 If the probability of a comment given
the abusive language model is higher
than its probability given non-abusive
language model, then the comment is
classified as abusive
17

RNNLM
 Ratios are used to calculate AUC matrix
 RNNLM toolkit implemented in Mikolov
2011
 One “word” model
 Two character models as “char1” & “char2”
 Bptt:- No. of steps to propagate error back
18
word char1 char2
No. of hidden
layers
50 50 200
bptt 4 4 10

3.3. Support Vector machine with Naïve
Bayes Features (NBSVM)
 SVM + NB
 Compute log ratio vector between the
average character n-gram counts from
abusive & non-abusive comments
 Input to the SVM
Log ratio vector * binary pattern for each
character n-gram in
comment vector
Multi-core LibLinear Library
19

4. Evaluation
 Dataset:- Data sets used in Djuric et al &
Nobata et al
 Labels:- From a combination of in-house
raters users reactively flagging bad
comments & abusive language pattern
detectors
 5 fold cross validation & report AUC
 Recall, precision & F1 score
 A token n-gram classifier with logistic
regression classifier
21

5. Conclusion
 Character based approaches fared
best in cases with irregular
normalization or obfuscation of words
 Has shown the superiority of simple
character-based approaches over the
previous state-of-art, as well as token-
based ones & two deep learning
approaches
24

What's hot

Introduction+to+software+designMunazza-Mah-Jabeen

Frontiers of Natural Language ProcessingSebastian Ruder

Datatypes in C LanguagePooja Patel

MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation ...Lifeng (Aaron) Han

LetterOfRecommendationJoel Voigt

Mythri_Resume_Fresherr mythri

Multi-modal NLP Systems in HealthcareJekaterina Novikova, PhD

Generation of Synthetic Referring Expressions for Object Segmentation in VideosUniversitat Politècnica de Catalunya

CoLing 2016Matīss ‎‎‎‎‎‎‎

NotesparadigmsDeepakkumar5880

A Feature-Based Model for Nested Named-Entity Recognition at VLSP-2018 NER Ev...Minh Pham

seq2seq learning for end-to-end dialogue systemsJordy Van Landeghem

Web services for supporting the interactions of learners in the social web - ...Traian Rebedea

Lecture 2Md. Rakibuzzaman Khan Pathan

New Developments in the BREACH attackE Hacking

What's hot (15)

Introduction+to+software+design

Frontiers of Natural Language Processing

Datatypes in C Language

MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation ...

LetterOfRecommendation

Mythri_Resume_Fresher

Multi-modal NLP Systems in Healthcare

Generation of Synthetic Referring Expressions for Object Segmentation in Videos

CoLing 2016

Notesparadigms

A Feature-Based Model for Nested Named-Entity Recognition at VLSP-2018 NER Ev...

seq2seq learning for end-to-end dialogue systems

Web services for supporting the interactions of learners in the social web - ...

Lecture 2

New Developments in the BREACH attack

Similar to Do characters abuse more than words?

Named Entity Recognition using Hidden Markov Model (HMM)kevig

DeepPavlov 2019Mikhail Burtsev

Are Neural Language Models Good Plagiarists? A Benchmark for Neural Paraphras...JanPhilipWahle

Colloquium talk on modal sense classification using a convolutional neural ne...Ana Marasović

Automatic Personality Prediction with Attention-based Neural NetworksJinho Choi

Detecting cyberbullying text using the approaches with machine learning model...IAESIJAI

1808.10245v1 (1).pdfKSHITIJCHAUDHARY20

Deep Learning | Speaker IndentificationSai Kiran Kadam

IRJET- Survey on Deep Learning Approaches for Phrase Structure Identification...IRJET Journal

Profiling Irony and Stereotype Spreaders on Twitter (IROSTEREO)Francisco Manuel Rangel Pardo

Abusive Language Detection.pptxVishnuVardhanReddyYe1

IRJET -Survey on Named Entity Recognition using Syntactic Parsing for Hindi L...IRJET Journal

Detection of slang words in e data using semi supervised learningijaia

COMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUEJournal For Research

Offline Handwritten Thai Character Recognition Using Single Tier Classifier a...Ferdin Joe John Joseph PhD

Deep Learning for Natural Language ProcessingParrotAI

IRJET- Automatic Language Identification using Hybrid Approach and Classifica...IRJET Journal

Tony clark caise 13-presentationcaise2013vlc

Similar to Do characters abuse more than words? (20)

Named Entity Recognition using Hidden Markov Model (HMM)

DeepPavlov 2019

Are Neural Language Models Good Plagiarists? A Benchmark for Neural Paraphras...

Colloquium talk on modal sense classification using a convolutional neural ne...

Automatic Personality Prediction with Attention-based Neural Networks

Detecting cyberbullying text using the approaches with machine learning model...

1808.10245v1 (1).pdf

Deep Learning | Speaker Indentification

IRJET- Survey on Deep Learning Approaches for Phrase Structure Identification...

Profiling Irony and Stereotype Spreaders on Twitter (IROSTEREO)

Abusive Language Detection.pptx

IRJET -Survey on Named Entity Recognition using Syntactic Parsing for Hindi L...

Detection of slang words in e data using semi supervised learning

COMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUE

Offline Handwritten Thai Character Recognition Using Single Tier Classifier a...

Deep Learning for Natural Language Processing

IRJET- Automatic Language Identification using Hybrid Approach and Classifica...

Tony clark caise 13-presentation

Recently uploaded

Analytical Profile of Coleus Forskohlii | Forskolin .pptxSwapnil Therkar

Evidences of Evolution General Biology 2John Carlo Rollon

Artificial Intelligence In Microbiology by Dr. Prince C PPRINCE C P

RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptxFarihaAbdulRasheed

Engler and Prantl system of classification in plant taxonomyNistarini College, Purulia (W.B) India

Module 4: Mendelian Genetics and Punnett SquareIsiahStephanRadaza

Heredity: Inheritance and Variation of TraitsCharlene Llagas

Call Girls in Aiims Metro Delhi 💯Call Us 🔝9953322196🔝 💯Escort.aasikanpl

Solution chemistry, Moral and Normal solutionsHajira Mahmood

Manassas R - Parkside Middle School 🌎🏫qfactory1

BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.PraveenaKalaiselvan1

Grafana in space: Monitoring Japan's SLIM moon lander in real timeSatoshi NAKAHIRA

Recombinant DNA technology( Transgenic plant and animal)DHURKADEVIBASKAR

Bentham & Hooker's Classification. along with the merits and demerits of the ...Nistarini College, Purulia (W.B) India

SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxkessiyaTpeter

Scheme-of-Work-Science-Stage-4 cambridge science.docxyaramohamed343013

Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.aasikanpl

Call Girls in Hauz Khas Delhi 💯Call Us 🔝9953322196🔝 💯Escort.aasikanpl

Call Us ≽ 9953322196 ≼ Call Girls In Lajpat Nagar (Delhi) |aasikanpl

Analytical Profile of Coleus Forskohlii | Forskolin .pdfSwapnil Therkar

Recently uploaded (20)

Analytical Profile of Coleus Forskohlii | Forskolin .pptx

Evidences of Evolution General Biology 2

Artificial Intelligence In Microbiology by Dr. Prince C P

RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx

Engler and Prantl system of classification in plant taxonomy

Module 4: Mendelian Genetics and Punnett Square

Heredity: Inheritance and Variation of Traits

Call Girls in Aiims Metro Delhi 💯Call Us 🔝9953322196🔝 💯Escort.

Solution chemistry, Moral and Normal solutions

Manassas R - Parkside Middle School 🌎🏫

BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.

Grafana in space: Monitoring Japan's SLIM moon lander in real time

Recombinant DNA technology( Transgenic plant and animal)

Bentham & Hooker's Classification. along with the merits and demerits of the ...

SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx

Scheme-of-Work-Science-Stage-4 cambridge science.docx

Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.

Call Girls in Hauz Khas Delhi 💯Call Us 🔝9953322196🔝 💯Escort.

Call Us ≽ 9953322196 ≼ Call Girls In Lajpat Nagar (Delhi) |

Analytical Profile of Coleus Forskohlii | Forskolin .pdf

Do characters abuse more than words?

1. Do Characters Abuse More Than Words? By: Yashar Mehdad Joel Tetreault (Yahoo! Research) 1

2. Outline 1. Introduction 2. Related Work 3. Methodology 4. Evaluation 5. Conclusion 6. Review 2

3. 1. INTRODUCTION 3

4. Introduction  Rise of online communities  Connect different communities  Ease of communication “Hurl insults, bully & threaten through the use of profanity & hate speech” Abusive Language 4

5. “Kill all Jews, they are a headache”  Straightforward methods to handle blacklists & regular expressions “Kill yrself a$$hole”  Tokenization & Normalization issue  Complex methods needed 5

6. 2. RELATED WORK 6

7. 2. Related Work  Profanity detection Sood et al 2012  Hate speech detection Warner & Hirschberg 2012  Cyberbullying Dadvar et al 2013  Generally abusive language detection Chen et al 2012, Djuric et al 2015 7

8.  Majority of the work focused on supervised classification with canonical NLP features  Token n-gram features  Hand crafted regular expressions & blacklist features 8

9.  Features which model the user’s past behavior Dadvar et al 2013  Semi supervised LDA approach Xiang et al 2012  Paragraph2vec approach Djuric et al 2015  Usage of many features Nobata et al 2016 9

10. 3. METHODOLOGY 10

11. 3. Methodology  Supervised classification methods with lexical & morphological features to measure various aspects of user comments  Hybrid method based on discriminative & generative classifiers  Binary classification task (abusive or not) 11

12.  Feature classes Tokens Characters Distributional semantics  Methods Distributional Representation of Comments (C2V) Recurrent Neural Network Language Model (RNNLM) Support Vector Machine with Naïve Bayes (NBSVM) 12

13. 3.1. Distributional Representation of Comments (C2V)  Modeling lexical semantics using vector space model (Mikolov et al 2013)  A vector representation with comment embeddings  A skip-bigram model to train the embeddings of the words in comments 13

14. C2V Mikolov et al. 2014 14

15. C2V  Window size 5 & 10  Low dimensional models (100,300) d300w10 d300w5 d100w10 d100w5  10 Iterations  Multi-core LibLinear Library Logisitc regression classifier 15

16. 3.2. Recurrent Neural Network Language Model (RNNLM)  RNN is potential in representing more advanced patterns  Models are trained for both classes (abusive & clean)  Token n-gram where n=1,2,3,4,5  Character n-grams where n=1,2,3,4,5 Space is considered as a character to investigate character Vs word claim 16

17. RNNLM  Testing:- Estimate the ratio of probability of the comment belonging to each class via bayes rule  If the probability of a comment given the abusive language model is higher than its probability given non-abusive language model, then the comment is classified as abusive 17

18. RNNLM  Ratios are used to calculate AUC matrix  RNNLM toolkit implemented in Mikolov 2011  One “word” model  Two character models as “char1” & “char2”  Bptt:- No. of steps to propagate error back 18 word char1 char2 No. of hidden layers 50 50 200 bptt 4 4 10

19. 3.3. Support Vector machine with Naïve Bayes Features (NBSVM)  SVM + NB  Compute log ratio vector between the average character n-gram counts from abusive & non-abusive comments  Input to the SVM Log ratio vector * binary pattern for each character n-gram in comment vector Multi-core LibLinear Library 19

20. 4. EVALUATION 20

21. 4. Evaluation  Dataset:- Data sets used in Djuric et al & Nobata et al  Labels:- From a combination of in-house raters users reactively flagging bad comments & abusive language pattern detectors  5 fold cross validation & report AUC  Recall, precision & F1 score  A token n-gram classifier with logistic regression classifier 21

22. Results 22

23. 5. CONCLUSION 23

24. 5. Conclusion  Character based approaches fared best in cases with irregular normalization or obfuscation of words  Has shown the superiority of simple character-based approaches over the previous state-of-art, as well as token- based ones & two deep learning approaches 24

25. THANK YOU 25

Editor's Notes

The rise of online communities over the last ten years, in various forms such as message boards, twitter, discussion forums, etc., have allowed people from disparate backgrounds to connect in a way that would not have been possible before. However, the ease of communication online has made it possible for both anonymous and non-anonymous posters to
Straightforward methods to handle abusive languages like …………………….. They are concious bastardizations of words in an effort to evade blacklists So characters often play an important role in the comment language
In this model, every paragraph is mapped to a uniques vector represented by a column in a matrix and every word is also mapped to a unique vector represented by a column in a matrix The paragraph vector & word vectors are averaged or concatenated to predict the next word in a context
After being trained the paragraph vectors can be used as features for the paragrap can be used to feed these features directly to conventional ML techniques such as logistic regression svm Or k-means Advantages Can work for tasks that do not have enough labeled data Take into consideration the word order at least in a small context in the same way that an n-gram model with a large n would do

Do characters abuse more than words?

Recommended

Recommended

More Related Content

What's hot

What's hot (15)

Similar to Do characters abuse more than words?

Similar to Do characters abuse more than words? (20)

Recently uploaded

Recently uploaded (20)

Do characters abuse more than words?

Editor's Notes