4. Introduction
Rise of online communities
Connect different communities
Ease of communication
“Hurl insults, bully & threaten through
the use of profanity & hate speech”
Abusive
Language
4
5. “Kill all Jews, they are a headache”
Straightforward methods to handle
blacklists & regular expressions
“Kill yrself a$$hole”
Tokenization & Normalization issue
Complex methods needed
5
7. 2. Related Work
Profanity detection
Sood et al 2012
Hate speech detection
Warner & Hirschberg 2012
Cyberbullying
Dadvar et al 2013
Generally abusive language detection
Chen et al 2012, Djuric et al 2015
7
8. Majority of the work focused on
supervised classification with
canonical NLP features
Token n-gram features
Hand crafted regular expressions &
blacklist features
8
9. Features which model the user’s past
behavior
Dadvar et al 2013
Semi supervised LDA approach
Xiang et al 2012
Paragraph2vec approach
Djuric et al 2015
Usage of many features
Nobata et al 2016
9
11. 3. Methodology
Supervised classification methods with
lexical & morphological features to
measure various aspects of user
comments
Hybrid method based on
discriminative & generative classifiers
Binary classification task (abusive or
not) 11
12. Feature classes
Tokens
Characters
Distributional semantics
Methods
Distributional Representation of Comments
(C2V)
Recurrent Neural Network Language Model
(RNNLM)
Support Vector Machine with Naïve Bayes
(NBSVM)
12
13. 3.1. Distributional Representation of
Comments (C2V)
Modeling lexical semantics using
vector space model (Mikolov et al
2013)
A vector representation with comment
embeddings
A skip-bigram model to train the
embeddings of the words in comments
13
16. 3.2. Recurrent Neural Network
Language Model (RNNLM)
RNN is potential in representing more
advanced patterns
Models are trained for both classes
(abusive & clean)
Token n-gram where n=1,2,3,4,5
Character n-grams where n=1,2,3,4,5
Space is considered as a character to
investigate character Vs word claim
16
17. RNNLM
Testing:- Estimate the ratio of
probability of the comment belonging
to each class via bayes rule
If the probability of a comment given
the abusive language model is higher
than its probability given non-abusive
language model, then the comment is
classified as abusive
17
18. RNNLM
Ratios are used to calculate AUC matrix
RNNLM toolkit implemented in Mikolov
2011
One “word” model
Two character models as “char1” & “char2”
Bptt:- No. of steps to propagate error back
18
word char1 char2
No. of hidden
layers
50 50 200
bptt 4 4 10
19. 3.3. Support Vector machine with Naïve
Bayes Features (NBSVM)
SVM + NB
Compute log ratio vector between the
average character n-gram counts from
abusive & non-abusive comments
Input to the SVM
Log ratio vector * binary pattern for each
character n-gram in
comment vector
Multi-core LibLinear Library
19
21. 4. Evaluation
Dataset:- Data sets used in Djuric et al &
Nobata et al
Labels:- From a combination of in-house
raters users reactively flagging bad
comments & abusive language pattern
detectors
5 fold cross validation & report AUC
Recall, precision & F1 score
A token n-gram classifier with logistic
regression classifier
21
24. 5. Conclusion
Character based approaches fared
best in cases with irregular
normalization or obfuscation of words
Has shown the superiority of simple
character-based approaches over the
previous state-of-art, as well as token-
based ones & two deep learning
approaches
24
The rise of online communities over the last ten years, in various forms such as message boards,
twitter, discussion forums, etc., have allowed people from disparate backgrounds to connect in a
way that would not have been possible before. However, the ease of communication online has
made it possible for both anonymous and non-anonymous posters to
Straightforward methods to handle abusive languages like ……………………..
They are concious bastardizations of words in an effort to evade blacklists
So characters often play an important role in the comment language
In this model, every paragraph is mapped to a uniques vector represented by a column in a matrix and
every word is also mapped to a unique vector represented by a column in a matrix
The paragraph vector & word vectors are averaged or concatenated to predict the next word in a context
After being trained the paragraph vectors can be used as features for the paragrap
can be used to feed these features directly to conventional ML techniques such as logistic regression svm
Or k-means
Advantages
Can work for tasks that do not have enough labeled data
Take into consideration the word order at least in a small context in the same way that
an n-gram model with a large n would do