SlideShare a Scribd company logo
1 of 27
SMU Master of IT in Business
ISSS610 APPLIED MACHINE LEARNING
Project Group 5:
Ayushi JAISWAL
Milouni DESAI
SOH Hui Shan
TEO Yaling
Toxic Comment Classification
Agenda
• Introduction
- Problem Statement
- Dataset
- EDA
- Overall Approach
• Initial Approach
- Challenges
• Revised Approach
- Flow Chart
- Feature Engineering
- Model Results & Feature Importance
- Examples
Problem Statement
Background
Online discussion is an integral aspect of social
media platforms but it has become an avenue
for abuse and harassment.
Hi! I am back again! Last warning!
Stop undoing my edits or die!
Severe
Toxic
Threat
Obscene
Insult
Identity-
Hate
Toxic
6 Classes
Objective
Build a multi-label classification
model to detect different types of
toxicity for each comment.
• Kaggle Competition Dataset
• 150k comments from Wikipedia Talk pages
• 6 Classes of Toxicity
• Toxic
• Severe toxic
• Obscene
• Threat
• Insult
• Identity hate
Dataset
Dataset can be downloaded from: https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/overview
Dataset
id comment_text toxic severe
_toxic
obscene threat insult identity
_hate
1 Hi! I am back again! Last warning!
Stop undoing my edits or die!
1 0 0 1 0 0
2 Gays are disgusting. It's just my
opinion but gays are disgusting.
1 0 0 0 1 1
3 IM GOING TO KILL YOU ALL!!
I'm serious, you are all mean nasty
idiots who deserve to die and I'm
going to throw every single one of
you in a shredder!!!!!!!!!
1 1 0 1 1 0
6 classes
Sample Records
15,294
8,449 7,877
1,595 1,405
478
Toxic Obscene Insult Severe
Toxic
Identity
Hate
Threat
Class distribution• About 90% of comments
are clean.
• Word clouds for all
toxic labels look very
similar to each other.
Word cloud for toxic comments Word cloud for clean comments
EDA
Approach
Feature
Engineering
Preprocessing
EDA
Modelling
Evaluation
Validation &
Modelling
Naïve Bayes,
Convolutional Neural
Networks (CNN),
Long Short-Term
Memory Networks
(LSTM)
Text Preprocessing
Remove punctuation,
Lemmatization,
Remove stopwords
Feature Engineering
Bag of Words, TF-IDF,
Word Embeddings
EDA
Data Visualization of
Distributions and
Word Clouds
Validation & Evaluation
Precision, Recall,
F1-score, Accuracy
Initial Approach: Results
Results of 6-label Classification*
* with subset of ~50k records
Why???
Let’s look at Confusion Matrix…
High Ave.
Accuracy 96%
but
very low Micro-Recall
and Micro-F1 score
avg_accuracy = sum(accuracy) / no_of_labels
micro_precision = sum(TP) / (sum(TP) + sum(FP))
micro_recall = sum(TP) / (sum(TP) + sum(FN))
micro_f1_score = (2 * micro_precision * micro_recall) / (micro_precision + micro_recall)
Initial Approach: Challenges
High Accuracy but High False Negatives
Embeddings + Logistic Regression Embeddings + Neural Network
• High FN is an issue, we want it to be low
• High accuracy is driven by imbalance data
(90% clean)
How do we deal
with these issues?
Revised Approach
Revised Approach
1. Down-sampling 60:40 to deal with imbalanced data
90%
10%
5% 5%
1% 1% 0%
Clean Toxic Obscene Insult Identity Hate Severe Toxic Theat
Before Down-Sampling
59%
39%
22% 21%
4% 4%
1%
Clean Toxic Obscene Insult Identity Hate Severe Toxic Theat
After Down-Sampling
Revised Approach
Probability Matrix
2. Conditional Probabilities of Each Class
The other 5 labels are
highly dependent on Toxic.
Therefore, this leads us to use a
2-level classification approach.
P(Toxic | Severe Toxic) = 100%
P(Toxic | Obscene) = 95%
P(Toxic | Insult) = 95%
P(Toxic | Identity Hate) = 94%
P(Toxic | Threat) = 95%
Revised Approach: Flow Chart
2-level Classification Approach
Down Sampling
Feature
Engineering
1st
Level
Pass only comments that
are predicted as toxic
Binary Classifier
Toxic Clean
Multi-label Classifier
2nd
Level
Severe
Toxic
Threat Obscene
Insult
Identity-
Hate
Revised Approach: Flow Chart
Down Sampling
Feature
Engineering
Binary Classifier
Toxic Clean
Multi-label Classifier
Severe
Toxic
Threat Obscene
Insult
Identity-
Hate
Accuracy
Precision
Recall
F1-score
Average Accuracy
Micro - Precision
Micro - Recall
Micro - F1-score
Overall
Accuracy
1st
Level
2nd
Level
2-level Classification Approach
Revised Approach:
Feature Engineering
Word-level Features
Pre-processing Steps:
• Retained case structure (lower case, upper case)
• Removed URLs, HTML tags, stop words
• Removed punctuation, except for “!”
• Lemmatization
Vectorization Methods:
• Bag of Words (BoW)
• Term Frequency Inverse Document Frequency (TF-IDF)
• Binary Representation of each Word
Word Embeddings:
• Pre-trained GloVe word embeddings
Revised Approach: Feature Engineering
Total length Number of unique words
Comment-level Features
• Word count
• Character count
• Word density (Word count / Character count)
• Total length
• Capitals
• Proportion of capitals
• Number of exclamation marks
• Number of unique words
• Proportion of unique words
Revised Approach:
Model Results
Revised Approach: Model Results
Best Performing Model: Embeddings + CNN
Model Selected: TF-IDF + SVM
Reasons:
• Less complex model at a cost of < 0.5% reduction in
f1-score.
• Simpler features (TF-IDF vs. Embeddings)
• Greater interpretability of model’s results as feature
importance can easily be extracted with SVM unlike
NN/CNN
1st level Binary Classification
Accuracy Precision Recall F1-Score
BoW + Naïve Bayes 87.53% 87.76% 87.53% 87.59%
TF-IDF + SVM 88.74% 88.70% 88.74% 88.70%
Binary + Logistic Regression 88.35% 88.31% 88.35% 88.30%
TF-IDF + NN 88.86% 88.81% 88.86% 88.81%
TF-IDF + CNN 88.06% 88.01% 88.06% 88.02%
Embeddings + CNN 89.20% 89.20% 89.10% 89.10%
Model
Level 1: Binary Label
Comparing Between Initial and Final Approach:
• Reduction in False Negatives (predicted as Clean
when actually Toxic)
• Improvement in True Positives (predicted as Toxic
when actually Toxic)
Initial Approach
Embeddings + Logistic
Regression
Final Approach
TF-IDF + SVM
0 1
0 6173 556 6729
1 680 3565 4245
Predicted
Actual
0 1
Actual 0 9203 8 9211
1 777 204 981
Predicted
Revised Approach: Model Results
2nd level Multi-label Classification
Best Performing Model: Embeddings + CNN
Model Selected: Binary + Logistic Regression
Reasons:
• Less complex model at a cost of < 1% reduction in micro
f1-score
• Simpler features (TF-IDF vs. Embeddings)
• Greater interpretability of model’s results as feature
importance can easily be extracted with logistic
regression.
Average
Accuracy
Micro-
Precision
Micro-
Recall
Micro-F1-
Score
BoW + Naïve Bayes 82.76% 63.15% 72.14% 67.35%
TFIDF + SVM 87.11% 78.21% 66.13% 71.66%
Binary + Logistic Regression 87.33% 77.29% 68.79% 72.79%
TFIDF + NN 85.38% 74.53% 61.78% 67.56%
TFIDF + CNN 86.21% 73.99% 67.95% 70.84%
Embeddings + CNN 86.90% 72.70% 74.70% 73.40%
Model
Level 2: Multi-Label
Revised Approach: Model Results
2nd level Multi-label Classification
Comments:
• Multi-label classification produced results with low
precision and recall due to certain labels having
significantly less records.
• For these 3 labels, False Negatives are high, ie. predicted
as Clean when actually are Severe Toxic / Threat / Identity
Hate.
Label Class Precision Recall F1-score No. of Records
Severe Toxic 1 0.59 0.23 0.33 396
Threat 1 0.59 0.35 0.44 117
Identity Hate 1 0.67 0.44 0.53 383
Insult 1 0.75 0.72 0.73 2118
Obscene 1 0.83 0.80 0.81 2256
Severe Toxic
0 1
0 3818 63 3881
1 306 90 396
Threat
0 1
0 4131 29 4160
1 76 41 117
Identity Hate
0 1
0 3811 83 3894
1 213 170 383
Predicted
Actual
Predicted
Actual
Predicted
Actual
Revised Approach: Final Model
Overall Accuracy on Test Results:
Selection of Final Model:
1st – level: TF-IDF + SVM
2nd – level: Binary + Logistic Regression
Level Model
Average
Accuracy
Micro-
Precision
Micro-
Recall
Micro-F1-
Score
1 TFIDF + SVM
2 Binary + Logistic Regression
95.71% 86.40% 83.25% 84.79%
Accuracy Precision Recall F1-Score
Average
Accuracy
Micro-
Precision
Micro-
Recall
Micro-F1-
Score
BoW + Naïve Bayes 87.53% 87.76% 87.53% 87.59% 82.76% 63.15% 72.14% 67.35%
TFIDF + SVM 88.74% 88.70% 88.74% 88.70% 87.11% 78.21% 66.13% 71.66%
Binary + Logistic Regression 88.35% 88.31% 88.35% 88.30% 87.33% 77.29% 68.79% 72.79%
TFIDF + NN 88.86% 88.81% 88.86% 88.81% 85.38% 74.53% 61.78% 67.56%
TFIDF + CNN 88.06% 88.01% 88.06% 88.02% 86.21% 73.99% 67.95% 70.84%
Embeddings + CNN 89.20% 89.20% 89.10% 89.10% 86.90% 72.70% 74.70% 73.40%
Model
Level 1: Binary Label Level 2: Multi-Label
Revised Approach:
Feature Importance
Feature Importance
Label: Toxic
Label: Severe toxic
Feature Importance
Label: Obscene
Label: Identity Hate
Feature Importance
Label: Threat
Label: Insult
Revised Approach: Examples
Comment True Labels Predicted Labels
Hello, you disgusting
transsexual.
You are a grotesque
blight upon the earth.
Please kill yourself.
Toxic, Obscene,
Threat, Identity Hate,
Insult
Toxic, Obscene,
Threat, Identity Hate,
Insult
Wtf, Why You deleting
Maashel? Bad things
will happen to you now.
Toxic, Threat Toxic
am going to shoot you
in the head and laugh
as your brains splatter
onto the ground.
Toxic, Severe Toxic,
Threat
Toxic, Threat
Thank You

More Related Content

What's hot

What's hot (20)

A review of machine learning based anomaly detection
A review of machine learning based anomaly detectionA review of machine learning based anomaly detection
A review of machine learning based anomaly detection
 
Anomaly Detection using Deep Auto-Encoders
Anomaly Detection using Deep Auto-EncodersAnomaly Detection using Deep Auto-Encoders
Anomaly Detection using Deep Auto-Encoders
 
Deep learning presentation
Deep learning presentationDeep learning presentation
Deep learning presentation
 
Multilayer perceptron
Multilayer perceptronMultilayer perceptron
Multilayer perceptron
 
Attention in Deep Learning
Attention in Deep LearningAttention in Deep Learning
Attention in Deep Learning
 
ppt
pptppt
ppt
 
Human Action Recognition
Human Action RecognitionHuman Action Recognition
Human Action Recognition
 
YOLO
YOLOYOLO
YOLO
 
Anomaly detection
Anomaly detectionAnomaly detection
Anomaly detection
 
Spam Detection Using Natural Language processing
Spam Detection Using Natural Language processingSpam Detection Using Natural Language processing
Spam Detection Using Natural Language processing
 
Understanding GloVe
Understanding GloVeUnderstanding GloVe
Understanding GloVe
 
Glove global vectors for word representation
Glove global vectors for word representationGlove global vectors for word representation
Glove global vectors for word representation
 
Classification Based Machine Learning Algorithms
Classification Based Machine Learning AlgorithmsClassification Based Machine Learning Algorithms
Classification Based Machine Learning Algorithms
 
Word Embeddings - Introduction
Word Embeddings - IntroductionWord Embeddings - Introduction
Word Embeddings - Introduction
 
Seq2Seq (encoder decoder) model
Seq2Seq (encoder decoder) modelSeq2Seq (encoder decoder) model
Seq2Seq (encoder decoder) model
 
Neural Networks: Multilayer Perceptron
Neural Networks: Multilayer PerceptronNeural Networks: Multilayer Perceptron
Neural Networks: Multilayer Perceptron
 
Reinforcement learning
Reinforcement learning Reinforcement learning
Reinforcement learning
 
Data Mining: Association Rules Basics
Data Mining: Association Rules BasicsData Mining: Association Rules Basics
Data Mining: Association Rules Basics
 
[PR12] You Only Look Once (YOLO): Unified Real-Time Object Detection
[PR12] You Only Look Once (YOLO): Unified Real-Time Object Detection[PR12] You Only Look Once (YOLO): Unified Real-Time Object Detection
[PR12] You Only Look Once (YOLO): Unified Real-Time Object Detection
 
Remote Procedure Call in Distributed System
Remote Procedure Call in Distributed SystemRemote Procedure Call in Distributed System
Remote Procedure Call in Distributed System
 

Similar to Toxic Comment Classification using Neural Network and Machine Learning

Soft And Handling
Soft And HandlingSoft And Handling
Soft And Handling
hiratufail
 
NEURAL Network Design Training
NEURAL Network Design  TrainingNEURAL Network Design  Training
NEURAL Network Design Training
ESCOM
 
TreeHuggr: Discovering Where Tree-based Classifiers are Vulnerable to Adversa...
TreeHuggr: Discovering Where Tree-based Classifiers are Vulnerable to Adversa...TreeHuggr: Discovering Where Tree-based Classifiers are Vulnerable to Adversa...
TreeHuggr: Discovering Where Tree-based Classifiers are Vulnerable to Adversa...
Bobby Filar
 
A Sparse-Coding Based Approach for Class-Specific Feature Selection
A Sparse-Coding Based Approach for Class-Specific Feature SelectionA Sparse-Coding Based Approach for Class-Specific Feature Selection
A Sparse-Coding Based Approach for Class-Specific Feature Selection
Davide Nardone
 

Similar to Toxic Comment Classification using Neural Network and Machine Learning (20)

Comparative Study of Machine Learning Algorithms for Sentiment Analysis with ...
Comparative Study of Machine Learning Algorithms for Sentiment Analysis with ...Comparative Study of Machine Learning Algorithms for Sentiment Analysis with ...
Comparative Study of Machine Learning Algorithms for Sentiment Analysis with ...
 
Soft And Handling
Soft And HandlingSoft And Handling
Soft And Handling
 
Feature selection with imbalanced data in agriculture
Feature selection with  imbalanced data in agricultureFeature selection with  imbalanced data in agriculture
Feature selection with imbalanced data in agriculture
 
Leveraging Feature Selection Within TreeNet
Leveraging Feature Selection Within TreeNetLeveraging Feature Selection Within TreeNet
Leveraging Feature Selection Within TreeNet
 
Semantic Analysis to Compute Personality Traits from Social Media Posts
Semantic Analysis to Compute Personality Traits from Social Media PostsSemantic Analysis to Compute Personality Traits from Social Media Posts
Semantic Analysis to Compute Personality Traits from Social Media Posts
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
2012 predictive clusters
2012 predictive clusters2012 predictive clusters
2012 predictive clusters
 
NEURAL Network Design Training
NEURAL Network Design  TrainingNEURAL Network Design  Training
NEURAL Network Design Training
 
Bridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
Bridging the Gap: Machine Learning for Ubiquitous Computing -- EvaluationBridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
Bridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
 
Face Recognition: From Scratch To Hatch / Эдуард Тянтов (Mail.ru Group)
Face Recognition: From Scratch To Hatch / Эдуард Тянтов (Mail.ru Group)Face Recognition: From Scratch To Hatch / Эдуард Тянтов (Mail.ru Group)
Face Recognition: From Scratch To Hatch / Эдуард Тянтов (Mail.ru Group)
 
Face Recognition: From Scratch To Hatch
Face Recognition: From Scratch To HatchFace Recognition: From Scratch To Hatch
Face Recognition: From Scratch To Hatch
 
Thesis presentation: Applications of machine learning in predicting supply risks
Thesis presentation: Applications of machine learning in predicting supply risksThesis presentation: Applications of machine learning in predicting supply risks
Thesis presentation: Applications of machine learning in predicting supply risks
 
ML-ChapterFour-ModelEvaluation.pptx
ML-ChapterFour-ModelEvaluation.pptxML-ChapterFour-ModelEvaluation.pptx
ML-ChapterFour-ModelEvaluation.pptx
 
Building Continuous Learning Systems
Building Continuous Learning SystemsBuilding Continuous Learning Systems
Building Continuous Learning Systems
 
Subverting Machine Learning Detections for fun and profit
Subverting Machine Learning Detections for fun and profitSubverting Machine Learning Detections for fun and profit
Subverting Machine Learning Detections for fun and profit
 
Surface features with nonparametric machine learning
Surface features with nonparametric machine learningSurface features with nonparametric machine learning
Surface features with nonparametric machine learning
 
TreeHuggr: Discovering Where Tree-based Classifiers are Vulnerable to Adversa...
TreeHuggr: Discovering Where Tree-based Classifiers are Vulnerable to Adversa...TreeHuggr: Discovering Where Tree-based Classifiers are Vulnerable to Adversa...
TreeHuggr: Discovering Where Tree-based Classifiers are Vulnerable to Adversa...
 
A Sparse-Coding Based Approach for Class-Specific Feature Selection
A Sparse-Coding Based Approach for Class-Specific Feature SelectionA Sparse-Coding Based Approach for Class-Specific Feature Selection
A Sparse-Coding Based Approach for Class-Specific Feature Selection
 
Analytics for large-scale time series and event data
Analytics for large-scale time series and event dataAnalytics for large-scale time series and event data
Analytics for large-scale time series and event data
 
Kaggle Gold Medal Case Study
Kaggle Gold Medal Case StudyKaggle Gold Medal Case Study
Kaggle Gold Medal Case Study
 

Recently uploaded

1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证
ppy8zfkfm
 
Audience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptxAudience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptx
Stephen266013
 
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotecAbortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
zifhagzkk
 
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
acoha1
 
obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...
obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...
obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...
yulianti213969
 
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Valters Lauzums
 
原件一样伦敦国王学院毕业证成绩单留信学历认证
原件一样伦敦国王学院毕业证成绩单留信学历认证原件一样伦敦国王学院毕业证成绩单留信学历认证
原件一样伦敦国王学院毕业证成绩单留信学历认证
pwgnohujw
 

Recently uploaded (20)

Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...
 
Predictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting TechniquesPredictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting Techniques
 
What is Insertion Sort. Its basic information
What is Insertion Sort. Its basic informationWhat is Insertion Sort. Its basic information
What is Insertion Sort. Its basic information
 
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证
 
Sensing the Future: Anomaly Detection and Event Prediction in Sensor Networks
Sensing the Future: Anomaly Detection and Event Prediction in Sensor NetworksSensing the Future: Anomaly Detection and Event Prediction in Sensor Networks
Sensing the Future: Anomaly Detection and Event Prediction in Sensor Networks
 
Digital Marketing Demystified: Expert Tips from Samantha Rae Coolbeth
Digital Marketing Demystified: Expert Tips from Samantha Rae CoolbethDigital Marketing Demystified: Expert Tips from Samantha Rae Coolbeth
Digital Marketing Demystified: Expert Tips from Samantha Rae Coolbeth
 
Northern New England Tableau User Group (TUG) May 2024
Northern New England Tableau User Group (TUG) May 2024Northern New England Tableau User Group (TUG) May 2024
Northern New England Tableau User Group (TUG) May 2024
 
Audience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptxAudience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptx
 
Genuine love spell caster )! ,+27834335081) Ex lover back permanently in At...
Genuine love spell caster )! ,+27834335081)   Ex lover back permanently in At...Genuine love spell caster )! ,+27834335081)   Ex lover back permanently in At...
Genuine love spell caster )! ,+27834335081) Ex lover back permanently in At...
 
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotecAbortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
 
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
 
Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...
Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...
Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...
 
How to Transform Clinical Trial Management with Advanced Data Analytics
How to Transform Clinical Trial Management with Advanced Data AnalyticsHow to Transform Clinical Trial Management with Advanced Data Analytics
How to Transform Clinical Trial Management with Advanced Data Analytics
 
Data Analysis Project Presentation : NYC Shooting Cluster Analysis
Data Analysis Project Presentation : NYC Shooting Cluster AnalysisData Analysis Project Presentation : NYC Shooting Cluster Analysis
Data Analysis Project Presentation : NYC Shooting Cluster Analysis
 
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
 
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
 
obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...
obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...
obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...
 
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
 
Aggregations - The Elasticsearch "GROUP BY"
Aggregations - The Elasticsearch "GROUP BY"Aggregations - The Elasticsearch "GROUP BY"
Aggregations - The Elasticsearch "GROUP BY"
 
原件一样伦敦国王学院毕业证成绩单留信学历认证
原件一样伦敦国王学院毕业证成绩单留信学历认证原件一样伦敦国王学院毕业证成绩单留信学历认证
原件一样伦敦国王学院毕业证成绩单留信学历认证
 

Toxic Comment Classification using Neural Network and Machine Learning

  • 1. SMU Master of IT in Business ISSS610 APPLIED MACHINE LEARNING Project Group 5: Ayushi JAISWAL Milouni DESAI SOH Hui Shan TEO Yaling Toxic Comment Classification
  • 2. Agenda • Introduction - Problem Statement - Dataset - EDA - Overall Approach • Initial Approach - Challenges • Revised Approach - Flow Chart - Feature Engineering - Model Results & Feature Importance - Examples
  • 3. Problem Statement Background Online discussion is an integral aspect of social media platforms but it has become an avenue for abuse and harassment. Hi! I am back again! Last warning! Stop undoing my edits or die! Severe Toxic Threat Obscene Insult Identity- Hate Toxic 6 Classes Objective Build a multi-label classification model to detect different types of toxicity for each comment.
  • 4. • Kaggle Competition Dataset • 150k comments from Wikipedia Talk pages • 6 Classes of Toxicity • Toxic • Severe toxic • Obscene • Threat • Insult • Identity hate Dataset Dataset can be downloaded from: https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/overview
  • 5. Dataset id comment_text toxic severe _toxic obscene threat insult identity _hate 1 Hi! I am back again! Last warning! Stop undoing my edits or die! 1 0 0 1 0 0 2 Gays are disgusting. It's just my opinion but gays are disgusting. 1 0 0 0 1 1 3 IM GOING TO KILL YOU ALL!! I'm serious, you are all mean nasty idiots who deserve to die and I'm going to throw every single one of you in a shredder!!!!!!!!! 1 1 0 1 1 0 6 classes Sample Records
  • 6. 15,294 8,449 7,877 1,595 1,405 478 Toxic Obscene Insult Severe Toxic Identity Hate Threat Class distribution• About 90% of comments are clean. • Word clouds for all toxic labels look very similar to each other. Word cloud for toxic comments Word cloud for clean comments EDA
  • 7. Approach Feature Engineering Preprocessing EDA Modelling Evaluation Validation & Modelling Naïve Bayes, Convolutional Neural Networks (CNN), Long Short-Term Memory Networks (LSTM) Text Preprocessing Remove punctuation, Lemmatization, Remove stopwords Feature Engineering Bag of Words, TF-IDF, Word Embeddings EDA Data Visualization of Distributions and Word Clouds Validation & Evaluation Precision, Recall, F1-score, Accuracy
  • 8. Initial Approach: Results Results of 6-label Classification* * with subset of ~50k records Why??? Let’s look at Confusion Matrix… High Ave. Accuracy 96% but very low Micro-Recall and Micro-F1 score avg_accuracy = sum(accuracy) / no_of_labels micro_precision = sum(TP) / (sum(TP) + sum(FP)) micro_recall = sum(TP) / (sum(TP) + sum(FN)) micro_f1_score = (2 * micro_precision * micro_recall) / (micro_precision + micro_recall)
  • 9. Initial Approach: Challenges High Accuracy but High False Negatives Embeddings + Logistic Regression Embeddings + Neural Network • High FN is an issue, we want it to be low • High accuracy is driven by imbalance data (90% clean) How do we deal with these issues?
  • 11. Revised Approach 1. Down-sampling 60:40 to deal with imbalanced data 90% 10% 5% 5% 1% 1% 0% Clean Toxic Obscene Insult Identity Hate Severe Toxic Theat Before Down-Sampling 59% 39% 22% 21% 4% 4% 1% Clean Toxic Obscene Insult Identity Hate Severe Toxic Theat After Down-Sampling
  • 12. Revised Approach Probability Matrix 2. Conditional Probabilities of Each Class The other 5 labels are highly dependent on Toxic. Therefore, this leads us to use a 2-level classification approach. P(Toxic | Severe Toxic) = 100% P(Toxic | Obscene) = 95% P(Toxic | Insult) = 95% P(Toxic | Identity Hate) = 94% P(Toxic | Threat) = 95%
  • 13. Revised Approach: Flow Chart 2-level Classification Approach Down Sampling Feature Engineering 1st Level Pass only comments that are predicted as toxic Binary Classifier Toxic Clean Multi-label Classifier 2nd Level Severe Toxic Threat Obscene Insult Identity- Hate
  • 14. Revised Approach: Flow Chart Down Sampling Feature Engineering Binary Classifier Toxic Clean Multi-label Classifier Severe Toxic Threat Obscene Insult Identity- Hate Accuracy Precision Recall F1-score Average Accuracy Micro - Precision Micro - Recall Micro - F1-score Overall Accuracy 1st Level 2nd Level 2-level Classification Approach
  • 16. Word-level Features Pre-processing Steps: • Retained case structure (lower case, upper case) • Removed URLs, HTML tags, stop words • Removed punctuation, except for “!” • Lemmatization Vectorization Methods: • Bag of Words (BoW) • Term Frequency Inverse Document Frequency (TF-IDF) • Binary Representation of each Word Word Embeddings: • Pre-trained GloVe word embeddings Revised Approach: Feature Engineering Total length Number of unique words Comment-level Features • Word count • Character count • Word density (Word count / Character count) • Total length • Capitals • Proportion of capitals • Number of exclamation marks • Number of unique words • Proportion of unique words
  • 18. Revised Approach: Model Results Best Performing Model: Embeddings + CNN Model Selected: TF-IDF + SVM Reasons: • Less complex model at a cost of < 0.5% reduction in f1-score. • Simpler features (TF-IDF vs. Embeddings) • Greater interpretability of model’s results as feature importance can easily be extracted with SVM unlike NN/CNN 1st level Binary Classification Accuracy Precision Recall F1-Score BoW + Naïve Bayes 87.53% 87.76% 87.53% 87.59% TF-IDF + SVM 88.74% 88.70% 88.74% 88.70% Binary + Logistic Regression 88.35% 88.31% 88.35% 88.30% TF-IDF + NN 88.86% 88.81% 88.86% 88.81% TF-IDF + CNN 88.06% 88.01% 88.06% 88.02% Embeddings + CNN 89.20% 89.20% 89.10% 89.10% Model Level 1: Binary Label Comparing Between Initial and Final Approach: • Reduction in False Negatives (predicted as Clean when actually Toxic) • Improvement in True Positives (predicted as Toxic when actually Toxic) Initial Approach Embeddings + Logistic Regression Final Approach TF-IDF + SVM 0 1 0 6173 556 6729 1 680 3565 4245 Predicted Actual 0 1 Actual 0 9203 8 9211 1 777 204 981 Predicted
  • 19. Revised Approach: Model Results 2nd level Multi-label Classification Best Performing Model: Embeddings + CNN Model Selected: Binary + Logistic Regression Reasons: • Less complex model at a cost of < 1% reduction in micro f1-score • Simpler features (TF-IDF vs. Embeddings) • Greater interpretability of model’s results as feature importance can easily be extracted with logistic regression. Average Accuracy Micro- Precision Micro- Recall Micro-F1- Score BoW + Naïve Bayes 82.76% 63.15% 72.14% 67.35% TFIDF + SVM 87.11% 78.21% 66.13% 71.66% Binary + Logistic Regression 87.33% 77.29% 68.79% 72.79% TFIDF + NN 85.38% 74.53% 61.78% 67.56% TFIDF + CNN 86.21% 73.99% 67.95% 70.84% Embeddings + CNN 86.90% 72.70% 74.70% 73.40% Model Level 2: Multi-Label
  • 20. Revised Approach: Model Results 2nd level Multi-label Classification Comments: • Multi-label classification produced results with low precision and recall due to certain labels having significantly less records. • For these 3 labels, False Negatives are high, ie. predicted as Clean when actually are Severe Toxic / Threat / Identity Hate. Label Class Precision Recall F1-score No. of Records Severe Toxic 1 0.59 0.23 0.33 396 Threat 1 0.59 0.35 0.44 117 Identity Hate 1 0.67 0.44 0.53 383 Insult 1 0.75 0.72 0.73 2118 Obscene 1 0.83 0.80 0.81 2256 Severe Toxic 0 1 0 3818 63 3881 1 306 90 396 Threat 0 1 0 4131 29 4160 1 76 41 117 Identity Hate 0 1 0 3811 83 3894 1 213 170 383 Predicted Actual Predicted Actual Predicted Actual
  • 21. Revised Approach: Final Model Overall Accuracy on Test Results: Selection of Final Model: 1st – level: TF-IDF + SVM 2nd – level: Binary + Logistic Regression Level Model Average Accuracy Micro- Precision Micro- Recall Micro-F1- Score 1 TFIDF + SVM 2 Binary + Logistic Regression 95.71% 86.40% 83.25% 84.79% Accuracy Precision Recall F1-Score Average Accuracy Micro- Precision Micro- Recall Micro-F1- Score BoW + Naïve Bayes 87.53% 87.76% 87.53% 87.59% 82.76% 63.15% 72.14% 67.35% TFIDF + SVM 88.74% 88.70% 88.74% 88.70% 87.11% 78.21% 66.13% 71.66% Binary + Logistic Regression 88.35% 88.31% 88.35% 88.30% 87.33% 77.29% 68.79% 72.79% TFIDF + NN 88.86% 88.81% 88.86% 88.81% 85.38% 74.53% 61.78% 67.56% TFIDF + CNN 88.06% 88.01% 88.06% 88.02% 86.21% 73.99% 67.95% 70.84% Embeddings + CNN 89.20% 89.20% 89.10% 89.10% 86.90% 72.70% 74.70% 73.40% Model Level 1: Binary Label Level 2: Multi-Label
  • 26. Revised Approach: Examples Comment True Labels Predicted Labels Hello, you disgusting transsexual. You are a grotesque blight upon the earth. Please kill yourself. Toxic, Obscene, Threat, Identity Hate, Insult Toxic, Obscene, Threat, Identity Hate, Insult Wtf, Why You deleting Maashel? Bad things will happen to you now. Toxic, Threat Toxic am going to shoot you in the head and laugh as your brains splatter onto the ground. Toxic, Severe Toxic, Threat Toxic, Threat