ashu ppt final.pptx

Computer Engineering Department
V.V.P Engineering College
Made By: Jejani Yasmin(170470702004)
Guided By: Prof. Maulik Dhamecha
Co-Guided By: Prof. Sagar Virani

Introduction
Motivation
Challenges
Literature Survey
Objective
Problem Statement
Proposed System
Conclusion
References

What Is Hate Speech?
 speech that attacks a person or group on the basis of
attributes such as race, religion, ethnic origin, national
origin, sex, disability, sexual orientation, or gender
identity.
 intended to insult, offend or intimidate to a person or
group.
 Hate speech is a crime.
 Social networking sites promotes free speech not hate
speech.

 World wide problems
 Spread of internet & rapid growth of social networking
site
 Anonymity provided by online social networking
 Also influence the business , causes serious real life
conflicts like murder , suicide.
 Maintain social media as a viable medium of
communication.

 no clear word define in statement that showing hate.
 Online social networking are full of ironic and joking
content that might be sound as offensive which in reality
is not.
 election time it is very challenging to detect hate speech.
 Require the proposed system that provide better
accuracy.
 Example: I hate seeing them loosing every time it's just
unfair.
 Example: if we want the opinion of a women, we'll ask
you dear...for now keep quite[3].

 Clustering algorithm is use to division of data according
to it’s similarity[6].
 Classification is data mining function that assign items
collection to target classes[7].
 hate speech detection are classified using machine
learning .
 Algorithm - usual Suspects
Decision tree,
Naïve bayes,
Random Forest,
Support vector machine

 Hybrid algorithms for data mining are a logical
combination of multiple pre-existing techniques to
enhance performance and provide better results[11].
 In the Hybrid approach use the concept of clustering and
classification to classify hate speech in order to
improvise classification accuracy[4].
 In Proposed system modified the hybrid approach in the
way that clustering process use for refinement for
classification to improve the accuracy

Title Automated Cyber bullying Detection using Clustering
Appearance Pattern[2]
Author&
Journal
2017-IEEE, Wails Romsaiyud, Kodchakorn na Pimpaka
,Prasetsilp, Piyaporn, Nurarak, Pirom Konglerd
Literature The algorithm included two main methods:
• creating partitions entire datasets into clusters
• capturing any specific partition with the frequency of
words with multinomial model feature vector and drawing
the probability of words occurring in a document for
predicting the eight classes.
Remark in future study more on the increasing a performance of
computation of time & cost on different data types from
many data sets.
Reference: Wallis Romsaiyud1, Kodchakorn na Nakornphanom2 , Pimpaka Prasertsilp3,
Piyaporn Nurarak4, Pirom Konglerd5(2017 IEEE) “Automated Cyber bullying Detection using
Clustering Appearance Patterns”

Title Hate Speech Detection in the Indonesian Language: A
Dataset and Preliminary Study[1]
Author&
Journal
2017-IEEE Ika Alfina , Rio Mulina , Mahomad Ivan
Fanany And Yudo Ekanata
Literature • Feature extraction using word n-gram(n=1,n=2),
character n-gram(n=3, n=4), negative sentiments.
• Classification perform using naive bayes, SVM,
Bayesian logistic regression , random forest decision tree.
Remarks •F-measure 93.5% was achieved when using word n-
gram feature with random forest tree.
•Results also show that word n-gram feature outperformed
character n-gram.
Reference: Ika Alfina, Rio Mulia, Mohamad Ivan Fanany, and Yudo Ekanata(ICACSIS 2017)” Hate Speech
Detection in the Indonesian Language: A Dataset andPreliminary Study”

Title Hate Speech on Twitter: A Pragmatic Approach to
Collect Hateful and Offensive Expressions and
Perform Hate Speech Detection[3]
Author &
Journal
2018:IEEE HAJIME WATANABE, MONDHER BOUAZIZI ,
AND TOMOAKI OHTSUKI
Literature •Approach is based on unigrams and patterns that are
automatically collected from the training set. These
patterns and unigrams are later used, among others, as
features to train a machine learning algorithm.
•Use the binary and ternery classification reaches the
accuracy equal to 87.4% and 78.4%.
Remarks Result show that j48 outperforms SVM
References: 2018:IEEE HAJIME WATANABE, MONDHER BOUAZIZI , AND TOMOAKIOHTSUKI “Hate
Speech on Twitter: A Pragmatic Approach to Collect Hateful and Offensive Expressions and Perform Hate
Speech Detection”

Title Combining Classi cation and Clustering for Tweet
sentimate analysis[4]
Author&
Journal
2014:IEE Brazilian Conference on Intelligent Systems Luiz
F.S coletta ,N’adia F.F.da silva,R. Hruschka∗ Estevam R.
Hruschka Jr.
Literature •In this SVM are combined with cluster ensemble.
•similar instance of same cluster more likely share the
same class.
•result are better that only using SVM.
Remarks •investigate “good” data partitions to compose a cluster
ensemble deserve attention
References: 2014:IEE Brazilian Conference on Intelligent SystemsLuis F. S. Colette∗, N´adia F. F. da
Silva∗, Eduardo R. Hruschka∗ Estevam R. HruschkaJr. Combining Classi cation and Clustering for Tweet
Sentiment Analysis

Title Combining Clustering with Classification: A Technique
to Improve Classification Accuracy[5]
Author&
Journal
2016 Yaswanth Kumar Alapati et al. / International
Journal of Computer Science Engineering (IJCSE)
Literature •Use the clustering priories to classification for the real life
data .
•For the clustering use K-mean and hierarchical clustering
& classification use naive bayes & neural network compare
the accuracy of all combination .
Remarks Results shows that clustering priories to classification is give
the better result in accuracy.
Reference: Yaswanth Kumar Alapati et al. / International Journal of Computer Science Engineering (IJCSE).
“Combining Clustering with Classification: A Technique to Improve Classification Accuracy”

Title Algorithm Result
Automated Cyber
bullying Detection
using Clustering
Appearance
Pattern[2]
Use the k-mean
and naïve bayes
algorithm
The main objective of the paper is
to partition abusive messages from
big data streaming with use of K-
means clustering and naïve bayes
An Improved
Malicious
Behavior
Detection Via k-
Means and
Decision Tree[12]
Use the k-mean
and decision tree
for detect
malicious
behaviour
KMDT have detected more
malicious behaviours accurately as
contrast to discrete and diversely
combined methods.
Combining
Classiﬁcation and
Clustering for
Tweet Sentiment
Analysis[4]
Use SVM with k-
medoids
SVM classifier combined with
cluster ensembles can offer better
accuracy than stand alone
SVM..and give this algorithm name
C^3E-SE, and use clustering
algorithm is K-medoids clustering.

Title Algorithm Result
Improving
Classification in Data
mining using Hybrid
algorithm [11]
Use the k-mean
and decision
tree for the
hybrid
algorithm.
This approach solves issues of
burdening decision tree with large
datasets by dividing the data
samples into clusters.
Classification using
Latent Dirichlet
Allocation with Naive
Bayes Classifier to
detect Cyber Bullying
in Twitter[13]
Use the LDA
and Naïve
bayes
LDA is use for identifying key
terms used as a feature vector
and provide the better accuracy
with naïve bayes.

 Proposed the approach based on clustering and
classification to give the better accuracy for detect hate
speech.
 In the proposed system modified hybrid algorithm in the
way that only require data are go to the classification
stage.
 Find the high swear word that clearly define hate speech.
There is also available hate base dictionary on the data.
world so I modified that and only take swear words.

 Use the clustering and classification technique for detect
hate speech .
 Clustering use refinement for classification.
 Implement the hybrid approach to detect hate speech to
provide the better result.

Input
(Tweets)
Pre-
processing Clustering
Feature
Extraction
High swear
word tweet
Extremely
positive
Other tweets
Not hate
speech
Hate speech
Classificatio
n
Compare
accuracy
and
precession

Step 1: Take the Tweeter's data.
Step 2: Preprocessing of tweets.( Remove url,
tokenization , lemmatization, For Example:
'caring’ lemmatization ‘care'. )
Step 3:Feature Extraction.
Bag of word
N-gram
Sentiment based feature using the positive and
negative lexicons
Swear words lexicons .

Step:4 apply the clustering on the entire data set that
partition the data to the clusters,
Cluster 1: cluster that contain high swear word that clearly decide
tweets are under the hate speech .
Cluster 2: cluster that contain only positive words that classified as non
hate speech.
Cluster 3: remaining tweets
Step 5: perform the classification on cluster 3.
Step 6: give the output hate speech and non hate speech
Step 7: compare the accuracy and precision

Naive
Bayes
K - Nearest
Neighbor
Decision
Trees
Accuracy in general Average Good Good
Speed of learning Excellent Excellent V. good
Speed of classification Excellent Average Excellent
Tolerance to missing values Excellent Average V. good
Tolerance to irrelevant attributes Good Good V. good
Tolerance to noise V. good Average Good
Attempts for incremental
learning
Excellent Excellent Good
Explanation ability/
transparency of knowledge/
classification
Excellent Good Excellent
Support Multi Classification Naturally
Extended
Excellent Excellent

 according to[9] random forest provide all the benefits of
decision tree also provide better result for large data set,
avoids over fitting problem ,also cover missing value
problem in the dataset.
 Correctly classified instance better than decision tree[9]
 Random forests provide information about the
importance of a variable and also the proximity of the data
points with one another[8]

 For the real time of tweeter data is very large for that we
require the clustering algorithm that is give better result
for large data set.
 According to[6] comparison of clustering algorithms show
that for the large data set k-mean is better ,small data set
hierarchical clustering give the better result.
 K-mean is one of the simplest and easy algorithm that’s
why choose the k-mean for clustering the data.

 For hate speech detection use clustering and
classification for detect the hate speech.
 Clustering use as refinement of data for classification.
 Use the hybrid approach to provide the better accuracy in
hate speech detection.

[1] Ika Alfina, Rio Mulia, Mohamad Ivan FananyYu Ekanata(ICACSIS
2017) “Hate Speech Detection in the Indonesian Language A Dataset and
Preliminary Study”
[2] Walisa Romsaiyud1, Kodchakorn n Nakornphanom2 , Pimpaka
Prasertsilp3,Piyaporn Nurarak4, Pirom Konglerd5(2017 IEEE) ”Automated
CyberbullyingDetection using Clustering Appearance Patterns.”
[3] HAJIME WATANABE, MONDHER BOUAZIZI ,& TOMOAKI
OHTSUKI(2018IEEEAccess) “Hate Speech on Twitter: A Pragmatic Approach
to Collect Hateful and Offensive Expressions and Perform Hate Speech
Detection”
[4]2014:IEEE Brazilian Conference on Intellig SystemsnLuiz F. S. Colette∗,
N´adia F. F. da Silva∗, Eduardo R.Hruschka∗ Estevam R. Hruschka Jr.
“Combining Classi cation and Clustering for Tweet Sentimen Analysis”
[5] Yaswanth Kumar Alapati et al. / InternationaJournal of Computer Science
Engineering (IJCSE). “Combining Clustering with Classification: A Technique to
Improve ClassificationAccuracy”

[6] Osama Abu Abbas (TIAjOIT) “Comparison between data clustering algorithm”
[7] Omkar Ardhapure1, Gayatri Patil2, Disha Udani3, Kamlesh Jetha4 (IJRET)
“COMPARATIVE STUDY OF CLASSIFICATION ALGORITHM FOR TEXT
BASED CATEGORIZATIO”
[8] Prajwala T R (International Journal of Advanced Research in Computer and
Communication Engineering Vol. 4, Issue 1, January 2015) “A Comparative Study
on Decision Tree and Random Forest Using R Tool”
[9] Jehad Ali1 , Rehanullah Khan2 , Nasir Ahmad3 , Imran Maqsood4 IJCSI
International Journal of Computer Science Issues, Vol. 9, Issue 5, No 3,
“Random Forests and Decision Trees September 2012 “
[10] Hamed Jelodar1 , Yongli Wang1 , Chi Yuan1 , Xia Feng2 “Latent
Dirichlet Allocation (LDA) and Topic modeling: models, applications, a survey”
[11]Akanksha Ahlawat1 , Bharti Suri2 (2016 IEEE) “Improving Classification in
Data mining using Hybrid algorithm”

[12]Warusia Yassin, Siti Rahayu, Faizal Abdollah and Hazlin Zin((IJACSA)
International Journal of Advanced Computer Science and Applications, Vol.7,
No.12,2016206) “An Improved Malicious Behavior Detection Via kMeans and
Decision Tree”
[13] K. Nalini andL. Jaba Sheela (Indian Journal of Science and Technology,
Vol 9(28), DOI: 10.17485/ijst/2016/v9i28/93825, July 2016ISSN) “Classification
using Latent Dirichlet Allocation with Naive Bayes Classifier to detect Cyber
Bullying in Twitter”
[14] Niyati Aggrawal (Computer Reviews Journal 2018) “Detection of Offensive
Tweets: A Comparative Study”
[15] Timothy Pratama, Ayu Purwarianti (©2017 IEEE) “Topic Classification and
Clustering on Indonesian Complaint Tweets for Bandung Government using
Supervised and Unsupervised Learning”
[16] PAULA FORTUNA, INESC TEC SÉRGIO NUNES, INESC TECand Faculty
of Engineering, University of Porto (ACM Computing Surveys July 2018) “A
Survey on Automatic Detection of Hate Speech in Text”

[17] Naufal Riza Fatahillah, Pulut Suryati , Cosmas Haryawan (2017
International Conference on Sustainable Information Engineering and
Technology (SIET)) “Implementation Of Naive Bayes Classifier Algorithm On
Social Media (Twitter) To The Teaching Of Indonesian Hate Speech“
[18] Pete Burnap and Matthew L. Williams(Policy & Internet, 7:2) “Cyber Hate
Speech on Twitter: An Application of Machine Classiﬁcation and Statistical
Modeling for Policy and Decision Making”
[19] Data Mining: Concepts and Techniques, . Jewie Han, Michelin Kamber,
Jian Pei.

ashu ppt final.pptx

Recommended

Recommended

More Related Content

Similar to ashu ppt final.pptx

Similar to ashu ppt final.pptx (20)

Recently uploaded

Recently uploaded (20)

ashu ppt final.pptx