PACIS_Fake_Entity_Detection_552.pptx

ConvGRUText: A Deep Learning
Method for Fake Text Detection on
Online Social Media
Gaurav Sarin [1,2] & Pradeep Kumar [2]
[1] Delhi School of Business, VIPS-TC
[2] Indian Institute of Management, Lucknow
India

Presentation Outline
Introduction
Related Research
Research Problem and Objectives
Research Methodology
Proposed Solution
Experimental Setup and Results
Conclusions
Future Research Directions and Limitations
Q&As

Introduction
The issue of deception detection is tackled as linguistic one not dependent on the source utilized (Saquete, E. et
al., 2020)
Fake news detection continues to receive rising focus from research communities and industry professionals
(Alkhodair, S. A. et al., 2019)
Major transmission of fake news has negative impact on both individuals and society (Zhang, X. et al., 2019)
Uncovering fake reviews on social media is decisive and technically challenging
Human eye cannot precisely differentiate between true and false text, so need for automated system

Related Research
Amazon, continues to sue individuals and companies, which offer to post untruthful reviews on behalf of sellers
as a paid service (Kieler et al., 2016)
Issue with applying classification methods to detect fake text is dearth of ground truth datasets (Aghakhani et al.,
2015)
Rubin et al., 2015 stated that there are different kinds of fake news, each with potential textual indicators
Ott et al., 2014 designed a multi-domain dataset by asking Turkers to pen down fake reviews
Qazvinian et al., 2011 utilized unigram, bigram and part-of-speech (PoS) features of tweets to detect rumour
Liu et al., 2006 stated that text classification is a construction problem of models that can classify new documents
into pre-defined classes

Related Research …
Ren et al., 2017 stated that Convolutional Neural Network (CNN) provides best representation for sentence and
Gated Recurrent Unit (GRU) provides best representation for document
As per Jurafsky et al., 2015 the sequential models like RNN and LSTM can be utilized for semantic composition
as well
Deep Learning methods CNN and RNN (LeCun et al., 2015), have garnered great success in image recognition
and NLP jobs (Yih et al., 2014), sentence modelling (Kalchbrenner et al. 2014), and other tasks (Collobert et al.,
2011)

Related Research …
The “Social Distance Theory” postulated by DePaulo et al., 1996 states that social non-acceptance of
deception and psychological negative effect to the deceivers leads to deceivers’ distancing themselves from
deceit and persons they attempt to deceive
The term “Deceptive Communication” factors in deliberately hiding the truth by either exclusion or authority
(Ekman and Friesen, 1969) and is a message consciously shared by sender to serve a false notion or conclusion
by the receiver. The “Interpersonal Deception Theory” (Buller et al., 1996) boosts this notion, justifying how
deceiver critically manages the communication attitude in reply to the feelings and mistrust of the receivers
The “Media Richness Theory” states that deceptive actors may most likely utilize communication mediums,
which give several clues, an option to customize and to share quick feedback (synchronous and immediate) –
aiding them to strengthen the non-committal message characteristic and regulate deceptive communication
strategy at runtime to blur their deceptive purpose (Daft et al., 1987)

Research Problem and Objectives
Research Problem
For a given text coming from different online sources such as news, reviews, posts, tweets or blogs, the task is to
figure out if it is a fake or real text using deep learning hybrid method in the online social media space
Research Objectives
To design Deep Neural Network (DNN) based hybrid model for fake text identification
To check whether DNN model outclasses machine learning models
Figure out best text representations methods and their impact on classification metrics

Research Methodology
Text pre-processing (stemming, removal of stop words, removal of punctuations, removal of numbers and
removal of special characters) done
Quantitative approach used in this work using three real-world datasets
A state-of-art novel deep learning model, ConvGRUText was designed

Proposed Solution (Process View)
Text Collection
Text pre-processing and
representation
Data Mining
Knowledge Discovery

Sentence
Collect text
from multiple
sources
Feature
Engineering
Data
Representation
Uni-Gram N-Gram Document
Term
Frequency
TermFrequency-
InverseDocumentFrequency Boolean
Validation
using ground
truth dataset
Proposed Solution (Model View)

ConvGRUText Network Diagram
Data Inputs CNN Layer 1
CNN Layer 2
[ReLU Activation]
GRU Layer 1
[ReLU Activation]
Text Classification (Fake or Real)
[Sigmoid Activation]

Experimental Setup and Results
Datasets – https://www.kaggle.com/jruvika/fake-news-detection, https://www.kaggle.com/rtatman/deceptive-
opinion-spam-corpus (M. Ott et al., 2014) and https://www.kaggle.com/c/nlp-getting-started
First one 4000 records, second one 1600 records and third one 7500 records. 2 attributes – News text and News
label (fake or real) or Review text and Review label (fake or real) or Tweet text and Tweet label (fake or real) used
in data analysis
Confusion matrix for fake news, reviews and tweets for different text representations created and compared
Different models and classification metrics for fake news, reviews and tweets compared
Receiver Operating Characteristics curves created for fake news, reviews and tweets

Results (News TF Models)
Metrics/
Models
Accuracy Recall Precision F1-Score Specificity AUC
Naïve Bayes (NB) 0.7466 0.8021 0.6070 0.6910 0.7162 0.7379
Decision Tree (DT) 0.8989 0.9058 0.8743 0.8897 0.8932 0.8974
Logistic Regression (LR) 0.9076 0.9360 0.8610 0.8969 0.8862 0.9047
Random Forest (RF) 0.9262 0.9070 0.9385 0.9224 0.9443 0.927
Support Vector Machine
(SVM)
0.9326 0.9545 0.8984 0.9256 0.9154 0.9305
Artificial Neural Network
(ANN)
0.9596 0.9648 0.9558 0.9602 0.9543 0.9597
ConvGRUText (Highest) 0.9761 0.9781 0.9701 0.9741 0.9744 0.9762

Results (News TF-IDF Models)
Metrics/
Models
NB 0.8826 0.8784 0.8690 0.8736 0.8863 0.8818
DT 0.9076 0.9167 0.8824 0.8992 0.9002 0.9060
LR 0.9201 0.9282 0.8984 0.9130 0.9134 0.9188
RF 0.9451 0.9365 0.9465 0.9414 0.9527 0.9452
SVM 0.9476 0.9611 0.9251 0.9427 0.9365 0.9462
ANN 0.962 0.9671 0.9581 0.9625 0.9567 0.9621

Results (News Boolean Models)
Metrics/
Models
NB 0.7481 0.7893 0.6293 0.7 0.7237 0.7409
DT 0.8953 0.9008 0.872 0.8861 0.8907 0.8938
LR 0.9076 0.936 0.861 0.8969 0.8862 0.9047
RF 0.9262 0.907 0.9385 0.9224 0.9443 0.92
SVM 0.9326 0.9545 0.8984 0.9256 0.9154 0.9305
ANN 0.9561 0.9781 0.9349 0.956 0.935 0.9565

Results (Reviews TF Models)
Metrics/
Models
RF 0.7031 0.7152 0.675 0.6945 0.6923 0.7031
DT 0.7406 0.7692 0.6875 0.726 0.7175 0.7406
NB 0.7875 0.7614 0.8375 0.7976 0.8194 0.7875
ANN 0.7934 0.8011 0.8057 0.8033 0.7848 0.7928
LR 0.8281 0.8302 0.825 0.8275 0.8261 0.8281
SVM 0.8281 0.8344 0.8187 0.8264 0.8221 0.8281

Results (Reviews TF-IDF Models)
Metrics/
Models
DT 0.7031 0.7019 0.7063 0.704 0.7044 0.7031
RF 0.7531 0.7341 0.7938 0.7627 0.7755 0.7531
ANN 0.8054 0.8235 0.8 0.8115 0.7866 0.8057
NB 0.8094 0.8037 0.8188 0.8111 0.8153 0.8094
SVM 0.8188 0.811 0.8312 0.8209 0.8269 0.8188
LR 0.8344 0.8365 0.8313 0.8338 0.8323 0.8344

Results (Reviews Boolean Models)
Metrics/
Models
RF 0.7031 0.7152 0.675 0.6945 0.6923 0.7031
DT 0.7406 0.7692 0.6875 0.726 0.7175 0.7406
ANN 0.8084 0.8323 0.7943 0.8128 0.7844 0.8091
NB 0.7875 0.7614 0.8375 0.7976 0.8194 0.7875
SVM 0.8281 0.8344 0.8187 0.8264 0.8221 0.8281
LR 0.8281 0.8302 0.825 0.8275 0.8261 0.8281

Results (Tweets TF Models)
Metrics/
Models
DT 0.6767 0.8063 0.3187 0.4568 0.6504 0.6309
SVM 0.714 0.723 0.5344 0.6145 0.7098 0.691
ANN 0.7176 0.7106 0.5478 0.6186 0.721 0.6937
NB 0.7073 0.6798 0.5938 0.6338 0.7273 0.6928
LR 0.722 0.7165 0.5766 0.6389 0.7249 0.7034
RF 0.724 0.7018 0.6141 0.655 0.7372 0.7099

Results (Tweets TF-IDF Models)
Metrics/
Models
DT 0.6953 0.7798 0.3984 0.5273 0.6718 0.6574
ANN 0.7151 0.7126 0.5141 0.5972 0.7163 0.6896
RF 0.7133 0.6923 0.5906 0.6374 0.7254 0.6976
NB 0.6353 0.5525 0.7641 0.6412 0.7545 0.6518
SVM 0.7233 0.7176 0.5797 0.6413 0.7263 0.705
LR 0.7333 0.7308 0.5938 0.6552 0.7347 0.7155

Results (Tweets Boolean Models)
Metrics/
Models
DT 0.6873 0.7422 0.6704 0.4094 0.5277 0.6518
SVM 0.714 0.7230 0.7098 0.5344 0.6145 0.691
NB 0.694 0.6574 0.7186 0.5906 0.6222 0.6808
LR 0.722 0.7165 0.7249 0.5766 0.6389 0.7034
ANN 0.7291 0.7094 0.7397 0.5964 0.648 0.7104
RF 0.724 0.7018 0.7372 0.6141 0.655 0.7099

ROC Curves (News ConvGRUText a. TF b. TF-IDF c. Boolean)
a b
c

ROC Curves (Reviews ConvGRUText a. TF b. TF-IDF c. Boolean)
a
c
b

ROC Curves (Tweets ConvGRUText a. TF b. TF-IDF c. Boolean)
a b
c

News Dataset Comparison
Metrics/Models Accuracy Recall Precision F1-Score Specificity AUC
ConvGRUText TF 0.9761 0.9781 0.9701 0.9741 0.9744 0.9762
ConvGRUText TF-
IDF
0.9811 0.9918 0.968 0.9797 0.972 0.9819
ConvGRUText
Boolean
0.9899 0.9945 0.9837 0.9891 0.986 0.9902

Reviews Dataset Comparison
ConvGRUText TF 0.8375 0.822 0.8535 0.8375 0.8535 0.8377
ConvGRUText TF-
IDF
0.8562 0.8527 0.8633 0.858 0.8598 0.8563
ConvGRUText
Boolean
0.9593 0.8466 0.8734 0.8598 0.8726 0.8596

Tweets Dataset Comparison
ConvGRUText TF 0.77 0.7251 0.7354 0.7302 0.7953 0.7644
ConvGRUText TF-
IDF
0.7613 0.694 0.7351 0.714 0.7791 0.753
ConvGRUText
Boolean
0.772 0.6801 0.763 0.7192 0.7775 0.7606

Conclusions
Online businesses capitalize on machine learning models to find out if a text is fake or real. They can use the
proposed hybrid deep learning model for fake text identification
Regulation is required by governments to stop the advancement of fake texts. This will need appropriate public
policy changes
If fake texts are controlled then market fluctuations will be less, leading to a stable economic environment
Data Analytics is the enabler to solve this critical business problem of distinguishing between fake and real texts
with high predictive accuracy, hence assisting the policy makers, businesses, communities, consumers and
society

Future Research Directions and Limitations
Execute developed hybrid model on other domains to further test generalizability
Larger dataset to test scalability of model
Could have used word embeddings such as GloVe, FastText or Word2Vec for data representations

Questions and Answers (10 mins)

PACIS_Fake_Entity_Detection_552.pptx

PACIS_Fake_Entity_Detection_552.pptx

Recommended

Recommended

More Related Content

Featured

Featured (20)

PACIS_Fake_Entity_Detection_552.pptx