Fake News Detection on Social Media using Machine Learning

Fake News Detection on Social Media using Machine Learning
Dept. of CSE 1 IESCE, Chittilappilly
1. INTRODUCTION
Social media is one of the most available news sources these days for many folks
worldwide due to their low value, quick access and fast spread. However, this comes with
some confusing signs and significant risk of exposure to 'false stories' written to mislead
readers. Such infom1ation can affect the public's voice and allow evil groups to control
the outcome of public events, such as elections. Fake and misleading news can have a real
impact on those who find themselves as targets. The information and news regarding the
spread of global pandemic covid-19 like self -verification of being infected, it spread
based on temperature, about vaccination; the speech of political figures during public
addressing and the unverified statement of them regarding the military invasion, about
developing and doing public goods; false and misleading images of people for malign or
praise them; manipulation of videos and audios are some of the cases and example of fake
news.
These days fake news is creating different issues from sarcastic articles to fabricated news
and planned government propaganda in some outlets. Fake news and lack of trust in the
media are growing problems with huge ramifications in our society. Obviously, a
purposely misleading story is "fake news" but lately blathering social media's discourse is
changing its definition. Some of them now use the term to dismiss the fact counter to their
preferred viewpoints. A view of an individual’s becomes information for others and based
on those biased and unverified information others build their surroundings. The increase
in information based on this approach made a society run with false ideas. This
falsification of information is hardly verified by an individual as they busy themselves in
their individual and virtual world. But the society based on the false and biased ideas is a
bomb which tickles every time to burst whenever a new idea intervenes and becomes a
threat to the dominance of the existing idea which is neither good for an individual or a
society.
However, in order to solve a problem, it is necessary to have an understanding on what
Fake News is and how the techniques in the fields of machine learning, natural language
processing help us to detect fake news.

2. LITERATURE REVIEW
Authors: Monther Aldwairi, Ali Alwahedi, in [1] has implemented the fake news and
click bait interfere with the ability of a user to discern useful information from the
internet vice especially when news become critical for decision making. Considering the
changing landscape of the modern business world, the issue of fake news has become
more than just a marketing problem as it warrant serious effort from security researchers.
It is imperative that any attempt to manipulate ort roll the internet through fake news or
click baits are countered with absolute effectiveness. We proposed a simple but effective
approach to allow user in-stall simple tool into their personal browser and use it to detect
and filter out potential click baits. The preliminary experimental results conducted to
access the method ability to attain its intended objective showed outstanding performance
in identify possible sources of fake news. Since we started this work, few fake news
databases have been made available we are recurrently expanding our approach using to
test its effectiveness against the new data sets.
Authors: Xinyi Zhou, Reza Zafarani, in [2] has researched about important of
multidisciplinary fake news research reviewing and organizing fake news detection
studies from multiple way which are news content and the medium on which the news
spreads, the rate of detection i.e., response time whether the news is real or fake was
measured to be very slow. They have detailed fact extraction,KB/KG construction and
fact checking. There are some open issues and several potential research tasks, First ,
when collecting facts to construct KB (KG), one concern is the sources from which facts
are abstracted. In addition to the traditional sources such as Wikipedia some other sources
eg, fact checking websites that contain expert analysis and justification for checked news
contents might help provide high quality domain knowledge. However such sources have
rarely been considered in current research. As fake news research is evolving, we
accompany this survey within online repository which will provide summaries and timely
updates on the research development on fake news. Including tutorials recent publications
and method data sets and other related resources.
Authors: Srishti Agrawal, Vaishali Arora, in [3] has implemented the key expressions of
news affairs have been taken in a form that needs to be verified. The filtered data is stored
in a database known as Mango DB. Data pre-processing unit is very reliable for setting up
data for the additional processing that is required. Classification is basically dependent on

no of tweets, no of hashtags, no of adherence confirmed user sentiment score, no of
retweets, methods of NLP. Due to multiple number of stance detection is used for
examining the stance of the author there are not 2 but three results are expected. It is a
psychological model that is used by the author, Stance Detection has any other
applications. The stance of the author can be considered as: Agreed, Neutral or Disagreed.
We can determine whether a news story is fake or genuine once we have considered all
the classes. Also the authenticity for a news story is given. After that we classify the
output and use classification algorithms. Moreover when the detection is measured =
neutral, which means neither its true nor its false. The complete process is not so useful
because the result is itself confusing, whether to trust or not. Which eventually failed the
very purpose of building the program.
Author: H. Parveen Sultanaa, Srijan Malhotra, in [4] has researched about the result that
are not satisfying with the variety of news. The results show that SVM and logistic
regression classifier have the best performance on this data set in the model, with SVM
having a slightly better performance than logistic regression classifier. The same can be
perceived from the fi scores. Also the training data is largely based on US politics and
economic news so it has been observed in our test cases, that the news statements related
to US politics have been correctly classified and fake news was detected. But the test
cases which have news related to technology have been wrongly predicted. The biggest
drawback that come packaged with this problem is that, the data is erratic and this means
that any type of prediction model can have anomalies and can mistakes. For future
improvements concepts like POS tagging, word2vec and topic modelling can be utilized.
These will give the model a lot more depth in terms of feature extraction and fine tuned
classification.
Authors: Rajendra Chatse, Pradeep Kumar Kale, in [5] has executed the process of this
project was tedious. It was not an easy experience of an expert as well. First system login
then registration, twitter data scrapping, twitter data to CSY conversion, applying NLP,
algorithm and predict the positive, negative and neutral, fake news detection. This paper
describes a simple fake news detection method based on one of the machine learning
algorithms - naive bayes classifier. The goal of the fake search is to examine hoe nai've
bayes works for this particular problem, given a manually labeled news data set, and to
support the idea of using artificial intelligence for fake news detection. Further, this
technique cannot be applied to social perform like facebook and twitter by adding recent

news and enhancing the fake news detection system. The main drawback of this was the
dataset stored had to be manua11y labeled, which is time consuming and not convenient
for large number of datasets. The difference between this papers and other papers on this
similar topics is that in this composition na'ive bayes. Classifier was specifically used for
fake news detection we have tested the difference in accuracy by taking different length
of the articles for detection the fakenews; also a concept of web scrapping was introduced
which gave us an insight into how we can update our dataset on regular basis to check the
truthfulness of the recently updated facebook posts.
Authors: Shruthy S Shetty, KB Shreejith, in [6] has researched about the Fake news
detection on social media has recently become emerging research that is capturing
attention. Fake news is generated on purpose to mislead readers to believe false
information, which makes it difficult and non-trivial to detect based on content. Fake
news on social media has been occurring for several years; however, there is no agreed
definition of the term "fake news". For better guidance of the future directions of fake
news direction research, appropriate classifications are necessary. Social media has
proved to be a powerful source for spreading fake news. It is important to utilize some of
the emerging patterns for fake news detection on social media. The one and only
drawback hers is SVM algorithm, because is not suitable for large data sets. SVM does
not perform very well when the data set has more noise i.e., target classes are
overlapping. In cases where the number of features for each data point exceeds the
number of training data samples, the SVM will underperform.
Authors: Nerissa Pereira, Sirman Dabreo, in [7] has presented a model for fake news
detection using a variety of machine learning and deep learning algorithms. Furthermore,
in the first level of implementation, we investigated the four different classifiers and
compared their accuracies. The model that achieves the highest accuracy is LSTM and the
highest accuracy is 93%. Fake news detection is a quite popular and trending research are
which has an extremely scarce number of datasets. The current model which we have
generated is run against the existing dataset, indicating that the model performs well
against it. I our next level we have analyzed the real time data from Twitter. Here we
have trained our model using logistic regression algorithm; due to the inability of the
LSTM to perform well over the real time tweets having considerably small length. The
accuracy for the tweets classification using Logistic Regression was found to be around
87%. Also, there is no Visual presentation in the result. Hence in the future work we need

to verify not just the Language but also the images and audio embedded in the content.
The method is only twitter oriented, hence any news which is not on twitter cannot be
predicted or analyzed whether its real or fake. Also, it will be a useless set of data.
Authors: Z Khanam, B N Alwasel, H Sirafi, in [8] has focused on detecting the fake news
by reviewing it in two stages: Characterization and disclosure. In the first stage, the basic
concepts and principles of fake news are highlighted in social media. During the
discovery stage, the current methods are reviewed for detection of fake news using
different supervised learning algorithms. As for the displayed fake news detection
approaches that is based on text analysis in the paper utilizes models based on speech
characteristics and predictive models that do note fit with the other current models. From
the utilized Nai·ve Bayes classifier to detect fake news from different sources, with
results of accuracy of 74%. Used combined ML algorithms, but they depend on unreliable
probability threshold with 85-91 % accuracy. Uses the Nai"ve Bayes to detect fake news
from different social media websites, but the results were not accurate for the untruthful
sources.
Authors: Christian Janze, Marten Risius, in [9] has implemented the research given by
them suggests that fake news sites could falsely suggest probity by selecting name, profile
pictures and logos similar to reliable sources. Thus, respective source-centric attributes
should be considered in future. In the present study, we only considered the most apparent
features of the news post, which are probably most influential due to their exposed
position. However, characteristics of the actual fake news text should prospectively also
be assessed to determine its status as being real or fake news. Beyond these
considerations, it needs to be noted that we also excluded some seemingly relevant
metrics like the percentage of post likes and the overall number of reactions due to multi
co-linearity. However, other limiting aspects concern the generalizability of our findings.
The news detection in the present work only revolves around political topics. While these
are currently of the predominant public interest, fake news can also target other areas like
science, sports or economics, which are not part of the study's sample. Nevertheless, as
we do not consider any topic specific features, we are confident in the generalizability of
our results. Furthermore, we only considered messages from Facebook, which are
structurally and functionally distinct from other social media platforms. While Facebook
represents the social media platforms where most news rae consumed other platforms are
also subject to fake news, which need individual means of detection. Next to this

limitation, it is possible that future advances in the realm of natural language generation
could potentially bypass our detection system by incorporating our findings to create fake
news which are indistinguishable from non-fake news. Considering the alleged
substantial effects of fake news on recent political events, the automatic detection of fake
news has important practical consequences. For future research, the present study
provides a starting point to identify improve the detection of fake news, which could also
be expanded to other topics and tested using data from additional social media platforms.
Current efforts of major platform operators to manually tag fake news is not an efficient
process.
Mykhailo Granik et.al. in their paper [3] shows a simple approach for fake news detection
using na'ive Bayes classifier. They were implemented as a software system and tested
against a dataset of Facebook news posts. They were collected five Facebook pages each
from the right and from the left, as well as three large mainstream political news pages
(Politico News). They achieved classification accuracy of approximately 74%.
Classification accuracy for fake news is slightly worse. This is caused by the skewness of
the dataset only 4.9% of it is fake news.
Himank Gupta et.al.[10] gave a framework based on different machine learning approach
that deals with various problem like accuracy shortage, time lag (BotMaker) and high
processing time to handle thousands of tweets in 1 sec. Firstly, they have 400,000 tweets
from HSpam 14 dataset. Then they further characterize the 150,000 spam tweets and
250,000 non spam tweets derived some lightweight features along with the Top 30 words
that are providing highest information gain from Bag-Of-Words. They were able to
achieve an accuracy of 91.65% and surpassed the existing solution by approximately
18%.
Marco L Della Vedova et.al [11] first proposed a novel ML fake news detection method
which, by combining news counter context features, outperforms existing methods in the
literature, increasing its accuracy up to 78.8%. Second, they implement method within a
Facebook Messenger Chatbot and validate it with a real-world application, obtaining a
fake news detection 81.7%. Their goal was to classify a news item as reliable or fake;
they first described the datasets they used for their test, the content-based approach they
implemented and the method they proposed to combine it with a social based approach
literature. The resulting dataset is composed of 15,500 posts, coming from 32 pages (14

conspiracy pages, 18 scientific pages than 2,300,00 likes by 900,000+ users, 8923
(57.6%) posts are hoaxes and 6,577 (42.4%) are non-hoaxes.
Cody Buntain et.al [12] develops a method for automating fake news detection on twitter
by learning to predict accuracy two credibility-focused twitter datasets: CREDBANK, a
crowd sourced dataset of accuracy assessments for events in PHEME, a dataset of
potential rumors in twitter and journalistic assessments of their accuracies. They apply
this method content sourced from BuzzFeed's fake news dataset. A feature analysis
identifies features that are most predictive for crowd journalistic accuracy assessments,
results of which are consistent with prior work. They rely on identifying highly retweeted
conversation and use the features of these threads to classify stories, limiting this work's
applicability only to the set of pop. Since the majority of tweets are rarely retweeted, this
method therefore is only usable only a minority of twitter conversation.
In his paper, Shivam B Parekh et.al [13] aims to present an insight of characterization of
news stories in the modem diasporic with the differential content types of news story and
its impact on readers. Subsequently, we dive into existing fake news approaches that are
heavily based on text-based analysis, and also describe popular fake news datasets. We
conclude identifying 4 key open research challenges that can guide future research. It is a
theoretical approach which gives illustrations detection by analyzing the psychological
factors.

3. SOFTWARE DEVELOPMENTS
3.1 PROPOSED MODEL
Social media is one of the most available news sources these days for many folks
worldwide due to their low value, quick access and fast spread. However, this comes with
confusing signs and significant risks of exposure to 'false stories' written to mislead
readers. Such information can affect the public's voice and allow evil groups to control
the outcome of public events, such as elections. These days fake news is creating different
issues from sarcastic articles to fabricated news and planned government propaganda in
some outlets. Fake news and lack of trust in the media are huge ramifications in our
society.
So, we proposed a system to detect the fake news, which is a classic text classification
problem with a straight forward proposition. It is needed to build a model that can
differentiate between "Real" news and "Fake" news. Methods should be followed are:
Pre-processing data is a normal first step before training and evaluating the data using
machine learning algorithms. Machine learning algorithms are only as good as the data
you are feeding them. It is a crucial that data is formatted properly and meaningful
features are included in order to have sufficient consistency that will result in the best
possible results Tfid vectorizer is used to abstract features from the content using this
abstracted feature do train ML algorithm (passive aggressive classifier).
3.2 EXISTING SYSTEM
A simple approach for fake news detection is performed using KNN classifier. The way
they get these probabilities is by using KNN, which describes the probability of a feature
which has Miss Classification and Less Prediction. In this proposed model, initially both
training and testing data are pre-processed by removing unwanted punctuation and word,
by next feature extraction is used the extract the needful information from the pre-
processing data. Supervised machine learning algorithms are applied to perform feature
extraction and prediction. After classification model is done by using SVM to classify the
news is predict as fake or real.
In this paper a model is Support Vector Machine and Nai:ve Bayes. SVM and Bayes are a
type of classification algorithm capable of learning order dependence in sequence

prediction problems. This classification algorithm is used. It was demonstrated that two
layers were sufficient to detect more complex features.
The main drawbacks of existing system are:
It is better but also more difficult to train can be layers. One layer works with simple
issues, and to be sufficient to find relatively complex features.
Developing a false perception about someone is one major drawback of fake news.
3.3 REQUIREMENTS SPECIFICATION
3.3.1 HARDWARE REQUIREMENTS
RAM capacity: 8GB minimum, 16GB or higher CPU: Intel Core i5 6th Generation
processor or higher
Accessories: Computer system powerful enough to handle the computing power
necessary
3.3.2 SOFTWARE REQUIREMENTS
Operating system: Microsoft Windows 10 or Ubuntu Language: Python 3.6
Tools
Anaconda Numpy Matplotlib Skleam
3.3.3 PYTHON
Python is an interpreted, object-oriented, high-level programming language with dynamic
semantics. Its high-level built in data structures, combined with dynamic typing and
dynamic binding, make it very attractive for Rapid Application Development, as well as
for use as a scripting or glue language to connect existing components together. Python's
simple, easy to learn syntax emphasizes readability and therefore reduces the cost of
program maintenance. Python supports modules and packages, which encourages
program modularity and code reuse. The Python interpreter and the extensive standard
library are available in source or binary form without charge for a11 major platforms, and
can be freely distributed.

Often, programmers fall in love with Python because of the increased productivity it
provides. Since there is no compilation step, the edit-test-debug cycle is incredibly fast.
Debugging Python programs is easy: a bug or bad input will never cause a segmentation
fault. Instead, when the interpreter discovers an error, it raises an exception. When the
program doesn't catch the exception, the interpreter prints a stack trace. A source level
debugger allows inspection of local and global variables, evaluation of arbitrary
expressions, setting breakpoints, stepping through the code a line at a time, and so on.
The debugger is written in Python itself, testifying to Python's introspective power. On
the other hand, often the quickest way to debug a program is to add a few print statements
to the source: the fast edit-test-debug cycle makes this simple approach very effective.
Python is dynamically and garbage collected. It supports multiple programming
paradigms, including structured (particularly procedural), object oriented and functional
programming. It is often described as a "batteries included" language due to its
comprehensive standard library. Python's large standard library provides tools suited to
many tasks, and is commonly cited as one of its greatest strengths. For Internet-facing
applications, many standard formats and protocols such as MIME and HTTP are
supported. It includes modules for creating graphical user interfaces, connecting to
relational databases, generating pseudorandom numbers, arithmetic with arbitrary-
precision decimals, manipulating regular expressions, and unit testing.Python consistently
ranks as one of the most popular programming languages.
3.3.4 ANACONDA
Anaconda is an open-source distribution of the Python and R programming languages for
data science that aims to simplify package management and deployment. Package
versions in Anaconda are managed by the package management system, conda, which
analyzes the current environment before executing an installation to avoid disrupting
other frameworks and packages.
The Anaconda distribution comes with over 250 packages automatically installed. Over
7500 additional open-source packages can be installed from PyPI as well as the conda
package and virtual environment manager. It also includes a GUI (graphical user
interface), Anaconda Navigator, as a graphical alternative to the command line interface.
Anaconda Navigator is included in the Anaconda distribution, and allows users to launch
applications and manage conda packages, environments and channels without using

command-line commands. Navigator can search for packages, install them in an
environment, run the packages and update them. Anaconda is a distribution of the Python
and R programming languages for scientific computing (data science,machine learning
applications, large-scale data processing, predictive analysis, etc.), that aims to simplify
packet management and deployment. The distribution includes data-science packages
suitable for windows, linux, and macOs. It is developed and maintained by Anaconda,
Inc., which was founded by Peter Wang and Travis Oliphant in 2012. As an Anaconda,
Inc. product, it is also known as Anaconda Distribution or Anaconda Individual Edition,
while other products from the company are Anaconda Team Edition and Anaconda
Enterprise Edition, both of which are not free.
Anaconda is an open source distribution for Python and R. With the availability of more
than 300 libraries for data science, it becomes fairly optimal for any programmer to work
on anaconda for data science. Anaconda helps in simplified package management and
deployment. Anaconda comes with a wide variety of tools to easily collect data from
various sources using various machine learning algorithms and AI algorithms. It helps in
getting an easily manageable environment setup which can deploy any project with the
click of a single button.
3.3.5 NUMPY
NumPy is a Python library used for working with arrays. NumPy stands for Numerical
Python. It also has functions for working in domain of linear algebra, fourier transform,
and matrices. NumPy was created in 2005 by Travis Oliphant. It is an open source project
and you can use it freely. NumPy is a Python library and is written partially in Python,
but most of the parts that require fast computation are written in C or C++.In Python we
have lists that serve the purpose of arrays, but they are slow to process. NumPy aims to
provide an array object that is up to 50x faster than traditional Python lists. The array
object in NumPy is called ndarray, it provides a lot of supporting functions that make
working with ndarray very easy. Arrays are very frequently used in data science, where
speed and resources are very important. NumPy arrays are stored at one continuous place
in memory unlike lists, so processes can access and manipulate them very efficiently.
This behavior is called locality of reference in computer science. This is the main reason
why NumPy is faster than lists. Also it is optimized to work with latest CPU
architectures. The source code for NumPy is located at this github repository

3.3.6 PANDAS
Pandas is an open source Python package that is most widely used for data science/data
analysis and machine learning tasks.It is built on top of another package named Numpy,
which provides support for multi-dimensional arrays. As one of the most popular data
wrangling packages, Pandas works well with many other data science modules inside the
Python ecosystem, and is typically included in every Python distribution, from those that
come with your operating system to commercial vendor distributions like ActiveState's
ActivePython. Pandas makes it simple to do many of the time consuming, repetitive tasks
associated with working with data. Python Pandas is defined as an open-source library
that provides high-performance data manipulation in Python. This tutorial is designed for
both beginners and professionals.It is used for data analysis in Python and developed by
Wes McKinney in 2008. Our Tutorial provides all the basic and advanced concepts of
Python Pandas, such as Numpy, Data operation and Time Series
Pandas is defined as an open-source library that provides high-performance data
manipulation in Python. The name of Pandas is derived from the word Panel Data, which
means an Econometrics from Multidimensional data. It is used for data analysis in Python
and developed by Wes McKinney in 2008.Data analysis requires lots of processing, such
as restructuring, cleaning or merging, etc. There are different tools are available for fast
data processing, such as Numpy, Scipy, Cython, and Panda. But we prefer Pandas
because working with Pandas is fast, simple and more expressive than other tools. Pandas
is built on top of the Numpy package, means Numpy is required for operating the Pandas.
Before Pandas, Python was capable for data preparation, but it only provided limited
support for data analysis. So, Pandas came into the picture and enhanced the capabilities
of data analysis. It can perform five significant steps required for processing and analysis
of data irrespective of the origin of the data, i.e., load, manipulate, prepare, model, and
analyze.
3.3.7 MATPLOTLIB
Matplotlib is a cross-platform, data visualization and graphical plotting library for Python
and its numerical extension NumPy. As such, it offers a viable open source alternative to
MATLAB. Developers can also use matplotlib's APis (Application Programming
Interfaces) to embed plots in GUI applications.A Python matplotlib script is structured so

that a few lines of code are all that is required in most instances to generate a visual data
plot.
Matplotlib is an amazing visualization library in Python for 2D plots of arrays. Matplotlib
is a multi-platform data visualization library built on NumPy arrays and designed to work
with the broader SciPy stack. It was introduced by John Hunter in the year 2002.One of
the greatest benefits of visualization is that it allows us visual access to huge amounts of
data in easily digestible visuals. Matplotlib consists of several plots like line, bar, scatter,
histogram etc.
Matplotlib is a cross-platform, data visualization and graphical plotting library for Python
and its numerical extension NumPy. As such, it offers a viable open source alternative to
MATLAB. Developers can also use matplotlib's APis (Application Programming
Interfaces) to embed plots in GUI applications.A Python matplotlib script is structured so
that a few lines of code are all that is required in most instances to generate a visual data
plot.
3.3.8 SKLEARN
Sklearn is the most useful and robust library for machine learning in Python. It provides a
selection of efficient tools for machine learning and statistical modeling including
classification, regression, clustering and dimensionality reduction via a consistence
interface in Python. This library, which is largely written in Python, is built upon NumPy,
SciPy and Matplotlib. Scikit-learn (Sklearn) is the most useful and robust library for
machine learning in Python. It provides a selection of efficient tools for machine learning
and statistical modeling including classification, regression, clustering and dimensionality
reduction via a consistence interface in Python. This library, which is largely written in
Python, is built upon NumPy, SciPy and Matplotlib.
It was originally called scikits.learn and was initially developed by David Cournapeau as
a Google summer of code project in 2007. Later, in 2010, Fabian Pedregosa, Gael
Varoquaux, Alexandre Gramfort, and Vincent Michel, from FIRCA (French Institute for
Research in Computer Science and Automation), took this project at another level and
made the first public release (v0.l beta) on 1st Feb. 2010.

4. MODULE DESCRIPTION
4.1 DATA PREPROCESSING
Prior to training and data evaluation using machine learning, data processing is a normal
first step. Algorithms for machine learning are always as useful as information you fed
them. It is important to format correct data and to include relevant items so that they are
consistent enough to produce best outcomes possible. Stop word removal, tokenization,
lower case, sentence segmentation, and punctuation removal are all examples of data
refinement. The information must be deleted. This allows us to reduce the size of the real
data by removing irrelevant information. We created a generic processing function for
each document to remove punctuation and non-letter characters, followed by the letter
case in the document was lowered. Make different steps to clean text (remove all non-
alphanumeric characters, delete stop words, delete missing rows, etc.)
4.2 FEATURE EXTRACTION
Feature selection is the method of reduction of dimensionality that reduces an original
batch of actual data to even more controllable computing categories. A distinguishing
feature of these large volumes of data is a lot of variables that have to be processed by
many data centers. To begin, we extract a number of language features from fake news
detection models: Building a model based on a count vectorizer using word tallies or a
term frequency inverse document frequency, TF id matrix can only get use of far. But
these models do not consider the important qualities like word ordering and context. It is
very possible that two articles that are similar in their word count will be completely
different in their meaning. The data science community has responded by taking actions
against the problem.
4.3 ALGORITHM TRAINING
The idea to use data from training in machine learning programs is a simple idea,
however the way such innovations work is also really simple. The training process is an
initial piece of facts used to help a program to realize how computational intelligence
technologies can be applied and specialized results produced. Successive sets of data
called confirmation and test sets may be used as an addition to this. It can process not
only individual data points, but also whole data sequences

4.4 PREDICTION
Usually, when a data set is separated into a workout and test set. A declaration about a
particular outcome is a prediction. Forecasting can be helpful to plan available in the
form. The majority of the data is used for training, while only a small portion of the data
is used for testing. Using message box module to generate an interface for finding a
statement is fake or original. Using the trained data machine can predict output. Test data
also applied for feature extraction and preprocessing. Jn today's society, it is crucial to
monitor fake stories online, as news reporting is produced quickly because of the easily
accessible technology. There are seven major groups in the world of false stories, and the
piece of counterfeit news content can be textual and visual. Linguistic as we11 as non-
linguistic indicators can be analyzed by several techniques to determine false news.
Although several of these methods are usually efficient in identifying fake notices, they
are limited.

5. METHODOLOGY
The main objective is to detect fake news, which is a classic text classification problem
with a straightforward proposition. It is needed to build a model that can differentiate
between "Real news" and "Fake news. Methods should be as follows:
1. Acquiring and loading the data.
2. Cleaning the dataset.
3. Removing extra symbols.
4. Removing punctuations.
5. Removing the stop words.
6. Stemming.
7. Tokenization.
8. Feature Extractions.
9. TF-IDF vectorizer.
10. Counter vectorizer with TF-IDF transformer.
11. Machine learning model training and verification.
Preprocessing data is the normal first step before training and evaluating the data using
machine learning algorithms. Machine learning algorithms are only as good as the data
you are feeding them. It is crucial that data is formatted properly and meaningful features
are included in order to have sufficient consistency that will result in the best possible
results. Tfid vectorizer is used to extract features from the content. Using those extracted
feqtures do train ML algorithm (passive aggressive classifier).
Fig. 5.1 Architecture

6. EXPERIMENTAL ANALYSIS
6.1 SAMPLE CODE
import numpy as np
import pandas as pd
import itertools
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfV ectorizer from
sklearn.linear_model import PassiveAggressiveClassifier from
skleam.metrics import accuracy_score, confusion_matrix import
pickle
#Read the data
df=pd.read_csv('news.csv')
#shape and head
print('Rows and colums',df.shape)
print("first 5 datas",df.head)
labels=df.label
print("labels:",labels.head())
x_train,x_test,y_train,y_test = train_test_split(df['text'], labels, test_size=0.2,
random_state=7) #Initialize a
TfidfVectorizer
tfidf_vectorizer=TfidfVectorizer(stop_words='english', max_df=O.7)
#Fit and transform train set, transform test set
tfidf_train=tfidf_vectorizer.fit_transform(x_train)

tfidf_test=tfidf_vectorizer.transform(x_test)
#Initialize a PassiveAggressiveClassifier
pac=PassiveAggressiveClassifier(max_iter=I 00)
pac.fit(tfidf_train,y _train)
with open('vectors.pickle', 'wb') as f: pickle.dump(tfidf_vectorizer, f)
with open('fakenews.pickle','wb') as f:
pickle.dump(pac,f)
pkl = open('fakenews.pickle', 'rb') pac
= pickle.load(pkl)
vec = open('vectors.pickle', 'rb')
tt_vect = pickle.load(vec)
#Predict on the test set and calculate accuracy y_pred=pac.
predict(tfidf_test) score=accuracy_score(y_test,y_pred)
print(f Accuracy: { round(score*100,2)}% ')
#Build confusion matrix
confusion_matrix(y_test,y_pred, labels=['FAKE','REAL'])
print(confusion_matrix(y_test,y_pred, labels=['F AKE','REAL']))
text='Watch The Exact Moment Paul Ryan Committed Political Suicide At A Trnmp
Rally (VIDEO)'
tf_text=tf_ vect.transform(Ltext])
pred=pac.predict(tf_text)
Print(pred)

6.2. IMPLEMENTATION
Fig. 6.1 User Interface
Fig.6.2 Home page

Fig. 6.3 Admin login
Fig.6.4 User login

Fig.6.5 News Uploading
Fig.6.6 News prediction

Fig. 6.7 User registration

7. CONCLUSION
The concept of deception detection in social media is particularly new and there is on-
going research in hopes that scholars can find more accurate ways to detect false
information in this booming, fake-news-infested domain. For this reason, this research
may be used to help other researchers discover which combination of methods should be
used in order to accurately detect fake news in social media. The proposed method
described in this paper is an idea for a more accurate fake news detection algorithm. It is
important that we have some mechanism for detecting fake news, or at the very least, an
awareness that not everything we read on social media may be true, so we always need to
be thinking critically. This way we can help people make more informed decisions and
they will not be fooled into thinking what others want to manipulate them into believing
Fake news interfere with the ability of a user to discern useful information from the
Internet services especially when news becomes critical for decision making. Considering
the changing landscape of the modern business world, the issue of fake news has become
more than just a marketing problem as it warrants serious efforts from security
researchers. It is imperative that any attempts to manipulate or troll the Internet through
fake news are countered with absolute effectiveness. We proposed a simple but effective
approach to allow users in-stall a simple tool into their personal browser and use it to
detect and filter out potential Clickbaits. The preliminary experimental results conducted
to assess the method's ability to attain its intended objective, showed outstanding
performance in identify possible sources of fake news. Since we started this work, few
fake news databases have been made available and we're currently expanding our
approach using R to test its effectiveness against the new datasets.
In the 21st century, the majority of the tasks are done online. Newspapers that were
earlier preferred as hard-copies are now being substituted by applications like Facebook,
Twitter, and news articles to be read online. Whatsapp's forwards are also a major source.
The growing problem of fake news only makes things more complicated and tries to
change or hamper the opinion and attitude of people towards use of digital technology.
When a person is deceived by the real news two possible things happen- People start
believing that their perceptions about a particular topic are true as assumed. Thus, in order
to curb the phenomenon, we have developed our Fake news Detection system that takes
input from the user and classify it to be true or fake. To implement this, various NLP and

Machine Learning Techniques have to be used. The model is trained using an appropriate
dataset and performance evaluation is also done using various performance measures. The
best model, i.e. the model with highest accuracy is used to classify the news headlines or
articles. As evident above for static search, our best model came out to be Logistic
Regression with an accuracy of 65%. Hence we then used grid search parameter
optimization to increase the performance of logistic regression which then gave us the
accuracy of 75%. Hence we can say that if a user feed a particular news article or its
headline in our model, there are 75% chances that it will be classified to its true nature.
The user can check the news article or keywords online; he can also check the
authenticity of the website. The accuracy for dynamic system is 93% and it increases with
every iteration. We intend to build our own dataset which will be kept up to date
according to the latest news. All the live news and latest data will be kept in a database
using Web Crawler and online database.

8. REFERENCES
[1] Abu-Nimeh, S., Chen, T., Alzubi, 0., 2011. Malicious and spam posts in online social
networks. Computer 44, 23-28. doi:10.l 109/MC.2011.222.
[2] Al Messabi, K., Aldwairi, M., Al Yousif, A., Thoban, A., Belqasmi, F., 2018.
Malware detection using dns records and domain name features",in: International
Conference on Future Networks and Distributed Systems (ICFNDS), ACM. URL:
https://doi.org/10.1145/3231053.3231082.
[3] Aldwairi, M., Abu-Dalo, A.M., Jarrah, M., 2017a. Pattern matching of signature-
based ids using myers algorithm under mapreduce frame-work. EURASIP J. Information
Security 2017, URL: http://dblp.uni-trier.de/db/journals/ejisec/ejisec2017.html# Aldw
airiAJ17.
[4] Aldwairi, M., Al-Salman, R., 2011. Malurls: Malicious urls classification system, in:
Annual International Conference on Information Theoryand Applications, GSTF Digital
Library (GSTF-DL), Singapore. doi:10.5176/978-981-08-8113-9_1TA201l-29. the best
paper award.
[5] Aldwairi, M., Alsaadi, H.H., 2017. Flukes: Autonomous log forensics, intelligence
and visualization tool, in: Proceedings of the InternationalConference on Future Networks
and Distributed Systems, ACM, New York, NY, USA. pp. 33:1-3
6] Aldwairi, M., Hasan, M., Balbahaith, Z., 2017b. Detection of drive-by download
attacks usmg machine learning approach. Int. J. Inf. Sec.Priv. 11, 16-28. URL:
https://doi.org/10.4018/IJISP.2017100102, doi:10.4018/IJISP.2017100102.
[7] Balmas, M., 2014. When fake news becomes real: Combined exposure to multiple
news sources and political attitudes of inefficacy, alienation,and cynicism.
Communication Research 41, 430-454. doi:10.1177/0093650212453600.
[8] Baym, G., Jones, J.P., 2012. News parody in global perspective: Politics, power, and
resistance.PopularCommunicationl0,213.URL:https://doi.org/10.1080/15405702.2012.63
856 6, doi: I 0.1080/15405702.2012.638566.
[9] Brewer, P.R., Young, D.G., Morreale, M., 2013. The impact ofreal news about fake
news": Intertextual processes and political satire. In-ternational Journal of Public Opinion

Research 25, 323-343. URL: http://dx.doi.org/l 0.1093/ijpor/edt0I 5, doi: IO. I
093/ijpor/edt0I5
[10] Chakraborty, A., Paranjape, B., Kakarla, S., Ganguly, N., 2016. Stop clickbait:
Detecting and preventing clickbaits in online news media,in: 2016 IEEE/ACM
International Conference on Advances in Social Networks Analysis and Mining
(ASONAM), pp. 9-16. doi:10.1109/ASONAM.2016.7752207
[11] Chen, Y., Conroy, N.J., Rubin, V.L., 2015. News in an online world: The need for an
"automatic crap detector", in: Proceedings of the 78thASIS&T Annual Meeting:
Infonnation Science with Impact: Research in and for the Community, American Society
for Information Science,SilverSprings,MD,USA.pp.81:1- 81:4.URL:http://dl.acm.org/
citation.cfm?id=2857070.2857151.
[12] Conroy, N.J., Rubin, V.L., Chen, Y., 2015. Automatic deception detection: Methods
for finding fake news, in: Proceedings of the 78th ASIS&TAnnual Meeting: Information
Science with Impact: Research in and for the Community, American Society for
Information Science, SilverSprings,MD,USA.pp.82:1- 82:4.URL:http://dl.acm.org/
citation.cfm?id=2857070.2857152.
l13] Hassid, J., 2011. Four models of the fourth estate: A typology of contemporary
chinese journalists. The China Quarterly 208, 813832.doi:10.1017/S0305741011001019.
[14] Lewis, S., 2011. Journalists, social media, and the use of humor on twitter. The
Electronic Journal of Communication/ La Revue Electronicde Communication 21, 1-2.
[15] Marchi, R., 2012. With facebook, biogs, and fake news, teens reject journalistic
objectivity. Journal of Communication Inquiry 36, 246-262. URL: https://doi.org/
10.1177/0196859912458700, doi:10.1177/0196859912458700.
[16] Masri, R., Aldwairi, M., 2017. Automated malicious advertisement detection using
virustotal, urlvoid, and trendmicro, in: 2017 8th Interna-tional Conference on Information
and Communication Systems (ICICS), pp. 336-341. doi:10.1109/IACS.2017.7921994.
[17] Nah, F.F.H., 2015. Fake-website detection tools : Identifying elements that promote
individuals use and enhance their performance 1 .introduction.[18] Pogue, D., 2017. How
to stamp out fake news. Scientific American 316, 24-24. doi:10.1038/scientific
american0217-24.

[19] Qbeitah, M.A., Aldwairi, M., 2018. Dynamic malware analysis of phishing emails,
in: 2018 9th International Conference on Information andCommunication Systems
(ICICS), pp. 18-24. doi:10.1109/IACS.2018.8355435.
[20] Riedel, B., Augenstein, I., Spithourakis, G.P., Riedel, S., 2017. A simple but tough-
to-beat baseline for the fake news challenge stance detectiontask. CoRR abs/1707.03264.
URL: http://arxiv.org/abs/1707.03264, arXiv:1707.03264
[21] Rubin, V.L., Chen, Y., Conroy, N.J., 2015. Deception detection for news: Three
types of fakes, in: Proceedings of the 78th ASIS&T AnnualMeeting: Information Science
with Impact: Research in and for the Community, American Society for Information
Science, Silver Springs,MD, USA. pp. 83:1-83:4. URL: http://dl.acm.org/citation.
cfm?id=2857070.2857153.
[22] Shu, K., Sliva, A., Wang, S., Tang, J., Liu, H., 2017. Fake news detection on social
media: Adata mmmg perspective. SIGKDDExplor. Newsl.19, 22-36. URL:
http://doi.acm.org/10.1145/3137597.3137600, doi:10.1145/3137597.3137600.
[23] Smith, J., Leavitt, A., Jackson, G., 2018. Designing new ways to give context to
news stories. https://medium.com/facebook-design/designing-new-ways-to-give-context-
to-newsstories-f 6cl 3604f450.
[24] Spicer, R.N., 2018. Lies, Damn Lies, Alternative Facts, Fake News, Propaganda,
Pinocchios, Pants on Fire, Disinformation, Misin-formation, Post-Truth, Data, and
Statistics. Springer International Publishing, Cham. pp. 1-31. URL:
https://doi.org/10.1007/978-3-319- 69820-5_1, doi:10.1007/978-3-319-69820-5_1.
[25] of Waikato, U., 2017. Waikato environment for knowledge analysis. URL:
https://www.cs.waikato.ac.nz/ml/weka/.

Fake News Detection on Social Media using Machine Learning

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Fake News Detection on Social Media using Machine Learning

Similar to Fake News Detection on Social Media using Machine Learning (20)

Recently uploaded

Recently uploaded (20)

Fake News Detection on Social Media using Machine Learning