1. SRI SHANMUGHA COLLEGE OF ENGINEERING AND TECHNOLOGY
FAKE NEWS DETECTION
USING PYTHON AND MACHINE LEARNING
DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE
PRESENTED BY
RAJESHWARI.J (732720243021)
SHANILA RINSI.T (732720243025)
SHAMILSHA (732720243024)
GUIDED BY
JASEENASH.R , AP/CSE
2. ABSTARCT
• Fake News is a news designed to deliberately spread hoaxes,
propaganda and misinformation or Falsehood also known as Fake
News, it is over-whelming.
• The main objective of this project is to find the fake news, which is
unique or classic text classification problems with a straight-forward
proposition. It is used to build a model that can differentiate which is
real news and fake news.
• the Logistic regression, Decision Tree, Gradient Boosting, Random
forests algorithms for finding the fake news, that involves the terms
frequency and inverse document frequency to vectorize the news
contents.
3. DOMAIN INTRODUCTION
• Fake news and lack of trust in the media are growing problems with huge
ramifications in our society. Obviously, a purposely misleading story is
“fake news”, but lately blathering social media’s discourse is changing its
definition.
• Disinformation, also as known as fake news, is over whelming.
• The phrase “Fake news” named the word of the year in 2016 by the
Macquarie dictionary. Thus, the traditional media is urged to be more
creative so that it can gain more attention from the public.
• A social event like Corona Virus in 2020, official organizations are
conducting fake news from the beginning. A project lead by the World
Health Organization called EPI-WIN is available online to provide credible
information and announce fake news regarding this disease.
4. PROBLEM DEFINITION
• The effects of fake news can be political, economic, business,
organization, health or even personal. It is needed to build a model
that can differentiate between “Real” news and “Fake” news.
• The extensive spread of fake news has the potential for extremely
negative impacts on individuals and society. Therefore, fake news
detection has become a challenging task in today's world.
• Using SK-learn, build a TF-IDF Vectorizers on the provided dataset.
a Passive Aggressive Classifier and fit the model. In the end, the
accuracy score and the confusion matrix tell us how well our model
fares.
5. EXISTING SYSTEM
• How to enforce user privacy preferences
• how to secure data when stored into the PDS.
• Users are not skilled enough to understand how to translate their
privacy requirements into their privacy preferences.
• Average users might have difficulties in properly setting potentially
complex privacy preferences
DISADVANTAGES OF EXISTING SYSTEM
• Personal data we are digitally producing are scattered.
• managed by different providers (e.g., online social media, hospitals, banks,
airlines, etc).
• Users are losing control on their data, whose protection is under the
responsibility of the data provider,.
• They cannot fully exploit their data, since each provider keeps a separate
view of them.
6. PROPOSED SYSTEM
By utilizing Logistic regression, Decision Tree, Gradient Boosting, Random
forests algorithms, we will make our model in order to increase the
performance and accuracy.
The proposed system is cost effective.
• Does not require any external hardware implementation.
• Early detection reduces fatalities.
• System generated results prone to less error.
• Reduces human effort and intervention.
• Drastically reduces time compared to manual detection.
• Accuracy in prediction.
• Flexible and portable.
Advantages
7. • This Project comes up with the applications of NLP (Natural
Language Processing) techniques for detecting the 'fake news',
that is, misleading news stories that comes from the non-
reputable sources.
• The main objective of this project is to detect the fake news by
using Natural Language Processing techniques.
OBJECTIVES
8. REQURIMENTS SPECIFICATION
HARDWARE REQURIMENTS
• System : Intel core i5.
• Hard Disk : 500GB.
• RAM : 4GB.
• Any desktop or laptop system with above configuration or higher level.
SOFTWARE REQURIMENTS
• Operating System : Windows XP/&/10
• Coding language : Python and various libraries.
• Software : Anaconda.
• IDE : Jupyter Notebook, Pycharm.
• Front-end : HTML.
• Back-end : Flask framework
9. METHODOLOGY
DATA COLLECTION
In this project, the dataset is being taken from Kaggle.com. The size of
the dataset is 6335*4. It means that there are 6335 rows along with 4
columns. The name of the columns are ‘URLs’, ‘Headline’, ‘Body’ and
‘Label’. The first column identifies the news, the second and third are the
title and text, and the fourth column has labels denoting the news is
REAL or FAKE.
FEATURE EXTRACTION
To analyze and model text after it has been preprocessed, it must
first be converted into features. Techniques include Bag of Words
and TF-IDF Vectorizer.
10. Term Frequency-Inverse Document Frequency
It increases proportionally with the number of times a word appears in a
document but is offset by its frequency in the overall corpus. While TF-IDF is a
good basic metric for extracting descriptive terms, it does not take into
consideration a word’s position or context
Bag of words
This model analyzes the text from all input documents and converts it in a bag-
of-words form. For example, for more than one text(set of text documents).we
can have one bag of words which will contains all distinct words from all texts
in one bag.
11. CLASSIFICATION
Naive Bayes Classifier
In machine learning Naive Bayes Classifiers are a family of simple
“probabilistic classifiers” based on applying Bayes’ theorem with
powerful(naïve) independent assumptions between the features.
Passive Aggressive Classifier
Online learning algorithm is mainly designed for detecting
fake news on social media where new data is added every second.
12. LANGUAGE USED FOR IMPLEMENTATION
Python is an interpreted, object-oriented programming language similar
to PERL, that has gained popularity because of its clear syntax and
readability.
The source code is freely available and open for modification and reuse.
Features Of Python
• Easy understandable and readable.
• Interpreted Language.
• Cross-platform Language
• Free and Open Source
• Object-Oriented Language
• Extensible
• GUI Programming Support
13. Advantages Of Python
Presence of Third-Party Modules
Open Source and Community Development
Learning Ease and Support Available
User-friendly Data Structures
Productivity and Speed.
FRONT-END: HTML
HTML provides a means to create structured documents by denoting structural
semantics for text such as headings, paragraphs, lists, links, quotes and other items.
Browsers do not display the HTML tags but use them to interpret the content of the page.
BACK-END: FLASK FRAMEWORK
FLASK is a micro web framework written in Python and can be used for building complex
database-driven websites starting with mostly static pages. It is classified as a micro
framework because it does not require particular tools or libraries. Applications that use
the Flask framework include Pinterest and LinkedIn.
14. PLATFORM
PYCHARM is an integrated development environment (IDE) used in computer
programming, specifically for the Python language. It is developed by the Czech
company Jet Brains. It provides code analysis, a graphical debugger, an integrated unit
tester, integration with version control systems (VCS), and supports web development
with Django as well as Data Science with Anaconda.
• Python refactoring: includes rename, extract method, introduce variable,
introduce constant, pull up, push down and others.
• Support for web frameworks: Django, web2py and Flask .
• Integrated Python debugger.
JUPYTER NOTEBOOK
A Jupyter Notebook can be converted to a number of open standard output formats such
as HTML, presentation slides, LaTeX, PDF, Restructured Text, Markdown, Python through
‘Download As’ in the web interface via the convert library where it takes a URL to any
publicly available notebook document, convert it to HTML on the fly and display to the
user.
15. ALGORITHMS
LOGISTIC REGRESSION
• Logistic regression is one of the most popular Machine Learning algorithms, which
comes under the Supervised Learning technique.
• It is used for predicting the categorical dependent variable using a given set of
independent variables.
• it gives the probabilistic values which lie between 0 and 1.
• Logistic Regression is much similar to the Linear Regression except that how they are
used.
• Logistic regression is used for solving the classification problems.
STEPS IN LOGISTIC REGRESSION
Data Pre-processing step:
• Fitting Logistic Regression to the Training set
• Predicting the test result
• Test accuracy of the result (Creation of Confusion matrix)
• Visualizing the test set result.
16. DECISION TREE CLASSIFICATION
• Decision Tree is a Supervised learning technique that can be used for both classification
and Regression problems, but mostly it is preferred for solving Classification problems.
• It is a tree-structured classifier, where internal nodes represent the features of a dataset,
branches represent the decision rules and each leaf node represents the outcome.
GRADIENT BOOSTING CLASSIFIER
• Gradient Boosting is a popular boosting algorithm. In gradient boosting, each
predictor corrects its predecessor’s error.
• In contrast to Ad boost, the weights of the training instances are not tweaked,
instead, each predictor is trained using the residual errors of predecessor as labels.
• There is a technique called the Gradient Boosted Trees whose base learner is CART
(Classification and Regression Trees).
• The below diagram explains how gradient boosted trees are trained for regression
problems.
17. # This Python 3 environment comes with many helpful analytics libraries installed # It
is defined by the kaggle/python
Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all
files under the input directory
# Modelling Algorithms
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import PassiveAggressiveClassifier
# Modelling Helpers
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer from
sklearn.feature_extraction.text import CountVectorizer
from sklearn import metrics
# Computations
import itertools
# Visualization
import matplotlib.pyplot as plt
test = pd.read_csv ("test.csv")
submit = pd.read_csv ("submit.csv") train = pd.read_csv("test.csv") train.head()
SOURCE CODE
18. id title author text
0 20800
Specter of Trump Loosens Tongues, if Not
Purse... David Streitfeld
PALO ALTO, Calif. — After years of
scorning...
1 20801 Russian warships ready to strike terrorists ne... NaN Russian warships ready to strike terrorists ne...
2 20802
#NoDAPL: Native American Leaders Vow to
Stay A... Common Dreams
Videos #NoDAPL: Native American Leaders
Vow to...
id
title author text
3 20803
Tim Tebow Will Attempt Another Comeback,
This ...
Daniel Victor If at first you don’t succeed, try a different...
4 20804 Keiser Report: Meme Wars (E995)
Truth Broadcast
Network
42 mins ago 1 Views 0 Comments 0 Likes
'For th...
print(f"Train Shape : {train.shape}") print(f"Test Shape : {test.shape}")
print(f"Submit Shape : {submit.shape}")
Train Shape : (5200, 4)
Test Shape : (5200, 4)
Submit Shape : (5200, 2)
train.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5200 entries, 0 to 5199 Data
columns (total 4 columns):
19. #Column Non-Null Count Dtype
•id 5200 non-null int64
•title 5078 non-null object
•author 4697 non-null object
•text 5193 non-null object dtypes: int64(1), object(3)
memory usage: 162.6+ KB
train.isnull().sum()
id 0 title 122
author 503
text 7
dtype: int64
train.dtypes.value_counts()
object 3
int64 1
dtype: int64
test=test.fillna(' ') train=train.fillna(' ‘)
# Create a column with all the data available
test['total']=test['title']+' '+test['author']+' '+test['text']
train['total']=train['title']+' '+train['author']+'
'+train['text']
# Have a glance at our training set
train.info() train.head()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5200 entries, 0
to 5199 Data columns (total 5 columns):
20. # Create a column with all the data available
test['total']=test['title']+' '+test['author']+' '+test['text']
train['total']=train['title']+' '+train['author']+' '+train['text']
# Have a glance at our training set
train.info() train.head()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5200 entries, 0 to 5199 Data
columns (total 5 columns):
# Column Non-Null Count Dtype
•id 5200 non-null int64
•title 5200 non-null object
•author 5200 non-null object
•text 5200 non-null object
•total 5200 non-null object dtypes: int64(1), object(4)
memory usage: 203.2+ KB
21. id title author text total
0 20800
Specter of Trump Loosens
Tongues, if Not Purse...
David Streitfeld
PALO ALTO, Calif. — After
years of scorning...
Specter of Trump
LoosensTongues, if Not
Purse...
1 20801
Russian warships ready to
strike terrorists
ne...
Russian warships ready to
strike terrorists
ne...
Russian warships ready
tostrike terrorists
ne...
2 20802
#NoDAPL: Native American
Leaders Vow to Stay A...
Common
Dreams
Videos #NoDAPL: Native
American Leaders Vow to...
#NoDAPL: Native American
Leaders Vow to Stay A...
3 20803
Tim Tebow Will Attempt
Another Comeback, This ...
Daniel Victor
If at first you don’t succeed, try
a different...
Tim Tebow Will Attempt
Another Comeback, This ...
4 20804
Keiser Report: Meme Wars
(E995)
Truth Broadcast
Network
42 mins ago 1 Views 0
Comments 0 Likes 'For th...
Keiser Report: Meme
Wars(E995) Truth
Broadcas...
OUTPUT
22. CONCLUSION:
• In our project, we have used the Logistic regression, Decision Tree,
Gradient Boosting classifier, Random forests classifier, these algorithms
were helpful for predicting the honesty which user gives an input news.
• After the user news input, it predicts the model with selection features
called as Count Vectorization and TF-IDF .
• Both features are useful for finding the extent of accuracy.
FUTURE SCOPE
• In the future, how to combine statistical linear models with context related
matric will be used to increase the accuracy while keeping time complexity
as low. For instance, a complex detection method can be set up with PA as
the first screen step.
• With other specialized machine learning technologies taking metadata into
account to increase the accuracy.
23. REFERENCES
[1]. By ‘Murari Choudhary, Prashant, Shashank Jha, Deepika Saxena and Ashutosh
Kumar Singh’ in the year 2021.
[2]. S. B. Parikh and P. K. Atrey, "Media-Rich Fake News Detection: A Survey",
2018 IEEE Conference on Multimedia Information Processing and Retrieval
(MIPR), pp. 436-441, 2018, April.
[3]. M. Granik and V.Mesyura, "Fake news detection using naive
Bayesclassifier",2017 IEEE First Ukraine Conference on Electrical and Computer
Engineering (UKRCON), pp. 900-903,2017.
[4.]. J. Zhang, L. Cui, Y. Fu and F. B. Gouza, "Fake news detection with deep
diffusive network model", 2018.
[5.] By ‘Terry Traylor, Jeremy Straub, Gurmeet and Nicholas Snell’, in the year 2019.
24. [6]. By ‘Rahul R Mandical, N Mamatha, N Shivakumar, R Monica and AN
Krishna’, in theyear 2020.
[7]. Ammara Habib, Muhammad Zubair Asghar, Adil Khan, Anam Habib and Aurangzeb
Khan, "False information detection in online content and its role in decision making:a
systematic literature review" in, Austria: Springer-Verlag GmbH, 2019.
[8]. Aswini Thota, Priyanka Tilak, Simrat Ahluwalia and Nibrat Lohia, "Fake News
Detection: A Deep Learning Approach", SMU Data Science Review, vol. 1, no. 3, 2018.
[9]. Kuai Xu, Feng Wang, Haiyan Wang and Bo Yang, "Detecting Fake News Over Online
Social Media via Domain Reputations and Content Understanding", the proceeding
of Tsinghua Science and Technology, vol. 25, no. 1, Feb. 2020.
[10]. Chaowei Zhang, Ashish Gupta, Christian Kauten, Amit V. Deokar and Xiao Qin,
"Detecting fake news for reducing misinformation risks using analytics approaches",
the proceeding of ELSEVIER European Journal of Operational Research, 2019.
25. [11]. By: Sadia Afroz*, Michael Brennan and Rachel Greenstadt* in theYear: 2012.
[12]. By S. S. Y. L. Natali Ruchansky, "CSI: A Hybrid Deep Model for Fake News
Detection", CIKM 2017 Internationall Conference on Information and
Knowledgemangaement, 2017.
[13]. By J. S. G. N. S. Terry Traylor, "Classifying Fake News Articles Using Natural
Language Processing to Identify In-Article Attribution as a Supervised Learning
Estimator", IEEE 13th International Conference on Semantic Computing, 2019.