SlideShare a Scribd company logo
1 of 27
Download to read offline
Fake News and Their Detection
Data Science and Big Data Analysis
Professor: Antonino Nocera
Team Name: 4V’s
Group members:
Arnold Fonkou
Vignesh Kumar Kembu
Ashina Nurkoo
Seyedkourosh Sajjadi
WELFake
Fake News Detection (WELFake) dataset
of 72,134 news articles with 35,028 real
and 37,106 fake news.
This dataset is a part of an ongoing
research on "Fake News Prediction on
Social Media Website" as a doctoral
degree program of Mr. Pawan Kumar
Verma and is partially supported by the
ARTICONF project funded by the
European Union’s Horizon 2020 research
and innovation program.
Columns:
- Serial number (starting from 0)
- Title (about the text news heading)
Text (about the news content)
- Label (0 = fake and 1 = real)
Data Stream Ingestion Hadoop MapReduce
MongoDB Sandbox
PySpark HDFS
Analysis
Architecture
PySpark
Ingestion
From CSV to JSON
Data Conversion
we have converted the
file into JSON to be
closer to reality.
Reading Data
Using PySpark
We used the
DATAFRAME client of
SPARK to read our big
data.
Saving to Hadoop
Write into Hadoop
We read from the data
frame and then we write
it to Hadoop.
Reading Section
Import findspark
findspark.init()
import pyspark
from pyspark.sql import *
spark = SparkSession.builder 
.master("local[1]") 
.appName("PySpark Read JSON") 
.getOrCreate()
# Reading multiline json file
multiline_dataframe = spark.read.option("multiline","true") 
.json("project_data_sample.json")
multiline_dataframe.head()
Saving Section
multiline_dataframe.write.save('/usr/local/hadoop/user3/dsba1.json',
format='json')
And the data is shown as below:
sqlContext = SQLContext(spark)
df = sqlContext.read.format('json').load('/usr/local/hadoop/user3/dsba1.json')
df.show()
Hadoop
Hadoop
Component
HDFS
Component
MapReduce
Mapper (BoW Creation)
Read Lines
Input Data
The data is given as input
lines to the mapper.
Extract Text
Title and Text Extraction
After reading each line as
a JSON object, we
extract the title and the
text related to that piece
of news from it.
Tokenize
Word Extraction
We perform some data
cleaning and then we
extract every single word
from it.
Text Cleaning
import sys
import re
import json
def clean_text(text):
text =
re.sub(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*(),]|(
?:%[0-9a-fA-F][0-9a-fA-F]))+', '', text)
text = re.sub(r'[^a-zA-Zs]+', '', text)
return text
Tokenizing
def tokenize(text):
if not isinstance(text, str):
text = str(text)
text = clean_text(text)
text = str.lower(text)
return text.split()
Execution
for line in sys.stdin:
line = line.strip()
Try:
json_obj = json.loads(line)
except:
continue
title = json_obj.get("title", "")
text = json_obj.get("text", "")
title_words = tokenize(title)
text_words = tokenize(text)
for word in title_words + text_words:
print(f"{word}t1")
Reducer
Read Lines
Input Data
The data is given as input
lines each containing 2
elements.
Initialize Counter
Word and Count
Extraction
After reading each line a
JON object, we extract
the title and the text
related to that piece of
news from it.
Create BoW
Dictionary
Create a dictionary and
add each word as the key
and its associated count
value as the value.
Counter Initialization
import sys
from collections import Counter
import json
bag_of_words = Counter()
Execution
for line in sys.stdin:
line = line.strip()
try:
word, count = line.split("t")
except:
continue
count = int(count)
bag_of_words[word] += count
with open('bow_data.json', 'w') as f:
json.dump(bag_of_words, f)
Moving to MongoDb
import json
from pymongo import MongoClient
client = MongoClient('mongodb://localhost:27017')
db = client['bow']
collection = db['bow_collection']
with open('bow_data.json', 'r') as f:
bow_data = json.load(f)
collection.insert_one(bow_data)
Performing MapReduce Operation
In the Terminal:
cat db.json | python3 bow_mapper.py | sort | python3 bow_reducer.py
HDFS
In the case of dealing with big data, we
could partition our dataset into a number
of batches instead of saving it in a single
file.
Instead of:
multiline_dataframe.write.save('/usr/local/hadoop/user3/dsba1.j
son', format='json')
Use:
partitioned_df = multiline_dataframe.repartition(4, "Unnamed: 0")
partitioned_df.write.save('/usr/local/hadoop/user3/dsba1.json',
format='json')
partition_counts = partitioned_df.rdd.mapPartitions(lambda it:
[sum(1 for _ in it)]).collect()
print(partition_counts)
[482, 480, 519, 519]
Create
Database
Create a database for
containing the data.
Import From
Hadoop
Import the JSON File
from Hadoop via
PySpark.
View Data &
Backup
View the data and if it is
inserted correctly then
create a backup before
starting the modifications.
Clean Data
Remove
non-alphanumeric
characters.
Display
Modified Data
Display the modified
content to view
changes.
MongoDB
Creating Database
Use an existing database or create a new one:
>use dsdb_dev
Viewing Data
>use dsdb_dev
>show collections
>db.fake_real_news.find()
>db.fake_real_news.aggregate([{$group : {_id: "$label", rest_number
: {$sum : 1}}}])
Creating a Copy
In the Terminal:
mongodump --db dsdb_dev --collection fake_real_news --out
/home/ds/Documents/
Importing From Hadoop
In the Terminal:
mongoimport --db dsdb_dev --collection fake_real_news --file
/usr/local/hadoop/user3/dsba1.json/part-00000-d1623440-4fde-4b
72-b87d-5943bec596d3-c000.json
Importing from Hadoop Using PySpark
with open('sampled_data.json', 'w') as file: for line in json_data: file.write(line +
'n')
import json
with open('sampled_data.json') as file:
data = file.readlines()
collection.insert_many([json.loads(line) for line in data])
df =
spark.read.json("/usr/local/hadoop/user3/dsba1.json/part-00000-d16234
40-4fde-4b72-b87d-5943bec596d3-c000.json")
sampled_df = df.sample(fraction=0.8, seed=42)
from pymongo import MongoClient
conn = MongoClient()
db = conn.dsdb_dev
collection = db['sampled_data']
json_data = sampled_df.toJSON().collect()
Data Cleaning
>db.fake_real_news.aggregate([
{‘$project': {‘_id': 1, 'Unnamed: 0': 1, 'label': 1, 'text': 1, 'title': 1} }
]). forEach(function(doc) {
if (doc.title) {
var newTitle = doc.title.replace(/[^a-zA-Z0-9 ]/g, '');
db.fake_real_news.update({ '_id': doc._id }, { '$set': { 'title': newTitle } });
}
});
Modified Content Display
>db.fake_real_news.aggregate([
{‘$project': {‘_id': 1, 'Unnamed: 0': 1, 'label': 1, 'text': 1, 'title': 1} }
]);
The file is now ready for word occurrence counting,
which can be done using Jupyter Notebook and
PyMongo.
Backup Restoration
In case of any need, restore the initial file:
>db.fake_real_news.drop()
mongorestore --db dsdb_dev --collection fake_real_news
/home/ds/Documents/dsdb_dev/fake_real_news.bson
Count the Number of Words
db.fake_real_news.aggregate ([
{
'$match': {
'label': "0" # the condition for the 'label'
field to be 1
}
},
{
'$project': {
'words': {'$split': [{'$toLower': '$title'}, ' ']} #
Split the lowercase version of the title field into
an array of words
}
},
{
'$unwind': '$words' # Separate documents
for each word
},
{
'$group': {
'_id': {
'word': '$words', # Group by word field
and count
},
'count': {'$sum': 1}
}
},
{
'$project': {
# Project to return only word field, count,
and id
'word': '$_id.word',
'count': 1
}
},
{
'$match': {
'word': {'$ne': None}, # Exclude null or
non-existent values
}
},
{
'$match': {
'$expr': {'$ne': ['$word', '']} # Exclude
empty strings
}
},
{
'$sort': {'count':-1}
}
])
Hypotheses
H1
Generation of fake news shall be with
the help of stop words.
Metrics - Average number of stop
words in title shall be higher in fake
news.
H2
Real news shall be short and crisp in
order to generate easy value.
Metrics - Length of the fake news shall
be more than the real ones.
H1
We used NLTK to extract stop words from
the title column and compared the
averages between fake and real titles.
The hypothesis is false, as shown by the
figure: fake news (0) is less frequent than
real news (1).
H2
The hypothesis is true, as shown by the figures:
fake news (0) tends to be longer than real news
(1).
Insights on Data &
Pre-processing
To gain quick insights from the data, we
used word clouds for the titles overall and
for fake/real data.
Real News
Fake News
Null Values
The title column contains some null
values, which may cause issues in data
analysis or processing.
We need to fill the null values in the title
column to ensure accurate data analysis.
Text Normalization
To further prepare the data, we applied text normalization techniques, including converting
the title and text to lowercase and removing punctuation marks.
Classification Model
For the binary classification of the News, we have choose
Random Forest Classifier
Splitting of data in x and y variable and Test and train split of
the data has been performed with 77 & 33 size.
The bag of words has been performed to the text of the news
(X_train & X_test) and by removing the stop words in English.
The Label Y_train & Y_test has the class of the news (Fake = 0
& Real = 1 )
Now the train data is feed to the RandomForestClassifier with
500 trees and the model has been tested with the test data
and the model classification confusion matrix is below.
Thank You For Your Attention!

More Related Content

What's hot

win32 app에서 UWP API호출하기
win32 app에서 UWP API호출하기win32 app에서 UWP API호출하기
win32 app에서 UWP API호출하기YEONG-CHEON YOU
 
Meta-Prod2Vec: Simple Product Embeddings with Side-Information
Meta-Prod2Vec: Simple Product Embeddings with Side-InformationMeta-Prod2Vec: Simple Product Embeddings with Side-Information
Meta-Prod2Vec: Simple Product Embeddings with Side-Informationrecsysfr
 
Handling inline assembly in Clang and LLVM
Handling inline assembly in Clang and LLVMHandling inline assembly in Clang and LLVM
Handling inline assembly in Clang and LLVMMin-Yih Hsu
 
게임 랭킹 ( Game Leader Board )
게임 랭킹 ( Game Leader Board )게임 랭킹 ( Game Leader Board )
게임 랭킹 ( Game Leader Board )ssuserda2e71
 
04 spark-pair rdd-rdd-persistence
04 spark-pair rdd-rdd-persistence04 spark-pair rdd-rdd-persistence
04 spark-pair rdd-rdd-persistenceVenkat Datla
 
Python basic - v01
Python   basic - v01Python   basic - v01
Python basic - v01ssuser5e7722
 
Blueprints: Introduction to Python programming
Blueprints: Introduction to Python programmingBlueprints: Introduction to Python programming
Blueprints: Introduction to Python programmingBhalaji Nagarajan
 
Creating data apps using Streamlit in Python
Creating data apps using Streamlit in PythonCreating data apps using Streamlit in Python
Creating data apps using Streamlit in PythonNithish Raghunandanan
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucenelucenerevolution
 
Enterprise Tic-Tac-Toe
Enterprise Tic-Tac-ToeEnterprise Tic-Tac-Toe
Enterprise Tic-Tac-ToeScott Wlaschin
 
Tutorial on SPARQL: SPARQL Protocol and RDF Query Language
Tutorial on SPARQL: SPARQL Protocol and RDF Query Language Tutorial on SPARQL: SPARQL Protocol and RDF Query Language
Tutorial on SPARQL: SPARQL Protocol and RDF Query Language Biswanath Dutta
 
Pythonで始めるWebアプリケーション開発
Pythonで始めるWebアプリケーション開発Pythonで始めるWebアプリケーション開発
Pythonで始めるWebアプリケーション開発Takahiro Kubo
 
텐서플로우 기초 이해하기
텐서플로우 기초 이해하기 텐서플로우 기초 이해하기
텐서플로우 기초 이해하기 Yong Joon Moon
 
[245] presto 내부구조 파헤치기
[245] presto 내부구조 파헤치기[245] presto 내부구조 파헤치기
[245] presto 내부구조 파헤치기NAVER D2
 
Python Collections Tutorial | Edureka
Python Collections Tutorial | EdurekaPython Collections Tutorial | Edureka
Python Collections Tutorial | EdurekaEdureka!
 
Getting a Data Science Job
Getting a Data Science JobGetting a Data Science Job
Getting a Data Science JobAlexey Grigorev
 
Indexing with MongoDB
Indexing with MongoDBIndexing with MongoDB
Indexing with MongoDBMongoDB
 
Recursion with Python [Rev]
Recursion with Python [Rev]Recursion with Python [Rev]
Recursion with Python [Rev]Dennis Walangadi
 
잘 알려지지 않은 숨은 진주, Winsock API - WSAPoll, Fast Loopback
잘 알려지지 않은 숨은 진주, Winsock API - WSAPoll, Fast Loopback잘 알려지지 않은 숨은 진주, Winsock API - WSAPoll, Fast Loopback
잘 알려지지 않은 숨은 진주, Winsock API - WSAPoll, Fast Loopback흥배 최
 

What's hot (20)

win32 app에서 UWP API호출하기
win32 app에서 UWP API호출하기win32 app에서 UWP API호출하기
win32 app에서 UWP API호출하기
 
Meta-Prod2Vec: Simple Product Embeddings with Side-Information
Meta-Prod2Vec: Simple Product Embeddings with Side-InformationMeta-Prod2Vec: Simple Product Embeddings with Side-Information
Meta-Prod2Vec: Simple Product Embeddings with Side-Information
 
Handling inline assembly in Clang and LLVM
Handling inline assembly in Clang and LLVMHandling inline assembly in Clang and LLVM
Handling inline assembly in Clang and LLVM
 
게임 랭킹 ( Game Leader Board )
게임 랭킹 ( Game Leader Board )게임 랭킹 ( Game Leader Board )
게임 랭킹 ( Game Leader Board )
 
04 spark-pair rdd-rdd-persistence
04 spark-pair rdd-rdd-persistence04 spark-pair rdd-rdd-persistence
04 spark-pair rdd-rdd-persistence
 
Python basic - v01
Python   basic - v01Python   basic - v01
Python basic - v01
 
Blueprints: Introduction to Python programming
Blueprints: Introduction to Python programmingBlueprints: Introduction to Python programming
Blueprints: Introduction to Python programming
 
Creating data apps using Streamlit in Python
Creating data apps using Streamlit in PythonCreating data apps using Streamlit in Python
Creating data apps using Streamlit in Python
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucene
 
Enterprise Tic-Tac-Toe
Enterprise Tic-Tac-ToeEnterprise Tic-Tac-Toe
Enterprise Tic-Tac-Toe
 
Tutorial on SPARQL: SPARQL Protocol and RDF Query Language
Tutorial on SPARQL: SPARQL Protocol and RDF Query Language Tutorial on SPARQL: SPARQL Protocol and RDF Query Language
Tutorial on SPARQL: SPARQL Protocol and RDF Query Language
 
Pythonで始めるWebアプリケーション開発
Pythonで始めるWebアプリケーション開発Pythonで始めるWebアプリケーション開発
Pythonで始めるWebアプリケーション開発
 
텐서플로우 기초 이해하기
텐서플로우 기초 이해하기 텐서플로우 기초 이해하기
텐서플로우 기초 이해하기
 
[245] presto 내부구조 파헤치기
[245] presto 내부구조 파헤치기[245] presto 내부구조 파헤치기
[245] presto 내부구조 파헤치기
 
Python Collections Tutorial | Edureka
Python Collections Tutorial | EdurekaPython Collections Tutorial | Edureka
Python Collections Tutorial | Edureka
 
Getting a Data Science Job
Getting a Data Science JobGetting a Data Science Job
Getting a Data Science Job
 
PostgreSQL
PostgreSQLPostgreSQL
PostgreSQL
 
Indexing with MongoDB
Indexing with MongoDBIndexing with MongoDB
Indexing with MongoDB
 
Recursion with Python [Rev]
Recursion with Python [Rev]Recursion with Python [Rev]
Recursion with Python [Rev]
 
잘 알려지지 않은 숨은 진주, Winsock API - WSAPoll, Fast Loopback
잘 알려지지 않은 숨은 진주, Winsock API - WSAPoll, Fast Loopback잘 알려지지 않은 숨은 진주, Winsock API - WSAPoll, Fast Loopback
잘 알려지지 않은 숨은 진주, Winsock API - WSAPoll, Fast Loopback
 

Similar to Fake News Detection Dataset Analysis (WELFake

Unraveling The Meaning From COVID-19 Dataset Using Python – A Tutorial for be...
Unraveling The Meaning From COVID-19 Dataset Using Python – A Tutorial for be...Unraveling The Meaning From COVID-19 Dataset Using Python – A Tutorial for be...
Unraveling The Meaning From COVID-19 Dataset Using Python – A Tutorial for be...Kavika Roy
 
Lesson 2 data preprocessing
Lesson 2   data preprocessingLesson 2   data preprocessing
Lesson 2 data preprocessingAbdurRazzaqe1
 
Hadoop - MongoDB Webinar June 2014
Hadoop - MongoDB Webinar June 2014Hadoop - MongoDB Webinar June 2014
Hadoop - MongoDB Webinar June 2014MongoDB
 
Jupyter Notebooks for machine learning on Kubernetes & OpenShift | DevNation ...
Jupyter Notebooks for machine learning on Kubernetes & OpenShift | DevNation ...Jupyter Notebooks for machine learning on Kubernetes & OpenShift | DevNation ...
Jupyter Notebooks for machine learning on Kubernetes & OpenShift | DevNation ...Red Hat Developers
 
Study material ip class 12th
Study material ip class 12thStudy material ip class 12th
Study material ip class 12thanimesh dwivedi
 
New Features in Apache Pinot
New Features in Apache PinotNew Features in Apache Pinot
New Features in Apache PinotSiddharth Teotia
 
Workshop "Can my .NET application use less CPU / RAM?", Yevhen Tatarynov
Workshop "Can my .NET application use less CPU / RAM?", Yevhen TatarynovWorkshop "Can my .NET application use less CPU / RAM?", Yevhen Tatarynov
Workshop "Can my .NET application use less CPU / RAM?", Yevhen TatarynovFwdays
 
Python Pandas
Python PandasPython Pandas
Python PandasSunil OS
 
Educational Objectives After successfully completing this assignmen.pdf
Educational Objectives After successfully completing this assignmen.pdfEducational Objectives After successfully completing this assignmen.pdf
Educational Objectives After successfully completing this assignmen.pdfrajeshjangid1865
 
Comparing EDA with classical and Bayesian analysis.pptx
Comparing EDA with classical and Bayesian analysis.pptxComparing EDA with classical and Bayesian analysis.pptx
Comparing EDA with classical and Bayesian analysis.pptxPremaGanesh1
 
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...Yao Yao
 
Introduction to Machine Learning by MARK
Introduction to Machine Learning by MARKIntroduction to Machine Learning by MARK
Introduction to Machine Learning by MARKMRKUsafzai0607
 
PyTorch 튜토리얼 (Touch to PyTorch)
PyTorch 튜토리얼 (Touch to PyTorch)PyTorch 튜토리얼 (Touch to PyTorch)
PyTorch 튜토리얼 (Touch to PyTorch)Hansol Kang
 
Introduction to objects and inputoutput
Introduction to objects and inputoutput Introduction to objects and inputoutput
Introduction to objects and inputoutput Ahmad Idrees
 
FDP-faculty deveopmemt program on python
FDP-faculty deveopmemt program on pythonFDP-faculty deveopmemt program on python
FDP-faculty deveopmemt program on pythonkannikadg
 
Big data analytics project report
Big data analytics project reportBig data analytics project report
Big data analytics project reportManav Deshmukh
 

Similar to Fake News Detection Dataset Analysis (WELFake (20)

Unraveling The Meaning From COVID-19 Dataset Using Python – A Tutorial for be...
Unraveling The Meaning From COVID-19 Dataset Using Python – A Tutorial for be...Unraveling The Meaning From COVID-19 Dataset Using Python – A Tutorial for be...
Unraveling The Meaning From COVID-19 Dataset Using Python – A Tutorial for be...
 
Lesson 2 data preprocessing
Lesson 2   data preprocessingLesson 2   data preprocessing
Lesson 2 data preprocessing
 
Hadoop - MongoDB Webinar June 2014
Hadoop - MongoDB Webinar June 2014Hadoop - MongoDB Webinar June 2014
Hadoop - MongoDB Webinar June 2014
 
Jupyter Notebooks for machine learning on Kubernetes & OpenShift | DevNation ...
Jupyter Notebooks for machine learning on Kubernetes & OpenShift | DevNation ...Jupyter Notebooks for machine learning on Kubernetes & OpenShift | DevNation ...
Jupyter Notebooks for machine learning on Kubernetes & OpenShift | DevNation ...
 
Predicting the relevance of search results for e-commerce systems
Predicting the relevance of search results for e-commerce systemsPredicting the relevance of search results for e-commerce systems
Predicting the relevance of search results for e-commerce systems
 
Study material ip class 12th
Study material ip class 12thStudy material ip class 12th
Study material ip class 12th
 
weatherr.pptx
weatherr.pptxweatherr.pptx
weatherr.pptx
 
New Features in Apache Pinot
New Features in Apache PinotNew Features in Apache Pinot
New Features in Apache Pinot
 
Numerical data.
Numerical data.Numerical data.
Numerical data.
 
Workshop "Can my .NET application use less CPU / RAM?", Yevhen Tatarynov
Workshop "Can my .NET application use less CPU / RAM?", Yevhen TatarynovWorkshop "Can my .NET application use less CPU / RAM?", Yevhen Tatarynov
Workshop "Can my .NET application use less CPU / RAM?", Yevhen Tatarynov
 
Python Pandas
Python PandasPython Pandas
Python Pandas
 
Educational Objectives After successfully completing this assignmen.pdf
Educational Objectives After successfully completing this assignmen.pdfEducational Objectives After successfully completing this assignmen.pdf
Educational Objectives After successfully completing this assignmen.pdf
 
Comparing EDA with classical and Bayesian analysis.pptx
Comparing EDA with classical and Bayesian analysis.pptxComparing EDA with classical and Bayesian analysis.pptx
Comparing EDA with classical and Bayesian analysis.pptx
 
The steps of R code Master.pptx
The steps of R code Master.pptxThe steps of R code Master.pptx
The steps of R code Master.pptx
 
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
 
Introduction to Machine Learning by MARK
Introduction to Machine Learning by MARKIntroduction to Machine Learning by MARK
Introduction to Machine Learning by MARK
 
PyTorch 튜토리얼 (Touch to PyTorch)
PyTorch 튜토리얼 (Touch to PyTorch)PyTorch 튜토리얼 (Touch to PyTorch)
PyTorch 튜토리얼 (Touch to PyTorch)
 
Introduction to objects and inputoutput
Introduction to objects and inputoutput Introduction to objects and inputoutput
Introduction to objects and inputoutput
 
FDP-faculty deveopmemt program on python
FDP-faculty deveopmemt program on pythonFDP-faculty deveopmemt program on python
FDP-faculty deveopmemt program on python
 
Big data analytics project report
Big data analytics project reportBig data analytics project report
Big data analytics project report
 

Recently uploaded

Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service LucknowAminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknowmakika9823
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 

Recently uploaded (20)

Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service LucknowAminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 

Fake News Detection Dataset Analysis (WELFake

  • 1. Fake News and Their Detection Data Science and Big Data Analysis Professor: Antonino Nocera Team Name: 4V’s Group members: Arnold Fonkou Vignesh Kumar Kembu Ashina Nurkoo Seyedkourosh Sajjadi
  • 2. WELFake Fake News Detection (WELFake) dataset of 72,134 news articles with 35,028 real and 37,106 fake news. This dataset is a part of an ongoing research on "Fake News Prediction on Social Media Website" as a doctoral degree program of Mr. Pawan Kumar Verma and is partially supported by the ARTICONF project funded by the European Union’s Horizon 2020 research and innovation program. Columns: - Serial number (starting from 0) - Title (about the text news heading) Text (about the news content) - Label (0 = fake and 1 = real)
  • 3. Data Stream Ingestion Hadoop MapReduce MongoDB Sandbox PySpark HDFS Analysis Architecture PySpark
  • 4. Ingestion From CSV to JSON Data Conversion we have converted the file into JSON to be closer to reality. Reading Data Using PySpark We used the DATAFRAME client of SPARK to read our big data. Saving to Hadoop Write into Hadoop We read from the data frame and then we write it to Hadoop.
  • 5. Reading Section Import findspark findspark.init() import pyspark from pyspark.sql import * spark = SparkSession.builder .master("local[1]") .appName("PySpark Read JSON") .getOrCreate() # Reading multiline json file multiline_dataframe = spark.read.option("multiline","true") .json("project_data_sample.json") multiline_dataframe.head() Saving Section multiline_dataframe.write.save('/usr/local/hadoop/user3/dsba1.json', format='json') And the data is shown as below: sqlContext = SQLContext(spark) df = sqlContext.read.format('json').load('/usr/local/hadoop/user3/dsba1.json') df.show()
  • 7. Mapper (BoW Creation) Read Lines Input Data The data is given as input lines to the mapper. Extract Text Title and Text Extraction After reading each line as a JSON object, we extract the title and the text related to that piece of news from it. Tokenize Word Extraction We perform some data cleaning and then we extract every single word from it.
  • 8. Text Cleaning import sys import re import json def clean_text(text): text = re.sub(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*(),]|( ?:%[0-9a-fA-F][0-9a-fA-F]))+', '', text) text = re.sub(r'[^a-zA-Zs]+', '', text) return text Tokenizing def tokenize(text): if not isinstance(text, str): text = str(text) text = clean_text(text) text = str.lower(text) return text.split() Execution for line in sys.stdin: line = line.strip() Try: json_obj = json.loads(line) except: continue title = json_obj.get("title", "") text = json_obj.get("text", "") title_words = tokenize(title) text_words = tokenize(text) for word in title_words + text_words: print(f"{word}t1")
  • 9. Reducer Read Lines Input Data The data is given as input lines each containing 2 elements. Initialize Counter Word and Count Extraction After reading each line a JON object, we extract the title and the text related to that piece of news from it. Create BoW Dictionary Create a dictionary and add each word as the key and its associated count value as the value.
  • 10. Counter Initialization import sys from collections import Counter import json bag_of_words = Counter() Execution for line in sys.stdin: line = line.strip() try: word, count = line.split("t") except: continue count = int(count) bag_of_words[word] += count with open('bow_data.json', 'w') as f: json.dump(bag_of_words, f)
  • 11. Moving to MongoDb import json from pymongo import MongoClient client = MongoClient('mongodb://localhost:27017') db = client['bow'] collection = db['bow_collection'] with open('bow_data.json', 'r') as f: bow_data = json.load(f) collection.insert_one(bow_data) Performing MapReduce Operation In the Terminal: cat db.json | python3 bow_mapper.py | sort | python3 bow_reducer.py
  • 12. HDFS In the case of dealing with big data, we could partition our dataset into a number of batches instead of saving it in a single file. Instead of: multiline_dataframe.write.save('/usr/local/hadoop/user3/dsba1.j son', format='json') Use: partitioned_df = multiline_dataframe.repartition(4, "Unnamed: 0") partitioned_df.write.save('/usr/local/hadoop/user3/dsba1.json', format='json') partition_counts = partitioned_df.rdd.mapPartitions(lambda it: [sum(1 for _ in it)]).collect() print(partition_counts) [482, 480, 519, 519]
  • 13. Create Database Create a database for containing the data. Import From Hadoop Import the JSON File from Hadoop via PySpark. View Data & Backup View the data and if it is inserted correctly then create a backup before starting the modifications. Clean Data Remove non-alphanumeric characters. Display Modified Data Display the modified content to view changes. MongoDB
  • 14. Creating Database Use an existing database or create a new one: >use dsdb_dev Viewing Data >use dsdb_dev >show collections >db.fake_real_news.find() >db.fake_real_news.aggregate([{$group : {_id: "$label", rest_number : {$sum : 1}}}]) Creating a Copy In the Terminal: mongodump --db dsdb_dev --collection fake_real_news --out /home/ds/Documents/ Importing From Hadoop In the Terminal: mongoimport --db dsdb_dev --collection fake_real_news --file /usr/local/hadoop/user3/dsba1.json/part-00000-d1623440-4fde-4b 72-b87d-5943bec596d3-c000.json
  • 15. Importing from Hadoop Using PySpark with open('sampled_data.json', 'w') as file: for line in json_data: file.write(line + 'n') import json with open('sampled_data.json') as file: data = file.readlines() collection.insert_many([json.loads(line) for line in data]) df = spark.read.json("/usr/local/hadoop/user3/dsba1.json/part-00000-d16234 40-4fde-4b72-b87d-5943bec596d3-c000.json") sampled_df = df.sample(fraction=0.8, seed=42) from pymongo import MongoClient conn = MongoClient() db = conn.dsdb_dev collection = db['sampled_data'] json_data = sampled_df.toJSON().collect()
  • 16. Data Cleaning >db.fake_real_news.aggregate([ {‘$project': {‘_id': 1, 'Unnamed: 0': 1, 'label': 1, 'text': 1, 'title': 1} } ]). forEach(function(doc) { if (doc.title) { var newTitle = doc.title.replace(/[^a-zA-Z0-9 ]/g, ''); db.fake_real_news.update({ '_id': doc._id }, { '$set': { 'title': newTitle } }); } }); Modified Content Display >db.fake_real_news.aggregate([ {‘$project': {‘_id': 1, 'Unnamed: 0': 1, 'label': 1, 'text': 1, 'title': 1} } ]); The file is now ready for word occurrence counting, which can be done using Jupyter Notebook and PyMongo. Backup Restoration In case of any need, restore the initial file: >db.fake_real_news.drop() mongorestore --db dsdb_dev --collection fake_real_news /home/ds/Documents/dsdb_dev/fake_real_news.bson
  • 17. Count the Number of Words db.fake_real_news.aggregate ([ { '$match': { 'label': "0" # the condition for the 'label' field to be 1 } }, { '$project': { 'words': {'$split': [{'$toLower': '$title'}, ' ']} # Split the lowercase version of the title field into an array of words } }, { '$unwind': '$words' # Separate documents for each word }, { '$group': { '_id': { 'word': '$words', # Group by word field and count }, 'count': {'$sum': 1} } }, { '$project': { # Project to return only word field, count, and id 'word': '$_id.word', 'count': 1 } }, { '$match': { 'word': {'$ne': None}, # Exclude null or non-existent values } }, { '$match': { '$expr': {'$ne': ['$word', '']} # Exclude empty strings } }, { '$sort': {'count':-1} } ])
  • 18. Hypotheses H1 Generation of fake news shall be with the help of stop words. Metrics - Average number of stop words in title shall be higher in fake news. H2 Real news shall be short and crisp in order to generate easy value. Metrics - Length of the fake news shall be more than the real ones.
  • 19. H1 We used NLTK to extract stop words from the title column and compared the averages between fake and real titles. The hypothesis is false, as shown by the figure: fake news (0) is less frequent than real news (1).
  • 20. H2 The hypothesis is true, as shown by the figures: fake news (0) tends to be longer than real news (1).
  • 21. Insights on Data & Pre-processing To gain quick insights from the data, we used word clouds for the titles overall and for fake/real data.
  • 24. Null Values The title column contains some null values, which may cause issues in data analysis or processing. We need to fill the null values in the title column to ensure accurate data analysis.
  • 25. Text Normalization To further prepare the data, we applied text normalization techniques, including converting the title and text to lowercase and removing punctuation marks.
  • 26. Classification Model For the binary classification of the News, we have choose Random Forest Classifier Splitting of data in x and y variable and Test and train split of the data has been performed with 77 & 33 size. The bag of words has been performed to the text of the news (X_train & X_test) and by removing the stop words in English. The Label Y_train & Y_test has the class of the news (Fake = 0 & Real = 1 ) Now the train data is feed to the RandomForestClassifier with 500 trees and the model has been tested with the test data and the model classification confusion matrix is below.
  • 27. Thank You For Your Attention!