HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
Database Admin for Comp review seminar.pdf
1. BAHRDAR UNIVERSITY
BAHIR DAR INSTITUTE OF TECHNOLOGY
SCHOOL OF RESEARCH AND POST GRADUATE STUDIES
FACULTY OF COMPUTING
Department of Information Technology
Gradute Seminar
Title: Twitter Fake Account Detection
Submitted by: Fentanesh Bezie BDU1300731
fantiebez@gmail.com
Submited to: Mr.Arham D(Ass. Prof)
MAY 11, 2021
BAHIR DAR, ETHIOPIA
3. ii
Abstract
Millions of people use social networking sites like Twitter and Facebook, and their interactions
with these sites have influenced their lives. This prevalence in social networking has resulted in a
number of issues, including the likelihood of false information being exposed to users through fake
accounts, resulting in the spread of malicious material. This situation has the potential to cause
significant harm to society in the real world. The researcher investigated and presented a tool for
detecting fake Twitter accounts. They analyzed the results of the Nave Bayes algorithm after
processing their dataset using a supervised discretization technique known as Entropy
Minimization Discretization (EMD) on numerical features.
Keywords- machine learning; social media; Twitter; spam detection; fake detection
4. 1
1. Introduction
For the past two decades, the social networking movement has exploded. Various forms of social
networking have spawned a slew of online activities that have piqued the attention of a vast number
of users. On the other hand, they are suffering from an increase in the number of fake accounts
generated.
The term "fake accounts" refers to accounts that do not belong to real people. Fake accounts can
spread false information, deceptive web ratings, and spam. Users with fake accounts engage in
prohibited behavior and break Twitter's laws. They may be automated account interactions or
attempts to deceive or confuse people, such as posting harmful links, creating multiple accounts,
posting frequently to the same subject or duplicate posts, posting links with unrelated tweets, and
abusing the reply and mention features, among other things.
Real accounts are those that follow Twitter's rules. Tweets may be sent as e-mail attachments or
as SMS text messages. Twitter allows users to send and receive 140-character messages directly
from their smartphones through a variety of Web-based services. Twitter disseminates knowledge
to a vast number of real-time users.
Millions of people use social networking sites like Twitter and Facebook, and their interactions
with these sites have influenced their lives. This prevalence in social networking has resulted in a
number of issues, including the likelihood of false information being exposed to users through fake
accounts, resulting in the spread of malicious material. This situation can result to a huge damage
in the real world to the society.
Spammers are a major issue on social media since they can use their identities for a variety of
purposes. Spreading rumors is one of these goals, which can have a significant impact on a specific
company or even the whole community. The researcher detects false profile accounts from the
Twitter online social network to prevent the dissemination of fake news, advertisements, and fake
followers, based on the importance of social media's impact on society.
2. Objective
The aim of this study is to identify fake Twitter profile accounts in order to prevent the
dissemination of false information, advertising, and followers.
5. 2
3. Methodology
3.1 Tools and Techniques
The researcher uses Naïve Bayes algorithm and Entropy Minimization Discretization (EMD)
techniques
3.1.1 Naïve Bayes algorithm
It's used in supervised learning exercises, and it's quick and simple to grasp. It outperforms
numerical variables when it comes to multi-class estimation and categorical input variables. Since
Nave Bayes classifiers have a higher success rate in text classification, they are commonly used in
spam filtering and sentiment analysis.
The predictive attributes, in particular, are believed to be conditionally independent. Let C be a
random variable that represents an instance's class, and X be a vector of random variables that
represents the attribute values. Let c stand for a specific class name, and x for a specific attribute
value.
3.1.2 Entropy minimization discretization (EMD) technique
It is a supervised discretization technique. It evaluates different candidate cut points which are the
midpoints of each pair in a sorted data. To evaluate the cut points, the data is divided into intervals
and the class information entropy is calculated. The point with the minimum entropy among all
candidates is selected. This process is done recursively always selecting the best cut point. A
minimum description length (MDL) is applied to decide when to stop. They used this technique
for their experiments because of its success. They are given a set of instances S, a feature A, and
a partition boundary T, the class information entropy of the partition induced by T, E (A, T, S) is
given by the Equation (2)
(2)
3.2 Dataset preparation
To make the experiments the researcher has created their own dataset by using Twitter API. Twitter
allows to interact with its data such as tweets and several attributes about tweets using Twitter
API. By means of a server-side scripting language requests can be made to Twitter API and results
are in JSON format that can be read easily. There are four main objects in Twitter API. These are:
6. 3
Tweets, Users, Entities and Places. Each of these objects have many attributes. They have selected
16 attributes for their Naive Bayes learning algorithm features.
The researcher prepared dataset for their experiments and their data are collected manually by
three individuals and the intersection of them, that means the common decisions, are selected and
put in the dataset. Class decisions are made by examining username, background image, profile
image, follower and friends count, description of the account, number of tweets, and content of
the tweets. Totally, there are 501 fake and 499 real account data is collected. Evaluation metrics
are Accuracy, F-Measure, and confusion matrix.
First Experiment Applying the Naïve Bayes learning Algorithm on the Dataset Using All
Attributes without Discretization, as a result of the first experiment, 861 of the 1000 instances
are classified correctly with the 86.1% accuracy, 112 of 501 fake accounts are classified as real
and 27 of 499 real accounts are classified as fake, Weighted average of the F-measure is 0,860.
Second Experiment Applying the Naïve Bayes learning Algorithm on the Dataset after
Discretization, as a result of the experiment, 901 of the 1000 instances are classified correctly
with the 90.9% accuracy, 60 of 501 fake accounts are classified as real and 31 of 499 real accounts
are classified as fake, Weighted average of the F-measure is increased to 0,909.
4. Critiques
4.1 Strong side
✓ Since the researcher uses more attributes from user items, it's critical to be able to spot fake
accounts quickly. since user objects contain account-wide details.
4.2 Weakness
✓ The data is collected manually. Therefor, error may be occurred.
✓ They use small sample data. if the collected data becomes more, the performance will be
increased.
7. 4
References
al, Yazan Boshmaf et. (February, 2015). Íntegro: Leveraging Victim Prediction for Robust Fake
Account Detection in OSNs. 15, 8-11.
Buket Erúahin1, Özlem Aktaú1, , Deniz KÕlÕnç2, , Ceyhun Akyol2. (2017 ). Twitter Fake
Account Detection . IEEE .
Supraja Gurajala, Joshua S. White, Brian Hudson, and Jeanna N. Matthews, "Fake Twitter
accounts: Profile characteristics obtained using . (15, July 27 ). Fake Twitter accounts:
Profile characteristics obtained using an activity-based pattern detection approach,.
Vladislav Kontsevoi, Naim Lujan, and Adrian Orozco,. (" May 14, 2014.). Detecting Subversion
of Twitter,.