With the advent of the Internet and social media, while hundreds of people have benefitted from the vast sources of information available, there has been an enormous increase in the rise of cyber-crimes, particularly targeted towards women. According to a 2019 report in the [4] Economics Times, India has witnessed a 457% rise in cybercrime in the five year span between 2011 and 2016. Most speculate that this is due to impact of social media such as Facebook, Instagram and Twitter on our daily lives. While these definitely help in creating a sound social network, creation of user accounts in these sites usually needs just an email-id. A real life person can create multiple fake IDs and hence impostors can easily be made. Unlike the real world scenario where multiple rules and regulations are imposed to identify oneself in a unique manner (for example while issuing one’s passport or driver’s license), in the virtual world of social media, admission does not require any such checks. In this paper, we study the different accounts of Instagram, in particular and try to assess an account as fake or real using Machine Learning techniques namely Logistic Regression and Random Forest Algorithm.
Recently, fake news has been incurring many problems to our society. As a result, many researchers have been working on identifying fake news. Most of the fake news detection systems utilize the linguistic feature of the news. However, they have difficulty in sensing highly ambiguous fake news which can be detected only after identifying meaning and latest related information. In this paper, to resolve this problem, we shall present a new Korean fake news detection system using fact DB which is built and updated by human's direct judgement after collecting obvious facts. Our system receives a proposition, and search the semantically related articles from Fact DB in order to verify whether the given proposition is true or not by comparing the proposition with the related articles in fact DB. To achieve this, we utilize a deep learning model, Bidirectional Multi Perspective Matching for Natural Language Sentence BiMPM , which has demonstrated a good performance for the sentence matching task. However, BiMPM has some limitations in that the longer the length of the input sentence is, the lower its performance is, and it has difficulty in making an accurate judgement when an unlearned word or relation between words appear. In order to overcome the limitations, we shall propose a new matching technique which exploits article abstraction as well as entity matching set in addition to BiMPM. In our experiment, we shall show that our system improves the whole performance for fake news detection. Prasanth. K | Praveen. N | Vijay. S | Auxilia Osvin Nancy. V ""Fake News Detection using Machine Learning"" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-4 | Issue-2 , February 2020,
URL: https://www.ijtsrd.com/papers/ijtsrd30014.pdf
Paper Url : https://www.ijtsrd.com/engineering/information-technology/30014/fake-news-detection-using-machine-learning/prasanth-k
Make a query regarding a topic of interest and come to know the sentiment for the day in pie-chart or for the week in form of line-chart for the tweets gathered from twitter.com
Recently, fake news has been incurring many problems to our society. As a result, many researchers have been working on identifying fake news. Most of the fake news detection systems utilize the linguistic feature of the news. However, they have difficulty in sensing highly ambiguous fake news which can be detected only after identifying meaning and latest related information. In this paper, to resolve this problem, we shall present a new Korean fake news detection system using fact DB which is built and updated by human's direct judgement after collecting obvious facts. Our system receives a proposition, and search the semantically related articles from Fact DB in order to verify whether the given proposition is true or not by comparing the proposition with the related articles in fact DB. To achieve this, we utilize a deep learning model, Bidirectional Multi Perspective Matching for Natural Language Sentence BiMPM , which has demonstrated a good performance for the sentence matching task. However, BiMPM has some limitations in that the longer the length of the input sentence is, the lower its performance is, and it has difficulty in making an accurate judgement when an unlearned word or relation between words appear. In order to overcome the limitations, we shall propose a new matching technique which exploits article abstraction as well as entity matching set in addition to BiMPM. In our experiment, we shall show that our system improves the whole performance for fake news detection. Prasanth. K | Praveen. N | Vijay. S | Auxilia Osvin Nancy. V ""Fake News Detection using Machine Learning"" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-4 | Issue-2 , February 2020,
URL: https://www.ijtsrd.com/papers/ijtsrd30014.pdf
Paper Url : https://www.ijtsrd.com/engineering/information-technology/30014/fake-news-detection-using-machine-learning/prasanth-k
Make a query regarding a topic of interest and come to know the sentiment for the day in pie-chart or for the week in form of line-chart for the tweets gathered from twitter.com
Fake news has a negative impact on individuals and society, hence the detection of fake news is becoming a bigger field of interest for data scientists. Attempts to leverage artificial intelligence technologies particularly machine/deep learning techniques and natural language processing (NLP) to automatically detect fake news and prevent its viral spread have recently been actively discussed.
Large technology companies have begun to take steps to address this trend. For example, Google has adjusted its news rankings to prioritize well-known sites and has banned sites with a history of spreading fake news. Facebook has integrated fact checking organizations into its platform.
This SlideShare explores the concept of NLP for detecting fake news in brief.
Twitter Sentiment Analysis Project Done using R.
In these Project we deal with the tweets database that are avaialble to us by the Twitter. We clean the tweets and break them out into tokens and than analysis each word using Bag of Word concept and than rate each word on the basis of the score wheter it is positive, negative and neutral.
We used Naive Baye's Classifier as our base.
Sentimental analysis is a context based mining of text, which extracts and identify subjective information from a text or sentence provided. Here the main concept is extracting the sentiment of the text using machine learning techniques such as LSTM Long short term memory . This text classification method analyses the incoming text and determines whether the underlined emotion is positive or negative along with probability associated with that positive or negative statements. Probability depicts the strength of a positive or negative statement, if the probability is close to zero, it implies that the sentiment is strongly negative and if probability is close to1, it means that the statement is strongly positive. Here a web application is created to deploy this model using a Python based micro framework called flask. Many other methods, such as RNN and CNN, are inefficient when compared to LSTM. Dirash A R | Dr. S K Manju Bargavi "LSTM Based Sentiment Analysis" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-5 | Issue-4 , June 2021, URL: https://www.ijtsrd.compapers/ijtsrd42345.pdf Paper URL: https://www.ijtsrd.comcomputer-science/data-processing/42345/lstm-based-sentiment-analysis/dirash-a-r
This presentation consist of detail description regarding how social media sentiments analysis is performed , what is its scope and benefits in real life scenario.
Sentiment analysis is essential operation to understand the polarity of particular text, blog etc. This presentation has introduction to SA and the approaches in which they can be designed.
modeling and predicting cyber hacking breaches Venkat Projects
Analyzing cyber incident data sets is an important method for deepening our understanding of the evolution of the threat situation. This is a relatively new research topic, and many studies remain to be done. In this paper, we report a statistical analysis of a breach incident data set corresponding to 12 years (2005–2017) of cyber hacking activities that include malware attacks. We show that, in contrast to the findings reported in the literature, both hacking breach incident inter-arrival times and breach sizes should be modeled by stochastic processes, rather than by distributions because they exhibit autocorrelations. Then, we propose particular stochastic process models to, respectively, fit the inter-arrival times and the breach sizes. We also show that these models can predict the inter-arrival times and the breach sizes. In order to get deeper insights into the evolution of hacking breach incidents, we conduct both qualitative and quantitative trend analyses on the data set. We draw a set of cybersecurity insights, including that the threat of cyber hacks is indeed getting worse in terms of their frequency, but not in terms of the magnitude of their damage.
Fake accounts detection on social media using stack ensemble systemIJECEIAES
In today’s world, social media has spread widely, and the social life of people have become deeply associated with social media use. They use it to communicate with each other, share events and news, and even run businesses. The huge growth in social media and the massive number of users has lured attackers to distribute harmful content through fake accounts, leading to a large number of people falling victim to those accounts. In this work, we propose a mechanism for identifying fake accounts on the social media site Twitter by using two methods to preprocess data and extract the most effective features, they are the spearman correlation coefficient and the chi-square test. For classification, we used supervised machine learning algorithms based on the ensemble system (stack method) by using random forest, support vector machine, and naive Bayes algorithms in the first level of the stack, and the logistic regression algorithm as a meta classifier. The stack ensemble system was shown to be effective in achieving the best results when compared to the algorithms used with it, with data accuracy reaching 99%.
Fake news has a negative impact on individuals and society, hence the detection of fake news is becoming a bigger field of interest for data scientists. Attempts to leverage artificial intelligence technologies particularly machine/deep learning techniques and natural language processing (NLP) to automatically detect fake news and prevent its viral spread have recently been actively discussed.
Large technology companies have begun to take steps to address this trend. For example, Google has adjusted its news rankings to prioritize well-known sites and has banned sites with a history of spreading fake news. Facebook has integrated fact checking organizations into its platform.
This SlideShare explores the concept of NLP for detecting fake news in brief.
Twitter Sentiment Analysis Project Done using R.
In these Project we deal with the tweets database that are avaialble to us by the Twitter. We clean the tweets and break them out into tokens and than analysis each word using Bag of Word concept and than rate each word on the basis of the score wheter it is positive, negative and neutral.
We used Naive Baye's Classifier as our base.
Sentimental analysis is a context based mining of text, which extracts and identify subjective information from a text or sentence provided. Here the main concept is extracting the sentiment of the text using machine learning techniques such as LSTM Long short term memory . This text classification method analyses the incoming text and determines whether the underlined emotion is positive or negative along with probability associated with that positive or negative statements. Probability depicts the strength of a positive or negative statement, if the probability is close to zero, it implies that the sentiment is strongly negative and if probability is close to1, it means that the statement is strongly positive. Here a web application is created to deploy this model using a Python based micro framework called flask. Many other methods, such as RNN and CNN, are inefficient when compared to LSTM. Dirash A R | Dr. S K Manju Bargavi "LSTM Based Sentiment Analysis" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-5 | Issue-4 , June 2021, URL: https://www.ijtsrd.compapers/ijtsrd42345.pdf Paper URL: https://www.ijtsrd.comcomputer-science/data-processing/42345/lstm-based-sentiment-analysis/dirash-a-r
This presentation consist of detail description regarding how social media sentiments analysis is performed , what is its scope and benefits in real life scenario.
Sentiment analysis is essential operation to understand the polarity of particular text, blog etc. This presentation has introduction to SA and the approaches in which they can be designed.
modeling and predicting cyber hacking breaches Venkat Projects
Analyzing cyber incident data sets is an important method for deepening our understanding of the evolution of the threat situation. This is a relatively new research topic, and many studies remain to be done. In this paper, we report a statistical analysis of a breach incident data set corresponding to 12 years (2005–2017) of cyber hacking activities that include malware attacks. We show that, in contrast to the findings reported in the literature, both hacking breach incident inter-arrival times and breach sizes should be modeled by stochastic processes, rather than by distributions because they exhibit autocorrelations. Then, we propose particular stochastic process models to, respectively, fit the inter-arrival times and the breach sizes. We also show that these models can predict the inter-arrival times and the breach sizes. In order to get deeper insights into the evolution of hacking breach incidents, we conduct both qualitative and quantitative trend analyses on the data set. We draw a set of cybersecurity insights, including that the threat of cyber hacks is indeed getting worse in terms of their frequency, but not in terms of the magnitude of their damage.
Fake accounts detection on social media using stack ensemble systemIJECEIAES
In today’s world, social media has spread widely, and the social life of people have become deeply associated with social media use. They use it to communicate with each other, share events and news, and even run businesses. The huge growth in social media and the massive number of users has lured attackers to distribute harmful content through fake accounts, leading to a large number of people falling victim to those accounts. In this work, we propose a mechanism for identifying fake accounts on the social media site Twitter by using two methods to preprocess data and extract the most effective features, they are the spearman correlation coefficient and the chi-square test. For classification, we used supervised machine learning algorithms based on the ensemble system (stack method) by using random forest, support vector machine, and naive Bayes algorithms in the first level of the stack, and the logistic regression algorithm as a meta classifier. The stack ensemble system was shown to be effective in achieving the best results when compared to the algorithms used with it, with data accuracy reaching 99%.
Empirical analysis of ensemble methods for the classification of robocalls in...IJECEIAES
With the advent of technology, there has been an excessive use of cellular phones. Cellular phones have made life convenient in our society. However, individuals and groups have subverted the telecommunication devices to deceive unwary victims. Robocalls are quite prevalent these days and they can either be legal or used by scammers to trick one out of their money. The proposed methodology in the paper is to experiment two ensemble models on the dataset acquired from the Federal Trade Commission(DNC Dataset). It is imperative to analyze the call records and based on the patterns the calls can classify as a robocall or not a robocall. Two algorithms Random Forest and XgBoost are combined in two ways and compared in the paper in terms of accuracy, sensitivity and the time taken.
Classification of instagram fake users using supervised machine learning algo...IJECEIAES
On Instagram, the number of followers is a common success indicator. Hence, followers selling services become a huge part of the market. Influencers become bombarded with fake followers and this causes a business owner to pay more than they should for a brand endorsement. Identifying fake followers becomes important to determine the authenticity of an influencer. This research aims to identify fake users' behavior and proposes supervised machine learning models to classify authentic and fake users. The dataset contains fake users bought from various sources, and authentic users. There are 17 features used, based on these sources: 6 metadata, 3 media info, 2 engagement, 2 media tags, 4 media similarity Five machine learning algorithms will be tested. Three different approaches of classification are proposed, i.e. classification to 2-classes and 4-classes, and classification with metadata. Random forest algorithm produces the highest accuracy for the 2-classes (authentic, fake) and 4-classes (authentic, active fake user, inactive fake user, spammer) classification, with accuracy up to 91.76%. The result also shows that the five metadata variables, i.e. number of posts, followers, biography length, following, and link availability are the biggest predictors for the users class. Additionally, descriptive statistics results reveal noticeable differences between fake and authentic users.
Abstract: Detection of fake news based on deep learning techniques is a major issue used to mislead people. For
the experiments, several types of datasets, models, and methodologies have been used to detect fake news. Also,
most of the datasets contain text id, tweets id, and user-based id and user-based features. To get the proper results
and accuracy various models like CNN (Convolution neural network), DEEP CNN, and LSTM (Long short-term
memory) are used
CLASSIFIER SELECTION MODELS FOR INTRUSION DETECTION SYSTEM (IDS)ieijjournal1
Any abnormal activity can be assumed to be anomalies intrusion. In the literature several techniques and
algorithms have been discussed for anomaly detection. In the most of cases true positive and false positive
parameters have been used to compare their performance. However, depending upon the application a
wrong true positive or wrong false positive may have severe detrimental effects. This necessitates inclusion
of cost sensitive parameters in the performance. Moreover the most common testing dataset KDD-CUP-99
has huge size of data which intern require certain amount of pre-processing. Our work in this paper starts
with enumerating the necessity of cost sensitive analysis with some real life examples. After discussing
KDD-CUP-99 an approach is proposed for feature elimination and then features selection to reduce the
number of more relevant features directly and size of KDD-CUP-99 indirectly. From the reported
literature general methods for anomaly detection are selected which perform best for different types of
attacks. These different classifiers are clubbed to form an ensemble. A cost opportunistic technique is
suggested to allocate the relative weights to classifiers ensemble for generating the final result. The cost
sensitivity of true positive and false positive results is done and a method is proposed to select the elements
of cost sensitivity metrics for further improving the results to achieve the overall better performance. The
impact on performance trade of due to incorporating the cost sensitivity is discussed.
Machine learning is a sub-field of artificial intelligence (AI) that focuses on creating statistical models and algorithms that allow computers to learn and become more proficient at performing particular tasks. Machine learning algorithms create a mathematical model with the help of historical sample data, or “training data,” that assists in making predictions or judgments without being explicitly programmed.
Machine learning is a sub-field of artificial intelligence (AI) that focuses on creating statistical models and algorithms that allow computers to learn and become more proficient at performing particular tasks. Machine learning algorithms create a mathematical model with the help of historical sample data, or “training data,” that assists in making predictions or judgments without being explicitly programmed.
How to build machine learning apps.pdfJamieDornan2
Machine learning is a sub-field of artificial intelligence (AI) that focuses on creating statistical models and algorithms that allow computers to learn and become more proficient at performing particular tasks.
Instructions for Submissions thorugh G- Classroom.pptxJheel Barad
This presentation provides a briefing on how to upload submissions and documents in Google Classroom. It was prepared as part of an orientation for new Sainik School in-service teacher trainees. As a training officer, my goal is to ensure that you are comfortable and proficient with this essential tool for managing assignments and fostering student engagement.
The Roman Empire A Historical Colossus.pdfkaushalkr1407
The Roman Empire, a vast and enduring power, stands as one of history's most remarkable civilizations, leaving an indelible imprint on the world. It emerged from the Roman Republic, transitioning into an imperial powerhouse under the leadership of Augustus Caesar in 27 BCE. This transformation marked the beginning of an era defined by unprecedented territorial expansion, architectural marvels, and profound cultural influence.
The empire's roots lie in the city of Rome, founded, according to legend, by Romulus in 753 BCE. Over centuries, Rome evolved from a small settlement to a formidable republic, characterized by a complex political system with elected officials and checks on power. However, internal strife, class conflicts, and military ambitions paved the way for the end of the Republic. Julius Caesar’s dictatorship and subsequent assassination in 44 BCE created a power vacuum, leading to a civil war. Octavian, later Augustus, emerged victorious, heralding the Roman Empire’s birth.
Under Augustus, the empire experienced the Pax Romana, a 200-year period of relative peace and stability. Augustus reformed the military, established efficient administrative systems, and initiated grand construction projects. The empire's borders expanded, encompassing territories from Britain to Egypt and from Spain to the Euphrates. Roman legions, renowned for their discipline and engineering prowess, secured and maintained these vast territories, building roads, fortifications, and cities that facilitated control and integration.
The Roman Empire’s society was hierarchical, with a rigid class system. At the top were the patricians, wealthy elites who held significant political power. Below them were the plebeians, free citizens with limited political influence, and the vast numbers of slaves who formed the backbone of the economy. The family unit was central, governed by the paterfamilias, the male head who held absolute authority.
Culturally, the Romans were eclectic, absorbing and adapting elements from the civilizations they encountered, particularly the Greeks. Roman art, literature, and philosophy reflected this synthesis, creating a rich cultural tapestry. Latin, the Roman language, became the lingua franca of the Western world, influencing numerous modern languages.
Roman architecture and engineering achievements were monumental. They perfected the arch, vault, and dome, constructing enduring structures like the Colosseum, Pantheon, and aqueducts. These engineering marvels not only showcased Roman ingenuity but also served practical purposes, from public entertainment to water supply.
Model Attribute Check Company Auto PropertyCeline George
In Odoo, the multi-company feature allows you to manage multiple companies within a single Odoo database instance. Each company can have its own configurations while still sharing common resources such as products, customers, and suppliers.
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdfTechSoup
In this webinar you will learn how your organization can access TechSoup's wide variety of product discount and donation programs. From hardware to software, we'll give you a tour of the tools available to help your nonprofit with productivity, collaboration, financial management, donor tracking, security, and more.
Synthetic Fiber Construction in lab .pptxPavel ( NSTU)
Synthetic fiber production is a fascinating and complex field that blends chemistry, engineering, and environmental science. By understanding these aspects, students can gain a comprehensive view of synthetic fiber production, its impact on society and the environment, and the potential for future innovations. Synthetic fibers play a crucial role in modern society, impacting various aspects of daily life, industry, and the environment. ynthetic fibers are integral to modern life, offering a range of benefits from cost-effectiveness and versatility to innovative applications and performance characteristics. While they pose environmental challenges, ongoing research and development aim to create more sustainable and eco-friendly alternatives. Understanding the importance of synthetic fibers helps in appreciating their role in the economy, industry, and daily life, while also emphasizing the need for sustainable practices and innovation.
The Indian economy is classified into different sectors to simplify the analysis and understanding of economic activities. For Class 10, it's essential to grasp the sectors of the Indian economy, understand their characteristics, and recognize their importance. This guide will provide detailed notes on the Sectors of the Indian Economy Class 10, using specific long-tail keywords to enhance comprehension.
For more information, visit-www.vavaclasses.com
The Art Pastor's Guide to Sabbath | Steve ThomasonSteve Thomason
What is the purpose of the Sabbath Law in the Torah. It is interesting to compare how the context of the law shifts from Exodus to Deuteronomy. Who gets to rest, and why?
The Art Pastor's Guide to Sabbath | Steve Thomason
DETECTION OF FAKE ACCOUNTS IN INSTAGRAM USING MACHINE LEARNING
1. International Journal of Computer Science & Information Technology (IJCSIT) Vol 11, No 5, October 2019
DOI: 10.5121/ijcsit.2019.11507 83
DETECTION OF FAKE ACCOUNTS IN INSTAGRAM USING
MACHINE LEARNING
Ananya Dey1
, Hamsashree Reddy2
, Manjistha Dey3
and Niharika Sinha4
1
National Institute of Technology, Tiruchirappalli, India
2
PES University, Bangalore, India
3
RV College of Engineering, Bangalore, India
4
Manipal Institute of Technology, Karnataka, India
ABSTRACT
With the advent of the Internet and social media, while hundreds of people have benefitted from the vast sources of
information available, there has been an enormous increase in the rise of cyber-crimes, particularly targeted
towards women. According to a 2019 report in the [4] Economics Times, India has witnessed a 457% rise in
cybercrime in the five year span between 2011 and 2016. Most speculate that this is due to impact of social media
such as Facebook, Instagram and Twitter on our daily lives. While these definitely help in creating a sound social
network, creation of user accounts in these sites usually needs just an email-id. A real life person can create
multiple fake IDs and hence impostors can easily be made. Unlike the real world scenario where multiple rules and
regulations are imposed to identify oneself in a unique manner (for example while issuing one’s passport or driver’s
license), in the virtual world of social media, admission does not require any such checks. In this paper, we study
the different accounts of Instagram, in particular and try to assess an account as fake or real using Machine
Learning techniques namely Logistic Regression and Random Forest Algorithm.
KEYWORDS
Logistic Regression, Random Forest Algorithm, median imputation, Maximum likelihood estimation, k cross
validation, overfitting, out of bag data, recall, identity theft, Angler phishing.
1. INTRODUCTION
Instagram is an online photo and video sharing social networking platform that has been available on both
Android and iOS since 2012. As of May 2019, there are over a billion users registered on Instagram.
In the recent years, Instagram has been found to be using third party apps, called bots. While these can
definitely impersonate a user and tarnish their reputation leading to ‘identity theft’, there has also been
greater instances of malicious ways of promoting the brand image of a company known as “influencer
marketing”. These days a number of businesses are using social media to heed to their customers’ needs
which has led to yet another malpractice called Angler phishing. All these malpractices have made it vital
to implement strong fraud detection techniques and hence we propose our solution.
2. LITERATURE SURVEY
Previously a lot of work has been done on other platforms like Facebook and Twitter, but not much work
has been done for Instagram. Each of these Social Medias are different in terms of features that have to be
2. International Journal of Computer Science & Information Technology (IJCSIT) Vol 11, No 5, October 2019
84
considered, strategies used, etc. Few of the past work include [1] Worth its Weight in Likes: Towards
Detecting Fake Likes on Instagram: This particular paper concentrates on analysing likes and identifying
the genuine ones to reduce the effect of fake likes on Instagram influencer market. They used a simple
feed-forward neural network Multi-Layer Perceptron (MLP) which obtained a precision about 83%. [2]
Identifying Fake Profiles in LinkedIn: A number of features were considered to train the dataset using
neural networks, SVMs, and principal component analysis. The precision rate achieved was 84%. [3]
Detection of Fake Profiles in Social Media: This is a literature review to detecting fake social media
accounts classified into the approaches aimed on analysing individual accounts. So our 2 proposed
method is a novel approach in terms of the platform chosen and the algorithms used like Random Forest
for classification.
Figure1: Proposed System
3. MATERIALS AND METHODS
In this section we present the materials and methods used for the research work.
3. International Journal of Computer Science & Information Technology (IJCSIT) Vol 11, No 5, October 2019
85
Data set information:
The dataset has been taken from https://www.kaggle.com/free4ever1/instagram-fake-spammer-genuine-
accounts.It consists of two CSV files- train.csv (19 KB) and test.csv (4 KB) .The dependent variable,
which is whether it is a fake or not fake account is categorical and it takes two values 0 (not fake) and 1
(fake) profile. The distribution of the training dataset is such that 50% is fake and the rest 50% is
legitimate. Below is a table to denote the parameters that have been considered (denoted in column
Profile feature), their range of values, each of their mean values and what each of the features denote.1.
Collecting the IP addresses which will be used as the training dataset
Table 1: Dataset Features
Figure 2: Snapshot of Training Dataset
Exploratory Data Analysis: This is a critical process of initial data investigation done so as to discover
patterns in the dataset and spot anomalies with the help of summary statistics and graphical
representation. Below are the various sub processes that were done.
a. Missing Value Treatment
The given dataset had no missing value. Missing values could occur in a dataset due to mostly real world
problems and can be treated either through deletion or imputation. The presence of missing values
reduces the data available to be analysed, compromising on the statistical power of the study, and
eventually the reliability of its results.
b. Outlier Detection
Outliers are extreme values that deviate from the usual data values in the dataset. If outliers are present in
the dataset, then the accuracy is reduced significantly as the training dataset learns from this noise in the
data and could give an over-fit model. After careful analysis using graphs, we conclude that the following
features had outliers present in them- nums/length username, full name words, description length, #posts,
#followers and #follows. To deal with these outliers, we used median imputation by calculating the
median of these set of values. Note that we do not include the outliers in the median calculation. Once we
get the median, we replace all the outliers with the calculated median value.
4. International Journal of Computer Science & Information Technology (IJCSIT) Vol 11, No 5, October 2019
86
c. Bivariate Analysis
This is done to understand the relationship between two variables and the strength of association between
them. We calculated the correlation matrix and concluded absence of high multicollinearity between the
variables. This is one of the assumptions before building a logistic regression model.
Once data pre-processing is done, we can safely move into the algorithms. We have been provided with a
labelled training dataset and can therefore proceed with applying supervised learning algorithms that map
the input to the output .For the scope of this paper we have considered two commonly used classification
algorithms, i.e, Logistic Regression and the Random Forest algorithm. Each of them has been explained
in depth below.
1. Logistic Regression
The assumptions of this model include absence of outliers in the dataset and absence of high correlations
between the predictors, which have been taken care of in the preceding steps. In logistic regression, the
probabilities predicted using the logit function. The values greater than or equal to the decision boundary
belong to one class while the values lower than it belong to the other.
We first run the GLM function in R to perform regression and find out the beta coefficients and p values
for each of the features. The Beta coefficient, which is calculated based on maximum likelihood
estimation, is an indicator of how strongly the predictor variable indicates the dependent variable. Based
on the p values, we remove those variables whose 5 values are greater than 0.05 and re run the model.
Finally we performed K cross validation to check overfitting
Figure 3: Model Summery
2. Random forest Algorithm
This is a regression model performing well on classification model. Since there are a very few
assumptions attached to it, so data preparation is less challenging. We used the random Forest function in
5. International Journal of Computer Science & Information Technology (IJCSIT) Vol 11, No 5, October 2019
87
R and set the n-tree parameter (denoting the number of trees) as 500 and the number of variables tried at
each split as 3. Random Record Selection is the first task in which each tree is trained on ⅔ of the total
training data and some variables are selected at random(say m) out of all the variables and these m
variables are used to split the node. For each tree, using the leftover (36.8%) data, the misclassification
rate is calculated. This gives us the Out Of Bag (OOB) error rate. The forest chooses the classification
having the most votes over all the trees in the forest. This is the RF score and the percent YES votes
received is the predicted probability.
4. RESULTS AND ANALYSIS
In this section we determine the results of the two classification models.
After creating the models using the training datasets, we apply the models on unseen data, i.e., and the
test dataset. We create the confusion matrix based on these and calculate various performance parameters
as discussed below.
1. Logistic Regression
Figure 4: Confusion matrix for Logistic Regression (T = True, F=False, P=Positive, N=Negative)
We calculate the different model metrics as follows-
Precision- We divide the total number of correctly classified positive examples by the total
number of predicted positive examples.
Precision= 𝑇𝑃 / (𝑇𝑃+𝐹𝑃) = 57/ (57+8) = 87.6 %
Recall- The ratio of the total number of correctly classified positive examples divide to the total
number of positive examples. Recall= 𝑇𝑃/((𝑇𝑃+𝐹𝑁) = 57/ (57+3) = 95 %
F1 score- It is the harmonic mean of recall and precision. F1 score= (2∗𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛∗𝑅𝑒𝑐𝑎𝑙𝑙)/
(𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑅𝑒𝑐𝑎𝑙𝑙) = 91.15
Accuracy- It is calculated using the following formula
(𝑇𝑃+𝑇𝑁) /(𝑇𝑃+𝑇𝑁+𝐹𝑃+𝐹𝑁) = (57+52) / (57+52+8+3) = 90.8 %
Based on this curve, we infer that using k=3 will give optimal results.
6. International Journal of Computer Science & Information Technology (IJCSIT) Vol 11, No 5, October 2019
88
Figure 5: ROC Curve for Logistic Regression
2. Random Forest Algorithm
Figure 6: Confusion matrix for Logistic Regression (T = True, F=False, P=Positive, N=Negative)
On a similar note, we calculate the various model metrics for Random forest algorithm.
Precision= 𝑇𝑃 / (𝑇𝑃+𝐹𝑃) = 55 / (55+4) = 93.2 %
Recall= 𝑇𝑃 / (𝑇𝑃+𝐹𝑁) = 55/(55+5) = 91.6 %
F1 score= (2∗𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛∗𝑅𝑒𝑐𝑎𝑙𝑙) / (𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑅𝑒𝑐𝑎𝑙𝑙) = 92.42
Accuracy= (𝑇𝑃+𝑇𝑁) / (𝑇𝑃+𝑇𝑁+𝐹𝑃+𝐹𝑁) = (55+56) / (55+56+4+5) = 92.5 %
7. International Journal of Computer Science & Information Technology (IJCSIT) Vol 11, No 5, October 2019
89
Figure 7: Variable Importance Graph
5. CONCLUSIONS
While going through previous similar research conducted on detection of fake profiles on social media
platforms, we realized that not a lot has been done on Instagram as a social network platform in particular.
Hence, we targeted our approach for the same. In this paper, we introduced a novel approach for detecting
fake user profiles on Instagram based on certain features using concepts of machine learning. We used
two models for this- Logistic Regression and Random Forest algorithms, achieving an accuracy of 90.8%
and 92.5% respectively. Such high accuracies have not been attained in previous work conducted for
other social media platforms (highest accuracy achieved before this was 86%).
REFERENCES
[1] Indira Sen,Anupama Aggarwal,Shiven Mian.2018."Worth its Weight in Likes: Towards Detecting Fake Likes
on Instagram". In ACM International Conference on Information and Knowledge Management.
[2] Shalinda Adikari, Kaushik Dutta. 2014. “Identifying Fake Profiles In LinkedIn”. In Pacific Asia Conference
on Information Systems.
[3] Aleksei Romanov, Alexander Semenov, Oleksiy Mazhelis and Jari Veijalainen.2017. "Detection of Fake
Profiles in Social Media”. In 13th International Conference on Web Information Systems and Technologies.
8. International Journal of Computer Science & Information Technology (IJCSIT) Vol 11, No 5, October 2019
90
[4] https://telecom.economictimes.indiatimes.com/news/india-saw-457-rise-in-cybercrime-in-fiveyears-
study/67455224
[5] Todor Mihaylov, Preslav Nakov.2016. "Hunting for Troll Comments in News Community Forums". In
Association for Computational Linguistics.
[6] Ml-cheatsheet.readthedocs.io. (2019). Logistic Regression — ML Cheatsheet documentation. [Online]
Available at: https://ml cheatsheet.readthedocs.io/en/latest/logistic_regression.html#binarylogistic-regression
[Accessed 10 Jun. 2019].
[7] 3. Schoonjans, F. (2019). ROC curve analysis with MedCalc. [Online] MedCalc. Available at:
https://www.medcalc.org/manual/roc-curves.php [Accessed 10 Jun. 2019].
[8] Kietzmann, J.H., Hermkens, K., McCarthy, I.P., Silvestre,B.S., 2011. Social media? Get serious!
Understanding the functional building blocks of social media. Bus.Horiz., SPECIAL ISSUE: SOCIAL
MEDIA 54, 241251. doi:10.1016/j.bushor.2011.01.005.
[9] Krombholz, K., Hobel, H., Huber, M., Weippl, E., 2015.Advanced Social Engineering Attacks. J Inf
SecurAppl 22, 113–122. doi:10.1016/j.jisa.2014.09.005.