SlideShare a Scribd company logo
1 of 80
Download to read offline
Tunisian Republic
Ministry of Higher Education
and Scientific Research
University of Tunis El Manar
Higher Institute of Computer Science
Masterā€™s Thesis
Presented in order to obtain the
Masterā€™s Degree in Information and Technology
Mention: Information and Technology
Specialty : Software Engineering (GL)
By:
Wajdi KHATTEL
Proposal of a Terrorist Detection Model in
Social Networks
Presented on 07.12.2019
In front of jury composed of:
President:
Evaluator:
Academic supervisor:
Laboratory supervisor:
Najet AROUS
Olfa EL MOURALI
Ramzi GUETARI
Nour El Houda BEN CHAABENE
Realized within
Academic year : 2018-2019
Laboratory Supervisor
Academic Supervisor
I authorize the student to submit his internship report for a defense
Signature
I authorize the student to submit his internship report for a defense
Signature
Le 22/11/2019
Ramzi Guetari
Le 22/11/2019
Nour El Houda Ben Chaabene
Dedications
I want to dedicate this humble work to:
My parents Abderraouf and Sonia for all the pain they have been through and all the
sacriļ¬ces they made in order for me to reach this level and for me to be what I am today.
To my sister Yosra and her husband Jamel for their patience, continuous support and
care.
To all the members of my family and my dearest friends for the best times and laughs
we had and sticking by my side the time I needed.
For all those I love and all those who love me. To all who helped that I forgot to mention.
With Love,
Wajdi Khattel.
iii
Acknowledgements
I would like ļ¬rst to thank and express my very profound gratitude to my academic advisor,
Mrs. Nour EL Houda BEN CHAABENE for the huge eļ¬€ort and sacriļ¬ce she gave the entire
time and also for believing in our capacities and her patience, motivation, and immense
knowledge. Her guidance helped us in all the time of research and writing of this thesis.
My academic Professor, Mr. Ramzi GUETARI, for his big support and generosity and his
continuous welcome in his oļ¬ƒce that was always open whenever I ran into a trouble spot or
had a question about our research, and steering us in the right direction whenever I needed it.
Also anyone who contributed to this work for the support, even spiritually especially the last
couple of weeks.
With Gratitude
Wajdi Khattel.
iv
Table of Contents
General Introduction 1
I State of the art 3
1 Anomaly Detection in Social Media . . . . . . . . . . . . . . . . . . . . . . . 4
1.1 Activity-based Detection . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Graph-based Detection . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2 Terrorist Detection in Social Media . . . . . . . . . . . . . . . . . . . . . . . 10
2.1 Existing Content-based Models . . . . . . . . . . . . . . . . . . . . . 11
2.2 Existing Graph-input Analysis . . . . . . . . . . . . . . . . . . . . . 13
II Existing Techniques 16
1 Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.1 Textual-Content Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.1.1 Text Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.1.2 Data Representation . . . . . . . . . . . . . . . . . . . . . . 20
1.2 Image-Content Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.2.1 CNN: Convolutional Layer . . . . . . . . . . . . . . . . . . 23
1.2.2 CNN: Pooling Layer . . . . . . . . . . . . . . . . . . . . . . 25
1.2.3 CNN: Fully-Connected Layer . . . . . . . . . . . . . . . . . 26
1.3 Numerical-Content Data . . . . . . . . . . . . . . . . . . . . . . . . . 26
2 Data Classiļ¬cation in Machine Learning . . . . . . . . . . . . . . . . . . . . 26
2.1 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.2 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.3 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
v
Table of Contents
III Proposed Model 29
1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
1.1 Oļ¬„ine Data Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
1.2 Online Data Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2 Proposed Model Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.1 Model Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.2 Content-Based Classiļ¬cation . . . . . . . . . . . . . . . . . . . . . . . 34
2.2.1 Text Classiļ¬cation Model . . . . . . . . . . . . . . . . . . . 34
2.2.2 Image Classiļ¬cation Model . . . . . . . . . . . . . . . . . . 36
2.2.3 General Information Classiļ¬cation Model . . . . . . . . . . 37
2.3 Decision Making . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.4 Global Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
IV Implementation and Results 43
1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
1.1 Oļ¬„ine Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
1.1.1 Textual-Content Data . . . . . . . . . . . . . . . . . . . . . 44
1.1.2 Image-Content Data . . . . . . . . . . . . . . . . . . . . . . 46
1.1.3 General Information Data . . . . . . . . . . . . . . . . . . . 48
1.2 Online Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
1.2.1 Facebook Data . . . . . . . . . . . . . . . . . . . . . . . . . 49
1.2.2 Instagram Data . . . . . . . . . . . . . . . . . . . . . . . . . 51
1.2.3 Twitter Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
2 Model Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
2.1 Text Classiļ¬cation Model . . . . . . . . . . . . . . . . . . . . . . . . . 54
2.1.1 NLP Process . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
2.1.2 Data Vectorization . . . . . . . . . . . . . . . . . . . . . . . 54
2.1.3 Data Classiļ¬cation . . . . . . . . . . . . . . . . . . . . . . . 55
2.2 Image Classiļ¬cation Model . . . . . . . . . . . . . . . . . . . . . . . 56
2.2.1 Data Augmentation . . . . . . . . . . . . . . . . . . . . . . 56
2.2.2 Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . 57
2.3 General Information Classiļ¬cation Model . . . . . . . . . . . . . . . 58
vi
Table of Contents
2.4 Proposed Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3 Results Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
V Conclusions and Perspectives 63
Bibliography 65
vii
List of Figures
I.1 Uniļ¬ed User Proļ¬ling (UUP) system with cyber security perspective . . . . 6
I.2 User Proļ¬ling Method in Authorization Logs . . . . . . . . . . . . . . . . . 7
I.3 Context-aware graph-based approach framework . . . . . . . . . . . . . . . 8
I.4 Forum user proļ¬ling approach framework . . . . . . . . . . . . . . . . . . . 9
I.5 Transfer-Learning CNN Framework . . . . . . . . . . . . . . . . . . . . . . . 12
I.6 Multidimensional Key Actor Detection Framework . . . . . . . . . . . . . . 14
II.1 An example of morphemes extraction . . . . . . . . . . . . . . . . . . . . . . 18
II.2 An example of syntax analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 19
II.3 An example of semantic network . . . . . . . . . . . . . . . . . . . . . . . . 20
II.4 Curved Edge Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
III.1 Multi-dimensional Network . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
III.2 Text Classiļ¬cation Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
III.3 Image Classiļ¬cation Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
III.4 General Information Classiļ¬cation Model . . . . . . . . . . . . . . . . . . . 38
III.5 Proposed Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
III.6 Model Workļ¬‚ow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
IV.1 Twitter Searching Tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
IV.2 Sample of news headlines . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
IV.3 Word Cloud of our Textual Data . . . . . . . . . . . . . . . . . . . . . . . . . 46
IV.4 Sample of Terrorists images . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
IV.5 Sample of Military/News images . . . . . . . . . . . . . . . . . . . . . . . . 48
IV.6 Age Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
IV.7 Relationship Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
IV.8 Gender Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
IV.9 Facebook Graph API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
viii
List of Figures
IV.10An example of data augmentation . . . . . . . . . . . . . . . . . . . . . . . . 57
IV.11Class Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
ix
List of Tables
I.1 Anomaly detection existing works comparison . . . . . . . . . . . . . . . . 10
I.2 Activity-based techniques comparison . . . . . . . . . . . . . . . . . . . . . 13
II.1 Comparison of word embedding methods . . . . . . . . . . . . . . . . . . . 22
IV.1 Textual-Content Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
IV.2 Image-Content Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
IV.3 Text Models Metric Scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
IV.4 Image Models Metric Scores . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
IV.5 General Information Models Metric Scores . . . . . . . . . . . . . . . . . . . 59
IV.6 Model Testing Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
x
Acronyms
UUP Uniļ¬ed User Proļ¬ling
CERT Computer Emergency Response Team
NATOPS Naval Air Training and Operating Procedures Standardization
SVM Support Vector Machine
SMP Social Media Processing
M-SN Multiple Social Networks
M-IT Multiple Input Types
T-UBC Time-based User Behavior Changes
T-FBC Time-based Future Behaviorā€™s Changes
ISIS Islamic State of Iraq and Syria
URL Uniform Resource Locator
LSTM Long Short-Term Memory
CNN Convolutional Neural Network
API Application Program Interface
GTD Global Terrorism Database
GDELT Global Data on Events Location and Tone
SNA Social Network Analysis
NLP Natural Language Processing
FOPL First Order Predicate Logic
xi
TF-IDF Term Frequency-Inverse Document Frequency
CBOW Continuous Bag Of Words
RGB Red, Green, Blue
START Study of Terrorism And Responses to Terrorism
PIRUS Proļ¬les of Individual Radicalization In the United States
HTTP HyperText Transfer Protocol
RDF Resource Description Framework
REST REpresentational State Transfer
JSON JavaScript Object Notation
SDK Software Development Kit
RAM Random Access Memory
CPU Central Processing Unit
GPU Graphics Processing Unit
NLTK Natural Language ToolKit
DA Data Augmentation
TL Transfer Learning
VGG Visual Geometry Group
FB Facebook
IG Instagram
T Twitter
xii
General Introduction
The appearing of social networks has created an ease of communication and
idea sharing. Several of them has now become one of the most popular
sources of information, namely: Facebook, Twitter, LinkedIn, etc.
Within the last decade, the number of the people across the word that uses those
websites kept increasing to overcome billion of active users per day [1]. Most of these
users are there to interact with their friends, family and meet new people that shares
their interests. Other users such as business owners are there to communicate with their
target audience for promoting their brand or receiving feedback from customers.
Although this facilitated way of communication could be used in a friendly way,
there are other users that take that advantage in a harmful way such as bullies, spammers
and hackers. One of the most dangerous categories is terrorist groups. They are one of
the most proļ¬table users of this advantage. The ability for them to incite other people,
promote their groups and planning attacks has become very simple.
Detecting these groups in an accurate and a fast way has become one of the most
important tasks for social network owners. Several approaches and methods has been
proposed to that end such as manual monitoring and ļ¬rewalls. But, as the number of
those individuals kept increasing, an accurate and fully automated approaches must be
used. Fortunately, the evolution of new technologies, especially the appearing of machine
learning, has made that task easier.
In this thesis, we propose a model that learns the characteristics that describes a
terrorist individual. Additionally, the model learns by itself the new characteristics that
deļ¬ne terrorism behaviors, since the abnormal behaviors of our social-culture changes
over-time.
The ļ¬rst chapter presents some existing works that deal with the issue of anomaly
detection in general and terrorism detection in particular, to give the reader a general
idea of the research carried out in this domain.
1
The second chapter presents some existing techniques in the machine learning do-
main in order for us to implement our proposed model.
The third chapter introduces the basis of our proposed model from a theoretical per-
spective so that we can implement the modelā€™s design.
The fourth chapter presents the practical part of our work where we go through the
pipeline of our modelā€™s implementation and discuss the results.
Finally, we end with a general conclusion and perspectives.
2
IState of the art
This chapter presents an overview of some existing works that deal with the issue of
anomaly detection in general and terrorism detection in particular. We begin this chapter
by deļ¬ning the concept of anomaly and point up the importance of its application in the
social media area. We present, by next, an overview of some applied anomaly detection
and terrorist detection works categorized based on their input format. The purpose of this
chapter is to give the reader a general idea of the research carried out in the detection of
anomalies and terrorism.
Introduction
Social mediaā€™s main objective is to provide a platform for people to communicate
together and share their thoughts. Although, most of the users use it in a friendly way,
many others can beneļ¬t from this ease of communication to plan attacks or incite the
others to adopt extremism behaviors. Therefore, it is extremely important that we can
detect these users in an accurate and a fast way. Those users are often referred to as an
anomaly due to their abnormal behaviors.
3
Chapter I. State of the art
Abnormal behaviors are behaviors that diļ¬€er or follow an unusual pattern com-
pared to what is deļ¬ned as normal sociocultural behavior.
Our main objective behind this research is to study the characteristics that describe
an anomalous individual. Although, in social media, an anomalous user certainly will
hide his anomalousness, therefore, time is important since we will be looking for peaks
and deviations from his/her usual behavior pattern. Nevertheless, what is considered
abnormal in todayā€™s sociocultural could become normal after a period of time, thus, we
should take into consideration the behaviorā€™s evolution when deļ¬ning the abnormal be-
havior.
Diļ¬€erent models and approaches has been proposed toward solving this problem. Based
on their input format, we can categorize them into activity-based detection, where the in-
put data is the userā€™s activity, and graph-based detection, where the input data is a graph
of multiple users.
However, the anomaly itself is way too abstract as a term, this motivated us to initiate an
attempt to work only on one concrete type of anomaly which is the terrorism. To consider
an individual as a terrorist, we have ļ¬rst to deļ¬ne what is a terrorist, since there are no
universal agreement on the deļ¬nition of a terrorist [2]. Facebook, in their deļ¬nition of
dangerous individuals and organizations, attempted to deļ¬ne terrorism as following:
Terrorism: Any nongovernmental organization that engages in premeditated acts of vio-
lence against persons or property to intimidate a civilian population, government or interna-
tional organization in order to achieve a political, religious or ideological aim. [3]
Since we are working with social medias, we decided to consider that deļ¬nition.
In the following sections, we start by presenting the existing activity-based and graph-
based anomaly detection proposals, then we put our focus on the terrorist detection
works.
1 Anomaly Detection in Social Media
This section presents the existing models and approaches for anomaly detection in
social media categorized by their input format. We look for whether the latest propos-
als can identify the future anomaly-behaviorā€™s changes, the userā€™s behavior over-time
4
Chapter I. State of the art
changes and the usage of multiple social networks.
1.1 Activity-based Detection
Activity-based detection approaches consider that users are kind of independent
from each other. An individual is deļ¬ned by his/her own activities and that would deter-
mine if whether his/her behavior are abnormal.
In [4], the authors presented a survey of the available user proļ¬ling methods for
anomaly detection, then they proposed their own anomaly detection model. They showed
the advantages and disadvantages of each model from a cybersecurity perspective, some
models were using operating system log and web browser history as data source while
others were more focused on social networks such as twitter and Facebook. Their anal-
yses revealed that the models based on history and logs were more limited and not con-
sistent from the perspective of not really knowing whether the same user is the only one
using that operating system or the web browser, while the social network-related models,
were more consistent because it is a private account based approaches that also includes
users interactivity with each other which leads to better results. Based on other methodā€™s
data sources, they deļ¬ned a user proļ¬le representation with a vector of 7 main feature
categories :
ā€¢ Users interests features
ā€¢ Knowledge and skills features
ā€¢ Demographic information features
ā€¢ Intention features
ā€¢ Online and oļ¬„ine behaviour features
ā€¢ Social media activity features
ā€¢ Network traļ¬ƒc features
Each features category contains some features and sub-grouped features inside, which
led ļ¬nally to having more than 270 features that are mostly security-related. Their pro-
posed model called "Uniļ¬ed User Proļ¬ling" (Fig.I.1), will mainly collect the data from
5
Chapter I. State of the art
the diļ¬€erent sources, then clean it and parse it in order to have structured data which
ļ¬nally leads to having a user proļ¬le vector that the administrator is able to monitor in
diļ¬€erent categories and detect anomalies based on the user activity.
While their model is mostly complete in term of features and they considered diļ¬€erent
social networks, it is still limited to not automatically detect anomalies.
Figure I.1: Uniļ¬ed User Proļ¬ling (UUP) system with cyber security perspective
In [5], the writers proposed a pattern recognition method, that given a vector of a user
proļ¬le, it will take the userā€™s daily activity and create a time-series pattern for that user
on each activity he does (Fig.I.2), then each time the user is involved in an activity, the
new behaviour is compared to his/her behavioral pattern of that activity. If a deviation
from the normal behavior happened, it is ļ¬‚agged suspicious, but since a minor deviation
doesnā€™t always mean a suspicion, there is a behavioral model of all system users that
the activity will also be compared to so that the false alarms are kept at the minimum.
Their model is a random forest trained on the CERT dataset along with a private dataset
acquired from NextLabs which achieved over 97% accuracy.
This method showed great results in term of insider threat detection which is considered
as a single social network, so it is still limited by not supporting multiple social networks
and it cannot learn future abnormal behaviors automatically over time.
6
Chapter I. State of the art
Figure I.2: User Proļ¬ling Method in Authorization Logs
1.2 Graph-based Detection
Graph-based detection approaches consider user interactivity by analyzing a snap-
shot of a network. Each user can have a relation with other users such as mentions, shares
and likes.
There are two approaches for that, statically or dynamically. In the static graph-based
detection approaches, the analysis is done on a single snapshot of the network. While for
the dynamic graph-based detection approaches, the analysis is done in a time-based way
by analyzing a series of snapshots.
In [6], the writers proposed an anomaly detection framework that, at each timestamp
t, each user within a network have an activity score and a mutual score with other users.
The scores are based on the userā€™s activities and the interactivity with other users on these
activities. A Mutual agreement matrix is then produced to represent those scores where
the userā€™s activities score in the matrix diagonal. Using an anomaly scoring function that
they proposed, the userā€™s scores are passed into it and thresholded to deļ¬ne whether the
user is anomalous or not (Fig.I.3). As a data source, they used the "CMU-CERT Insider
Threat Dataset" and the "NATOPS Gesture Dataset", then they compared the results of
their framework to other known models. Their model by far was the best, they reached
around 0.95 of area-under-curve score while the other models such as SVM and clustering
7
Chapter I. State of the art
were around 0.89.
Despite the fact that the framework overcame by far the expecting results for detecting
insider threats and its ability to support overtime behavior changes, it is still limited by
not considering diļ¬€erent input data types such as images and texts and not analyzing
multiple networks simultaneously.
Figure I.3: Context-aware graph-based approach framework
In [7], the authors proposed a user proļ¬ling approach based on user behavior features
and social network connection features (Fig.I.4). The ļ¬rst set of features (user behavior
features) is the foundation of user representation which are composed of posts contents
statistics, posts content semantics and user behavior statistics. The social network con-
nection features are basically a set of features that leads to the construction of a network
of similar users that have similar network representation. The experiment results showed
that by using the network connections the model overall score improved. Their approach
reached the second place among around 900 participants in the SMP 2017 User Proļ¬ling
Competition.
This work showed that the use of graphs and the consideration of user interactivity is an
improvement toward grouping individuals thus, detecting anomalous communities. The
limitation of this work is that it cannot detect the categoryā€™s behavior future changes.
8
Chapter I. State of the art
Figure I.4: Forum user proļ¬ling approach framework
1.3 Summary
Within the scope of our research in the anomaly detection in social media, we studied
diļ¬€erent papers. Table I.2 presents the advantages and limitations of those papers in term
of their support of multiple social networks (M-SN), support of multiple input data types
such as text and images (M-IT), support of over-time user behavior changes (T-UBC) and
their ability to learn future new abnormal behaviorā€™s changes (T-FBC).
9
Chapter I. State of the art
Paper Description Input Format M-SN M-IT T-UBC T-FBC
Lashakry et al.,
2019 [4]
Proposed model for
user proļ¬le creation
to monitor users
Userā€™s Activity    
Zamanian et al.,
2019 [5]
Proposed model for
user activity pattern
recognition with ran-
dom forest
Userā€™s Activity    
Bhattacharjee et
al., 2017 [6]
Proposed a prob-
abilistic anomaly
classiļ¬er model
Graph of users    
Chen et al., 2018
[7]
Proposed a user pro-
ļ¬ling framework that
can be used to detect
anomalous users
Graph of users    
Table I.1: Anomaly detection existing works comparison
None of the mentioned works has considered all the mentioned functionalities to-
gether. Therefore, we decided to work on a model that supports those features. To fa-
cilitate that, we considered a hybrid architecture where the input format is graph-based
to include user interactivity and the ease of detecting communities but also focus on the
userā€™s activity to solve our main problem of identifying the characteristics that describes
an anomalous individual.
2 Terrorist Detection in Social Media
As we decided to have a hybrid architecture with both graph-input and activity-
based detection, we identiļ¬ed the existing terrorist detection works that focus on userā€™s
social medias content and other works that treats a graph as an input. In this section, we
present those papers to get more overview about how to solve our problem.
10
Chapter I. State of the art
2.1 Existing Content-based Models
In this section, we focus on models that treats the content of the activities that an
individual can get involved in on social medias. Those are served as a proof-of-concept
for our implementation of them.
In [8], the writers implemented a model that detects extremists in social media based
on some information related to usernames, proļ¬le, and textual content. They built their
dataset from Twitter by looking for hashtags related to extremism which results into
having around 1.5M tweets, then they extracted 150 ISIS-related accounts that posted
those tweets and were reported to the Twitter Safety account (@TwitterSafety) by normal
users and 150 normal users to have a balanced dataset all along with 3k of unlabeled
data.
Afterwards, they categorized the features into 3 major groups:
ā€¢ Twitter handleā€™s (username) related features: length, number of unique characters
and Kolmogorov complexity of the username.
ā€¢ Proļ¬le related features: this group contains 7 features related to the proļ¬le of the
user such as the proļ¬leā€™s description, the number of followers and the location.
ā€¢ Content related features: the number of URLs, the number of hashtags and the
sentiment of the content.
Based on this dataset, they tried to answer two research questions:
ā€¢ Are extremists on Twitter inclined to adopt similar handles?
ā€¢ Can we infer the labels (extremist vs. non-extremist) of unseen handles based on
their proximity to the labeled instances?
After their experiment with diļ¬€erent supervised and semi-supervised approaches, both
question had a positive answer and SVM had the best precision score with 0.96 which
shows the signiļ¬cance of the proposed feature set, but char-LSTM had the best precision-
recall score with 0.76 that minimize the number of false negatives.
This work presented diļ¬€erent ways of collecting the necessary data in an extremist detec-
tion work. They also showed that the use of diļ¬€erent input data types from social media
11
Chapter I. State of the art
can help detecting extremists. The limitation of this model is that it does not support
over-time userā€™s behavior change and it cannot learn future extremist behaviors.
In [9], the authors presented a convolutional neural network (CNN) in order to de-
tect suspicious e-crimes and terrorist involvement by classifying social media image con-
tents. They used three diļ¬€erent kinds of datasets in which we are only interested in the
terrorism images dataset. Based on the transfer learning technique, they took the CNN
architecture of the imagenet model [10] and they reduced its network size by lowering
the kernel size of each layer to come up with their new smaller network (Fig I.5). In the
results, their architecture outperformed the default imagenet by around 1% of mean av-
erage precision score and took half imagenetā€™s execution time.
This paper showed that the concept of detecting terrorists based on their social media im-
age contents is possible along with the advantage of using transfer learning rather than
building a CNN from scratch. But their model supports only one type of data which is
images.
Figure I.5: Transfer-Learning CNN Framework
In Table I.2 we present the content-based models that we analyzed with their advan-
tages and limitations.
12
Chapter I. State of the art
Paper Description Advantages Limits
Alvari et al.,
2019 [8]
(semi)-supervised
model of extremist
detection based
on userā€™s general
information and
textual-content
data
- Proof-of-concept
of detection based
on textual-content
and general infor-
mation
- Support multiple
input data types
- Cannot support
multiple social
networks
- Cannot detect if
user is adopting
new behaviors
over-time
- Cannot learn
future behaviorā€™s
change
Chitrakar et al.,
2016 [9]
Image classiļ¬cation
model using CNN
and Transfer learn-
ing
- Proof-of-concept
of image content
based detection
- Highlighted a
model improve-
ment technique:
Transfer Learning
- Cannot support
multiple input data
types
- Cannot learn
future behaviorā€™s
change
Table I.2: Activity-based techniques comparison
2.2 Existing Graph-input Analysis
In this section, we study the existing works that works with graph as an input for the
terrorist detection in social media problem.
In [11], the authors proposed a framework that treats multidimensional network as
an input for the identiļ¬cation of terrorist network key-actors. The dimensions represent
the types of relationships or interactions in a social media. The workļ¬‚ow of their frame-
work starts by building a multidimensional network through a keyword-based search on
a social media platform, then that network is mapped to a single layer network by using
certain mapping functions. To detect the key actors, they use several centrality measures
13
Chapter I. State of the art
such as Degree of Centrality and Betweenness Centrality. The output of the frame-
work is a ranked list of the key actors within the network. The frameworkā€™s eļ¬€ectiveness
was evaluated with a ground truth dataset of a 16-month period Twitter data. Fig. I.6
presents the workļ¬‚ow of this framework.
This work presented the usage of multidimensional networks and how we can analyse it
to detect terrorist-networkā€™s key actors. Their usage of the multiple dimensions could be
more eļ¬ƒcient if they considered multiple social medias instead of multiple relationship
and interaction types.
Figure I.6: Multidimensional Key Actor Detection Framework
In [12], the writers created a survey on social network analysis for counter-terrorism
where they provided the data collection methods and the diļ¬€erent types of analysis.
The two sources of data are: online social networks and oļ¬„ine social networks. The on-
line social networks are the social media websites which allow users to interact with other
users through sending messages, posting information, these are websites like Facebook,
Twitter and YouTube in which we collect the data using their APIs. In the other hand, of-
ļ¬‚ine social networks are the real life social networks based on the relations like ļ¬nancial
transactions, locations, events etc, and these are the public databases such as Global Ter-
rorism Database (GTD) [13] and Global Data on Events Location and Tone (GDELT) [14].
14
Chapter I. State of the art
Furthermore, they analyzed the diļ¬€erent centrality measures that provides the impor-
tance and position of a node in a network such as:
ā€¢ Degree Centrality: A node with higher degree value is often considered as an active
actor in a network. The degree value is the number of connections linked to a node.
[15]
ā€¢ Closeness Centrality: A node with higher closeness value can quickly access other
nodes in a network. The closeness value is a measure for how fast a node can reach
other nodes. [15]
ā€¢ Betweenness Centrality: A node with higher betweenness value is often considered
as an inļ¬‚uencer in a network. The betweenness value is the number of shortest
paths between any pair that pass through a node. We can see this as which node
acts as a bridge to make communities in a network. [15]
Finally, they stated some SNA tools comparison based on the functionality, platform,
license type and ļ¬le-formats. As conclusion, they winded up with the idea of when doing
social network analysis, the main challenge is the data itself, since the privacy of users is
a very sensitive issue and also most of the times data tends to be incomplete with lot of
missing and fake nodes and relations, which often leads to incorrect analysis results. This
survey provided us the diļ¬€erent data collection methods as well as the graph analysis
methodologies.
Conclusion
In this chapter, we presented some existing works that have dealt with anomaly de-
tection in general and terrorist detection in particular in diļ¬€erent approaches. To the
best of our analysis, the existing methods did not deal with terrorism in multidimen-
sional graphs with combining diļ¬€erent types of classiļ¬cations in a time-based way. This
motivated us to provide a model of terrorism detection in multidimensional graphs that
supports diļ¬€erent types of input data that can also detect over-time behaviorā€™s change.
In the next chapter, we initiate a research on the existing techniques needed to im-
plement our proposed model.
15
IIExisting Techniques
This chapter presents the necessary techniques to implement our proposed model.
We begin by presenting the diļ¬€erent input data types that we are considering and the
techniques used for the analysis of each type. Then, we present the classiļ¬cation models
to use and how they works.
Introduction
Each social network has ample input data that could be shared on it, identifying
these data types and choosing which ones we will be working with an important task
toward achieving our goal. In our previous analysis of the diļ¬€erent existing proposals,
the authors of [4] identiļ¬ed nearly 270 of anomaly detection security-related features,
some of which were social media activity features, We analyzed those features and based
on [8, 9], we grouped them into three data types categories namely: textual-content data,
image-content data, and numerical-content data. To classify an individual based on those
content data, diļ¬€erent classiļ¬cation models exists.
In the next sections, we begin by giving an overview about the identiļ¬ed input data
types and their analysis approaches, then we present the diļ¬€erent classiļ¬cation models.
16
Chapter II. Existing Techniques
1 Data Types
In this section, we brieļ¬‚y introduces each type of data along with the chosen ap-
proach toward their analysis and classiļ¬cation.
1.1 Textual-Content Data
Textual-content data is mainly characters that are part of a certain language and
could be read by a human being. We begin by presenting the chosen text analysis ap-
proach, then we decide on a data representation techniques to transform the text to nu-
merical input.
1.1.1 Text Analysis
In text analysis, the most common used technique is Text Mining.
Text Mining is the process of extracting high quality information from textual data,
where the information could be patterns or matching structures in text without the con-
sideration of the semantics of the it. The outcome of it are mostly statistical information
such as frequency and correlation of words. [16]
In terrorism detection domain, we are interested in knowing what the user is trying to
incite with the post and whether it is serious, sarcasm or reporting a news. To diļ¬€erentiate
that, we need to go through the semantic analysis and not working with words as objects.
One of the most important text-miningā€™s processing methodologies, that also consider the
semantics of words, is the Natural Language Processing.
Natural Language Processing is the process of making the computer understand
the language spoken by humans along with the semantics and sentiments conveyed from
it by doing some analysis such as morphological, syntactical and semantic analysis [16].
The ļ¬rst step in NLP is the morphology processing which involved analyzing the
structure of words studying their construction from primitive meaningful units called
17
Chapter II. Existing Techniques
morphemes. This will help us divide the diļ¬€erent words/phrases of a document into
tokens that will be used on later analysis.
Morphemes are the smallest units with a meaning in a word. There are two types
of morphemes namely Stems and Aļ¬ƒxes where the stems are the base or root of a
word and aļ¬ƒxes could be a preļ¬x, an inļ¬x or a suļ¬ƒx. Aļ¬ƒxes that never appear isolated,
but are combined with a stem.
Taking the example of Fig. II.1, we can see how we split a word into a stem which carries
the main meaning of the word and some aļ¬ƒxes.
Figure II.1: An example of morphemes extraction
Tokens are words, keywords, phrases or symbols that have a useful semantic unit
for processing. We refer to its extraction process as Tokenization. It is mainly composed
of a lemma + part of speech tag + grammatical features. Example:
ā€¢ plays ā†’ play (lemma) + Noun (part of speech tag) + plural (grammatical feature)
ā€¢ plays ā†’ play (lemma) + Verb (part of speech tag) + Singular (grammatical feature)
After ļ¬nishing studying the structure of the words, we have to examine their ar-
rangement and combination in a sentence, using syntax analysis.
In a sentence, words arrangement follow precise rules of the languageā€™s grammar. Taking
an example of the sentence Three people were killed in an incident today and following
the English grammar parser, we end up with the example of Fig. II.2 where we have some
grammatical groups such as S for sentence, NP for noun phrase, VP for verb phrase,
NN for singular nouns and NNS for plural nouns.
18
Chapter II. Existing Techniques
Figure II.2: An example of syntax analysis
This analysis will make the machine able to understand the relationship between the
words and the diļ¬€erent references.
After structuring the words and studying their relationship, it is time for the ma-
chine to understand the meaning of the words and phrases along with the context of the
document. Focusing on the relationship between the words and elements such as syn-
onyms, antonyms and hyponyms (hierarchical order of meaning), the semantic system is
able to build blocks composed of:
ā€¢ Entities: Individuals or instances.
ā€¢ Concepts: Category of individuals or classes.
ā€¢ Relations: Relationship between entities and concepts.
ā€¢ Predicates: Verb structures or semantic roles.
These can be represented through methods such as ļ¬rst order predicate logic (FOPL),
semantic networks and conceptual dependency.
Fig. II.3 illustrates an example of semantic networks using our last example of the
sentence Three people were killed in an incident today.
19
Chapter II. Existing Techniques
Figure II.3: An example of semantic network
Based on these semantics, the machine can now learn the meaning of the words and
the text, thus, from this part it is possible to lean the meaning of the userā€™s textual data.
1.1.2 Data Representation
After going through the text analysis, our machine can now understand the meaning
of the textual content data. But in order to build a classiļ¬er that will automatically cat-
egorize the current and future data, our data must be numerical to apply mathematical
rules while also preserving its semantics.
Word embedding is one of the most popular representation of textual data, where it trans-
forms a word in a document into a vector of numerical features where mostly close vectors
means that these words share the same meaning or are in the same context therefore the
data will not loose it semantics.
While doing our research, the most used word embedding techniques are Word2Vec
and Term Frequency-Inverse Document Frequency (TF-IDF).
Word2Vec uses two diļ¬€erent approaches, namely: Continuous Bag Of Words (CBOW)
and Skip Gram, both are based on neural networks that takes a context as an input and
use back-propagation to learn [17]. The mathematical background work of Word2Vec,
tries to maximize the probability of the next word wt given the previous word h. Thus,
20
Chapter II. Existing Techniques
the probability P (wt|h) in Equation II.1, where score(wt,h) computes the compatibility
between wt with the context h and sof tmax is the known mathematical softmax function.
P (wt|h) = sof tmax(score(wt,h)) (II.1)
CBOW learns the embedding of a word by predicting it based on the surrounding
words that are considered as the context here.
Skip-Gram learns the embedding of a word by considering the current word as the
context and predicting the surrounding words.
According to [17], Skip-Gram is able to function with less data and represents rarer words
more, while CBOW is faster and represents frequent words clearer.
TF-IDF represents words with weights. These weights are based on the product of the
term frequency times the inverse document frequency In simpler terms, words that occur
frequently throughout the document should be given very little weighting or signiļ¬cance.
For example, in English, simpler terms include: the, or, and and. They donā€™t provide
a large amount of value. However, if a word appears very little or appears frequently, but
only in one or two places, then these are identiļ¬ed as more important words and should
be weighted as such [18].
Term-Frequency (TF) is the percentage of occurrence of a term t in a document
d. As illustrated in Equation II.2, we calculate term-frequency by taking the number of
times a term t is appearing in a document d by the total number of words in the document
d.
tft,d =
nt,d
term nterm,d
(II.2)
nt,d: The number of occurrences of term t in the document d.
term nterm,d: The sum of occurrences of all the terms that appear in the document d
which is the total number of words in the document d.
21
Chapter II. Existing Techniques
Method Advantages Disadvantages
Word2Vec - Optimized memory usage
- Fast execution time
- Contains a lot of noisy data
- Does not work well with ambigu-
ity
TF-IDF - The vocabulary is built with
words that identify the category
- Extract relevant information
- High memory usage
- The closest words are not similar
in meaning but in the category of
the documentā€™s context
Table II.1: Comparison of word embedding methods
Inverse-Document-Frequency (IDF) is the rank of a term t for its relevance within
a document d. Equation II.3 show the mathematical formula to calculate inverse docu-
ment frequency. This is done by taking the total number of documents N and dividing
that by dft the number of documents that contains the term t.
idf (t) = loge(
N
dft
) (II.3)
Finally, if we are trying to get the weight wt,d of the word t in a document d using TF-IDF,
we get that as shown in Equation II.4 by multiplying the tft,d by the idf (t).
wt,d = tft,d āˆ— idf (t) (II.4)
As found in review over existing research, such as in [18], it appears that Word2Vec
performs better in term of memory, execution time and embedding quality for words sim-
ilar in context and meaning, while TF-IDF performs better in identifying the words that
determine the documentā€™s category. In other words, it detects the keywords that iden-
tify a category of documents. Table II.1 summarizes the advantages and disadvantages of
each method.
22
Chapter II. Existing Techniques
1.2 Image-Content Data
This type of data is anything that is a visual representation of something. Diļ¬€erent
approaches are also available for image processing, but as determined in [10], convolu-
tional neural network is by far the most performing method to utilize in image classiļ¬ca-
tion in term of precision and execution time.
Convolutional Neural Network is a deep learning algorithm and an extension of
neural network that is distinguished from other methods by its ability to consider spa-
cial structure and translation invariance. This means that regardless of where an object
is located in an image, it is still considered as the same object [19]. The advantage of
having a multidimensional input, unlike regular neural networks that use a vector as an
input, makes it performs better with image data since the images usually has three color
channels (RGB) which makes it a three dimensions matrix. Taking an example of an a
32Ɨ32 image with 3 color channels, we would have 32Ɨ32Ɨ3 = 3072 weights for a regu-
lar neural network, if we go for a 512Ɨ512 image, we would have 512Ɨ512Ɨ3 = 786432
weights. This will results in huge calculations as well as an over-ļ¬tting for having too
much information and details. [20]
A simple CNN is a sequence of layers: convolutional layer, pooling layer and fully-
connected layer. In a typical CNN, there are several rounds of convolution/pooling until
we proceed to the fully-connected layer.
1.2.1 CNN: Convolutional Layer
Each convolutional layer of the network has a set of feature maps that can recognize
increasingly complex patterns/shapes in a hierarchical manner. Instead of regular matrix
multiplications, convolutional layer uses convolution calculations. To do that, convolu-
tional layer needs to construct the ļ¬lters and apply calculations on it while doing some
optimization techniques such as Striding and Padding.
Filters are used to detect patterns in an image, they also oļ¬€er weight sharing. For
example a ļ¬lter which detects curved edge (Fig.II.4), matches the left corner of an image
but may also match the right bottom corner of the image if both corners has a curved
23
Chapter II. Existing Techniques
edges.
Figure II.4: Curved Edge Filter
Calculation are matrix multiplications that are used to apply a ļ¬lter on an input
image we.
Let us consider:
0 0 1 1 0
1 1 3 1 2
1 0 1 4 2
0 2 2 1 0
3 4 1 0 0
ļ£«
ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£­
ļ£¶
ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£ø
*
1 1 0
0 0 1
1 0 0
ļ£«
ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£­
ļ£¶
ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£ø
=
? ? ?
? ? ?
? ? ?
ļ£«
ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£­
ļ£¶
ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£ø
In order to get the value of the ļ¬rst ā€™?ā€™ we need to use the ļ¬lter on the ļ¬rst 3x3 matrix of
pixels : ? = (0 āˆ— 1) + (0 āˆ— 1) + (1 āˆ— 0) + (1 āˆ— 0) + (1 āˆ— 0) + (3 āˆ— 1) + (1 āˆ— 1) + (0 āˆ— 0) + (1 āˆ— 0) = 4. Then
we continue, the value next to ā€™?ā€™ is the value of the second 3x3 matrix of pixels in which
ā€™3ā€™ is the center. This means we moved by 1 pixel to the right.
24
Chapter II. Existing Techniques
0 0 1 1 0
1 1 3 1 2
1 0 1 4 2
0 2 2 1 0
3 4 1 0 0
ļ£«
ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£­
ļ£¶
ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£ø
*
1 1 0
0 0 1
1 0 0
ļ£«
ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£­
ļ£¶
ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£ø
=
4 ? ?
? ? ?
? ? ?
ļ£«
ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£­
ļ£¶
ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£ø
? = (0āˆ—1)+(1āˆ—1)+(1āˆ—0)+(1āˆ—0)+(3āˆ—0)+(1āˆ—1)+(0āˆ—1)+(1āˆ—0)+(4āˆ—0) = 2. And so on.
Striding is a parameter of how many pixels we are going to move to calculate the
next value. It is mainly used to reduce the calculation as values next to each other are
more likely to be similar. In our last example the striding was 1, that means we only
moved the red box by 1 pixel to get the next value. Usually, we use a value of 2 or 3 since
in most of the cases a 2-3 pixels apart would make a variation or a change of a pattern.
Padding is used to prevent information loss. In our example when applied the ļ¬l-
ter, we didnā€™t consider having the values of the ļ¬rst/last rows and the ļ¬rst/last columns
as center for the 3x3 matrix. To ļ¬x that we add zero padding which will add new rows/-
columns ļ¬lled with 0.
0 0 1 1 0
1 1 3 1 2
1 0 1 4 2
0 2 2 1 0
3 4 1 0 0
ļ£«
ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£­
ļ£¶
ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£ø
ā‡’
0 0 0 0 0 0 0
0 0 0 1 1 0 0
0 1 1 3 1 2 0
0 1 0 1 4 2 0
0 0 2 2 1 0 0
0 3 4 1 0 0 0
0 0 0 0 0 0 0
ļ£«
ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£­
ļ£¶
ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£ø
1.2.2 CNN: Pooling Layer
Pooling layer is used to determine what information is critical and what constitutes
irrelevant details. There are many types of pooling layers such as: max pooling layer and
average pooling layer. With max pooling, we look at a neighborhood of pixels and only
keeps the maximum value.
Considering a 2x2 max pooling with a stride of 2:
25
Chapter II. Existing Techniques
1 0 0 1
3 2 0 2
0 0 4 2
4 1 0 1
ļ£«
ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£­
ļ£¶
ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£ø
ā‡’
3 2
4 4
ļ£«
ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£­
ļ£¶
ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£ø
For each 2x2 matrix we took the maximum value and each time we move by two pixels
(stride) to get the next 2x2 matrix.
1.2.3 CNN: Fully-Connected Layer
A fully-connected layer is a layer on which all the inputs are connected to all the
outputs. In a CNN it is used to ļ¬nally determine the class that will be assigned to our
main input. Before proceeding to the fully-connected layer, we have to use a technique
called ļ¬‚attening in order to generate a vector which is needed for this layer.
Flattening:
ā€¢ Each 2D matrix of pixels is turned into 1 column of pixels.
ā€¢ Each one of our 2D matrices is placed on top of another.
1.3 Numerical-Content Data
Numerical-content data is the data that are based on numbers that could be statis-
tically interpreted. This type of data does not require a pre-processing thus, it can be
directly ļ¬tted into a model. The models for this type of data are mostly the general sta-
tistical machine learning models that we will be presenting later.
2 Data Classiļ¬cation in Machine Learning
Machine learning is a subset of the artiļ¬cial intelligence domain, that makes the
machine able to automatically gain knowledge from experience without being explicitly
programmed. By following some statistics and mathematical concepts, it looks for pat-
terns in the data we provide, learn them and make better decisions in the future. [21]
Several learning methods exists in Machine Learning:
26
Chapter II. Existing Techniques
ā€¢ Supervised Learning: Given a sample of data and the desired output, the machine
should learn a function that maps the inputs to the outputs.
ā€¢ Unsupervised Learning: Given a sample of data without the output, the machine
should learn a function that categorize these samples based on learned patterns.
ā€¢ Semi-Supervised Learning: Given a small number of data with the desired output
(labeled data), and other data without output (unlabeled data), the machine should
learn a function that can label the unlabeled data using the knowledge learned from
the labeled data.
ā€¢ Reinforcement Learning: Given a sample of data, a certain actions and rewards
related to the actions, the machine should learn a function that ļ¬nds the optimal
actions toward achieving maximum rewards.
Classiļ¬cation is part of supervised learning in which the machine is going to catego-
rize a new observed data based on the learned patterns of each category from the training
data. In the following sections, we present the most common classiļ¬cation algorithms.
2.1 Support Vector Machines
A Support vector machine model is a representation of the data in a space. Examples
of a same category are close to each other. The group of examples in a category are
separated by a clear gap as wide and as spaced as possible from the examples of another
category. New observed examples are then predicted to be part of a category based on the
side of the gap in which they fall. [22]
2.2 Logistic Regression
Logistic regression is a statistical model that analyses a data in which there is at least
one feature that could determine the outcome. By using a logistic function, it tries to
model a binary output that is measured with a dichotomous variable. Since the output
is binary, it can only be used for binary classiļ¬cation problems. To use it for multi-class
problem, N logistic regression models should be trained, where N is the number of classes
you have, each model is trained on a certain class with one-vs-all approach. [23]
27
Chapter II. Existing Techniques
2.3 Neural Networks
A neural network is a network in which we have multiple layers of perceptrons. A
perceptron is the elementary unit in an artiļ¬cial neural network which was introduced
as a model of biological neurons in 1959 [24]. The output of each perceptron in a layer is
connected to each perceptron of another layer as an input which makes it known as fully
connected layer. A neural network must have an input layer, an output layer and in be-
tween a hidden layer. Any neural network with more than one hidden layer is considered
as a deep neural network. [20]
Conclusion
In this chapter, we studied the existing techniques needed to perform a classiļ¬cation
on textual-content data, image-content data and numerical-content data. In the next
chapter, we detail the basis of our proposed model.
28
IIIProposed Model
This chapter introduces a novel time-based terrorism detection model that works
with multidimensional networks and diļ¬€erent types of input data. The output results
of our model are the nodes that belong to terrorist regions in a graph across the dimen-
sions of the multidimensional network. To identify this type of nodes, we ļ¬rst have to
determine what are the terrorist regions and how to create them. Then, we examine the
network to estimate a terrorism score for each node in a dynamic way in order to detect
over time behavior changes.
First, we introduce the purpose of the model along with the proposed research questions
then we present the sources of data. After that, we present in detail the theoretical ap-
proach toward constructing our model and we ļ¬nish by a conclusion.
Introduction
Nowadays, social networks provides many types of data that could be used such as
images, texts and videos, but most of the existing models work on speciļ¬c type of data
on a speciļ¬c social network.
Our proposed model will try to cover this limitation by supporting a multidimensional
network as an input in order to have the ability to use multiple social network data at the
29
Chapter III. Proposed Model
same time along with the supporting of diļ¬€erent input data types. In addition to that,
the model will also consider the evolution of individualā€™s behavior over time to detect
deviation from the usual behavior pattern. Furthermore, the model will adapt itself to
the behaviorā€™s evolution to be kept updated with new abnormal behaviors.
Before describing the basis of the model construction, it is ļ¬rst necessary to present the
research questions that will be used as a metric to track the accuracy of our proposed
model for solving the main research problem of the thesis which is the study of the char-
acteristics that describes a terrorist in diļ¬€erent social media platforms.
The research questions being posed are as follows:
Q1: Can we identify the behavior of a terrorist based on his/her social media content ?
Q2: Can machine learning help automatically detect if a user is adopting a terrorism be-
havior over time ?
Q3: Do terrorists adopt the same behavior on diļ¬€erent social networks ?
In order to answer these research questions, we have to pass through some phases:
ā€¢ Phase 1: Identifying the available data sources
ā€¢ Phase 2: Determining the convenient classiļ¬cation approach
ā€¢ Phase 3: Estimating the terrorism score calculation
First, we start by collecting the necessary data of each user. Then, we create a multidi-
mensional network where each dimension represents a social network. Once the network
is ready, it is then used as an input to our model where each feature from each social
network will be mapped to its respective sub-model. Finally, a decision score will be cal-
culated.
If the node is detected as a terrorist, the model will be re-trained with those new inputs
to be kept up-to-date with newest (unseen) terrorist behaviors, in case the model losses
its accuracy once we updated it, it will be reverted to the last version. Additionally, each
node will be passed to the model each time it was involved in a new activity, that way,
the node could also be considered as a terrorist once the user adopt terrorism behavior
over-time.
30
Chapter III. Proposed Model
1 Data Collection
As part of phase 1, the data sources of the diļ¬€erent data types should be identiļ¬ed.
As we shared in the last chapter, there exists three types of data:
ā€¢ Textual-Content Data: These include posts, comments, image captions, text in an
image, etc.
ā€¢ Image-Content Data: These are posted photos, proļ¬le picture, etc.
ā€¢ Numerical-Content Data: These are age, number of friends, average posts per day,
etc.
Several other information exists in social media such as username, gender and rela-
tionship. Therefore, instead of having the numerical-content data category, we opted for
utilizing another category named general information data, where we have the existing
numerical-content data in addition to the userā€™s information data. We present by next the
data sources of the diļ¬€erent data contents that we have. As mentioned by [12], we can
categorize the data sources into two categories namely: oļ¬„ine data sources and online
data sources.
In this section, we provide the sources of both oļ¬„ine and online data that are used in or-
der to retrieve our target data types for model training and later prediction. As a strategy
for training the model and precisely distinguishing terrorism from other similar data, we
decided to consider terrorist contents as positive labels against military and news con-
tents as negative labels, as these types of contents are related, training them against each
other will make the model more precise.
1.1 Oļ¬„ine Data Sources
Oļ¬„ine data is the data used for the model training which was gathered from public
terrorism datasets. For each input type, we used a diļ¬€erent dataset. All of them deļ¬nes
the terrorism from the American point of view.
For the textual-content data, we inspired from [8], to use twitter API to gather tweets
that consist of terrorism-related hashtags and tweets from terrorist accounts that were re-
ported to twitterā€™s safety account (@twittersafety) ensuring that they are not anti-terrorist
31
Chapter III. Proposed Model
accounts with that we will be creating our oļ¬„ine textual-content dataset where we con-
sider those tweets as positive labels against terrorism news tweets and news headlines
gathered from other public datasets such as Global Terrorism Database (GTD) [13] as
negative labels. We will also be using google translate API since some accounts may pub-
lish tweets in diļ¬€erent languages.
For the image-content data, we did not ļ¬nd a public terrorism-related images dataset
within the scope of our research. We decided to use a manual web scraping method with
Google Image as our data source. We will be manually gathering terrorist individuals
images and incitement of terrorism images, which are our positive labels, and contrasting
them against military and terrorism news images, which are our our negative labels.
For general information data, Study of Terrorism And Responses to Terrorism (START)
published a database called Proļ¬les of Individual Radicalization In the United States
(PIRUS) [25] which contains approximately 145 features about many radical proļ¬les in
the united states from which we will be extracting our projectā€™s relevant features that are
age, gender, relationship, etc.
1.2 Online Data Sources
Online data is the social network data that is part of the prediction and future model
re-training. The sources for that are the public APIs provided by the social networks.
For social media, we decided to study three popular websites that have similar data con-
tents and that could also be linked together: Facebook, Instagram and Twitter.
Facebook provides Graph API which is a HTTP-based API service to access the Face-
book social graph objects [26]. With the right permissions, Graph API allows you to query
public data as well as creating contents [27]. The data is rich with semantics since Graph
API utilizes RDF format as a return type. [28]
Instagram as part of Facebook, also provides Graph API for business accounts [29].
For normal user accounts it gives REST API that returns JSON object for querying public
data . [30]
Twitter hands over a REST API with JSON return format that provides several public
data queries as well as private data with the right permissions. [31, 32]
32
Chapter III. Proposed Model
2 Proposed Model Design
With the data preparation phases ready, we can now start determining the classi-
ļ¬cation approach that we will be using along with the terrorism score formula, thus,
ļ¬nishing the phase 2 and phase 3.
In this section, we explain the theoretical side of the necessary steps toward constructing
our proposed model. As previously mentioned, the model consider a graph as an input.
Then, an individualā€™s content-based classiļ¬cation with a decision making component to
calculate the ļ¬nal score of the node and utilizing a threshold to determine whether the
user is a terrorist.
2.1 Model Input
Inspired from [11] the best way to represent our input data is a multidimensional
network. However, unlike their proposal, the dimensions in our work will represent each
social network used.
Let G = (V ,E,D) denote an undirected unweighted multidimensional graph, in which
V is a set of nodes representing each user, D reļ¬‚ects the dimensions which are the social
networks and E = {(u,v,d);u,v āˆˆ V ,d āˆˆ D} represents a set of edges that are the connection
between the users that represents things such as: relationship, shared comments or post
sharing. Fig. III.1 illustrates an example of how this network look like.
At each timestamp, the user will have his/her data inserted into our model to have
his/her score. The timestamp here would be each time the user is involved in a new
activity, which is the method used by [5].
33
Chapter III. Proposed Model
Figure III.1: Multi-dimensional Network
2.2 Content-Based Classiļ¬cation
The model itself will contain three diļ¬€erent sub-models, one for each content type
we have.
2.2.1 Text Classiļ¬cation Model
As mentioned in the previous chapter, before applying machine learning classiļ¬ca-
tion models on a textual content, we have to do text analysis and transform it to numerical
input that a model can understand.
As illustrated in Fig. III.2, the ļ¬rst step when the textual data is received, it has to
34
Chapter III. Proposed Model
pass through the NLP process. Once that is done, it has to be represented in a numeri-
cal way. In the last chapter, we presented a comparison between two word embedding
techniques namely: Word2Vec and TF-IDF. We chose TF-IDF because we are doing a clas-
siļ¬cation problem, we are more interested in diļ¬€erentiating the categories rather than
representing the similarity of the words meaning. Now that our machine can understand
our textual data and the data itself can be represented in a numerical way. We can start
passing that to any machine learning model. As a strategy, we decided that in the im-
plementation phase we would try diļ¬€erent models that we mentioned in the last chapter,
such as Support Vector Machines, Logistic Regression and Neural Networks then com-
pare their results to assess which one performs better.
Figure III.2: Text Classiļ¬cation Model
35
Chapter III. Proposed Model
2.2.2 Image Classiļ¬cation Model
In the previous chapter, we presented convolutional neural network as a model to use
in image classiļ¬cation. But designing a CNN will require ample parameters tuning and
adding/removing of convolution blocks to ļ¬nd the best architecture while re-training
your model each time. This task is a huge time consuming job. To overcome this, there
is a technique called Transfer Learning that could help getting better results in a faster
way.
Transfer Learning is a technique that makes a model beneļ¬t from knowledge gained
during solving another similar problem. For example, a model that learned to recognize
cars could use its knowledge to recognize trucks [33]. This is done by taking a pre-trained
model, changing few layers, usually the last ones, and re-training only those layers.
It is proven in [34], that transfer learning could have a huge improvement for accu-
racy, execution time and memory usage.
Another known limitation that we usually encounter in image classiļ¬cation is not having
diverse enough data or enough samples. A solution to that is the Data Augmentation
technique.
Data Augmentation is a technique for generating more data because having little
data and not enough variation, leads to a bottleneck in Neural Network models that usu-
ally requires thousands of training samples with diverse variation to be able to generalize
the learning. This is done by using some techniques such as:
ā€¢ Flipping: Flip the image horizontally and vertically.
ā€¢ Rotating: Rotate the image with some degrees.
ā€¢ Scaling: Re-scale an image by making it larger or smaller.
ā€¢ Cropping: Crop a part of an image.
ā€¢ Translating: Move the image in some direction.
ā€¢ Adding Gaussian Noise: Add noisy points to the image.
36
Chapter III. Proposed Model
Applying data augmentation can help in improving the model score as discussed
in [35].
Therefore, as illustrated in Fig. III.3, once we have image data, it passes through our
trained CNN model resulting in an image-content score.
Figure III.3: Image Classiļ¬cation Model
2.2.3 General Information Classiļ¬cation Model
For the general information model, the features do not require pre-processing for
the machine to understand it. We have to follow some encoding techniques for the non-
numerical data, then ļ¬t that to a supervised machine learning classiļ¬cation model.
For non-numerical features such as gender and relationship, we have to encode these
into numerical values. As these are binary, we can use 0 and 1. For non-binary values,
37
Chapter III. Proposed Model
we have to use techniques such as one-hot encoding or a Sparse Categorical Cross En-
tropy encoding.
As for the username, we can apply some feature engineering to create relevant features
from it such as the length, number of unique characters and other important information
as discussed in [8].
Other numerical features such as the age, the number of friends and the number of fol-
lowers can be passed them directly to the model.
In the implementation phase we try diļ¬€erent classiļ¬cation models where we compare
their results to select which one performs better.
Figure III.4: General Information Classiļ¬cation Model
38
Chapter III. Proposed Model
2.3 Decision Making
Now that we have a model for each data type, we can go to phase 3 where we propose
a calculation formula to provide a score for each user.
While doing our work and based on the available features, we noticed that the textual
content and the image content has more impact on the user behavior than the general
information which could be mis-leading. Therefore, as a compromise we decided to give
a weight to each input data relative to its impact on determining the anomaly of the user.
Taking 3 scores one for each sub-model {s1,s2,s3} and 3 weights {Ī±1,Ī±2,Ī±3}, each node
u āˆˆ V on each dimension d āˆˆ D should have the terrorism score of that dimension S(u)d
as in (III.1).
S(u)d =
3
i=1
(Ī±i Ɨ s(u)i) (III.1)
Now each user has a score for each dimension based on the sub-models score of each
dimension, but as an output, we want a single score. For that, given 3 dimensions, each
user must have a terrorism score ST (u) as in (III.2).
ST (u) =
3
d=1 S(u)d
3
(III.2)
Now that each user u āˆˆ V has a terrorism score ST (u), we have to decide whether
that user is a terrorist or not, this is done by deļ¬ning a certain threshold Ī³ where:
ST (u) = Ī³ ā‡’ T errorist
ST (u)  Ī³ ā‡’ NotT errorist
(III.3)
The values of the weights Ī±i and the threshold Ī³ are determined in the implementa-
tion phase.
2.4 Global Model
After deļ¬ning the diļ¬€erent components of our model, let us present its design along
with the workļ¬‚ow of how to use it. Fig. III.5 shows how our model look like using an
example of a single user with three dimensions that are the Facebook, Twitter and Insta-
gram data.
39
Chapter III. Proposed Model
Figure III.5: Proposed Model
Fig. III.6 illustrates the workļ¬‚ow of our model. Each time a user is involved in an ac-
tivity, the userā€™s data will pass through our model. In the case in which the user behavior
is detected as terrorist, we re-train the model with this new data to keep it updated with
new unseen behaviors. If the model loses accuracy after re-training, we revert to the last
existing model.
40
Chapter III. Proposed Model
Figure III.6: Model Workļ¬‚ow
Conclusion
In this chapter, we presented our proposed approach, starting from the research
questions that we are looking to solve. Then, we showed the diļ¬€erent phases to follow
in order to answer those questions. Finally, we explored the steps to follow toward the
41
Chapter III. Proposed Model
construction of our model.
The next chapter will detail the achievements and the diļ¬€erent results.
42
IVImplementation and Results
This chapter presents the practical part of our work. We will go through the pipeline
of our implementation starting with data gathering, then the model creation and we ļ¬n-
ish with the interpretation of the results and a response to the research questions.
1 Data Collection
In this section, we will explain how to gather the data that we identiļ¬ed in the last
chapter. As we discussed, there are two types of data, the oļ¬„ine and the online data. In
the next sections, we will implement the data gathering solution to each of them.
1.1 Oļ¬„ine Data
To train the models, we used a strategy of using an oļ¬„ine dataset which is the public
datasets related to our problem.
In the last chapter, we decided a data sources for each input type, we will implement
their gathering scripts in the next sections.
43
Chapter IV. Implementation and Results
1.1.1 Textual-Content Data
For the textual data, we have two sources of data:
ā€¢ Positive labels: Tweets of banned tweeter accounts.
ā€¢ Negative labels: News headlines of the GTD.
Our positive labels are the data that contains terrorist textual content. Our strategy
was to gather tweets of the banned users that were reported to @twittersafety account
and that also contains terrorism-related hashtags when they were reported, this could
be done through the Twitter API or the Twitter searching tool. Fig. IV.1 illustrates an
example of our searches looking for tweets that were reported to or mentioned the
twittersafety account containing the hashtags #ISIS, #terrorist, #Daech, #IslamicState.
Figure IV.1: Twitter Searching Tool
While doing our research, we found out that some organization already did this pro-
cess and extracted over 17k of clean terrorist data of ISIS users, and published that into a
44
Chapter IV. Implementation and Results
Kaggle dataset called How ISIS Uses Twitter [36].
For our negative labels, we need content related to terrorism in an opposite way, such
as news reporting on terrorism. For that, we will be using the news headlines from the
Global Terrorism Database (GTD) [13]. Fig. IV.2 presents a sample of 4 rows from the
GTD news headlines.
Figure IV.2: Sample of news headlines
Our ļ¬nal dataset contains the merge of the tweets labeled as terrorist, and the GTD
data labeled as news. Fig IV.3 shows the word cloud of the most appearing keywords
from our dataset, that includes both positive and negative labels.
45
Chapter IV. Implementation and Results
Figure IV.3: Word Cloud of our Textual Data
The number of samples we have total to approximately 300k of data, where about
122K are terrorist data and around 181K are news headlines, Table IV.1 presents the real
numbers in our dataset.
Label Number of samples
Positive labels 122619
Negative labels 181691
Total Data 304310
Table IV.1: Textual-Content Dataset
1.1.2 Image-Content Data
As we discussed in our research, the source of the image data is Google-Image and
we will be manually gathering images from it. Lucky for us, a python package called
google_images_download [37] exists, which allow us to automate this task by choosing
the keywords that we are looking for and the number of images needed.
46
Chapter IV. Implementation and Results
We started a script that downloaded around ļ¬ve hundred of terrorist persons and in-
citement acts in addition to another ļ¬ve hundred images of military and terrorism news.
Unfortunately, the images were not 100% related to what we are looking for, therefore,
we had to manually verify the gathered images and remove the non-related images.
After cleaning the data and keeping only related images, we had around 200 of ter-
rorist images and 300 of military and news images. Table IV.2 illustrates the real numbers
of images in our dataset. Fig. IV.4 and Fig. IV.5 shows random three images of each cate-
gory.
Label Number of samples
Positive labels 219
Negative labels 314
Total Data 533
Table IV.2: Image-Content Dataset
Figure IV.4: Sample of Terrorists images
47
Chapter IV. Implementation and Results
Figure IV.5: Sample of Military/News images
1.1.3 General Information Data
For general information data, we used the Proļ¬les of Individual Radicalization In
the United States (PIRUS) [25] public dataset from which we extracted the ages, genders
and relationships status of 135 extremist person that are our positive labels. As for the
negative labels, we will be using the online data to build our dataset.
Fig.IV.6, Fig.IV.7 and Fig.IV.8 shows the distribution of each feature within our pos-
itive labels data.
Figure IV.6: Age Distribution
48
Chapter IV. Implementation and Results
Figure IV.7: Relationship Distribution
Figure IV.8: Gender Distribution
1.2 Online Data
In this section, we will implement the necessary scripts that will gather the online
data from our selected social media platforms: Facebook, Instagram and Twitter.
1.2.1 Facebook Data
Facebook provides the HTTP-based API called Graph API. A public SDK called facebook-
sdk will help us write automated Facebook data gathering script using Python.
To use Facebook Graph API, it is necessary to pass an access token that has the rele-
vant permissions to access the social graph objects that you are querying. In the Facebook
social graph objects, each object has some ļ¬elds related to the object type, for example,
49
Chapter IV. Implementation and Results
the object User will contain information around the user proļ¬le such as the age, rela-
tionship and gender. The main existing objects that we are interested in are the User, the
Post, and the Comment. To access each graph object, you pass an id of an object of that
type,. Therefore, for posts and comments we cannot access them directly since the posts
ids are contained in the list of posts of the user object, and the same for comments as they
are part of the posts. Fig.IV.9 shows a representation of the Facebook Graph API.
Figure IV.9: Facebook Graph API
Our script starts by obtaining the information of the user along with the list of posts
ids. Then, it accesses all the posts by looping through the list of posts ids from the posts
ļ¬eld in the User object and retrieve the necessary information from it. After that, it
extracts the comments by looping through the list of comments ids from the comments
ļ¬eld in each Post object. Finally it will parse the textual and image data from those
posts and comments.
The following code is an example of how to get the user information along with the posts
data.
graph = facebook.GraphAPI(access_token=access_token , version=3.1)
50
Chapter IV. Implementation and Results
user_information = graph.get_object(
id=ā€™meā€™, fields=ā€™id,name,age_range ,gender,relationship_statusā€™)
posts_ids = []
posts_object = graph.get_object(id=ā€™meā€™, fields=ā€™postsā€™)
posts_ids.extend(posts_object[ā€™postsā€™][ā€™dataā€™])
while next_page is not None:
response = requests.get(next_page)
new_data = json.loads(response.content)
posts_ids.extend(new_data[ā€™dataā€™])
try:
next_page = new_data[ā€™pagingā€™][ā€™nextā€™]
except:
next_page = None
for post in posts_ids:
post_data = graph.get_object(
id=post.id,
fields=ā€™created_time ,full_picture ,message ,shares ,
likes.summary(1)ā€™)
post_data[ā€™likesā€™] = post_data[ā€™likesā€™][ā€™summaryā€™][ā€™total_countā€™]
try:
post_data[ā€™sharesā€™] = post_data[ā€™sharesā€™][ā€™countā€™]
except:
post_data[ā€™sharesā€™] = 0
1.2.2 Instagram Data
For Instagram, the task is easier as it provides a normal REST API with JSON output
where the access to each endpoint is direct through any HTTP request module. In Python,
we use the module requests with the Instagram endpoint: https://api.instagram.com/v1/
where we can access the information of the user through /users/self/?access_token={}
or the posts through /users/self/media/recent/?access_token={}.
The following code shows how our script will gather information from Instagram.
51
Chapter IV. Implementation and Results
# User data
response = requests.get(
ā€™https://api.instagram.com/v1/users/self/?access_
token={}ā€™.format(access_token))
user = json.loads(response.content)[ā€™dataā€™]
# Posts data
response = requests.get(
ā€™https://api.instagram.com/v1/users/self/media/recent/?access_
token={}ā€™.format(access_token))
data = json.loads(response.content)
for post in data[ā€™dataā€™]:
_id = post[ā€™idā€™]
creation_timestamp = post[ā€™created_timeā€™]
created_time = datetime.fromtimestamp(
int(creation_timestamp)).strftime(ā€™%Yāˆ’%māˆ’%d %H:%M:%Sā€™)
message = post[ā€™captionā€™][ā€™textā€™] if post[ā€™captionā€™]
is not None else ā€™ā€™
img_url = post[ā€™imagesā€™][ā€™standard_resolutionā€™][ā€™urlā€™]
post_data = dict(created_time=created_time , id=_id,
message=message , img_url=img_url)
1.2.3 Twitter Data
Similarly to Instagram, Twitter also provides a REST API, however, it also hands over
a Python SDK making the API usage easier. In order to use it, we have to pass 4 access
keys: consumer key, consumer secret, access token key and access token secret. Each
key has relevant permissions that allow access to either userā€™s private data or the public
Twitter data.
The following code is an example of how we loaded the tweets using the Twitter Python
SDK.
52
Chapter IV. Implementation and Results
api = twitter.Api(consumer_key=consumer_key ,
consumer_secret=consumer_secret ,
access_token_key=access_token_key ,
access_token_secret=access_token_secret)
user_id = api.VerifyCredentials().AsDict()[ā€™idā€™]
tweets = api.GetUserTimeline(user_id=user_id)
for tweet in tweets:
tweet = tweet.AsDict()
_id = tweet[ā€™idā€™]
created_time = tweet[ā€™created_atā€™]
message = tweet[ā€™textā€™] if tweet[ā€™textā€™] is not None else ā€™ā€™
tweet_data = dict(created_time=created_time ,
id=_id, message=message)
2 Model Implementation
In the next sections, we will be implementing the diļ¬€erent components that will lead
toward constructing our proposed model.
For each sub-model, we will be splitting the dataset of that content type into 80%
training data, and 20% testing data. All the models will be implemented on the same
machine provided by Kaggle, a data science platform, with the following hardware:
ā€¢ RAM: 16 GB
ā€¢ CPU count: 2
ā€¢ GPU: Tesla K80
ā€¢ Disk: 5 GB
53
Chapter IV. Implementation and Results
2.1 Text Classiļ¬cation Model
The steps to construct our text classiļ¬cation model were ļ¬rst to have the NLP pipeline
ready for data pre-processing, then vectorize it with TF-IDF and pass it to a classiļ¬cation
model.
2.1.1 NLP Process
When the NLP enters the practical phase, the process becomes tokenization, removal
of stop words and lemmatization. In the following code we will be using the Natural
Language Toolkit (NLTK) python package to do these steps. We start with regular ex-
pressions that will remove unnecessary texts that disrupts the process such as links and
dates. Then, we split the text into tokens, removing the stopwords (common useless
words like ā€™aā€™, ā€™theā€™, ā€™thatā€™, ā€™onā€™) and lemmatizing the words, by determining the root word
based on its part-of-speech tag (adjective, verb, noun).
def process_text(text):
nltk_processed_data = []
text = re.sub(rā€™https?://. [rn] ā€™, ā€™ā€™, text,
flags=re.MULTILINE)
text = re.sub(ā€™(?:[0āˆ’9] [:/āˆ’]){2}[0āˆ’9]{2,4}ā€™, ā€™ā€™, text,
flags=re.MULTILINE)
for w in tokenizer.tokenize(text) :
word = w.lower()
if not is_stopword(word=word):
processed_text = wordnet_lemmatizer.lemmatize(
word, get_wordnet_pos(word))
nltk_processed_data.append(processed_text)
return nltk_processed_data
2.1.2 Data Vectorization
To use our data for classiļ¬cation models, we have to vectorize it into semantic nu-
merical data. In the last chapter, we deļ¬ned TF-IDF as our vectorizer model. Sckit-Learn
54
Chapter IV. Implementation and Results
oļ¬€ers ā€™Tļ¬dfVectorizerā€™ module in its package with an easy usage of two lines. We deļ¬ned
the object parameters as follows:
ā€¢ max_df: Max document frequency for a word to be considered in the grammar ā‡’
0.95 (word must maximum appears in 95% of the documents)
ā€¢ min_df: Min document frequency for a word to be considered in the grammar ā‡’
0.1 (word must minimum appears in 10% of the documents)
ā€¢ ngram_range: Number of words to consider as a single token in the grammar ā‡’
(1,3) (From 1 word to 3 words)
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(max_df=0.95, min_df=0.1, ngram_range=(1,3))
X = vectorizer.fit_transform(train_data)
The data used to train the ā€™Tļ¬dfVectorizerā€™ is around 205K of samples.
After vectorizing the data, TF-IDF has identiļ¬ed 330 feature vectors, which makes our
data shape: (n_training_samples, n_dimensions) ā‡’ (243448, 330)
We utilize the trained vectorizer to transform the testing data, as follows:
transformed_data = vectorizer.transform(test_data).toarray()
2.1.3 Data Classiļ¬cation
As mentioned in the last chapter, we will try three classiļ¬cation models namely Lo-
gistic Regression, Support Vector Machine and Neural Network. The best performing
model will be used later for our global model.
To implement the Logistic Regression and the Support Vector Machine models,
we used Scikit-Learn, a python machine learning library that oļ¬€ers many known models.
We trained these two models with their default suggested parameterā€™s values.
For the Neural Network, we used Keras as a framework that works on top of TensorFlow.
The architecture of our model is composed of three layers, with 16 neurons, 8 neurons
and 1 neuron respectively. The ļ¬rst two layers has a ā€™reluā€™ activation as it is proven for its
performance, and the last layer has a ā€™sigmoidā€™ activation as it is our output layer and we
55
Chapter IV. Implementation and Results
have a binary classiļ¬cation problem. The model is compiled with ā€™binary_crossentropyā€™
as a loss function and ā€™adamā€™ as an optimizer. For the training parameters, we used 20
epochs with 128 batch size and 20% of validation data extracted from the training data.
Table IV.5 show the diļ¬€erent metric scores for each model along with the training
execution time. These models are trained and tested with the same data and on the same
machine. The model that we will be using in our global model is the Neural Network as
it has the best F1-score with a good average of training time.
Model Name Accuracy F1-Score Training time
Logistic Regression 0.9726 0.9674 39.9 secs
SVM 0.9626 0.9548 6h 48min 33s
Neural Network 0.9774 0.9719 1min 11s
Table IV.3: Text Models Metric Scores
2.2 Image Classiļ¬cation Model
In last chapter, we deļ¬ned convolutional neural network as our image classiļ¬cation
model along with optimization techniques namely Transfer Learning and Data Augmen-
tation. Therefore, as a ļ¬rst step, we have to implement our data augmentation functions,
then deļ¬ne which base modelā€™s learnt knowledge will be used in our model.
2.2.1 Data Augmentation
To use data augmentation, a python package called ā€™imgaugā€™ exists that provides all
the diļ¬€erent data augmentation techniques. In the following code, we show an example
on how to use the augmenters of the ā€™imgaugā€™ library where we will be applying a random
augmentation technique for the image used.
from imgaug import augmenters
img_augmentor = augmenters.Sequential([
# S e l e c t one of the augmentation techniques randomly
augmenters.OneOf([
iaa.Affine(rotate=0),
iaa.Affine(rotate=90),
56
Chapter IV. Implementation and Results
iaa.Affine(rotate=180),
iaa.Affine(rotate=270),
iaa.Fliplr(0.5),
iaa.Flipud(0.5),
])], random_order=True)
# Apply the augmentation technique on the image
image_aug = img_augmentor.augment_image(image)
Fig.IV.10 shows an example of two images generated through the data augmentation
code above.
Figure IV.10: An example of data augmentation
After applying the data augmentation on the training data, we generated an addi-
tional 30% of data resulting in a total of approximately around 550 images.
2.2.2 Transfer Learning
Many pre-trained models exists nowadays, but they are each focused on a speciļ¬c
problem. In our case, we work more with faces and objects like guns, so the pre-trained
model VGG16 [38] is more suitable to our problem.
To adapt the VGG16 to our problem, we remove its fully-connected layers, freeze the
training on the the remaining layers and add two new layers. The ļ¬rst will have 16 neu-
rons and ā€™reluā€™ activation. The second, our output layer, will have 1 neuron and ā€™sigmoidā€™
57
Chapter IV. Implementation and Results
activation. The loss function will be ā€™binary_crossentropyā€™ with ā€™adamā€™ as an optimizer.
Since the image classiļ¬cation could be a complex task and we have little amount of data,
we will train the model with 5000 epochs with 32 batch size while having an early stop-
ping strategy of 250 rounds.
In Table IV.4, we present the diļ¬€erent scores of combination using our two CNN
layers, with and without the pre-trained model and with and without the generated data
from the data augmentation. While the scores were measured by the same testing data,
the training data diļ¬€ers when using the data augmentation. The usage of both DA and
TL together has resulted in better scores and not so long training time, therefore, we will
be using that in our global model.
Model Accuracy F1-Score Training time
CNN 0.7631 0.7219 3min 50secs
CNN + DA 0.7781 0.7463 4min 12secs
CNN + TL 0.8291 0.8103 8min 48secs
CNN + DA + TL 0.8571 0.8454 9min 23secs
Table IV.4: Image Models Metric Scores
2.3 General Information Classiļ¬cation Model
For the general information, we will follow the same strategy used in the text clas-
siļ¬cation where we will be working with three classiļ¬cation models namely Logistic Re-
gression, Support Vector Machine and Neural Network and the best performing model
will be later used for our global model.
For the Logistic Regression and the Support Vector Machine we used the default
Scikit-Learn parameterā€™s values.
However, for the Neural Network, we used an architecture of four layers with 16 neu-
rons, 8 neurons, 4 neurons and 1 neuron respectively. A ā€™reluā€™ activation is used for the
ļ¬rst three layers and a ā€™sigmoidā€™ activation for the last layer. The model is compiled with
ā€™binary_crossentropyā€™ as a loss function and ā€™adamā€™ as an optimizer. For the training pa-
rameters, we will use 200 epochs with 32 batch size and 20% of validation data extracted
from the training data.
58
Chapter IV. Implementation and Results
Table IV.5 illustrated the metric scores of the trained models with the same data on
the same machine. For the global model, we will be using SVM as it exceeds by far the
performance of the other models.
Model Name Accuracy F1-Score Training time
Logistic Regression 0.7650 0.7873 5 secs
SVM 0.8300 0.8495 7 secs
Neural Network 0.8173 0.8325 48.6 secs
Table IV.5: General Information Models Metric Scores
2.4 Proposed Model
In this part, we will be going through our proposed modelā€™s workļ¬‚ow to put things
together and implement the missing components.
Our modelā€™s input is a multidimensional network, therefore, we have to implement
a parser that will map the data into the correspondent sub-model.
This could be solved through creating objects where we can store the data in a convenient
way then pass to the sub-models. Fig.IV.11 illustrates our class diagram where we store
each userā€™s data. The general user information data are in the User object, while the
Post object, which could also be a Comment, has both image and textual data.
59
Chapter IV. Implementation and Results
Figure IV.11: Class Diagram
The second component of our model is the sub-models that will be receiving the
input data. For that, we will use the pre-trained chosen models of each input type and
output a score per each model.
The next component is the decision making where we have to interpret the output
score of the sub-models and calculate the terrorism score and decide on the userā€™s ex-
tremeness. The calculation formula for that was already deļ¬ned in the last chapter, but
the values of the threshold Ī³ and the models factors Ī± are not yet decided. For the factors,
we decided that since we have more features on the image and textual content than the
general information, we will have the factors as follow:
ā€¢ Text-Model factor: 0.4 (40%)
ā€¢ Image-Model factor: 0.4 (40%)
ā€¢ Information-Model factor: 0.2 (20%)
As for the threshold, we do not have many real online data to decide on this in a
scientiļ¬c way, we agreed to keep it in a neutral way with value of 0.5 (50%).
The model itself is adapted to an over-time change, thus, a component that re-train
and revert a model must be implemented as well. For that, we will have a database where
we store the last modelā€™s score and a python function that checks if the score improved
after re-training the model on the new terrorist-userā€™s data.
60
Chapter IV. Implementation and Results
With having those components ready, our modelā€™s implementation is ļ¬nished and
the model is ready to be tested.
3 Results Interpretation
In this section, we will start testing our model with a network to see if we can answer
our research questions that were posed in the beginning of our proposal.
The network passed to the model is composed of two real users (U1  U2) that are
non-terrorist and one generated terrorist user (U3) as we cannot ļ¬nd an available terrorist
users. The input was tested only a single timestamp t, due to lack of historical data.
As we can see in Table IV.6, which presents the scores predicted for those users for each
sub-model on each social network (Facebook: FB, Instagram: IG, Twitter: T), the model
has performed good by predicting correctly the anomalousness of the users. Based on
these results, we can see that a terrorist could be detected according to his/her social
media content, thus, our answer for Q1 is positive. We can also notice that the scores on
the same data type from diļ¬€erent social networks are mostly similar, except for the text
content on Instagram as it is only image captions, which means that our answer to Q3 is
positive.
User Text-Model Score Image-Model Score Information-Model Score Final Score
FB IG T FB IG T FB IG T
U1 0.084 0.084 0.079 0.031 0.068 0.063 0.265 0.318 0.345 0.116
U2 0.059 0.054 0.078 0.013 0.054 0.115 0.530 0.445 0.276 0.133
U3 0.859 0.298 0.854 0.658 0.877 0.816 0.530 0.637 0.690 0.705
Table IV.6: Model Testing Results
After detecting the user U3 as a terrorist, the sub-models were re-trained again with
appending the new data extracted from U3 to the old data. The new score of each sub-
model were increased by an average of 0.01. Although this increase could be considered
negligible, but over time, it will help our model being up-to-date with the new terrorism
contents, thus, if a user is starting to adopt the new terrorism behaviors that the model
61
Chapter IV. Implementation and Results
was not trained on in the ļ¬rst place, the user will still be detected as a terrorist, therefore
our answer to Q2 is positive.
Conclusion
During this chapter, we presented the implementation of our solution starting from
the data gathering, then the sub-models training and our proposed model construction,
and we ļ¬nished by testing our model and answering our research questions.
62
VConclusions and Perspectives
In this thesis, we proposed a terrorist detection model that works with multidimen-
sional networks as an input format and that can also support diļ¬€erent input data types
such as texts and images. Our model can also detect if the user is adopting a new behavior
over-time, and the model itself can automatically learn new terrorism behaviors.
We started by presenting the existing works carried in the anomaly and terrorism
detection domains. Then, we discussed the existing techniques for data processing and
data classiļ¬cation in an automated way. After that, we presented the modelā€™s design and
the theoretical perspective of the workļ¬‚ow. Finally, we started implementing the model
and discussed the results.
The model itself showed good results on two real users and one generated user by
predicting their anomalousness correctly. Despite the fact that the number of the online
data used for testing is too little, this is still considered as a proof-of-concept that our
proposed model can be implemented and put in a production environment.
Although we tried to cover the limitation of other existing models, our proposed
model is still limited by not supporting some functionalities such as:
ā€¢ Graph analysis: We can use graph analysis methodologies to detect communities
since our input data is a network.
63
Chapter V. Conclusions and Perspectives
ā€¢ Support of videos: We can add another sub-model that works with video classiļ¬ca-
tion, since videos are one of the most important contents in social medias.
The modelā€™s accuracy can also be improved by using larger datasets, thus, we also
solve the calculation of the threshold and the sub-models factors.
64
Bibliography
[1] Shannon Greenwood, Andrew Perrin, and Maeve Duggan. Social media update
2016. Pew Research Center, 11(2), 2016.
[2] Alex P Schmid. The deļ¬nition of terrorism. In The Routledge handbook of terrorism
research, pages 57ā€“116. Routledge, 2011.
[3] Facebook community standards. URL https://www.facebook.com/
communitystandards/dangerous_individuals_organizations.
[4] Arash Habibi Lashkari, Min Chen, and Ali A Ghorbani. A survey on user proļ¬ling
model for anomaly detection in cyberspace. Journal of Cyber Security and Mobility, 8
(1):75ā€“112, 2019.
[5] Zahedeh Zamanian, Ali Feizollah, Nor Badrul Anuar, Laiha Binti Mat Kiah,
Karanam Srikanth, and Sudhindra Kumar. User proļ¬ling in anomaly detection of
authorization logs. In Computational Science and Technology, pages 59ā€“65. Springer,
2019.
[6] Sreyasee Das Bhattacharjee, Junsong Yuan, Zhang Jiaqi, and Yap-Peng Tan. Context-
aware graph-based analysis for detecting anomalous activities. In 2017 IEEE Inter-
national Conference on Multimedia and Expo (ICME), pages 1021ā€“1026. IEEE, 2017.
[7] Di Chen, Qinglin Zhang, Gangbao Chen, Chuang Fan, and Qinghong Gao. Forum
user proļ¬ling by incorporating user behavior and social network connections. In
International Conference on Cognitive Computing, pages 30ā€“42. Springer, 2018.
[8] Hamidreza Alvari, Soumajyoti Sarkar, and Paulo Shakarian. Detection of violent
extremists in social media. arXiv preprint arXiv:1902.01577, 2019.
[9] Pradip Chitrakar, Chengcui Zhang, Gary Warner, and Xinpeng Liao. Social media
image retrieval using distilled convolutional neural network for suspicious e-crime
65
Bibliography
and terrorist account detection. In 2016 IEEE International Symposium on Multimedia
(ISM), pages 493ā€“498. IEEE, 2016.
[10] Alex Krizhevsky, Ilya Sutskever, and Geoļ¬€rey E Hinton. Imagenet classiļ¬cation with
deep convolutional neural networks. In Advances in neural information processing
systems, pages 1097ā€“1105, 2012.
[11] George Kalpakis, Theodora Tsikrika, Stefanos Vrochidis, and Ioannis Kompatsiaris.
Identifying terrorism-related key actors in multidimensional social networks. In
International Conference on Multimedia Modeling, pages 93ā€“105. Springer, 2019.
[12] Pankaj Choudhary and Upasna Singh. A survey on social network analysis for
counter-terrorism. International Journal of Computer Applications, 112(9):24ā€“29,
2015.
[13] Gary LaFree and Laura Dugan. Introducing the global terrorism database. Terrorism
and Political Violence, 19(2):181ā€“204, 2007.
[14] Kalev Leetaru and Philip A Schrodt. Gdelt: Global data on events, location, and
tone, 1979ā€“2012. In ISA annual convention, volume 2, pages 1ā€“49. Citeseer, 2013.
[15] Linton C Freeman. Centrality in social networks conceptual clariļ¬cation. Social
networks, 1(3):215ā€“239, 1978.
[16] EDUCBA contributors. Text mining vs natural language process-
ing - top 5 comparisons, Aug 2019. URL https://www.educba.com/
important-text-mining-vs-natural-language-processing/.
[17] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeļ¬€rey Dean. Eļ¬ƒcient estimation of
word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
[18] Shivangi Singhal. Data representation in nlp, Jul 2019. URL https://medium.com/
@shiivangii/data-representation-in-nlp-7bb6a771599a.
[19] Eric Kauderer-Abrams. Quantifying translation-invariance in convolutional neural
networks. arXiv preprint arXiv:1801.01450, 2017.
66
Master's Thesis
Master's Thesis

More Related Content

What's hot

Anarchi report
Anarchi reportAnarchi report
Anarchi reportBart Gerard
Ā 
Lecture notes on hybrid systems
Lecture notes on hybrid systemsLecture notes on hybrid systems
Lecture notes on hybrid systemsAOERA
Ā 
how to design classes
how to design classeshow to design classes
how to design classesmustafa sarac
Ā 
ubc_2014_spring_dewancker_ian (9)
ubc_2014_spring_dewancker_ian (9)ubc_2014_spring_dewancker_ian (9)
ubc_2014_spring_dewancker_ian (9)Ian Dewancker
Ā 
MACHINE LEARNING METHODS FOR THE
MACHINE LEARNING METHODS FOR THEMACHINE LEARNING METHODS FOR THE
MACHINE LEARNING METHODS FOR THEbutest
Ā 
Capturing Knowledge Of User Preferences With Recommender Systems
Capturing Knowledge Of User Preferences With Recommender SystemsCapturing Knowledge Of User Preferences With Recommender Systems
Capturing Knowledge Of User Preferences With Recommender SystemsMegaVjohnson
Ā 
Design and implementation of a Virtual Reality application for Computational ...
Design and implementation of a Virtual Reality application for Computational ...Design and implementation of a Virtual Reality application for Computational ...
Design and implementation of a Virtual Reality application for Computational ...Lorenzo D'Eri
Ā 
Cognos v10.1
Cognos v10.1Cognos v10.1
Cognos v10.1Exo-marker
Ā 
aniketpingley_dissertation_aug11
aniketpingley_dissertation_aug11aniketpingley_dissertation_aug11
aniketpingley_dissertation_aug11Aniket Pingley
Ā 
Im-ception - An exploration into facial PAD through the use of fine tuning de...
Im-ception - An exploration into facial PAD through the use of fine tuning de...Im-ception - An exploration into facial PAD through the use of fine tuning de...
Im-ception - An exploration into facial PAD through the use of fine tuning de...Cooper Wakefield
Ā 

What's hot (18)

thesis
thesisthesis
thesis
Ā 
Anarchi report
Anarchi reportAnarchi report
Anarchi report
Ā 
Lecture notes on hybrid systems
Lecture notes on hybrid systemsLecture notes on hybrid systems
Lecture notes on hybrid systems
Ā 
PhD-2013-Arnaud
PhD-2013-ArnaudPhD-2013-Arnaud
PhD-2013-Arnaud
Ā 
how to design classes
how to design classeshow to design classes
how to design classes
Ā 
ubc_2014_spring_dewancker_ian (9)
ubc_2014_spring_dewancker_ian (9)ubc_2014_spring_dewancker_ian (9)
ubc_2014_spring_dewancker_ian (9)
Ā 
MACHINE LEARNING METHODS FOR THE
MACHINE LEARNING METHODS FOR THEMACHINE LEARNING METHODS FOR THE
MACHINE LEARNING METHODS FOR THE
Ā 
Capturing Knowledge Of User Preferences With Recommender Systems
Capturing Knowledge Of User Preferences With Recommender SystemsCapturing Knowledge Of User Preferences With Recommender Systems
Capturing Knowledge Of User Preferences With Recommender Systems
Ā 
btpreport
btpreportbtpreport
btpreport
Ā 
Uml (grasp)
Uml (grasp)Uml (grasp)
Uml (grasp)
Ā 
document
documentdocument
document
Ā 
SCE-0188
SCE-0188SCE-0188
SCE-0188
Ā 
Design and implementation of a Virtual Reality application for Computational ...
Design and implementation of a Virtual Reality application for Computational ...Design and implementation of a Virtual Reality application for Computational ...
Design and implementation of a Virtual Reality application for Computational ...
Ā 
Cognos v10.1
Cognos v10.1Cognos v10.1
Cognos v10.1
Ā 
aniketpingley_dissertation_aug11
aniketpingley_dissertation_aug11aniketpingley_dissertation_aug11
aniketpingley_dissertation_aug11
Ā 
Im-ception - An exploration into facial PAD through the use of fine tuning de...
Im-ception - An exploration into facial PAD through the use of fine tuning de...Im-ception - An exploration into facial PAD through the use of fine tuning de...
Im-ception - An exploration into facial PAD through the use of fine tuning de...
Ā 
General physics
General physicsGeneral physics
General physics
Ā 
Mining of massive datasets
Mining of massive datasetsMining of massive datasets
Mining of massive datasets
Ā 

Similar to Master's Thesis

Upstill_thesis_2000
Upstill_thesis_2000Upstill_thesis_2000
Upstill_thesis_2000Trystan Upstill
Ā 
Content Based Image Retrieval
Content Based Image RetrievalContent Based Image Retrieval
Content Based Image RetrievalLĆ©o Vetter
Ā 
Ibm watson analytics
Ibm watson analyticsIbm watson analytics
Ibm watson analyticsLeon Henry
Ā 
Master_Thesis
Master_ThesisMaster_Thesis
Master_ThesisDhara Shah
Ā 
Computer Security: A Machine Learning Approach
Computer Security: A Machine Learning ApproachComputer Security: A Machine Learning Approach
Computer Security: A Machine Learning Approachbutest
Ā 
Nweke digital-forensics-masters-thesis-sapienza-university-italy
Nweke digital-forensics-masters-thesis-sapienza-university-italyNweke digital-forensics-masters-thesis-sapienza-university-italy
Nweke digital-forensics-masters-thesis-sapienza-university-italyAimonJamali
Ā 
Machine_translation_for_low_resource_Indian_Languages_thesis_report
Machine_translation_for_low_resource_Indian_Languages_thesis_reportMachine_translation_for_low_resource_Indian_Languages_thesis_report
Machine_translation_for_low_resource_Indian_Languages_thesis_reportTrushita Redij
Ā 
A proposed taxonomy of software weapons
A proposed taxonomy of software weaponsA proposed taxonomy of software weapons
A proposed taxonomy of software weaponsUltraUploader
Ā 
Scale The Realtime Web
Scale The Realtime WebScale The Realtime Web
Scale The Realtime Webpfleidi
Ā 
An Optical Character Recognition Engine For Graphical Processing Units
An Optical Character Recognition Engine For Graphical Processing UnitsAn Optical Character Recognition Engine For Graphical Processing Units
An Optical Character Recognition Engine For Graphical Processing UnitsKelly Lipiec
Ā 
Distributed Decision Tree Learning for Mining Big Data Streams
Distributed Decision Tree Learning for Mining Big Data StreamsDistributed Decision Tree Learning for Mining Big Data Streams
Distributed Decision Tree Learning for Mining Big Data StreamsArinto Murdopo
Ā 
Designing Countermeasures For Tomorrows Threats : Documentation
Designing Countermeasures For Tomorrows Threats : DocumentationDesigning Countermeasures For Tomorrows Threats : Documentation
Designing Countermeasures For Tomorrows Threats : DocumentationDarwish Ahmad
Ā 

Similar to Master's Thesis (20)

Investigation in deep web
Investigation in deep webInvestigation in deep web
Investigation in deep web
Ā 
Upstill_thesis_2000
Upstill_thesis_2000Upstill_thesis_2000
Upstill_thesis_2000
Ā 
Thesis
ThesisThesis
Thesis
Ā 
Content Based Image Retrieval
Content Based Image RetrievalContent Based Image Retrieval
Content Based Image Retrieval
Ā 
Ibm watson analytics
Ibm watson analyticsIbm watson analytics
Ibm watson analytics
Ā 
IBM Watson Content Analytics Redbook
IBM Watson Content Analytics RedbookIBM Watson Content Analytics Redbook
IBM Watson Content Analytics Redbook
Ā 
main
mainmain
main
Ā 
Master_Thesis
Master_ThesisMaster_Thesis
Master_Thesis
Ā 
Computer Security: A Machine Learning Approach
Computer Security: A Machine Learning ApproachComputer Security: A Machine Learning Approach
Computer Security: A Machine Learning Approach
Ā 
Nweke digital-forensics-masters-thesis-sapienza-university-italy
Nweke digital-forensics-masters-thesis-sapienza-university-italyNweke digital-forensics-masters-thesis-sapienza-university-italy
Nweke digital-forensics-masters-thesis-sapienza-university-italy
Ā 
Software guide 3.20.0
Software guide 3.20.0Software guide 3.20.0
Software guide 3.20.0
Ā 
Machine_translation_for_low_resource_Indian_Languages_thesis_report
Machine_translation_for_low_resource_Indian_Languages_thesis_reportMachine_translation_for_low_resource_Indian_Languages_thesis_report
Machine_translation_for_low_resource_Indian_Languages_thesis_report
Ā 
A proposed taxonomy of software weapons
A proposed taxonomy of software weaponsA proposed taxonomy of software weapons
A proposed taxonomy of software weapons
Ā 
Scale The Realtime Web
Scale The Realtime WebScale The Realtime Web
Scale The Realtime Web
Ā 
An Optical Character Recognition Engine For Graphical Processing Units
An Optical Character Recognition Engine For Graphical Processing UnitsAn Optical Character Recognition Engine For Graphical Processing Units
An Optical Character Recognition Engine For Graphical Processing Units
Ā 
Master's Thesis
Master's ThesisMaster's Thesis
Master's Thesis
Ā 
E.M._Poot
E.M._PootE.M._Poot
E.M._Poot
Ā 
Distributed Decision Tree Learning for Mining Big Data Streams
Distributed Decision Tree Learning for Mining Big Data StreamsDistributed Decision Tree Learning for Mining Big Data Streams
Distributed Decision Tree Learning for Mining Big Data Streams
Ā 
Designing Countermeasures For Tomorrows Threats : Documentation
Designing Countermeasures For Tomorrows Threats : DocumentationDesigning Countermeasures For Tomorrows Threats : Documentation
Designing Countermeasures For Tomorrows Threats : Documentation
Ā 
Technical report
Technical reportTechnical report
Technical report
Ā 

Recently uploaded

IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024Mark Billinghurst
Ā 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSCAESB
Ā 
chaitra-1.pptx fake news detection using machine learning
chaitra-1.pptx  fake news detection using machine learningchaitra-1.pptx  fake news detection using machine learning
chaitra-1.pptx fake news detection using machine learningmisbanausheenparvam
Ā 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
Ā 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130Suhani Kapoor
Ā 
Heart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxHeart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxPoojaBan
Ā 
šŸ”9953056974šŸ”!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
šŸ”9953056974šŸ”!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...šŸ”9953056974šŸ”!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
šŸ”9953056974šŸ”!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...9953056974 Low Rate Call Girls In Saket, Delhi NCR
Ā 
microprocessor 8085 and its interfacing
microprocessor 8085  and its interfacingmicroprocessor 8085  and its interfacing
microprocessor 8085 and its interfacingjaychoudhary37
Ā 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girlsssuser7cb4ff
Ā 
main PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidmain PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidNikhilNagaraju
Ā 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escortsranjana rawat
Ā 
power system scada applications and uses
power system scada applications and usespower system scada applications and uses
power system scada applications and usesDevarapalliHaritha
Ā 
HARMONY IN THE HUMAN BEING - Unit-II UHV-2
HARMONY IN THE HUMAN BEING - Unit-II UHV-2HARMONY IN THE HUMAN BEING - Unit-II UHV-2
HARMONY IN THE HUMAN BEING - Unit-II UHV-2RajaP95
Ā 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024hassan khalil
Ā 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130Suhani Kapoor
Ā 
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxwendy cai
Ā 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escortsranjana rawat
Ā 
Model Call Girl in Narela Delhi reach out to us at šŸ”8264348440šŸ”
Model Call Girl in Narela Delhi reach out to us at šŸ”8264348440šŸ”Model Call Girl in Narela Delhi reach out to us at šŸ”8264348440šŸ”
Model Call Girl in Narela Delhi reach out to us at šŸ”8264348440šŸ”soniya singh
Ā 

Recently uploaded (20)

IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024
Ā 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentation
Ā 
chaitra-1.pptx fake news detection using machine learning
chaitra-1.pptx  fake news detection using machine learningchaitra-1.pptx  fake news detection using machine learning
chaitra-1.pptx fake news detection using machine learning
Ā 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
Ā 
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCRCall Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Ā 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
Ā 
Heart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxHeart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptx
Ā 
šŸ”9953056974šŸ”!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
šŸ”9953056974šŸ”!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...šŸ”9953056974šŸ”!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
šŸ”9953056974šŸ”!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
Ā 
microprocessor 8085 and its interfacing
microprocessor 8085  and its interfacingmicroprocessor 8085  and its interfacing
microprocessor 8085 and its interfacing
Ā 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girls
Ā 
main PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidmain PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfid
Ā 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
Ā 
power system scada applications and uses
power system scada applications and usespower system scada applications and uses
power system scada applications and uses
Ā 
HARMONY IN THE HUMAN BEING - Unit-II UHV-2
HARMONY IN THE HUMAN BEING - Unit-II UHV-2HARMONY IN THE HUMAN BEING - Unit-II UHV-2
HARMONY IN THE HUMAN BEING - Unit-II UHV-2
Ā 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024
Ā 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
Ā 
young call girls in Rajiv ChowkšŸ” 9953056974 šŸ” Delhi escort Service
young call girls in Rajiv ChowkšŸ” 9953056974 šŸ” Delhi escort Serviceyoung call girls in Rajiv ChowkšŸ” 9953056974 šŸ” Delhi escort Service
young call girls in Rajiv ChowkšŸ” 9953056974 šŸ” Delhi escort Service
Ā 
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptx
Ā 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
Ā 
Model Call Girl in Narela Delhi reach out to us at šŸ”8264348440šŸ”
Model Call Girl in Narela Delhi reach out to us at šŸ”8264348440šŸ”Model Call Girl in Narela Delhi reach out to us at šŸ”8264348440šŸ”
Model Call Girl in Narela Delhi reach out to us at šŸ”8264348440šŸ”
Ā 

Master's Thesis

  • 1. Tunisian Republic Ministry of Higher Education and Scientific Research University of Tunis El Manar Higher Institute of Computer Science Masterā€™s Thesis Presented in order to obtain the Masterā€™s Degree in Information and Technology Mention: Information and Technology Specialty : Software Engineering (GL) By: Wajdi KHATTEL Proposal of a Terrorist Detection Model in Social Networks Presented on 07.12.2019 In front of jury composed of: President: Evaluator: Academic supervisor: Laboratory supervisor: Najet AROUS Olfa EL MOURALI Ramzi GUETARI Nour El Houda BEN CHAABENE Realized within Academic year : 2018-2019
  • 2. Laboratory Supervisor Academic Supervisor I authorize the student to submit his internship report for a defense Signature I authorize the student to submit his internship report for a defense Signature Le 22/11/2019 Ramzi Guetari Le 22/11/2019 Nour El Houda Ben Chaabene
  • 3. Dedications I want to dedicate this humble work to: My parents Abderraouf and Sonia for all the pain they have been through and all the sacriļ¬ces they made in order for me to reach this level and for me to be what I am today. To my sister Yosra and her husband Jamel for their patience, continuous support and care. To all the members of my family and my dearest friends for the best times and laughs we had and sticking by my side the time I needed. For all those I love and all those who love me. To all who helped that I forgot to mention. With Love, Wajdi Khattel. iii
  • 4. Acknowledgements I would like ļ¬rst to thank and express my very profound gratitude to my academic advisor, Mrs. Nour EL Houda BEN CHAABENE for the huge eļ¬€ort and sacriļ¬ce she gave the entire time and also for believing in our capacities and her patience, motivation, and immense knowledge. Her guidance helped us in all the time of research and writing of this thesis. My academic Professor, Mr. Ramzi GUETARI, for his big support and generosity and his continuous welcome in his oļ¬ƒce that was always open whenever I ran into a trouble spot or had a question about our research, and steering us in the right direction whenever I needed it. Also anyone who contributed to this work for the support, even spiritually especially the last couple of weeks. With Gratitude Wajdi Khattel. iv
  • 5. Table of Contents General Introduction 1 I State of the art 3 1 Anomaly Detection in Social Media . . . . . . . . . . . . . . . . . . . . . . . 4 1.1 Activity-based Detection . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.2 Graph-based Detection . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2 Terrorist Detection in Social Media . . . . . . . . . . . . . . . . . . . . . . . 10 2.1 Existing Content-based Models . . . . . . . . . . . . . . . . . . . . . 11 2.2 Existing Graph-input Analysis . . . . . . . . . . . . . . . . . . . . . 13 II Existing Techniques 16 1 Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 1.1 Textual-Content Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 1.1.1 Text Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 17 1.1.2 Data Representation . . . . . . . . . . . . . . . . . . . . . . 20 1.2 Image-Content Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 1.2.1 CNN: Convolutional Layer . . . . . . . . . . . . . . . . . . 23 1.2.2 CNN: Pooling Layer . . . . . . . . . . . . . . . . . . . . . . 25 1.2.3 CNN: Fully-Connected Layer . . . . . . . . . . . . . . . . . 26 1.3 Numerical-Content Data . . . . . . . . . . . . . . . . . . . . . . . . . 26 2 Data Classiļ¬cation in Machine Learning . . . . . . . . . . . . . . . . . . . . 26 2.1 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.2 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.3 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 v
  • 6. Table of Contents III Proposed Model 29 1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 1.1 Oļ¬„ine Data Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 1.2 Online Data Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2 Proposed Model Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.1 Model Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.2 Content-Based Classiļ¬cation . . . . . . . . . . . . . . . . . . . . . . . 34 2.2.1 Text Classiļ¬cation Model . . . . . . . . . . . . . . . . . . . 34 2.2.2 Image Classiļ¬cation Model . . . . . . . . . . . . . . . . . . 36 2.2.3 General Information Classiļ¬cation Model . . . . . . . . . . 37 2.3 Decision Making . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 2.4 Global Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 IV Implementation and Results 43 1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 1.1 Oļ¬„ine Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 1.1.1 Textual-Content Data . . . . . . . . . . . . . . . . . . . . . 44 1.1.2 Image-Content Data . . . . . . . . . . . . . . . . . . . . . . 46 1.1.3 General Information Data . . . . . . . . . . . . . . . . . . . 48 1.2 Online Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 1.2.1 Facebook Data . . . . . . . . . . . . . . . . . . . . . . . . . 49 1.2.2 Instagram Data . . . . . . . . . . . . . . . . . . . . . . . . . 51 1.2.3 Twitter Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 2 Model Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 2.1 Text Classiļ¬cation Model . . . . . . . . . . . . . . . . . . . . . . . . . 54 2.1.1 NLP Process . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 2.1.2 Data Vectorization . . . . . . . . . . . . . . . . . . . . . . . 54 2.1.3 Data Classiļ¬cation . . . . . . . . . . . . . . . . . . . . . . . 55 2.2 Image Classiļ¬cation Model . . . . . . . . . . . . . . . . . . . . . . . 56 2.2.1 Data Augmentation . . . . . . . . . . . . . . . . . . . . . . 56 2.2.2 Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . 57 2.3 General Information Classiļ¬cation Model . . . . . . . . . . . . . . . 58 vi
  • 7. Table of Contents 2.4 Proposed Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3 Results Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 V Conclusions and Perspectives 63 Bibliography 65 vii
  • 8. List of Figures I.1 Uniļ¬ed User Proļ¬ling (UUP) system with cyber security perspective . . . . 6 I.2 User Proļ¬ling Method in Authorization Logs . . . . . . . . . . . . . . . . . 7 I.3 Context-aware graph-based approach framework . . . . . . . . . . . . . . . 8 I.4 Forum user proļ¬ling approach framework . . . . . . . . . . . . . . . . . . . 9 I.5 Transfer-Learning CNN Framework . . . . . . . . . . . . . . . . . . . . . . . 12 I.6 Multidimensional Key Actor Detection Framework . . . . . . . . . . . . . . 14 II.1 An example of morphemes extraction . . . . . . . . . . . . . . . . . . . . . . 18 II.2 An example of syntax analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 19 II.3 An example of semantic network . . . . . . . . . . . . . . . . . . . . . . . . 20 II.4 Curved Edge Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 III.1 Multi-dimensional Network . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 III.2 Text Classiļ¬cation Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 III.3 Image Classiļ¬cation Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 III.4 General Information Classiļ¬cation Model . . . . . . . . . . . . . . . . . . . 38 III.5 Proposed Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 III.6 Model Workļ¬‚ow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 IV.1 Twitter Searching Tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 IV.2 Sample of news headlines . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 IV.3 Word Cloud of our Textual Data . . . . . . . . . . . . . . . . . . . . . . . . . 46 IV.4 Sample of Terrorists images . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 IV.5 Sample of Military/News images . . . . . . . . . . . . . . . . . . . . . . . . 48 IV.6 Age Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 IV.7 Relationship Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 IV.8 Gender Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 IV.9 Facebook Graph API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 viii
  • 9. List of Figures IV.10An example of data augmentation . . . . . . . . . . . . . . . . . . . . . . . . 57 IV.11Class Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 ix
  • 10. List of Tables I.1 Anomaly detection existing works comparison . . . . . . . . . . . . . . . . 10 I.2 Activity-based techniques comparison . . . . . . . . . . . . . . . . . . . . . 13 II.1 Comparison of word embedding methods . . . . . . . . . . . . . . . . . . . 22 IV.1 Textual-Content Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 IV.2 Image-Content Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 IV.3 Text Models Metric Scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 IV.4 Image Models Metric Scores . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 IV.5 General Information Models Metric Scores . . . . . . . . . . . . . . . . . . . 59 IV.6 Model Testing Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 x
  • 11. Acronyms UUP Uniļ¬ed User Proļ¬ling CERT Computer Emergency Response Team NATOPS Naval Air Training and Operating Procedures Standardization SVM Support Vector Machine SMP Social Media Processing M-SN Multiple Social Networks M-IT Multiple Input Types T-UBC Time-based User Behavior Changes T-FBC Time-based Future Behaviorā€™s Changes ISIS Islamic State of Iraq and Syria URL Uniform Resource Locator LSTM Long Short-Term Memory CNN Convolutional Neural Network API Application Program Interface GTD Global Terrorism Database GDELT Global Data on Events Location and Tone SNA Social Network Analysis NLP Natural Language Processing FOPL First Order Predicate Logic xi
  • 12. TF-IDF Term Frequency-Inverse Document Frequency CBOW Continuous Bag Of Words RGB Red, Green, Blue START Study of Terrorism And Responses to Terrorism PIRUS Proļ¬les of Individual Radicalization In the United States HTTP HyperText Transfer Protocol RDF Resource Description Framework REST REpresentational State Transfer JSON JavaScript Object Notation SDK Software Development Kit RAM Random Access Memory CPU Central Processing Unit GPU Graphics Processing Unit NLTK Natural Language ToolKit DA Data Augmentation TL Transfer Learning VGG Visual Geometry Group FB Facebook IG Instagram T Twitter xii
  • 13. General Introduction The appearing of social networks has created an ease of communication and idea sharing. Several of them has now become one of the most popular sources of information, namely: Facebook, Twitter, LinkedIn, etc. Within the last decade, the number of the people across the word that uses those websites kept increasing to overcome billion of active users per day [1]. Most of these users are there to interact with their friends, family and meet new people that shares their interests. Other users such as business owners are there to communicate with their target audience for promoting their brand or receiving feedback from customers. Although this facilitated way of communication could be used in a friendly way, there are other users that take that advantage in a harmful way such as bullies, spammers and hackers. One of the most dangerous categories is terrorist groups. They are one of the most proļ¬table users of this advantage. The ability for them to incite other people, promote their groups and planning attacks has become very simple. Detecting these groups in an accurate and a fast way has become one of the most important tasks for social network owners. Several approaches and methods has been proposed to that end such as manual monitoring and ļ¬rewalls. But, as the number of those individuals kept increasing, an accurate and fully automated approaches must be used. Fortunately, the evolution of new technologies, especially the appearing of machine learning, has made that task easier. In this thesis, we propose a model that learns the characteristics that describes a terrorist individual. Additionally, the model learns by itself the new characteristics that deļ¬ne terrorism behaviors, since the abnormal behaviors of our social-culture changes over-time. The ļ¬rst chapter presents some existing works that deal with the issue of anomaly detection in general and terrorism detection in particular, to give the reader a general idea of the research carried out in this domain. 1
  • 14. The second chapter presents some existing techniques in the machine learning do- main in order for us to implement our proposed model. The third chapter introduces the basis of our proposed model from a theoretical per- spective so that we can implement the modelā€™s design. The fourth chapter presents the practical part of our work where we go through the pipeline of our modelā€™s implementation and discuss the results. Finally, we end with a general conclusion and perspectives. 2
  • 15. IState of the art This chapter presents an overview of some existing works that deal with the issue of anomaly detection in general and terrorism detection in particular. We begin this chapter by deļ¬ning the concept of anomaly and point up the importance of its application in the social media area. We present, by next, an overview of some applied anomaly detection and terrorist detection works categorized based on their input format. The purpose of this chapter is to give the reader a general idea of the research carried out in the detection of anomalies and terrorism. Introduction Social mediaā€™s main objective is to provide a platform for people to communicate together and share their thoughts. Although, most of the users use it in a friendly way, many others can beneļ¬t from this ease of communication to plan attacks or incite the others to adopt extremism behaviors. Therefore, it is extremely important that we can detect these users in an accurate and a fast way. Those users are often referred to as an anomaly due to their abnormal behaviors. 3
  • 16. Chapter I. State of the art Abnormal behaviors are behaviors that diļ¬€er or follow an unusual pattern com- pared to what is deļ¬ned as normal sociocultural behavior. Our main objective behind this research is to study the characteristics that describe an anomalous individual. Although, in social media, an anomalous user certainly will hide his anomalousness, therefore, time is important since we will be looking for peaks and deviations from his/her usual behavior pattern. Nevertheless, what is considered abnormal in todayā€™s sociocultural could become normal after a period of time, thus, we should take into consideration the behaviorā€™s evolution when deļ¬ning the abnormal be- havior. Diļ¬€erent models and approaches has been proposed toward solving this problem. Based on their input format, we can categorize them into activity-based detection, where the in- put data is the userā€™s activity, and graph-based detection, where the input data is a graph of multiple users. However, the anomaly itself is way too abstract as a term, this motivated us to initiate an attempt to work only on one concrete type of anomaly which is the terrorism. To consider an individual as a terrorist, we have ļ¬rst to deļ¬ne what is a terrorist, since there are no universal agreement on the deļ¬nition of a terrorist [2]. Facebook, in their deļ¬nition of dangerous individuals and organizations, attempted to deļ¬ne terrorism as following: Terrorism: Any nongovernmental organization that engages in premeditated acts of vio- lence against persons or property to intimidate a civilian population, government or interna- tional organization in order to achieve a political, religious or ideological aim. [3] Since we are working with social medias, we decided to consider that deļ¬nition. In the following sections, we start by presenting the existing activity-based and graph- based anomaly detection proposals, then we put our focus on the terrorist detection works. 1 Anomaly Detection in Social Media This section presents the existing models and approaches for anomaly detection in social media categorized by their input format. We look for whether the latest propos- als can identify the future anomaly-behaviorā€™s changes, the userā€™s behavior over-time 4
  • 17. Chapter I. State of the art changes and the usage of multiple social networks. 1.1 Activity-based Detection Activity-based detection approaches consider that users are kind of independent from each other. An individual is deļ¬ned by his/her own activities and that would deter- mine if whether his/her behavior are abnormal. In [4], the authors presented a survey of the available user proļ¬ling methods for anomaly detection, then they proposed their own anomaly detection model. They showed the advantages and disadvantages of each model from a cybersecurity perspective, some models were using operating system log and web browser history as data source while others were more focused on social networks such as twitter and Facebook. Their anal- yses revealed that the models based on history and logs were more limited and not con- sistent from the perspective of not really knowing whether the same user is the only one using that operating system or the web browser, while the social network-related models, were more consistent because it is a private account based approaches that also includes users interactivity with each other which leads to better results. Based on other methodā€™s data sources, they deļ¬ned a user proļ¬le representation with a vector of 7 main feature categories : ā€¢ Users interests features ā€¢ Knowledge and skills features ā€¢ Demographic information features ā€¢ Intention features ā€¢ Online and oļ¬„ine behaviour features ā€¢ Social media activity features ā€¢ Network traļ¬ƒc features Each features category contains some features and sub-grouped features inside, which led ļ¬nally to having more than 270 features that are mostly security-related. Their pro- posed model called "Uniļ¬ed User Proļ¬ling" (Fig.I.1), will mainly collect the data from 5
  • 18. Chapter I. State of the art the diļ¬€erent sources, then clean it and parse it in order to have structured data which ļ¬nally leads to having a user proļ¬le vector that the administrator is able to monitor in diļ¬€erent categories and detect anomalies based on the user activity. While their model is mostly complete in term of features and they considered diļ¬€erent social networks, it is still limited to not automatically detect anomalies. Figure I.1: Uniļ¬ed User Proļ¬ling (UUP) system with cyber security perspective In [5], the writers proposed a pattern recognition method, that given a vector of a user proļ¬le, it will take the userā€™s daily activity and create a time-series pattern for that user on each activity he does (Fig.I.2), then each time the user is involved in an activity, the new behaviour is compared to his/her behavioral pattern of that activity. If a deviation from the normal behavior happened, it is ļ¬‚agged suspicious, but since a minor deviation doesnā€™t always mean a suspicion, there is a behavioral model of all system users that the activity will also be compared to so that the false alarms are kept at the minimum. Their model is a random forest trained on the CERT dataset along with a private dataset acquired from NextLabs which achieved over 97% accuracy. This method showed great results in term of insider threat detection which is considered as a single social network, so it is still limited by not supporting multiple social networks and it cannot learn future abnormal behaviors automatically over time. 6
  • 19. Chapter I. State of the art Figure I.2: User Proļ¬ling Method in Authorization Logs 1.2 Graph-based Detection Graph-based detection approaches consider user interactivity by analyzing a snap- shot of a network. Each user can have a relation with other users such as mentions, shares and likes. There are two approaches for that, statically or dynamically. In the static graph-based detection approaches, the analysis is done on a single snapshot of the network. While for the dynamic graph-based detection approaches, the analysis is done in a time-based way by analyzing a series of snapshots. In [6], the writers proposed an anomaly detection framework that, at each timestamp t, each user within a network have an activity score and a mutual score with other users. The scores are based on the userā€™s activities and the interactivity with other users on these activities. A Mutual agreement matrix is then produced to represent those scores where the userā€™s activities score in the matrix diagonal. Using an anomaly scoring function that they proposed, the userā€™s scores are passed into it and thresholded to deļ¬ne whether the user is anomalous or not (Fig.I.3). As a data source, they used the "CMU-CERT Insider Threat Dataset" and the "NATOPS Gesture Dataset", then they compared the results of their framework to other known models. Their model by far was the best, they reached around 0.95 of area-under-curve score while the other models such as SVM and clustering 7
  • 20. Chapter I. State of the art were around 0.89. Despite the fact that the framework overcame by far the expecting results for detecting insider threats and its ability to support overtime behavior changes, it is still limited by not considering diļ¬€erent input data types such as images and texts and not analyzing multiple networks simultaneously. Figure I.3: Context-aware graph-based approach framework In [7], the authors proposed a user proļ¬ling approach based on user behavior features and social network connection features (Fig.I.4). The ļ¬rst set of features (user behavior features) is the foundation of user representation which are composed of posts contents statistics, posts content semantics and user behavior statistics. The social network con- nection features are basically a set of features that leads to the construction of a network of similar users that have similar network representation. The experiment results showed that by using the network connections the model overall score improved. Their approach reached the second place among around 900 participants in the SMP 2017 User Proļ¬ling Competition. This work showed that the use of graphs and the consideration of user interactivity is an improvement toward grouping individuals thus, detecting anomalous communities. The limitation of this work is that it cannot detect the categoryā€™s behavior future changes. 8
  • 21. Chapter I. State of the art Figure I.4: Forum user proļ¬ling approach framework 1.3 Summary Within the scope of our research in the anomaly detection in social media, we studied diļ¬€erent papers. Table I.2 presents the advantages and limitations of those papers in term of their support of multiple social networks (M-SN), support of multiple input data types such as text and images (M-IT), support of over-time user behavior changes (T-UBC) and their ability to learn future new abnormal behaviorā€™s changes (T-FBC). 9
  • 22. Chapter I. State of the art Paper Description Input Format M-SN M-IT T-UBC T-FBC Lashakry et al., 2019 [4] Proposed model for user proļ¬le creation to monitor users Userā€™s Activity Zamanian et al., 2019 [5] Proposed model for user activity pattern recognition with ran- dom forest Userā€™s Activity Bhattacharjee et al., 2017 [6] Proposed a prob- abilistic anomaly classiļ¬er model Graph of users Chen et al., 2018 [7] Proposed a user pro- ļ¬ling framework that can be used to detect anomalous users Graph of users Table I.1: Anomaly detection existing works comparison None of the mentioned works has considered all the mentioned functionalities to- gether. Therefore, we decided to work on a model that supports those features. To fa- cilitate that, we considered a hybrid architecture where the input format is graph-based to include user interactivity and the ease of detecting communities but also focus on the userā€™s activity to solve our main problem of identifying the characteristics that describes an anomalous individual. 2 Terrorist Detection in Social Media As we decided to have a hybrid architecture with both graph-input and activity- based detection, we identiļ¬ed the existing terrorist detection works that focus on userā€™s social medias content and other works that treats a graph as an input. In this section, we present those papers to get more overview about how to solve our problem. 10
  • 23. Chapter I. State of the art 2.1 Existing Content-based Models In this section, we focus on models that treats the content of the activities that an individual can get involved in on social medias. Those are served as a proof-of-concept for our implementation of them. In [8], the writers implemented a model that detects extremists in social media based on some information related to usernames, proļ¬le, and textual content. They built their dataset from Twitter by looking for hashtags related to extremism which results into having around 1.5M tweets, then they extracted 150 ISIS-related accounts that posted those tweets and were reported to the Twitter Safety account (@TwitterSafety) by normal users and 150 normal users to have a balanced dataset all along with 3k of unlabeled data. Afterwards, they categorized the features into 3 major groups: ā€¢ Twitter handleā€™s (username) related features: length, number of unique characters and Kolmogorov complexity of the username. ā€¢ Proļ¬le related features: this group contains 7 features related to the proļ¬le of the user such as the proļ¬leā€™s description, the number of followers and the location. ā€¢ Content related features: the number of URLs, the number of hashtags and the sentiment of the content. Based on this dataset, they tried to answer two research questions: ā€¢ Are extremists on Twitter inclined to adopt similar handles? ā€¢ Can we infer the labels (extremist vs. non-extremist) of unseen handles based on their proximity to the labeled instances? After their experiment with diļ¬€erent supervised and semi-supervised approaches, both question had a positive answer and SVM had the best precision score with 0.96 which shows the signiļ¬cance of the proposed feature set, but char-LSTM had the best precision- recall score with 0.76 that minimize the number of false negatives. This work presented diļ¬€erent ways of collecting the necessary data in an extremist detec- tion work. They also showed that the use of diļ¬€erent input data types from social media 11
  • 24. Chapter I. State of the art can help detecting extremists. The limitation of this model is that it does not support over-time userā€™s behavior change and it cannot learn future extremist behaviors. In [9], the authors presented a convolutional neural network (CNN) in order to de- tect suspicious e-crimes and terrorist involvement by classifying social media image con- tents. They used three diļ¬€erent kinds of datasets in which we are only interested in the terrorism images dataset. Based on the transfer learning technique, they took the CNN architecture of the imagenet model [10] and they reduced its network size by lowering the kernel size of each layer to come up with their new smaller network (Fig I.5). In the results, their architecture outperformed the default imagenet by around 1% of mean av- erage precision score and took half imagenetā€™s execution time. This paper showed that the concept of detecting terrorists based on their social media im- age contents is possible along with the advantage of using transfer learning rather than building a CNN from scratch. But their model supports only one type of data which is images. Figure I.5: Transfer-Learning CNN Framework In Table I.2 we present the content-based models that we analyzed with their advan- tages and limitations. 12
  • 25. Chapter I. State of the art Paper Description Advantages Limits Alvari et al., 2019 [8] (semi)-supervised model of extremist detection based on userā€™s general information and textual-content data - Proof-of-concept of detection based on textual-content and general infor- mation - Support multiple input data types - Cannot support multiple social networks - Cannot detect if user is adopting new behaviors over-time - Cannot learn future behaviorā€™s change Chitrakar et al., 2016 [9] Image classiļ¬cation model using CNN and Transfer learn- ing - Proof-of-concept of image content based detection - Highlighted a model improve- ment technique: Transfer Learning - Cannot support multiple input data types - Cannot learn future behaviorā€™s change Table I.2: Activity-based techniques comparison 2.2 Existing Graph-input Analysis In this section, we study the existing works that works with graph as an input for the terrorist detection in social media problem. In [11], the authors proposed a framework that treats multidimensional network as an input for the identiļ¬cation of terrorist network key-actors. The dimensions represent the types of relationships or interactions in a social media. The workļ¬‚ow of their frame- work starts by building a multidimensional network through a keyword-based search on a social media platform, then that network is mapped to a single layer network by using certain mapping functions. To detect the key actors, they use several centrality measures 13
  • 26. Chapter I. State of the art such as Degree of Centrality and Betweenness Centrality. The output of the frame- work is a ranked list of the key actors within the network. The frameworkā€™s eļ¬€ectiveness was evaluated with a ground truth dataset of a 16-month period Twitter data. Fig. I.6 presents the workļ¬‚ow of this framework. This work presented the usage of multidimensional networks and how we can analyse it to detect terrorist-networkā€™s key actors. Their usage of the multiple dimensions could be more eļ¬ƒcient if they considered multiple social medias instead of multiple relationship and interaction types. Figure I.6: Multidimensional Key Actor Detection Framework In [12], the writers created a survey on social network analysis for counter-terrorism where they provided the data collection methods and the diļ¬€erent types of analysis. The two sources of data are: online social networks and oļ¬„ine social networks. The on- line social networks are the social media websites which allow users to interact with other users through sending messages, posting information, these are websites like Facebook, Twitter and YouTube in which we collect the data using their APIs. In the other hand, of- ļ¬‚ine social networks are the real life social networks based on the relations like ļ¬nancial transactions, locations, events etc, and these are the public databases such as Global Ter- rorism Database (GTD) [13] and Global Data on Events Location and Tone (GDELT) [14]. 14
  • 27. Chapter I. State of the art Furthermore, they analyzed the diļ¬€erent centrality measures that provides the impor- tance and position of a node in a network such as: ā€¢ Degree Centrality: A node with higher degree value is often considered as an active actor in a network. The degree value is the number of connections linked to a node. [15] ā€¢ Closeness Centrality: A node with higher closeness value can quickly access other nodes in a network. The closeness value is a measure for how fast a node can reach other nodes. [15] ā€¢ Betweenness Centrality: A node with higher betweenness value is often considered as an inļ¬‚uencer in a network. The betweenness value is the number of shortest paths between any pair that pass through a node. We can see this as which node acts as a bridge to make communities in a network. [15] Finally, they stated some SNA tools comparison based on the functionality, platform, license type and ļ¬le-formats. As conclusion, they winded up with the idea of when doing social network analysis, the main challenge is the data itself, since the privacy of users is a very sensitive issue and also most of the times data tends to be incomplete with lot of missing and fake nodes and relations, which often leads to incorrect analysis results. This survey provided us the diļ¬€erent data collection methods as well as the graph analysis methodologies. Conclusion In this chapter, we presented some existing works that have dealt with anomaly de- tection in general and terrorist detection in particular in diļ¬€erent approaches. To the best of our analysis, the existing methods did not deal with terrorism in multidimen- sional graphs with combining diļ¬€erent types of classiļ¬cations in a time-based way. This motivated us to provide a model of terrorism detection in multidimensional graphs that supports diļ¬€erent types of input data that can also detect over-time behaviorā€™s change. In the next chapter, we initiate a research on the existing techniques needed to im- plement our proposed model. 15
  • 28. IIExisting Techniques This chapter presents the necessary techniques to implement our proposed model. We begin by presenting the diļ¬€erent input data types that we are considering and the techniques used for the analysis of each type. Then, we present the classiļ¬cation models to use and how they works. Introduction Each social network has ample input data that could be shared on it, identifying these data types and choosing which ones we will be working with an important task toward achieving our goal. In our previous analysis of the diļ¬€erent existing proposals, the authors of [4] identiļ¬ed nearly 270 of anomaly detection security-related features, some of which were social media activity features, We analyzed those features and based on [8, 9], we grouped them into three data types categories namely: textual-content data, image-content data, and numerical-content data. To classify an individual based on those content data, diļ¬€erent classiļ¬cation models exists. In the next sections, we begin by giving an overview about the identiļ¬ed input data types and their analysis approaches, then we present the diļ¬€erent classiļ¬cation models. 16
  • 29. Chapter II. Existing Techniques 1 Data Types In this section, we brieļ¬‚y introduces each type of data along with the chosen ap- proach toward their analysis and classiļ¬cation. 1.1 Textual-Content Data Textual-content data is mainly characters that are part of a certain language and could be read by a human being. We begin by presenting the chosen text analysis ap- proach, then we decide on a data representation techniques to transform the text to nu- merical input. 1.1.1 Text Analysis In text analysis, the most common used technique is Text Mining. Text Mining is the process of extracting high quality information from textual data, where the information could be patterns or matching structures in text without the con- sideration of the semantics of the it. The outcome of it are mostly statistical information such as frequency and correlation of words. [16] In terrorism detection domain, we are interested in knowing what the user is trying to incite with the post and whether it is serious, sarcasm or reporting a news. To diļ¬€erentiate that, we need to go through the semantic analysis and not working with words as objects. One of the most important text-miningā€™s processing methodologies, that also consider the semantics of words, is the Natural Language Processing. Natural Language Processing is the process of making the computer understand the language spoken by humans along with the semantics and sentiments conveyed from it by doing some analysis such as morphological, syntactical and semantic analysis [16]. The ļ¬rst step in NLP is the morphology processing which involved analyzing the structure of words studying their construction from primitive meaningful units called 17
  • 30. Chapter II. Existing Techniques morphemes. This will help us divide the diļ¬€erent words/phrases of a document into tokens that will be used on later analysis. Morphemes are the smallest units with a meaning in a word. There are two types of morphemes namely Stems and Aļ¬ƒxes where the stems are the base or root of a word and aļ¬ƒxes could be a preļ¬x, an inļ¬x or a suļ¬ƒx. Aļ¬ƒxes that never appear isolated, but are combined with a stem. Taking the example of Fig. II.1, we can see how we split a word into a stem which carries the main meaning of the word and some aļ¬ƒxes. Figure II.1: An example of morphemes extraction Tokens are words, keywords, phrases or symbols that have a useful semantic unit for processing. We refer to its extraction process as Tokenization. It is mainly composed of a lemma + part of speech tag + grammatical features. Example: ā€¢ plays ā†’ play (lemma) + Noun (part of speech tag) + plural (grammatical feature) ā€¢ plays ā†’ play (lemma) + Verb (part of speech tag) + Singular (grammatical feature) After ļ¬nishing studying the structure of the words, we have to examine their ar- rangement and combination in a sentence, using syntax analysis. In a sentence, words arrangement follow precise rules of the languageā€™s grammar. Taking an example of the sentence Three people were killed in an incident today and following the English grammar parser, we end up with the example of Fig. II.2 where we have some grammatical groups such as S for sentence, NP for noun phrase, VP for verb phrase, NN for singular nouns and NNS for plural nouns. 18
  • 31. Chapter II. Existing Techniques Figure II.2: An example of syntax analysis This analysis will make the machine able to understand the relationship between the words and the diļ¬€erent references. After structuring the words and studying their relationship, it is time for the ma- chine to understand the meaning of the words and phrases along with the context of the document. Focusing on the relationship between the words and elements such as syn- onyms, antonyms and hyponyms (hierarchical order of meaning), the semantic system is able to build blocks composed of: ā€¢ Entities: Individuals or instances. ā€¢ Concepts: Category of individuals or classes. ā€¢ Relations: Relationship between entities and concepts. ā€¢ Predicates: Verb structures or semantic roles. These can be represented through methods such as ļ¬rst order predicate logic (FOPL), semantic networks and conceptual dependency. Fig. II.3 illustrates an example of semantic networks using our last example of the sentence Three people were killed in an incident today. 19
  • 32. Chapter II. Existing Techniques Figure II.3: An example of semantic network Based on these semantics, the machine can now learn the meaning of the words and the text, thus, from this part it is possible to lean the meaning of the userā€™s textual data. 1.1.2 Data Representation After going through the text analysis, our machine can now understand the meaning of the textual content data. But in order to build a classiļ¬er that will automatically cat- egorize the current and future data, our data must be numerical to apply mathematical rules while also preserving its semantics. Word embedding is one of the most popular representation of textual data, where it trans- forms a word in a document into a vector of numerical features where mostly close vectors means that these words share the same meaning or are in the same context therefore the data will not loose it semantics. While doing our research, the most used word embedding techniques are Word2Vec and Term Frequency-Inverse Document Frequency (TF-IDF). Word2Vec uses two diļ¬€erent approaches, namely: Continuous Bag Of Words (CBOW) and Skip Gram, both are based on neural networks that takes a context as an input and use back-propagation to learn [17]. The mathematical background work of Word2Vec, tries to maximize the probability of the next word wt given the previous word h. Thus, 20
  • 33. Chapter II. Existing Techniques the probability P (wt|h) in Equation II.1, where score(wt,h) computes the compatibility between wt with the context h and sof tmax is the known mathematical softmax function. P (wt|h) = sof tmax(score(wt,h)) (II.1) CBOW learns the embedding of a word by predicting it based on the surrounding words that are considered as the context here. Skip-Gram learns the embedding of a word by considering the current word as the context and predicting the surrounding words. According to [17], Skip-Gram is able to function with less data and represents rarer words more, while CBOW is faster and represents frequent words clearer. TF-IDF represents words with weights. These weights are based on the product of the term frequency times the inverse document frequency In simpler terms, words that occur frequently throughout the document should be given very little weighting or signiļ¬cance. For example, in English, simpler terms include: the, or, and and. They donā€™t provide a large amount of value. However, if a word appears very little or appears frequently, but only in one or two places, then these are identiļ¬ed as more important words and should be weighted as such [18]. Term-Frequency (TF) is the percentage of occurrence of a term t in a document d. As illustrated in Equation II.2, we calculate term-frequency by taking the number of times a term t is appearing in a document d by the total number of words in the document d. tft,d = nt,d term nterm,d (II.2) nt,d: The number of occurrences of term t in the document d. term nterm,d: The sum of occurrences of all the terms that appear in the document d which is the total number of words in the document d. 21
  • 34. Chapter II. Existing Techniques Method Advantages Disadvantages Word2Vec - Optimized memory usage - Fast execution time - Contains a lot of noisy data - Does not work well with ambigu- ity TF-IDF - The vocabulary is built with words that identify the category - Extract relevant information - High memory usage - The closest words are not similar in meaning but in the category of the documentā€™s context Table II.1: Comparison of word embedding methods Inverse-Document-Frequency (IDF) is the rank of a term t for its relevance within a document d. Equation II.3 show the mathematical formula to calculate inverse docu- ment frequency. This is done by taking the total number of documents N and dividing that by dft the number of documents that contains the term t. idf (t) = loge( N dft ) (II.3) Finally, if we are trying to get the weight wt,d of the word t in a document d using TF-IDF, we get that as shown in Equation II.4 by multiplying the tft,d by the idf (t). wt,d = tft,d āˆ— idf (t) (II.4) As found in review over existing research, such as in [18], it appears that Word2Vec performs better in term of memory, execution time and embedding quality for words sim- ilar in context and meaning, while TF-IDF performs better in identifying the words that determine the documentā€™s category. In other words, it detects the keywords that iden- tify a category of documents. Table II.1 summarizes the advantages and disadvantages of each method. 22
  • 35. Chapter II. Existing Techniques 1.2 Image-Content Data This type of data is anything that is a visual representation of something. Diļ¬€erent approaches are also available for image processing, but as determined in [10], convolu- tional neural network is by far the most performing method to utilize in image classiļ¬ca- tion in term of precision and execution time. Convolutional Neural Network is a deep learning algorithm and an extension of neural network that is distinguished from other methods by its ability to consider spa- cial structure and translation invariance. This means that regardless of where an object is located in an image, it is still considered as the same object [19]. The advantage of having a multidimensional input, unlike regular neural networks that use a vector as an input, makes it performs better with image data since the images usually has three color channels (RGB) which makes it a three dimensions matrix. Taking an example of an a 32Ɨ32 image with 3 color channels, we would have 32Ɨ32Ɨ3 = 3072 weights for a regu- lar neural network, if we go for a 512Ɨ512 image, we would have 512Ɨ512Ɨ3 = 786432 weights. This will results in huge calculations as well as an over-ļ¬tting for having too much information and details. [20] A simple CNN is a sequence of layers: convolutional layer, pooling layer and fully- connected layer. In a typical CNN, there are several rounds of convolution/pooling until we proceed to the fully-connected layer. 1.2.1 CNN: Convolutional Layer Each convolutional layer of the network has a set of feature maps that can recognize increasingly complex patterns/shapes in a hierarchical manner. Instead of regular matrix multiplications, convolutional layer uses convolution calculations. To do that, convolu- tional layer needs to construct the ļ¬lters and apply calculations on it while doing some optimization techniques such as Striding and Padding. Filters are used to detect patterns in an image, they also oļ¬€er weight sharing. For example a ļ¬lter which detects curved edge (Fig.II.4), matches the left corner of an image but may also match the right bottom corner of the image if both corners has a curved 23
  • 36. Chapter II. Existing Techniques edges. Figure II.4: Curved Edge Filter Calculation are matrix multiplications that are used to apply a ļ¬lter on an input image we. Let us consider: 0 0 1 1 0 1 1 3 1 2 1 0 1 4 2 0 2 2 1 0 3 4 1 0 0 ļ£« ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£­ ļ£¶ ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£ø * 1 1 0 0 0 1 1 0 0 ļ£« ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£­ ļ£¶ ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£ø = ? ? ? ? ? ? ? ? ? ļ£« ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£­ ļ£¶ ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£ø In order to get the value of the ļ¬rst ā€™?ā€™ we need to use the ļ¬lter on the ļ¬rst 3x3 matrix of pixels : ? = (0 āˆ— 1) + (0 āˆ— 1) + (1 āˆ— 0) + (1 āˆ— 0) + (1 āˆ— 0) + (3 āˆ— 1) + (1 āˆ— 1) + (0 āˆ— 0) + (1 āˆ— 0) = 4. Then we continue, the value next to ā€™?ā€™ is the value of the second 3x3 matrix of pixels in which ā€™3ā€™ is the center. This means we moved by 1 pixel to the right. 24
  • 37. Chapter II. Existing Techniques 0 0 1 1 0 1 1 3 1 2 1 0 1 4 2 0 2 2 1 0 3 4 1 0 0 ļ£« ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£­ ļ£¶ ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£ø * 1 1 0 0 0 1 1 0 0 ļ£« ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£­ ļ£¶ ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£ø = 4 ? ? ? ? ? ? ? ? ļ£« ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£­ ļ£¶ ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£ø ? = (0āˆ—1)+(1āˆ—1)+(1āˆ—0)+(1āˆ—0)+(3āˆ—0)+(1āˆ—1)+(0āˆ—1)+(1āˆ—0)+(4āˆ—0) = 2. And so on. Striding is a parameter of how many pixels we are going to move to calculate the next value. It is mainly used to reduce the calculation as values next to each other are more likely to be similar. In our last example the striding was 1, that means we only moved the red box by 1 pixel to get the next value. Usually, we use a value of 2 or 3 since in most of the cases a 2-3 pixels apart would make a variation or a change of a pattern. Padding is used to prevent information loss. In our example when applied the ļ¬l- ter, we didnā€™t consider having the values of the ļ¬rst/last rows and the ļ¬rst/last columns as center for the 3x3 matrix. To ļ¬x that we add zero padding which will add new rows/- columns ļ¬lled with 0. 0 0 1 1 0 1 1 3 1 2 1 0 1 4 2 0 2 2 1 0 3 4 1 0 0 ļ£« ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£­ ļ£¶ ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£ø ā‡’ 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 1 3 1 2 0 0 1 0 1 4 2 0 0 0 2 2 1 0 0 0 3 4 1 0 0 0 0 0 0 0 0 0 0 ļ£« ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£­ ļ£¶ ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£ø 1.2.2 CNN: Pooling Layer Pooling layer is used to determine what information is critical and what constitutes irrelevant details. There are many types of pooling layers such as: max pooling layer and average pooling layer. With max pooling, we look at a neighborhood of pixels and only keeps the maximum value. Considering a 2x2 max pooling with a stride of 2: 25
  • 38. Chapter II. Existing Techniques 1 0 0 1 3 2 0 2 0 0 4 2 4 1 0 1 ļ£« ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£­ ļ£¶ ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£ø ā‡’ 3 2 4 4 ļ£« ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£­ ļ£¶ ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£ø For each 2x2 matrix we took the maximum value and each time we move by two pixels (stride) to get the next 2x2 matrix. 1.2.3 CNN: Fully-Connected Layer A fully-connected layer is a layer on which all the inputs are connected to all the outputs. In a CNN it is used to ļ¬nally determine the class that will be assigned to our main input. Before proceeding to the fully-connected layer, we have to use a technique called ļ¬‚attening in order to generate a vector which is needed for this layer. Flattening: ā€¢ Each 2D matrix of pixels is turned into 1 column of pixels. ā€¢ Each one of our 2D matrices is placed on top of another. 1.3 Numerical-Content Data Numerical-content data is the data that are based on numbers that could be statis- tically interpreted. This type of data does not require a pre-processing thus, it can be directly ļ¬tted into a model. The models for this type of data are mostly the general sta- tistical machine learning models that we will be presenting later. 2 Data Classiļ¬cation in Machine Learning Machine learning is a subset of the artiļ¬cial intelligence domain, that makes the machine able to automatically gain knowledge from experience without being explicitly programmed. By following some statistics and mathematical concepts, it looks for pat- terns in the data we provide, learn them and make better decisions in the future. [21] Several learning methods exists in Machine Learning: 26
  • 39. Chapter II. Existing Techniques ā€¢ Supervised Learning: Given a sample of data and the desired output, the machine should learn a function that maps the inputs to the outputs. ā€¢ Unsupervised Learning: Given a sample of data without the output, the machine should learn a function that categorize these samples based on learned patterns. ā€¢ Semi-Supervised Learning: Given a small number of data with the desired output (labeled data), and other data without output (unlabeled data), the machine should learn a function that can label the unlabeled data using the knowledge learned from the labeled data. ā€¢ Reinforcement Learning: Given a sample of data, a certain actions and rewards related to the actions, the machine should learn a function that ļ¬nds the optimal actions toward achieving maximum rewards. Classiļ¬cation is part of supervised learning in which the machine is going to catego- rize a new observed data based on the learned patterns of each category from the training data. In the following sections, we present the most common classiļ¬cation algorithms. 2.1 Support Vector Machines A Support vector machine model is a representation of the data in a space. Examples of a same category are close to each other. The group of examples in a category are separated by a clear gap as wide and as spaced as possible from the examples of another category. New observed examples are then predicted to be part of a category based on the side of the gap in which they fall. [22] 2.2 Logistic Regression Logistic regression is a statistical model that analyses a data in which there is at least one feature that could determine the outcome. By using a logistic function, it tries to model a binary output that is measured with a dichotomous variable. Since the output is binary, it can only be used for binary classiļ¬cation problems. To use it for multi-class problem, N logistic regression models should be trained, where N is the number of classes you have, each model is trained on a certain class with one-vs-all approach. [23] 27
  • 40. Chapter II. Existing Techniques 2.3 Neural Networks A neural network is a network in which we have multiple layers of perceptrons. A perceptron is the elementary unit in an artiļ¬cial neural network which was introduced as a model of biological neurons in 1959 [24]. The output of each perceptron in a layer is connected to each perceptron of another layer as an input which makes it known as fully connected layer. A neural network must have an input layer, an output layer and in be- tween a hidden layer. Any neural network with more than one hidden layer is considered as a deep neural network. [20] Conclusion In this chapter, we studied the existing techniques needed to perform a classiļ¬cation on textual-content data, image-content data and numerical-content data. In the next chapter, we detail the basis of our proposed model. 28
  • 41. IIIProposed Model This chapter introduces a novel time-based terrorism detection model that works with multidimensional networks and diļ¬€erent types of input data. The output results of our model are the nodes that belong to terrorist regions in a graph across the dimen- sions of the multidimensional network. To identify this type of nodes, we ļ¬rst have to determine what are the terrorist regions and how to create them. Then, we examine the network to estimate a terrorism score for each node in a dynamic way in order to detect over time behavior changes. First, we introduce the purpose of the model along with the proposed research questions then we present the sources of data. After that, we present in detail the theoretical ap- proach toward constructing our model and we ļ¬nish by a conclusion. Introduction Nowadays, social networks provides many types of data that could be used such as images, texts and videos, but most of the existing models work on speciļ¬c type of data on a speciļ¬c social network. Our proposed model will try to cover this limitation by supporting a multidimensional network as an input in order to have the ability to use multiple social network data at the 29
  • 42. Chapter III. Proposed Model same time along with the supporting of diļ¬€erent input data types. In addition to that, the model will also consider the evolution of individualā€™s behavior over time to detect deviation from the usual behavior pattern. Furthermore, the model will adapt itself to the behaviorā€™s evolution to be kept updated with new abnormal behaviors. Before describing the basis of the model construction, it is ļ¬rst necessary to present the research questions that will be used as a metric to track the accuracy of our proposed model for solving the main research problem of the thesis which is the study of the char- acteristics that describes a terrorist in diļ¬€erent social media platforms. The research questions being posed are as follows: Q1: Can we identify the behavior of a terrorist based on his/her social media content ? Q2: Can machine learning help automatically detect if a user is adopting a terrorism be- havior over time ? Q3: Do terrorists adopt the same behavior on diļ¬€erent social networks ? In order to answer these research questions, we have to pass through some phases: ā€¢ Phase 1: Identifying the available data sources ā€¢ Phase 2: Determining the convenient classiļ¬cation approach ā€¢ Phase 3: Estimating the terrorism score calculation First, we start by collecting the necessary data of each user. Then, we create a multidi- mensional network where each dimension represents a social network. Once the network is ready, it is then used as an input to our model where each feature from each social network will be mapped to its respective sub-model. Finally, a decision score will be cal- culated. If the node is detected as a terrorist, the model will be re-trained with those new inputs to be kept up-to-date with newest (unseen) terrorist behaviors, in case the model losses its accuracy once we updated it, it will be reverted to the last version. Additionally, each node will be passed to the model each time it was involved in a new activity, that way, the node could also be considered as a terrorist once the user adopt terrorism behavior over-time. 30
  • 43. Chapter III. Proposed Model 1 Data Collection As part of phase 1, the data sources of the diļ¬€erent data types should be identiļ¬ed. As we shared in the last chapter, there exists three types of data: ā€¢ Textual-Content Data: These include posts, comments, image captions, text in an image, etc. ā€¢ Image-Content Data: These are posted photos, proļ¬le picture, etc. ā€¢ Numerical-Content Data: These are age, number of friends, average posts per day, etc. Several other information exists in social media such as username, gender and rela- tionship. Therefore, instead of having the numerical-content data category, we opted for utilizing another category named general information data, where we have the existing numerical-content data in addition to the userā€™s information data. We present by next the data sources of the diļ¬€erent data contents that we have. As mentioned by [12], we can categorize the data sources into two categories namely: oļ¬„ine data sources and online data sources. In this section, we provide the sources of both oļ¬„ine and online data that are used in or- der to retrieve our target data types for model training and later prediction. As a strategy for training the model and precisely distinguishing terrorism from other similar data, we decided to consider terrorist contents as positive labels against military and news con- tents as negative labels, as these types of contents are related, training them against each other will make the model more precise. 1.1 Oļ¬„ine Data Sources Oļ¬„ine data is the data used for the model training which was gathered from public terrorism datasets. For each input type, we used a diļ¬€erent dataset. All of them deļ¬nes the terrorism from the American point of view. For the textual-content data, we inspired from [8], to use twitter API to gather tweets that consist of terrorism-related hashtags and tweets from terrorist accounts that were re- ported to twitterā€™s safety account (@twittersafety) ensuring that they are not anti-terrorist 31
  • 44. Chapter III. Proposed Model accounts with that we will be creating our oļ¬„ine textual-content dataset where we con- sider those tweets as positive labels against terrorism news tweets and news headlines gathered from other public datasets such as Global Terrorism Database (GTD) [13] as negative labels. We will also be using google translate API since some accounts may pub- lish tweets in diļ¬€erent languages. For the image-content data, we did not ļ¬nd a public terrorism-related images dataset within the scope of our research. We decided to use a manual web scraping method with Google Image as our data source. We will be manually gathering terrorist individuals images and incitement of terrorism images, which are our positive labels, and contrasting them against military and terrorism news images, which are our our negative labels. For general information data, Study of Terrorism And Responses to Terrorism (START) published a database called Proļ¬les of Individual Radicalization In the United States (PIRUS) [25] which contains approximately 145 features about many radical proļ¬les in the united states from which we will be extracting our projectā€™s relevant features that are age, gender, relationship, etc. 1.2 Online Data Sources Online data is the social network data that is part of the prediction and future model re-training. The sources for that are the public APIs provided by the social networks. For social media, we decided to study three popular websites that have similar data con- tents and that could also be linked together: Facebook, Instagram and Twitter. Facebook provides Graph API which is a HTTP-based API service to access the Face- book social graph objects [26]. With the right permissions, Graph API allows you to query public data as well as creating contents [27]. The data is rich with semantics since Graph API utilizes RDF format as a return type. [28] Instagram as part of Facebook, also provides Graph API for business accounts [29]. For normal user accounts it gives REST API that returns JSON object for querying public data . [30] Twitter hands over a REST API with JSON return format that provides several public data queries as well as private data with the right permissions. [31, 32] 32
  • 45. Chapter III. Proposed Model 2 Proposed Model Design With the data preparation phases ready, we can now start determining the classi- ļ¬cation approach that we will be using along with the terrorism score formula, thus, ļ¬nishing the phase 2 and phase 3. In this section, we explain the theoretical side of the necessary steps toward constructing our proposed model. As previously mentioned, the model consider a graph as an input. Then, an individualā€™s content-based classiļ¬cation with a decision making component to calculate the ļ¬nal score of the node and utilizing a threshold to determine whether the user is a terrorist. 2.1 Model Input Inspired from [11] the best way to represent our input data is a multidimensional network. However, unlike their proposal, the dimensions in our work will represent each social network used. Let G = (V ,E,D) denote an undirected unweighted multidimensional graph, in which V is a set of nodes representing each user, D reļ¬‚ects the dimensions which are the social networks and E = {(u,v,d);u,v āˆˆ V ,d āˆˆ D} represents a set of edges that are the connection between the users that represents things such as: relationship, shared comments or post sharing. Fig. III.1 illustrates an example of how this network look like. At each timestamp, the user will have his/her data inserted into our model to have his/her score. The timestamp here would be each time the user is involved in a new activity, which is the method used by [5]. 33
  • 46. Chapter III. Proposed Model Figure III.1: Multi-dimensional Network 2.2 Content-Based Classiļ¬cation The model itself will contain three diļ¬€erent sub-models, one for each content type we have. 2.2.1 Text Classiļ¬cation Model As mentioned in the previous chapter, before applying machine learning classiļ¬ca- tion models on a textual content, we have to do text analysis and transform it to numerical input that a model can understand. As illustrated in Fig. III.2, the ļ¬rst step when the textual data is received, it has to 34
  • 47. Chapter III. Proposed Model pass through the NLP process. Once that is done, it has to be represented in a numeri- cal way. In the last chapter, we presented a comparison between two word embedding techniques namely: Word2Vec and TF-IDF. We chose TF-IDF because we are doing a clas- siļ¬cation problem, we are more interested in diļ¬€erentiating the categories rather than representing the similarity of the words meaning. Now that our machine can understand our textual data and the data itself can be represented in a numerical way. We can start passing that to any machine learning model. As a strategy, we decided that in the im- plementation phase we would try diļ¬€erent models that we mentioned in the last chapter, such as Support Vector Machines, Logistic Regression and Neural Networks then com- pare their results to assess which one performs better. Figure III.2: Text Classiļ¬cation Model 35
  • 48. Chapter III. Proposed Model 2.2.2 Image Classiļ¬cation Model In the previous chapter, we presented convolutional neural network as a model to use in image classiļ¬cation. But designing a CNN will require ample parameters tuning and adding/removing of convolution blocks to ļ¬nd the best architecture while re-training your model each time. This task is a huge time consuming job. To overcome this, there is a technique called Transfer Learning that could help getting better results in a faster way. Transfer Learning is a technique that makes a model beneļ¬t from knowledge gained during solving another similar problem. For example, a model that learned to recognize cars could use its knowledge to recognize trucks [33]. This is done by taking a pre-trained model, changing few layers, usually the last ones, and re-training only those layers. It is proven in [34], that transfer learning could have a huge improvement for accu- racy, execution time and memory usage. Another known limitation that we usually encounter in image classiļ¬cation is not having diverse enough data or enough samples. A solution to that is the Data Augmentation technique. Data Augmentation is a technique for generating more data because having little data and not enough variation, leads to a bottleneck in Neural Network models that usu- ally requires thousands of training samples with diverse variation to be able to generalize the learning. This is done by using some techniques such as: ā€¢ Flipping: Flip the image horizontally and vertically. ā€¢ Rotating: Rotate the image with some degrees. ā€¢ Scaling: Re-scale an image by making it larger or smaller. ā€¢ Cropping: Crop a part of an image. ā€¢ Translating: Move the image in some direction. ā€¢ Adding Gaussian Noise: Add noisy points to the image. 36
  • 49. Chapter III. Proposed Model Applying data augmentation can help in improving the model score as discussed in [35]. Therefore, as illustrated in Fig. III.3, once we have image data, it passes through our trained CNN model resulting in an image-content score. Figure III.3: Image Classiļ¬cation Model 2.2.3 General Information Classiļ¬cation Model For the general information model, the features do not require pre-processing for the machine to understand it. We have to follow some encoding techniques for the non- numerical data, then ļ¬t that to a supervised machine learning classiļ¬cation model. For non-numerical features such as gender and relationship, we have to encode these into numerical values. As these are binary, we can use 0 and 1. For non-binary values, 37
  • 50. Chapter III. Proposed Model we have to use techniques such as one-hot encoding or a Sparse Categorical Cross En- tropy encoding. As for the username, we can apply some feature engineering to create relevant features from it such as the length, number of unique characters and other important information as discussed in [8]. Other numerical features such as the age, the number of friends and the number of fol- lowers can be passed them directly to the model. In the implementation phase we try diļ¬€erent classiļ¬cation models where we compare their results to select which one performs better. Figure III.4: General Information Classiļ¬cation Model 38
  • 51. Chapter III. Proposed Model 2.3 Decision Making Now that we have a model for each data type, we can go to phase 3 where we propose a calculation formula to provide a score for each user. While doing our work and based on the available features, we noticed that the textual content and the image content has more impact on the user behavior than the general information which could be mis-leading. Therefore, as a compromise we decided to give a weight to each input data relative to its impact on determining the anomaly of the user. Taking 3 scores one for each sub-model {s1,s2,s3} and 3 weights {Ī±1,Ī±2,Ī±3}, each node u āˆˆ V on each dimension d āˆˆ D should have the terrorism score of that dimension S(u)d as in (III.1). S(u)d = 3 i=1 (Ī±i Ɨ s(u)i) (III.1) Now each user has a score for each dimension based on the sub-models score of each dimension, but as an output, we want a single score. For that, given 3 dimensions, each user must have a terrorism score ST (u) as in (III.2). ST (u) = 3 d=1 S(u)d 3 (III.2) Now that each user u āˆˆ V has a terrorism score ST (u), we have to decide whether that user is a terrorist or not, this is done by deļ¬ning a certain threshold Ī³ where: ST (u) = Ī³ ā‡’ T errorist ST (u) Ī³ ā‡’ NotT errorist (III.3) The values of the weights Ī±i and the threshold Ī³ are determined in the implementa- tion phase. 2.4 Global Model After deļ¬ning the diļ¬€erent components of our model, let us present its design along with the workļ¬‚ow of how to use it. Fig. III.5 shows how our model look like using an example of a single user with three dimensions that are the Facebook, Twitter and Insta- gram data. 39
  • 52. Chapter III. Proposed Model Figure III.5: Proposed Model Fig. III.6 illustrates the workļ¬‚ow of our model. Each time a user is involved in an ac- tivity, the userā€™s data will pass through our model. In the case in which the user behavior is detected as terrorist, we re-train the model with this new data to keep it updated with new unseen behaviors. If the model loses accuracy after re-training, we revert to the last existing model. 40
  • 53. Chapter III. Proposed Model Figure III.6: Model Workļ¬‚ow Conclusion In this chapter, we presented our proposed approach, starting from the research questions that we are looking to solve. Then, we showed the diļ¬€erent phases to follow in order to answer those questions. Finally, we explored the steps to follow toward the 41
  • 54. Chapter III. Proposed Model construction of our model. The next chapter will detail the achievements and the diļ¬€erent results. 42
  • 55. IVImplementation and Results This chapter presents the practical part of our work. We will go through the pipeline of our implementation starting with data gathering, then the model creation and we ļ¬n- ish with the interpretation of the results and a response to the research questions. 1 Data Collection In this section, we will explain how to gather the data that we identiļ¬ed in the last chapter. As we discussed, there are two types of data, the oļ¬„ine and the online data. In the next sections, we will implement the data gathering solution to each of them. 1.1 Oļ¬„ine Data To train the models, we used a strategy of using an oļ¬„ine dataset which is the public datasets related to our problem. In the last chapter, we decided a data sources for each input type, we will implement their gathering scripts in the next sections. 43
  • 56. Chapter IV. Implementation and Results 1.1.1 Textual-Content Data For the textual data, we have two sources of data: ā€¢ Positive labels: Tweets of banned tweeter accounts. ā€¢ Negative labels: News headlines of the GTD. Our positive labels are the data that contains terrorist textual content. Our strategy was to gather tweets of the banned users that were reported to @twittersafety account and that also contains terrorism-related hashtags when they were reported, this could be done through the Twitter API or the Twitter searching tool. Fig. IV.1 illustrates an example of our searches looking for tweets that were reported to or mentioned the twittersafety account containing the hashtags #ISIS, #terrorist, #Daech, #IslamicState. Figure IV.1: Twitter Searching Tool While doing our research, we found out that some organization already did this pro- cess and extracted over 17k of clean terrorist data of ISIS users, and published that into a 44
  • 57. Chapter IV. Implementation and Results Kaggle dataset called How ISIS Uses Twitter [36]. For our negative labels, we need content related to terrorism in an opposite way, such as news reporting on terrorism. For that, we will be using the news headlines from the Global Terrorism Database (GTD) [13]. Fig. IV.2 presents a sample of 4 rows from the GTD news headlines. Figure IV.2: Sample of news headlines Our ļ¬nal dataset contains the merge of the tweets labeled as terrorist, and the GTD data labeled as news. Fig IV.3 shows the word cloud of the most appearing keywords from our dataset, that includes both positive and negative labels. 45
  • 58. Chapter IV. Implementation and Results Figure IV.3: Word Cloud of our Textual Data The number of samples we have total to approximately 300k of data, where about 122K are terrorist data and around 181K are news headlines, Table IV.1 presents the real numbers in our dataset. Label Number of samples Positive labels 122619 Negative labels 181691 Total Data 304310 Table IV.1: Textual-Content Dataset 1.1.2 Image-Content Data As we discussed in our research, the source of the image data is Google-Image and we will be manually gathering images from it. Lucky for us, a python package called google_images_download [37] exists, which allow us to automate this task by choosing the keywords that we are looking for and the number of images needed. 46
  • 59. Chapter IV. Implementation and Results We started a script that downloaded around ļ¬ve hundred of terrorist persons and in- citement acts in addition to another ļ¬ve hundred images of military and terrorism news. Unfortunately, the images were not 100% related to what we are looking for, therefore, we had to manually verify the gathered images and remove the non-related images. After cleaning the data and keeping only related images, we had around 200 of ter- rorist images and 300 of military and news images. Table IV.2 illustrates the real numbers of images in our dataset. Fig. IV.4 and Fig. IV.5 shows random three images of each cate- gory. Label Number of samples Positive labels 219 Negative labels 314 Total Data 533 Table IV.2: Image-Content Dataset Figure IV.4: Sample of Terrorists images 47
  • 60. Chapter IV. Implementation and Results Figure IV.5: Sample of Military/News images 1.1.3 General Information Data For general information data, we used the Proļ¬les of Individual Radicalization In the United States (PIRUS) [25] public dataset from which we extracted the ages, genders and relationships status of 135 extremist person that are our positive labels. As for the negative labels, we will be using the online data to build our dataset. Fig.IV.6, Fig.IV.7 and Fig.IV.8 shows the distribution of each feature within our pos- itive labels data. Figure IV.6: Age Distribution 48
  • 61. Chapter IV. Implementation and Results Figure IV.7: Relationship Distribution Figure IV.8: Gender Distribution 1.2 Online Data In this section, we will implement the necessary scripts that will gather the online data from our selected social media platforms: Facebook, Instagram and Twitter. 1.2.1 Facebook Data Facebook provides the HTTP-based API called Graph API. A public SDK called facebook- sdk will help us write automated Facebook data gathering script using Python. To use Facebook Graph API, it is necessary to pass an access token that has the rele- vant permissions to access the social graph objects that you are querying. In the Facebook social graph objects, each object has some ļ¬elds related to the object type, for example, 49
  • 62. Chapter IV. Implementation and Results the object User will contain information around the user proļ¬le such as the age, rela- tionship and gender. The main existing objects that we are interested in are the User, the Post, and the Comment. To access each graph object, you pass an id of an object of that type,. Therefore, for posts and comments we cannot access them directly since the posts ids are contained in the list of posts of the user object, and the same for comments as they are part of the posts. Fig.IV.9 shows a representation of the Facebook Graph API. Figure IV.9: Facebook Graph API Our script starts by obtaining the information of the user along with the list of posts ids. Then, it accesses all the posts by looping through the list of posts ids from the posts ļ¬eld in the User object and retrieve the necessary information from it. After that, it extracts the comments by looping through the list of comments ids from the comments ļ¬eld in each Post object. Finally it will parse the textual and image data from those posts and comments. The following code is an example of how to get the user information along with the posts data. graph = facebook.GraphAPI(access_token=access_token , version=3.1) 50
  • 63. Chapter IV. Implementation and Results user_information = graph.get_object( id=ā€™meā€™, fields=ā€™id,name,age_range ,gender,relationship_statusā€™) posts_ids = [] posts_object = graph.get_object(id=ā€™meā€™, fields=ā€™postsā€™) posts_ids.extend(posts_object[ā€™postsā€™][ā€™dataā€™]) while next_page is not None: response = requests.get(next_page) new_data = json.loads(response.content) posts_ids.extend(new_data[ā€™dataā€™]) try: next_page = new_data[ā€™pagingā€™][ā€™nextā€™] except: next_page = None for post in posts_ids: post_data = graph.get_object( id=post.id, fields=ā€™created_time ,full_picture ,message ,shares , likes.summary(1)ā€™) post_data[ā€™likesā€™] = post_data[ā€™likesā€™][ā€™summaryā€™][ā€™total_countā€™] try: post_data[ā€™sharesā€™] = post_data[ā€™sharesā€™][ā€™countā€™] except: post_data[ā€™sharesā€™] = 0 1.2.2 Instagram Data For Instagram, the task is easier as it provides a normal REST API with JSON output where the access to each endpoint is direct through any HTTP request module. In Python, we use the module requests with the Instagram endpoint: https://api.instagram.com/v1/ where we can access the information of the user through /users/self/?access_token={} or the posts through /users/self/media/recent/?access_token={}. The following code shows how our script will gather information from Instagram. 51
  • 64. Chapter IV. Implementation and Results # User data response = requests.get( ā€™https://api.instagram.com/v1/users/self/?access_ token={}ā€™.format(access_token)) user = json.loads(response.content)[ā€™dataā€™] # Posts data response = requests.get( ā€™https://api.instagram.com/v1/users/self/media/recent/?access_ token={}ā€™.format(access_token)) data = json.loads(response.content) for post in data[ā€™dataā€™]: _id = post[ā€™idā€™] creation_timestamp = post[ā€™created_timeā€™] created_time = datetime.fromtimestamp( int(creation_timestamp)).strftime(ā€™%Yāˆ’%māˆ’%d %H:%M:%Sā€™) message = post[ā€™captionā€™][ā€™textā€™] if post[ā€™captionā€™] is not None else ā€™ā€™ img_url = post[ā€™imagesā€™][ā€™standard_resolutionā€™][ā€™urlā€™] post_data = dict(created_time=created_time , id=_id, message=message , img_url=img_url) 1.2.3 Twitter Data Similarly to Instagram, Twitter also provides a REST API, however, it also hands over a Python SDK making the API usage easier. In order to use it, we have to pass 4 access keys: consumer key, consumer secret, access token key and access token secret. Each key has relevant permissions that allow access to either userā€™s private data or the public Twitter data. The following code is an example of how we loaded the tweets using the Twitter Python SDK. 52
  • 65. Chapter IV. Implementation and Results api = twitter.Api(consumer_key=consumer_key , consumer_secret=consumer_secret , access_token_key=access_token_key , access_token_secret=access_token_secret) user_id = api.VerifyCredentials().AsDict()[ā€™idā€™] tweets = api.GetUserTimeline(user_id=user_id) for tweet in tweets: tweet = tweet.AsDict() _id = tweet[ā€™idā€™] created_time = tweet[ā€™created_atā€™] message = tweet[ā€™textā€™] if tweet[ā€™textā€™] is not None else ā€™ā€™ tweet_data = dict(created_time=created_time , id=_id, message=message) 2 Model Implementation In the next sections, we will be implementing the diļ¬€erent components that will lead toward constructing our proposed model. For each sub-model, we will be splitting the dataset of that content type into 80% training data, and 20% testing data. All the models will be implemented on the same machine provided by Kaggle, a data science platform, with the following hardware: ā€¢ RAM: 16 GB ā€¢ CPU count: 2 ā€¢ GPU: Tesla K80 ā€¢ Disk: 5 GB 53
  • 66. Chapter IV. Implementation and Results 2.1 Text Classiļ¬cation Model The steps to construct our text classiļ¬cation model were ļ¬rst to have the NLP pipeline ready for data pre-processing, then vectorize it with TF-IDF and pass it to a classiļ¬cation model. 2.1.1 NLP Process When the NLP enters the practical phase, the process becomes tokenization, removal of stop words and lemmatization. In the following code we will be using the Natural Language Toolkit (NLTK) python package to do these steps. We start with regular ex- pressions that will remove unnecessary texts that disrupts the process such as links and dates. Then, we split the text into tokens, removing the stopwords (common useless words like ā€™aā€™, ā€™theā€™, ā€™thatā€™, ā€™onā€™) and lemmatizing the words, by determining the root word based on its part-of-speech tag (adjective, verb, noun). def process_text(text): nltk_processed_data = [] text = re.sub(rā€™https?://. [rn] ā€™, ā€™ā€™, text, flags=re.MULTILINE) text = re.sub(ā€™(?:[0āˆ’9] [:/āˆ’]){2}[0āˆ’9]{2,4}ā€™, ā€™ā€™, text, flags=re.MULTILINE) for w in tokenizer.tokenize(text) : word = w.lower() if not is_stopword(word=word): processed_text = wordnet_lemmatizer.lemmatize( word, get_wordnet_pos(word)) nltk_processed_data.append(processed_text) return nltk_processed_data 2.1.2 Data Vectorization To use our data for classiļ¬cation models, we have to vectorize it into semantic nu- merical data. In the last chapter, we deļ¬ned TF-IDF as our vectorizer model. Sckit-Learn 54
  • 67. Chapter IV. Implementation and Results oļ¬€ers ā€™Tļ¬dfVectorizerā€™ module in its package with an easy usage of two lines. We deļ¬ned the object parameters as follows: ā€¢ max_df: Max document frequency for a word to be considered in the grammar ā‡’ 0.95 (word must maximum appears in 95% of the documents) ā€¢ min_df: Min document frequency for a word to be considered in the grammar ā‡’ 0.1 (word must minimum appears in 10% of the documents) ā€¢ ngram_range: Number of words to consider as a single token in the grammar ā‡’ (1,3) (From 1 word to 3 words) from sklearn.feature_extraction.text import TfidfVectorizer vectorizer = TfidfVectorizer(max_df=0.95, min_df=0.1, ngram_range=(1,3)) X = vectorizer.fit_transform(train_data) The data used to train the ā€™Tļ¬dfVectorizerā€™ is around 205K of samples. After vectorizing the data, TF-IDF has identiļ¬ed 330 feature vectors, which makes our data shape: (n_training_samples, n_dimensions) ā‡’ (243448, 330) We utilize the trained vectorizer to transform the testing data, as follows: transformed_data = vectorizer.transform(test_data).toarray() 2.1.3 Data Classiļ¬cation As mentioned in the last chapter, we will try three classiļ¬cation models namely Lo- gistic Regression, Support Vector Machine and Neural Network. The best performing model will be used later for our global model. To implement the Logistic Regression and the Support Vector Machine models, we used Scikit-Learn, a python machine learning library that oļ¬€ers many known models. We trained these two models with their default suggested parameterā€™s values. For the Neural Network, we used Keras as a framework that works on top of TensorFlow. The architecture of our model is composed of three layers, with 16 neurons, 8 neurons and 1 neuron respectively. The ļ¬rst two layers has a ā€™reluā€™ activation as it is proven for its performance, and the last layer has a ā€™sigmoidā€™ activation as it is our output layer and we 55
  • 68. Chapter IV. Implementation and Results have a binary classiļ¬cation problem. The model is compiled with ā€™binary_crossentropyā€™ as a loss function and ā€™adamā€™ as an optimizer. For the training parameters, we used 20 epochs with 128 batch size and 20% of validation data extracted from the training data. Table IV.5 show the diļ¬€erent metric scores for each model along with the training execution time. These models are trained and tested with the same data and on the same machine. The model that we will be using in our global model is the Neural Network as it has the best F1-score with a good average of training time. Model Name Accuracy F1-Score Training time Logistic Regression 0.9726 0.9674 39.9 secs SVM 0.9626 0.9548 6h 48min 33s Neural Network 0.9774 0.9719 1min 11s Table IV.3: Text Models Metric Scores 2.2 Image Classiļ¬cation Model In last chapter, we deļ¬ned convolutional neural network as our image classiļ¬cation model along with optimization techniques namely Transfer Learning and Data Augmen- tation. Therefore, as a ļ¬rst step, we have to implement our data augmentation functions, then deļ¬ne which base modelā€™s learnt knowledge will be used in our model. 2.2.1 Data Augmentation To use data augmentation, a python package called ā€™imgaugā€™ exists that provides all the diļ¬€erent data augmentation techniques. In the following code, we show an example on how to use the augmenters of the ā€™imgaugā€™ library where we will be applying a random augmentation technique for the image used. from imgaug import augmenters img_augmentor = augmenters.Sequential([ # S e l e c t one of the augmentation techniques randomly augmenters.OneOf([ iaa.Affine(rotate=0), iaa.Affine(rotate=90), 56
  • 69. Chapter IV. Implementation and Results iaa.Affine(rotate=180), iaa.Affine(rotate=270), iaa.Fliplr(0.5), iaa.Flipud(0.5), ])], random_order=True) # Apply the augmentation technique on the image image_aug = img_augmentor.augment_image(image) Fig.IV.10 shows an example of two images generated through the data augmentation code above. Figure IV.10: An example of data augmentation After applying the data augmentation on the training data, we generated an addi- tional 30% of data resulting in a total of approximately around 550 images. 2.2.2 Transfer Learning Many pre-trained models exists nowadays, but they are each focused on a speciļ¬c problem. In our case, we work more with faces and objects like guns, so the pre-trained model VGG16 [38] is more suitable to our problem. To adapt the VGG16 to our problem, we remove its fully-connected layers, freeze the training on the the remaining layers and add two new layers. The ļ¬rst will have 16 neu- rons and ā€™reluā€™ activation. The second, our output layer, will have 1 neuron and ā€™sigmoidā€™ 57
  • 70. Chapter IV. Implementation and Results activation. The loss function will be ā€™binary_crossentropyā€™ with ā€™adamā€™ as an optimizer. Since the image classiļ¬cation could be a complex task and we have little amount of data, we will train the model with 5000 epochs with 32 batch size while having an early stop- ping strategy of 250 rounds. In Table IV.4, we present the diļ¬€erent scores of combination using our two CNN layers, with and without the pre-trained model and with and without the generated data from the data augmentation. While the scores were measured by the same testing data, the training data diļ¬€ers when using the data augmentation. The usage of both DA and TL together has resulted in better scores and not so long training time, therefore, we will be using that in our global model. Model Accuracy F1-Score Training time CNN 0.7631 0.7219 3min 50secs CNN + DA 0.7781 0.7463 4min 12secs CNN + TL 0.8291 0.8103 8min 48secs CNN + DA + TL 0.8571 0.8454 9min 23secs Table IV.4: Image Models Metric Scores 2.3 General Information Classiļ¬cation Model For the general information, we will follow the same strategy used in the text clas- siļ¬cation where we will be working with three classiļ¬cation models namely Logistic Re- gression, Support Vector Machine and Neural Network and the best performing model will be later used for our global model. For the Logistic Regression and the Support Vector Machine we used the default Scikit-Learn parameterā€™s values. However, for the Neural Network, we used an architecture of four layers with 16 neu- rons, 8 neurons, 4 neurons and 1 neuron respectively. A ā€™reluā€™ activation is used for the ļ¬rst three layers and a ā€™sigmoidā€™ activation for the last layer. The model is compiled with ā€™binary_crossentropyā€™ as a loss function and ā€™adamā€™ as an optimizer. For the training pa- rameters, we will use 200 epochs with 32 batch size and 20% of validation data extracted from the training data. 58
  • 71. Chapter IV. Implementation and Results Table IV.5 illustrated the metric scores of the trained models with the same data on the same machine. For the global model, we will be using SVM as it exceeds by far the performance of the other models. Model Name Accuracy F1-Score Training time Logistic Regression 0.7650 0.7873 5 secs SVM 0.8300 0.8495 7 secs Neural Network 0.8173 0.8325 48.6 secs Table IV.5: General Information Models Metric Scores 2.4 Proposed Model In this part, we will be going through our proposed modelā€™s workļ¬‚ow to put things together and implement the missing components. Our modelā€™s input is a multidimensional network, therefore, we have to implement a parser that will map the data into the correspondent sub-model. This could be solved through creating objects where we can store the data in a convenient way then pass to the sub-models. Fig.IV.11 illustrates our class diagram where we store each userā€™s data. The general user information data are in the User object, while the Post object, which could also be a Comment, has both image and textual data. 59
  • 72. Chapter IV. Implementation and Results Figure IV.11: Class Diagram The second component of our model is the sub-models that will be receiving the input data. For that, we will use the pre-trained chosen models of each input type and output a score per each model. The next component is the decision making where we have to interpret the output score of the sub-models and calculate the terrorism score and decide on the userā€™s ex- tremeness. The calculation formula for that was already deļ¬ned in the last chapter, but the values of the threshold Ī³ and the models factors Ī± are not yet decided. For the factors, we decided that since we have more features on the image and textual content than the general information, we will have the factors as follow: ā€¢ Text-Model factor: 0.4 (40%) ā€¢ Image-Model factor: 0.4 (40%) ā€¢ Information-Model factor: 0.2 (20%) As for the threshold, we do not have many real online data to decide on this in a scientiļ¬c way, we agreed to keep it in a neutral way with value of 0.5 (50%). The model itself is adapted to an over-time change, thus, a component that re-train and revert a model must be implemented as well. For that, we will have a database where we store the last modelā€™s score and a python function that checks if the score improved after re-training the model on the new terrorist-userā€™s data. 60
  • 73. Chapter IV. Implementation and Results With having those components ready, our modelā€™s implementation is ļ¬nished and the model is ready to be tested. 3 Results Interpretation In this section, we will start testing our model with a network to see if we can answer our research questions that were posed in the beginning of our proposal. The network passed to the model is composed of two real users (U1 U2) that are non-terrorist and one generated terrorist user (U3) as we cannot ļ¬nd an available terrorist users. The input was tested only a single timestamp t, due to lack of historical data. As we can see in Table IV.6, which presents the scores predicted for those users for each sub-model on each social network (Facebook: FB, Instagram: IG, Twitter: T), the model has performed good by predicting correctly the anomalousness of the users. Based on these results, we can see that a terrorist could be detected according to his/her social media content, thus, our answer for Q1 is positive. We can also notice that the scores on the same data type from diļ¬€erent social networks are mostly similar, except for the text content on Instagram as it is only image captions, which means that our answer to Q3 is positive. User Text-Model Score Image-Model Score Information-Model Score Final Score FB IG T FB IG T FB IG T U1 0.084 0.084 0.079 0.031 0.068 0.063 0.265 0.318 0.345 0.116 U2 0.059 0.054 0.078 0.013 0.054 0.115 0.530 0.445 0.276 0.133 U3 0.859 0.298 0.854 0.658 0.877 0.816 0.530 0.637 0.690 0.705 Table IV.6: Model Testing Results After detecting the user U3 as a terrorist, the sub-models were re-trained again with appending the new data extracted from U3 to the old data. The new score of each sub- model were increased by an average of 0.01. Although this increase could be considered negligible, but over time, it will help our model being up-to-date with the new terrorism contents, thus, if a user is starting to adopt the new terrorism behaviors that the model 61
  • 74. Chapter IV. Implementation and Results was not trained on in the ļ¬rst place, the user will still be detected as a terrorist, therefore our answer to Q2 is positive. Conclusion During this chapter, we presented the implementation of our solution starting from the data gathering, then the sub-models training and our proposed model construction, and we ļ¬nished by testing our model and answering our research questions. 62
  • 75. VConclusions and Perspectives In this thesis, we proposed a terrorist detection model that works with multidimen- sional networks as an input format and that can also support diļ¬€erent input data types such as texts and images. Our model can also detect if the user is adopting a new behavior over-time, and the model itself can automatically learn new terrorism behaviors. We started by presenting the existing works carried in the anomaly and terrorism detection domains. Then, we discussed the existing techniques for data processing and data classiļ¬cation in an automated way. After that, we presented the modelā€™s design and the theoretical perspective of the workļ¬‚ow. Finally, we started implementing the model and discussed the results. The model itself showed good results on two real users and one generated user by predicting their anomalousness correctly. Despite the fact that the number of the online data used for testing is too little, this is still considered as a proof-of-concept that our proposed model can be implemented and put in a production environment. Although we tried to cover the limitation of other existing models, our proposed model is still limited by not supporting some functionalities such as: ā€¢ Graph analysis: We can use graph analysis methodologies to detect communities since our input data is a network. 63
  • 76. Chapter V. Conclusions and Perspectives ā€¢ Support of videos: We can add another sub-model that works with video classiļ¬ca- tion, since videos are one of the most important contents in social medias. The modelā€™s accuracy can also be improved by using larger datasets, thus, we also solve the calculation of the threshold and the sub-models factors. 64
  • 77. Bibliography [1] Shannon Greenwood, Andrew Perrin, and Maeve Duggan. Social media update 2016. Pew Research Center, 11(2), 2016. [2] Alex P Schmid. The deļ¬nition of terrorism. In The Routledge handbook of terrorism research, pages 57ā€“116. Routledge, 2011. [3] Facebook community standards. URL https://www.facebook.com/ communitystandards/dangerous_individuals_organizations. [4] Arash Habibi Lashkari, Min Chen, and Ali A Ghorbani. A survey on user proļ¬ling model for anomaly detection in cyberspace. Journal of Cyber Security and Mobility, 8 (1):75ā€“112, 2019. [5] Zahedeh Zamanian, Ali Feizollah, Nor Badrul Anuar, Laiha Binti Mat Kiah, Karanam Srikanth, and Sudhindra Kumar. User proļ¬ling in anomaly detection of authorization logs. In Computational Science and Technology, pages 59ā€“65. Springer, 2019. [6] Sreyasee Das Bhattacharjee, Junsong Yuan, Zhang Jiaqi, and Yap-Peng Tan. Context- aware graph-based analysis for detecting anomalous activities. In 2017 IEEE Inter- national Conference on Multimedia and Expo (ICME), pages 1021ā€“1026. IEEE, 2017. [7] Di Chen, Qinglin Zhang, Gangbao Chen, Chuang Fan, and Qinghong Gao. Forum user proļ¬ling by incorporating user behavior and social network connections. In International Conference on Cognitive Computing, pages 30ā€“42. Springer, 2018. [8] Hamidreza Alvari, Soumajyoti Sarkar, and Paulo Shakarian. Detection of violent extremists in social media. arXiv preprint arXiv:1902.01577, 2019. [9] Pradip Chitrakar, Chengcui Zhang, Gary Warner, and Xinpeng Liao. Social media image retrieval using distilled convolutional neural network for suspicious e-crime 65
  • 78. Bibliography and terrorist account detection. In 2016 IEEE International Symposium on Multimedia (ISM), pages 493ā€“498. IEEE, 2016. [10] Alex Krizhevsky, Ilya Sutskever, and Geoļ¬€rey E Hinton. Imagenet classiļ¬cation with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097ā€“1105, 2012. [11] George Kalpakis, Theodora Tsikrika, Stefanos Vrochidis, and Ioannis Kompatsiaris. Identifying terrorism-related key actors in multidimensional social networks. In International Conference on Multimedia Modeling, pages 93ā€“105. Springer, 2019. [12] Pankaj Choudhary and Upasna Singh. A survey on social network analysis for counter-terrorism. International Journal of Computer Applications, 112(9):24ā€“29, 2015. [13] Gary LaFree and Laura Dugan. Introducing the global terrorism database. Terrorism and Political Violence, 19(2):181ā€“204, 2007. [14] Kalev Leetaru and Philip A Schrodt. Gdelt: Global data on events, location, and tone, 1979ā€“2012. In ISA annual convention, volume 2, pages 1ā€“49. Citeseer, 2013. [15] Linton C Freeman. Centrality in social networks conceptual clariļ¬cation. Social networks, 1(3):215ā€“239, 1978. [16] EDUCBA contributors. Text mining vs natural language process- ing - top 5 comparisons, Aug 2019. URL https://www.educba.com/ important-text-mining-vs-natural-language-processing/. [17] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeļ¬€rey Dean. Eļ¬ƒcient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013. [18] Shivangi Singhal. Data representation in nlp, Jul 2019. URL https://medium.com/ @shiivangii/data-representation-in-nlp-7bb6a771599a. [19] Eric Kauderer-Abrams. Quantifying translation-invariance in convolutional neural networks. arXiv preprint arXiv:1801.01450, 2017. 66