DeepScan: Exploiting Deep Learning for Malicious Account Detection in Location-Based Social Networks

DeepScan: Exploiting Deep Learning for
Malicious Account Detection in
Location-Based Social Networks
Yang Chen
School of Computer Science
Fudan University
chenyang@fudan.edu.cn

Emergence of Location-Based Social
Networks (LBSNs)
• Primary functions
– Check-ins
– Reviews
2

LBSN Data for Urban Computing
• Rich spatiotemporal information of massive users
• Understand human behaviors in the context of urban
computing
• Prospective applications
– Life style modeling and analysis [Yuan et al., ACM
COSN’13]
– City dynamic studying [Silva et al., ACM TOIT’14]
– Venue recommendation [Zhao et al., ACM TIST’14]
– Retail business survival [D’Silva et al., ACM IMWUT’18]
3

Malicious Accounts in LBSNs
• Incentive for the attackers
– Winning awards (virtual coins, badges, …)
– Promoting some selected venues
Ensure LBSN users to access reliable information 4

The Dianping Service
Review Check-in
http://www.dianping.com/member/UID
Followings
Followers
Location
Nickname
5

Data Collection
• Crawling
– Time period: Mar. 11, 2017 – Apr. 16, 2017
– Tool: a Python-based crawler using the Selenium webdriver
library
• Amount: 107,616 randomly selected Dianping users
– Focus on those who have conducted at least 5 check-ins and
published at least 5 tips
• Data entry format
– Demographic information (user ID, gender, registration date,
birthday, number of followings, number of followers)
– A sequence of check-in records
– A sequence of posted reviews
– Favorited venues
6

Account Annotation
• 14 undergraduate students as volunteers
• 30 USD per 2000 annotations
• Each Dianping user is examined by at least 2 volunteers
– If the two volunteers have different opinions, we ask the third volunteer to
break the tie
3930 malicious users 9928 legitimate users
Ground-truth dataset
7

Drawbacks of Previous Solutions
• Supervised machine learning-based approaches
– Feature selection is critical
– Most of the proposed features rely on an aggregate
view of user activities
• Example: total number of check-ins, total number of visited
cities, average travel speed
• However, such a view cannot reflect the dynamics
of the fine-grained activities
– Example: the number of venues she has visited within
each interval along the timeline
8

Usefulness of Time Series Information
 The fine-grained activity sequences of a user can be extracted as time series
features to distinguish between legitimate and malicious users
Legitimate Malicious
3:10pm
3:11pm
3:12pm
7:10am
7:35am
8:40am
Each user has a total number of 3 check-ins
9

Recurrent Neural Networks (RNN)
 A chain-like architecture, i.e., an array of recurrently connected
cells (neurons)
 Elements of a time sequence will be fed into this array sequentially
 Each cell memorizes the information has been calculated so far
The hidden state that memorizes the information until
the t-th time interval
The input vector of
the t-th time interval
ℎ 𝑡 = 𝑓(𝑊 ∙ 𝑥 𝑡, ℎ 𝑡−1 + 𝑏)
10

Problems of the Standard RNNs
Example: I grew up in France… I speak fluent French.
 In standard RNNs, the inputs are processed in temporal
order, current hidden state of the cell tends to be largely
based on the latest inputs
 The current hidden state might still be correlated to long-
distance pieces of information in the inputs 11

Long Short-Term Memory (LSTM)
 LSTMs are able to remember their inputs over a long period of
time
 LSTMs contain their information in a memory, that is much like
the memory of a computer because the LSTM can read, write
and delete information from its memory.
12

A LSTM Cell
 Each cell in LSTM has three gates, i.e., the input, output and forget gates
 With the help of these gates, the memory cell can control the fraction of
information to get into or out of the cell, and generate a hidden state at
each time interval accordingly 13

Bidirectional LSTM
 Bi-LSTM network comprises two parallel LSTM layers, which are the
backward LSTM and the forward LSTM
 For every element in a given sequence, the bidirectional LSTM has the
information about all elements before and after it
 Make use of past and future information 14

DeepScan System
• Time series features
– Bi-LSTM network to handle the time series information
• Conventional features
– Spatiotemporal features, social connectivity features, UGC features,
demographic features
• The decision maker
– A supervised machine learning-based classifier
Time Series User Activity Modeling
…
Fully
Connected
…
𝒙 𝟐 𝒙 𝒕 𝒙 𝒏
…
Decision Maker
…
𝒙 𝟏
…
…
…
…
Forward LSTM……
Backward LSTM
Concat
Softmax
xt represents
activities within
the t-th time
interval Conventional
Features
15

Training and Test
• Training (and validation) dataset
– Pick a machine learning algorithm
– Apply grid search by sweeping a grid of parameters
• Find the parameters that could achieve the highest prediction
performance (“best” parameters)
• Test dataset
– Evaluate the trained model
Original dataset
Features Labels
Training Set
Test Set
ML Algorithm
Trained Model
Parameter tuning
Predicted
Labels
16

Implementation
• LSTM-based time series analysis
– TensorFlow
• Decision maker
– XGBoost
– Weka
17

Metrics
• Precision
– The fraction of predicted malicious accounts who
are really harmful
• Recall
– The fraction of malicious users who are detected
accurately
• F1-score
– The harmonic mean of precision and recall
𝐹1 =
2 ∙ 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∙ 𝑅𝑒𝑐𝑎𝑙𝑙
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙
18

Evaluation Outline
• Performance comparison between Bi-LSTM
and Uni-LSTM
• Comparison of different supervised machine
learning algorithms for the decision maker
• Feature importance
• Contribution of each subset of features
• Comparison between DeepScan and an
existing SVM-based approach
19

Performance Comparison Between
Bi-LSTM and Uni-LSTM
 Bi-LSTM achieves a better prediction performance than Uni-LSTM
consistently
 We obtain the best results when setting the interval length as one day
20

Comparison of Different Supervised Machine
Learning Algorithms for the Decision Maker
Algorithm Precision Recall F1-score
XGBoost 0.950 0.978 0.964
RF 0.942 0.982 0.962
J48 0.941 0.982 0.961
SVMr 0.940 0.970 0.955
SVMp 0.940 0.955 0.948
BN 0.757 0.829 0.791
 XGBoost performs the best
 We adopt XGBoost as the default algorithm of DeepScan
21

Feature Importance
 The two most influential features are the two LSTM-based time series features
22

Contribution of Each Subset of Features
Feature sets Precision Recall F1-score
DeepScan 0.950 0.978 0.964
-Time series features 0.863 0.864 0.864
-Spatiotemporal features 0.941 0.981 0.961
-UGC features 0.943 0.980 0.963
-Social features 0.942 0.977 0.959
-Demographic features 0.943 0.977 0.960
Excluding the set of time series features reduces the
prediction performance the most
23

Contribution of Each Subset of
Features (cont.)
 Starting from a random guess classifier, we add one subset of
features at a time
 Adding the set of LSTM-based time series features could increase
the F1-score the most
Random Guess 0.390 0.500 0.438
+Time series features 0.941 0.982 0.961
+Spatiotemporal features 0.872 0.876 0.874
+UGC features 0.741 0.932 0.825
+Social features 0.723 0.930 0.832
+Demographic features 0.724 0.961 0.826
24

Comparison with an Existing SVM-Based Approach
DeepScan 0.950 0.978 0.964
Zhang et al. 0.912 0.924 0.918
Zhang et al. + Time series features 0.941 0.981 0.961
 DeepScan yields a better performance
 Adding the time series features can also improve Zhang et al.’s approach
25

Observations
• We consider the users’ activities from a dynamic view
• Time series features: more discriminative and informative
than conventional features
Solution: DeepScan
• A deep learning-based malicious account detection system
• Can be leveraged by third-party vendors conveniently
Evaluation
• Using the real data collected from Dianping
• Achieving an excellent prediction performance with an F1-
score of 0.964
Conclusion
26

Future Work for DeepScan
• Deep learning-based user classification
– Discovery of malicious accounts, opinion leaders,
outliers…
• New deep learning algorithms
– Clockwork RNN [Koutník et al., ICML’14], Phased LSTM
[Neil et al., NIPS’16], Dilated RNN [Chang et al., NIPS’17]
• Data-driven urban computing
– Human mobility modeling and analysis
– Social-aware venue recommendation
27

DeepScan: Exploiting Deep Learning for Malicious Account Detection in Location-Based Social Networks

Recommended

Recommended

More Related Content

Similar to DeepScan: Exploiting Deep Learning for Malicious Account Detection in Location-Based Social Networks

Similar to DeepScan: Exploiting Deep Learning for Malicious Account Detection in Location-Based Social Networks (20)

Recently uploaded

Recently uploaded (20)

DeepScan: Exploiting Deep Learning for Malicious Account Detection in Location-Based Social Networks