SlideShare a Scribd company logo
DeepScan: Exploiting Deep Learning for
Malicious Account Detection in
Location-Based Social Networks
Yang Chen
School of Computer Science
Fudan University
chenyang@fudan.edu.cn
Emergence of Location-Based Social
Networks (LBSNs)
• Primary functions
– Check-ins
– Reviews
2
LBSN Data for Urban Computing
• Rich spatiotemporal information of massive users
• Understand human behaviors in the context of urban
computing
• Prospective applications
– Life style modeling and analysis [Yuan et al., ACM
COSN’13]
– City dynamic studying [Silva et al., ACM TOIT’14]
– Venue recommendation [Zhao et al., ACM TIST’14]
– Retail business survival [D’Silva et al., ACM IMWUT’18]
3
Malicious Accounts in LBSNs
• Incentive for the attackers
– Winning awards (virtual coins, badges, …)
– Promoting some selected venues
Ensure LBSN users to access reliable information 4
The Dianping Service
Review Check-in
http://www.dianping.com/member/UID
Followings
Followers
Location
Nickname
5
Data Collection
• Crawling
– Time period: Mar. 11, 2017 – Apr. 16, 2017
– Tool: a Python-based crawler using the Selenium webdriver
library
• Amount: 107,616 randomly selected Dianping users
– Focus on those who have conducted at least 5 check-ins and
published at least 5 tips
• Data entry format
– Demographic information (user ID, gender, registration date,
birthday, number of followings, number of followers)
– A sequence of check-in records
– A sequence of posted reviews
– Favorited venues
6
Account Annotation
• 14 undergraduate students as volunteers
• 30 USD per 2000 annotations
• Each Dianping user is examined by at least 2 volunteers
– If the two volunteers have different opinions, we ask the third volunteer to
break the tie
3930 malicious users 9928 legitimate users
Ground-truth dataset
7
Drawbacks of Previous Solutions
• Supervised machine learning-based approaches
– Feature selection is critical
– Most of the proposed features rely on an aggregate
view of user activities
• Example: total number of check-ins, total number of visited
cities, average travel speed
• However, such a view cannot reflect the dynamics
of the fine-grained activities
– Example: the number of venues she has visited within
each interval along the timeline
8
Usefulness of Time Series Information
 The fine-grained activity sequences of a user can be extracted as time series
features to distinguish between legitimate and malicious users
Legitimate Malicious
3:10pm
3:11pm
3:12pm
7:10am
7:35am
8:40am
Each user has a total number of 3 check-ins
9
Recurrent Neural Networks (RNN)
 A chain-like architecture, i.e., an array of recurrently connected
cells (neurons)
 Elements of a time sequence will be fed into this array sequentially
 Each cell memorizes the information has been calculated so far
The hidden state that memorizes the information until
the t-th time interval
The input vector of
the t-th time interval
ℎ 𝑡 = 𝑓(𝑊 ∙ 𝑥 𝑡, ℎ 𝑡−1 + 𝑏)
10
Problems of the Standard RNNs
Example: I grew up in France… I speak fluent French.
 In standard RNNs, the inputs are processed in temporal
order, current hidden state of the cell tends to be largely
based on the latest inputs
 The current hidden state might still be correlated to long-
distance pieces of information in the inputs 11
Long Short-Term Memory (LSTM)
 LSTMs are able to remember their inputs over a long period of
time
 LSTMs contain their information in a memory, that is much like
the memory of a computer because the LSTM can read, write
and delete information from its memory.
12
A LSTM Cell
 Each cell in LSTM has three gates, i.e., the input, output and forget gates
 With the help of these gates, the memory cell can control the fraction of
information to get into or out of the cell, and generate a hidden state at
each time interval accordingly 13
Bidirectional LSTM
 Bi-LSTM network comprises two parallel LSTM layers, which are the
backward LSTM and the forward LSTM
 For every element in a given sequence, the bidirectional LSTM has the
information about all elements before and after it
 Make use of past and future information 14
DeepScan System
• Time series features
– Bi-LSTM network to handle the time series information
• Conventional features
– Spatiotemporal features, social connectivity features, UGC features,
demographic features
• The decision maker
– A supervised machine learning-based classifier
Time Series User Activity Modeling
…
Fully
Connected
…
𝒙 𝟐 𝒙 𝒕 𝒙 𝒏
…
Decision Maker
…
𝒙 𝟏
…
…
…
…
Forward LSTM……
Backward LSTM
Concat
Softmax
xt represents
activities within
the t-th time
interval Conventional
Features
15
Training and Test
• Training (and validation) dataset
– Pick a machine learning algorithm
– Apply grid search by sweeping a grid of parameters
• Find the parameters that could achieve the highest prediction
performance (“best” parameters)
• Test dataset
– Evaluate the trained model
Original dataset
Features Labels
Training Set
Test Set
ML Algorithm
Trained Model
Parameter tuning
Predicted
Labels
16
Implementation
• LSTM-based time series analysis
– TensorFlow
• Decision maker
– XGBoost
– Weka
17
Metrics
• Precision
– The fraction of predicted malicious accounts who
are really harmful
• Recall
– The fraction of malicious users who are detected
accurately
• F1-score
– The harmonic mean of precision and recall
𝐹1 =
2 ∙ 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∙ 𝑅𝑒𝑐𝑎𝑙𝑙
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙
18
Evaluation Outline
• Performance comparison between Bi-LSTM
and Uni-LSTM
• Comparison of different supervised machine
learning algorithms for the decision maker
• Feature importance
• Contribution of each subset of features
• Comparison between DeepScan and an
existing SVM-based approach
19
Performance Comparison Between
Bi-LSTM and Uni-LSTM
 Bi-LSTM achieves a better prediction performance than Uni-LSTM
consistently
 We obtain the best results when setting the interval length as one day
20
Comparison of Different Supervised Machine
Learning Algorithms for the Decision Maker
Algorithm Precision Recall F1-score
XGBoost 0.950 0.978 0.964
RF 0.942 0.982 0.962
J48 0.941 0.982 0.961
SVMr 0.940 0.970 0.955
SVMp 0.940 0.955 0.948
BN 0.757 0.829 0.791
 XGBoost performs the best
 We adopt XGBoost as the default algorithm of DeepScan
21
Feature Importance
 The two most influential features are the two LSTM-based time series features
22
Contribution of Each Subset of Features
Feature sets Precision Recall F1-score
DeepScan 0.950 0.978 0.964
-Time series features 0.863 0.864 0.864
-Spatiotemporal features 0.941 0.981 0.961
-UGC features 0.943 0.980 0.963
-Social features 0.942 0.977 0.959
-Demographic features 0.943 0.977 0.960
Excluding the set of time series features reduces the
prediction performance the most
23
Contribution of Each Subset of
Features (cont.)
 Starting from a random guess classifier, we add one subset of
features at a time
 Adding the set of LSTM-based time series features could increase
the F1-score the most
Feature sets Precision Recall F1-score
Random Guess 0.390 0.500 0.438
+Time series features 0.941 0.982 0.961
+Spatiotemporal features 0.872 0.876 0.874
+UGC features 0.741 0.932 0.825
+Social features 0.723 0.930 0.832
+Demographic features 0.724 0.961 0.826
24
Comparison with an Existing SVM-Based Approach
Feature sets Precision Recall F1-score
DeepScan 0.950 0.978 0.964
Zhang et al. 0.912 0.924 0.918
Zhang et al. + Time series features 0.941 0.981 0.961
 DeepScan yields a better performance
 Adding the time series features can also improve Zhang et al.’s approach
25
Observations
• We consider the users’ activities from a dynamic view
• Time series features: more discriminative and informative
than conventional features
Solution: DeepScan
• A deep learning-based malicious account detection system
• Can be leveraged by third-party vendors conveniently
Evaluation
• Using the real data collected from Dianping
• Achieving an excellent prediction performance with an F1-
score of 0.964
Conclusion
26
Future Work for DeepScan
• Deep learning-based user classification
– Discovery of malicious accounts, opinion leaders,
outliers…
• New deep learning algorithms
– Clockwork RNN [Koutník et al., ICML’14], Phased LSTM
[Neil et al., NIPS’16], Dilated RNN [Chang et al., NIPS’17]
• Data-driven urban computing
– Human mobility modeling and analysis
– Social-aware venue recommendation
27
Thank you!

More Related Content

Similar to DeepScan: Exploiting Deep Learning for Malicious Account Detection in Location-Based Social Networks

sigir16
sigir16sigir16
Real Time Object Dectection using machine learning
Real Time Object Dectection using machine learningReal Time Object Dectection using machine learning
Real Time Object Dectection using machine learning
pratik pratyay
 
slide-171212080528.pptx
slide-171212080528.pptxslide-171212080528.pptx
slide-171212080528.pptx
SharanrajK22MMT1003
 
Stream Processing Overview
Stream Processing OverviewStream Processing Overview
Stream Processing Overview
Maycon Viana Bordin
 
Mantis qcon nyc_2015
Mantis qcon nyc_2015Mantis qcon nyc_2015
Mantis qcon nyc_2015
neerajrj
 
Unit i
Unit iUnit i
ruSMART 2013 presentation
ruSMART 2013 presentationruSMART 2013 presentation
ruSMART 2013 presentation
Oscar Rodríguez Rocha
 
BsidesLVPresso2016_JZeditsv6
BsidesLVPresso2016_JZeditsv6BsidesLVPresso2016_JZeditsv6
BsidesLVPresso2016_JZeditsv6
Rod Soto
 
Influence of time and length size feature selections for human activity seque...
Influence of time and length size feature selections for human activity seque...Influence of time and length size feature selections for human activity seque...
Influence of time and length size feature selections for human activity seque...
ISA Interchange
 
The Case for a Signal Oriented Data Stream Management System
The Case for a Signal Oriented Data Stream Management SystemThe Case for a Signal Oriented Data Stream Management System
The Case for a Signal Oriented Data Stream Management System
Reza Rahimi
 
fmelleHumanActivityRecognitionWithMobileSensors
fmelleHumanActivityRecognitionWithMobileSensorsfmelleHumanActivityRecognitionWithMobileSensors
fmelleHumanActivityRecognitionWithMobileSensors
Fridtjof Melle
 
Final PPT.pptx
Final PPT.pptxFinal PPT.pptx
Final PPT.pptx
samarth70133
 
Cloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big DataCloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big Data
Abhishek M Shivalingaiah
 
IRJET- Criminal Recognization in CCTV Surveillance Video
IRJET-  	  Criminal Recognization in CCTV Surveillance VideoIRJET-  	  Criminal Recognization in CCTV Surveillance Video
IRJET- Criminal Recognization in CCTV Surveillance Video
IRJET Journal
 
Data Visualization and Communication by Big Data
Data Visualization and Communication by Big DataData Visualization and Communication by Big Data
Data Visualization and Communication by Big Data
IRJET Journal
 
Mobile Crowdsensing with Mobile Agents
Mobile Crowdsensing with Mobile AgentsMobile Crowdsensing with Mobile Agents
Mobile Crowdsensing with Mobile Agents
Teemu Leppänen
 
Currency Detection using TensorFlow
Currency Detection using TensorFlowCurrency Detection using TensorFlow
Currency Detection using TensorFlow
IRJET Journal
 
various studies of prediction E-consumer
various studies of prediction E-consumer various studies of prediction E-consumer
various studies of prediction E-consumer
omaraldabash
 
Rise of the machines -- Owasp israel -- June 2014 meetup
Rise of the machines -- Owasp israel -- June 2014 meetupRise of the machines -- Owasp israel -- June 2014 meetup
Rise of the machines -- Owasp israel -- June 2014 meetup
Shlomo Yona
 
Behaviometrics: Behavior Modeling from Heterogeneous Sensory Time-Series
Behaviometrics: Behavior Modeling from Heterogeneous Sensory Time-SeriesBehaviometrics: Behavior Modeling from Heterogeneous Sensory Time-Series
Behaviometrics: Behavior Modeling from Heterogeneous Sensory Time-Series
Jiang Zhu
 

Similar to DeepScan: Exploiting Deep Learning for Malicious Account Detection in Location-Based Social Networks (20)

sigir16
sigir16sigir16
sigir16
 
Real Time Object Dectection using machine learning
Real Time Object Dectection using machine learningReal Time Object Dectection using machine learning
Real Time Object Dectection using machine learning
 
slide-171212080528.pptx
slide-171212080528.pptxslide-171212080528.pptx
slide-171212080528.pptx
 
Stream Processing Overview
Stream Processing OverviewStream Processing Overview
Stream Processing Overview
 
Mantis qcon nyc_2015
Mantis qcon nyc_2015Mantis qcon nyc_2015
Mantis qcon nyc_2015
 
Unit i
Unit iUnit i
Unit i
 
ruSMART 2013 presentation
ruSMART 2013 presentationruSMART 2013 presentation
ruSMART 2013 presentation
 
BsidesLVPresso2016_JZeditsv6
BsidesLVPresso2016_JZeditsv6BsidesLVPresso2016_JZeditsv6
BsidesLVPresso2016_JZeditsv6
 
Influence of time and length size feature selections for human activity seque...
Influence of time and length size feature selections for human activity seque...Influence of time and length size feature selections for human activity seque...
Influence of time and length size feature selections for human activity seque...
 
The Case for a Signal Oriented Data Stream Management System
The Case for a Signal Oriented Data Stream Management SystemThe Case for a Signal Oriented Data Stream Management System
The Case for a Signal Oriented Data Stream Management System
 
fmelleHumanActivityRecognitionWithMobileSensors
fmelleHumanActivityRecognitionWithMobileSensorsfmelleHumanActivityRecognitionWithMobileSensors
fmelleHumanActivityRecognitionWithMobileSensors
 
Final PPT.pptx
Final PPT.pptxFinal PPT.pptx
Final PPT.pptx
 
Cloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big DataCloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big Data
 
IRJET- Criminal Recognization in CCTV Surveillance Video
IRJET-  	  Criminal Recognization in CCTV Surveillance VideoIRJET-  	  Criminal Recognization in CCTV Surveillance Video
IRJET- Criminal Recognization in CCTV Surveillance Video
 
Data Visualization and Communication by Big Data
Data Visualization and Communication by Big DataData Visualization and Communication by Big Data
Data Visualization and Communication by Big Data
 
Mobile Crowdsensing with Mobile Agents
Mobile Crowdsensing with Mobile AgentsMobile Crowdsensing with Mobile Agents
Mobile Crowdsensing with Mobile Agents
 
Currency Detection using TensorFlow
Currency Detection using TensorFlowCurrency Detection using TensorFlow
Currency Detection using TensorFlow
 
various studies of prediction E-consumer
various studies of prediction E-consumer various studies of prediction E-consumer
various studies of prediction E-consumer
 
Rise of the machines -- Owasp israel -- June 2014 meetup
Rise of the machines -- Owasp israel -- June 2014 meetupRise of the machines -- Owasp israel -- June 2014 meetup
Rise of the machines -- Owasp israel -- June 2014 meetup
 
Behaviometrics: Behavior Modeling from Heterogeneous Sensory Time-Series
Behaviometrics: Behavior Modeling from Heterogeneous Sensory Time-SeriesBehaviometrics: Behavior Modeling from Heterogeneous Sensory Time-Series
Behaviometrics: Behavior Modeling from Heterogeneous Sensory Time-Series
 

Recently uploaded

Digital Marketing Performance Marketing Sample .pdf
Digital Marketing Performance Marketing  Sample .pdfDigital Marketing Performance Marketing  Sample .pdf
Digital Marketing Performance Marketing Sample .pdf
Vineet
 
SAP BW4HANA Implementagtion Content Document
SAP BW4HANA Implementagtion Content DocumentSAP BW4HANA Implementagtion Content Document
SAP BW4HANA Implementagtion Content Document
newdirectionconsulta
 
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...
Marlon Dumas
 
Build applications with generative AI on Google Cloud
Build applications with generative AI on Google CloudBuild applications with generative AI on Google Cloud
Build applications with generative AI on Google Cloud
Márton Kodok
 
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
hyfjgavov
 
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
nyvan3
 
reading_sample_sap_press_operational_data_provisioning_with_sap_bw4hana (1).pdf
reading_sample_sap_press_operational_data_provisioning_with_sap_bw4hana (1).pdfreading_sample_sap_press_operational_data_provisioning_with_sap_bw4hana (1).pdf
reading_sample_sap_press_operational_data_provisioning_with_sap_bw4hana (1).pdf
perranet1
 
A gentle exploration of Retrieval Augmented Generation
A gentle exploration of Retrieval Augmented GenerationA gentle exploration of Retrieval Augmented Generation
A gentle exploration of Retrieval Augmented Generation
dataschool1
 
Sid Sigma educational and problem solving power point- Six Sigma.ppt
Sid Sigma educational and problem solving power point- Six Sigma.pptSid Sigma educational and problem solving power point- Six Sigma.ppt
Sid Sigma educational and problem solving power point- Six Sigma.ppt
ArshadAyub49
 
Data Scientist Machine Learning Profiles .pdf
Data Scientist Machine Learning  Profiles .pdfData Scientist Machine Learning  Profiles .pdf
Data Scientist Machine Learning Profiles .pdf
Vineet
 
How To Control IO Usage using Resource Manager
How To Control IO Usage using Resource ManagerHow To Control IO Usage using Resource Manager
How To Control IO Usage using Resource Manager
Alireza Kamrani
 
Module 1 ppt BIG DATA ANALYTICS NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS NOTES FOR MCAModule 1 ppt BIG DATA ANALYTICS NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS NOTES FOR MCA
yuvarajkumar334
 
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
hqfek
 
Senior Software Profiles Backend Sample - Sheet1.pdf
Senior Software Profiles  Backend Sample - Sheet1.pdfSenior Software Profiles  Backend Sample - Sheet1.pdf
Senior Software Profiles Backend Sample - Sheet1.pdf
Vineet
 
Template xxxxxxxx ssssssssssss Sertifikat.pptx
Template xxxxxxxx ssssssssssss Sertifikat.pptxTemplate xxxxxxxx ssssssssssss Sertifikat.pptx
Template xxxxxxxx ssssssssssss Sertifikat.pptx
TeukuEriSyahputra
 
Sample Devops SRE Product Companies .pdf
Sample Devops SRE  Product Companies .pdfSample Devops SRE  Product Companies .pdf
Sample Devops SRE Product Companies .pdf
Vineet
 
Namma-Kalvi-11th-Physics-Study-Material-Unit-1-EM-221086.pdf
Namma-Kalvi-11th-Physics-Study-Material-Unit-1-EM-221086.pdfNamma-Kalvi-11th-Physics-Study-Material-Unit-1-EM-221086.pdf
Namma-Kalvi-11th-Physics-Study-Material-Unit-1-EM-221086.pdf
22ad0301
 
Econ3060_Screen Time and Success_ final_GroupProject.pdf
Econ3060_Screen Time and Success_ final_GroupProject.pdfEcon3060_Screen Time and Success_ final_GroupProject.pdf
Econ3060_Screen Time and Success_ final_GroupProject.pdf
blueshagoo1
 
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
ywqeos
 
一比一原版悉尼大学毕业证如何办理
一比一原版悉尼大学毕业证如何办理一比一原版悉尼大学毕业证如何办理
一比一原版悉尼大学毕业证如何办理
keesa2
 

Recently uploaded (20)

Digital Marketing Performance Marketing Sample .pdf
Digital Marketing Performance Marketing  Sample .pdfDigital Marketing Performance Marketing  Sample .pdf
Digital Marketing Performance Marketing Sample .pdf
 
SAP BW4HANA Implementagtion Content Document
SAP BW4HANA Implementagtion Content DocumentSAP BW4HANA Implementagtion Content Document
SAP BW4HANA Implementagtion Content Document
 
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...
 
Build applications with generative AI on Google Cloud
Build applications with generative AI on Google CloudBuild applications with generative AI on Google Cloud
Build applications with generative AI on Google Cloud
 
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
 
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
 
reading_sample_sap_press_operational_data_provisioning_with_sap_bw4hana (1).pdf
reading_sample_sap_press_operational_data_provisioning_with_sap_bw4hana (1).pdfreading_sample_sap_press_operational_data_provisioning_with_sap_bw4hana (1).pdf
reading_sample_sap_press_operational_data_provisioning_with_sap_bw4hana (1).pdf
 
A gentle exploration of Retrieval Augmented Generation
A gentle exploration of Retrieval Augmented GenerationA gentle exploration of Retrieval Augmented Generation
A gentle exploration of Retrieval Augmented Generation
 
Sid Sigma educational and problem solving power point- Six Sigma.ppt
Sid Sigma educational and problem solving power point- Six Sigma.pptSid Sigma educational and problem solving power point- Six Sigma.ppt
Sid Sigma educational and problem solving power point- Six Sigma.ppt
 
Data Scientist Machine Learning Profiles .pdf
Data Scientist Machine Learning  Profiles .pdfData Scientist Machine Learning  Profiles .pdf
Data Scientist Machine Learning Profiles .pdf
 
How To Control IO Usage using Resource Manager
How To Control IO Usage using Resource ManagerHow To Control IO Usage using Resource Manager
How To Control IO Usage using Resource Manager
 
Module 1 ppt BIG DATA ANALYTICS NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS NOTES FOR MCAModule 1 ppt BIG DATA ANALYTICS NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS NOTES FOR MCA
 
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
 
Senior Software Profiles Backend Sample - Sheet1.pdf
Senior Software Profiles  Backend Sample - Sheet1.pdfSenior Software Profiles  Backend Sample - Sheet1.pdf
Senior Software Profiles Backend Sample - Sheet1.pdf
 
Template xxxxxxxx ssssssssssss Sertifikat.pptx
Template xxxxxxxx ssssssssssss Sertifikat.pptxTemplate xxxxxxxx ssssssssssss Sertifikat.pptx
Template xxxxxxxx ssssssssssss Sertifikat.pptx
 
Sample Devops SRE Product Companies .pdf
Sample Devops SRE  Product Companies .pdfSample Devops SRE  Product Companies .pdf
Sample Devops SRE Product Companies .pdf
 
Namma-Kalvi-11th-Physics-Study-Material-Unit-1-EM-221086.pdf
Namma-Kalvi-11th-Physics-Study-Material-Unit-1-EM-221086.pdfNamma-Kalvi-11th-Physics-Study-Material-Unit-1-EM-221086.pdf
Namma-Kalvi-11th-Physics-Study-Material-Unit-1-EM-221086.pdf
 
Econ3060_Screen Time and Success_ final_GroupProject.pdf
Econ3060_Screen Time and Success_ final_GroupProject.pdfEcon3060_Screen Time and Success_ final_GroupProject.pdf
Econ3060_Screen Time and Success_ final_GroupProject.pdf
 
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
 
一比一原版悉尼大学毕业证如何办理
一比一原版悉尼大学毕业证如何办理一比一原版悉尼大学毕业证如何办理
一比一原版悉尼大学毕业证如何办理
 

DeepScan: Exploiting Deep Learning for Malicious Account Detection in Location-Based Social Networks

  • 1. DeepScan: Exploiting Deep Learning for Malicious Account Detection in Location-Based Social Networks Yang Chen School of Computer Science Fudan University chenyang@fudan.edu.cn
  • 2. Emergence of Location-Based Social Networks (LBSNs) • Primary functions – Check-ins – Reviews 2
  • 3. LBSN Data for Urban Computing • Rich spatiotemporal information of massive users • Understand human behaviors in the context of urban computing • Prospective applications – Life style modeling and analysis [Yuan et al., ACM COSN’13] – City dynamic studying [Silva et al., ACM TOIT’14] – Venue recommendation [Zhao et al., ACM TIST’14] – Retail business survival [D’Silva et al., ACM IMWUT’18] 3
  • 4. Malicious Accounts in LBSNs • Incentive for the attackers – Winning awards (virtual coins, badges, …) – Promoting some selected venues Ensure LBSN users to access reliable information 4
  • 5. The Dianping Service Review Check-in http://www.dianping.com/member/UID Followings Followers Location Nickname 5
  • 6. Data Collection • Crawling – Time period: Mar. 11, 2017 – Apr. 16, 2017 – Tool: a Python-based crawler using the Selenium webdriver library • Amount: 107,616 randomly selected Dianping users – Focus on those who have conducted at least 5 check-ins and published at least 5 tips • Data entry format – Demographic information (user ID, gender, registration date, birthday, number of followings, number of followers) – A sequence of check-in records – A sequence of posted reviews – Favorited venues 6
  • 7. Account Annotation • 14 undergraduate students as volunteers • 30 USD per 2000 annotations • Each Dianping user is examined by at least 2 volunteers – If the two volunteers have different opinions, we ask the third volunteer to break the tie 3930 malicious users 9928 legitimate users Ground-truth dataset 7
  • 8. Drawbacks of Previous Solutions • Supervised machine learning-based approaches – Feature selection is critical – Most of the proposed features rely on an aggregate view of user activities • Example: total number of check-ins, total number of visited cities, average travel speed • However, such a view cannot reflect the dynamics of the fine-grained activities – Example: the number of venues she has visited within each interval along the timeline 8
  • 9. Usefulness of Time Series Information  The fine-grained activity sequences of a user can be extracted as time series features to distinguish between legitimate and malicious users Legitimate Malicious 3:10pm 3:11pm 3:12pm 7:10am 7:35am 8:40am Each user has a total number of 3 check-ins 9
  • 10. Recurrent Neural Networks (RNN)  A chain-like architecture, i.e., an array of recurrently connected cells (neurons)  Elements of a time sequence will be fed into this array sequentially  Each cell memorizes the information has been calculated so far The hidden state that memorizes the information until the t-th time interval The input vector of the t-th time interval ℎ 𝑡 = 𝑓(𝑊 ∙ 𝑥 𝑡, ℎ 𝑡−1 + 𝑏) 10
  • 11. Problems of the Standard RNNs Example: I grew up in France… I speak fluent French.  In standard RNNs, the inputs are processed in temporal order, current hidden state of the cell tends to be largely based on the latest inputs  The current hidden state might still be correlated to long- distance pieces of information in the inputs 11
  • 12. Long Short-Term Memory (LSTM)  LSTMs are able to remember their inputs over a long period of time  LSTMs contain their information in a memory, that is much like the memory of a computer because the LSTM can read, write and delete information from its memory. 12
  • 13. A LSTM Cell  Each cell in LSTM has three gates, i.e., the input, output and forget gates  With the help of these gates, the memory cell can control the fraction of information to get into or out of the cell, and generate a hidden state at each time interval accordingly 13
  • 14. Bidirectional LSTM  Bi-LSTM network comprises two parallel LSTM layers, which are the backward LSTM and the forward LSTM  For every element in a given sequence, the bidirectional LSTM has the information about all elements before and after it  Make use of past and future information 14
  • 15. DeepScan System • Time series features – Bi-LSTM network to handle the time series information • Conventional features – Spatiotemporal features, social connectivity features, UGC features, demographic features • The decision maker – A supervised machine learning-based classifier Time Series User Activity Modeling … Fully Connected … 𝒙 𝟐 𝒙 𝒕 𝒙 𝒏 … Decision Maker … 𝒙 𝟏 … … … … Forward LSTM…… Backward LSTM Concat Softmax xt represents activities within the t-th time interval Conventional Features 15
  • 16. Training and Test • Training (and validation) dataset – Pick a machine learning algorithm – Apply grid search by sweeping a grid of parameters • Find the parameters that could achieve the highest prediction performance (“best” parameters) • Test dataset – Evaluate the trained model Original dataset Features Labels Training Set Test Set ML Algorithm Trained Model Parameter tuning Predicted Labels 16
  • 17. Implementation • LSTM-based time series analysis – TensorFlow • Decision maker – XGBoost – Weka 17
  • 18. Metrics • Precision – The fraction of predicted malicious accounts who are really harmful • Recall – The fraction of malicious users who are detected accurately • F1-score – The harmonic mean of precision and recall 𝐹1 = 2 ∙ 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∙ 𝑅𝑒𝑐𝑎𝑙𝑙 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙 18
  • 19. Evaluation Outline • Performance comparison between Bi-LSTM and Uni-LSTM • Comparison of different supervised machine learning algorithms for the decision maker • Feature importance • Contribution of each subset of features • Comparison between DeepScan and an existing SVM-based approach 19
  • 20. Performance Comparison Between Bi-LSTM and Uni-LSTM  Bi-LSTM achieves a better prediction performance than Uni-LSTM consistently  We obtain the best results when setting the interval length as one day 20
  • 21. Comparison of Different Supervised Machine Learning Algorithms for the Decision Maker Algorithm Precision Recall F1-score XGBoost 0.950 0.978 0.964 RF 0.942 0.982 0.962 J48 0.941 0.982 0.961 SVMr 0.940 0.970 0.955 SVMp 0.940 0.955 0.948 BN 0.757 0.829 0.791  XGBoost performs the best  We adopt XGBoost as the default algorithm of DeepScan 21
  • 22. Feature Importance  The two most influential features are the two LSTM-based time series features 22
  • 23. Contribution of Each Subset of Features Feature sets Precision Recall F1-score DeepScan 0.950 0.978 0.964 -Time series features 0.863 0.864 0.864 -Spatiotemporal features 0.941 0.981 0.961 -UGC features 0.943 0.980 0.963 -Social features 0.942 0.977 0.959 -Demographic features 0.943 0.977 0.960 Excluding the set of time series features reduces the prediction performance the most 23
  • 24. Contribution of Each Subset of Features (cont.)  Starting from a random guess classifier, we add one subset of features at a time  Adding the set of LSTM-based time series features could increase the F1-score the most Feature sets Precision Recall F1-score Random Guess 0.390 0.500 0.438 +Time series features 0.941 0.982 0.961 +Spatiotemporal features 0.872 0.876 0.874 +UGC features 0.741 0.932 0.825 +Social features 0.723 0.930 0.832 +Demographic features 0.724 0.961 0.826 24
  • 25. Comparison with an Existing SVM-Based Approach Feature sets Precision Recall F1-score DeepScan 0.950 0.978 0.964 Zhang et al. 0.912 0.924 0.918 Zhang et al. + Time series features 0.941 0.981 0.961  DeepScan yields a better performance  Adding the time series features can also improve Zhang et al.’s approach 25
  • 26. Observations • We consider the users’ activities from a dynamic view • Time series features: more discriminative and informative than conventional features Solution: DeepScan • A deep learning-based malicious account detection system • Can be leveraged by third-party vendors conveniently Evaluation • Using the real data collected from Dianping • Achieving an excellent prediction performance with an F1- score of 0.964 Conclusion 26
  • 27. Future Work for DeepScan • Deep learning-based user classification – Discovery of malicious accounts, opinion leaders, outliers… • New deep learning algorithms – Clockwork RNN [Koutník et al., ICML’14], Phased LSTM [Neil et al., NIPS’16], Dilated RNN [Chang et al., NIPS’17] • Data-driven urban computing – Human mobility modeling and analysis – Social-aware venue recommendation 27