網路安全是一個特殊的研究領域,其中一個原因是在網路安全問題中,"對手"不是文字、影像或任何形式死板板的資料,而是活生生的人;這些製造問題的黑客 (black hat hackers) 終日找尋各種系統及網路漏洞,企圖提出更高明的攻擊方式來獲取各種可能的利益。因此,在網路安全研究中,我們無法"預設"黑客會有什麼樣的攻擊行為,而必須從真正的資料中尋找蛛絲馬跡,從大量資料中發現及解決各種已發生或將發生可能危害使用者資料安全及隱私的行為。在這場研究中,我將介紹 data-driven network security research 並以幾個實際的研究案例來展示真實資料的統計分析可以幫助我們解決什麼樣的安全問題。
6. 陳昇瑋 /資料科學家未曾公開之資安研究事件簿
Area 3: Social Computing
Gender Swapping and User Behaviors in Online Social Games
7
M.F
Gender-swapped
M.M
Straight
[1] Jing-Kai Lou, Kunwoo Park, Meeyoung Cha, Juyong Park, Chin-Laung Lei, and Kuan-Ta Chen,
Gender Swapping and User Behaviors in Online Social Games, ACM WWW 2013.
24. 陳昇瑋 / 資料科學家未曾公開之資安研究事件簿
The research problem
For a unknown phone number
No google results (or no useful information)
No user tags / reports
Not a Whoscall user
Can we determine if it’s a malicious number?
推銷電話?
詐騙電話?
色情電話?
打錯電話?
16
25. 陳昇瑋 / 資料科學家未曾公開之資安研究事件簿
Rationale
We believe it’s possible to identify a malicious number
because of …
Whoscall userbase ( = potential sensors)
4 million installations
1 million active users (daily)
10 million phone calls (daily)
So, when a phone number reaches a Whoscall user, we
could possibly determine whether the number is
malicious or not based on its previous call behavior.
17
27. 陳昇瑋 / 資料科學家未曾公開之資安研究事件簿
Our Steps
Recruit a group of voluntary Whoscall users as our
sensors
Collect phone call logs from these sensors for a month
Compare these phone call logs with user reports
(封鎖記錄)
Use machine learning techniques to build a predictor for
unknown phone numbers
19
28. 陳昇瑋 / 資料科學家未曾公開之資安研究事件簿
Privacy Concerns
User privacy is kept the highest priority
Phone numbers are stored as MD5 hash codes
(therefore unable to be reversed)
20
51. 陳昇瑋 / 資料科學家未曾公開之資安研究事件簿
Our Goal
Predict whether a number is malicious as EARLY as
possible
In order to prevent further victims…
Our goal: accurate and FAST detection
43
56. Dynamic observation period
When we require malicious number prediction?
Ans: The time a phone call reaches a Whoscall user
48
time
Phonecall
Phonecall
Phonecall
Phonecall
?
Observation window
61. 陳昇瑋 / 資料科學家未曾公開之資安研究事件簿
Work in Progress
Feature selection (to avoid overfitting)
Anti-countermeasures
Semi-supervised training
Online learning
Personalized penalty setting
Crowdsourced prediction model refinement
And much more…
53
63. 陳昇瑋 / 資料科學家未曾公開之資安研究事件簿
Why should I use R?
“I already know Python. Why would I ever want to learn a
new language that already does a subset of what Python
already does?”
1. R is designed to do data-processing and statistical analysis,
and it makes it easy.
2. More support in R (e.g., the Use R! series)
3. More active development in R for data analysis
4. Who are you sharing your code with?
55
Introduction to R (Alex Storer, IQSS)
67. 陳昇瑋 / 資料科學家未曾公開之資安研究事件簿
Demo of R Basics
examine data.frame
table, hist, ecdf, color
plot, cor
barplot, boxplot on dur.call.med
59
68. 陳昇瑋 / 資料科學家未曾公開之資安研究事件簿
Final Words of Warning
“Using R is a bit akin to smoking. The
beginning is difficult, one may get
headaches and even gag the first few
times. But in the long run,it becomes
pleasurable and even addictive. Yet,
deep down, for those willing to be
honest, there is something not fully
healthy in it.” --Francois Pinard
R
74. 陳昇瑋 /資料科學家未曾公開之資安研究事件簿
Sensitive Info on SNS: A LOT!
Personal info
Photos, Diary, Schedule
Groups, Pages, Likes
Connections with friends
Friends’ information
Friends’ photos, demographics, and so on
Interactions with friends
Conversations
Messages
76. 陳昇瑋 /資料科學家未曾公開之資安研究事件簿
Stealthy Use: Tips 1, 2, 3!!
People let browsers to
manager their passwords
Entering password on mobile
devices is cumbersome
People left SNS logged on when
they’re temporarily away
5
79. 陳昇瑋 /資料科學家未曾公開之資安研究事件簿
The Loophole Of User Identity Process
The whole duration of using SNS
Log-in Log-out
?Logging-in
Authentication
The account will
be protected by
the logging-in
authentication
process.
We need the continual
authentication to ensure
the security for the
whole duration of using
SNS.
/78
83. 陳昇瑋 /資料科學家未曾公開之資安研究事件簿
Data Collection: HTTP Spying
• Intercept all HTTP communications (including
AJAX req. and resp.) between the subject’s PC and
Facebook servers
85. 陳昇瑋 /資料科學家未曾公開之資安研究事件簿
18 Different Actions On Facebook
/78
We define 18 common actions on Facebook and
categorize them into 2 groups: interactive actions and
page-switching actions.
Interactive actions are actions that users interact with a
certain target person.
Page-switching actions are those lead the browser into
another Facebook page.
88. 陳昇瑋 /資料科學家未曾公開之資安研究事件簿
The Evidence Of General Diversity
/78
Stalkers pay more attention to reading or searching the
interesting or earlier information hidden in expandable
pages.
89. 陳昇瑋 /資料科學家未曾公開之資安研究事件簿
The Evidence Of General Diversity (Con’t)
/78
Stalkers tend not to do the trackable action like adding
comment or pressing the like button.
90. 陳昇瑋 /資料科學家未曾公開之資安研究事件簿
What Stalkers Do Not Care
/78
Stalkers tend to ignore most of the
newsfeeds, and show less interest in
expanding comments, groups/fans
pages, or who likes the post.
91. 陳昇瑋 /資料科學家未曾公開之資安研究事件簿
What Acquainted Stalkers Care
/78
Acquainted stalkers are usually interested in accounts’
friend list, message pages, and profile cards.
92. 陳昇瑋 /資料科學家未曾公開之資安研究事件簿
What Stranger Stalkers Care
/78
Stranger stalkers are interested in
account owners’ profiles and
photos. Also they are more willing
to check nonfriends’ pages and
external links.
94. 陳昇瑋 /資料科學家未曾公開之資安研究事件簿
We randomly permute the data points for 20 times and do
the 10-fold cross validation, then record the mean and
standard deviation of accuracies.
/78
Detection Performance
95. 陳昇瑋 /資料科學家未曾公開之資安研究事件簿
Important Features for Early Detection
We count the features with the 3 most positively and
negatively weight w within 7 minutes which can give us
the hint to modify the early detection model.
/78