Classifying Phishing URLs Using Recurrent Neural Networks

Classifying Phishing URLs Using
Recurrent Neural Networks
Sergio Villegas
Javier Vargas
*Alejandro Correa Bahnsen
Easy Solutions Research
Eduardo Contreras Bohorquez
Fabio A. Gonzalez
MindLab Research Group,
Universidad Nacional de
Colombia

Industry recognition
A leading global provider of electronic
fraud prevention for financial institutions
and enterprise customers
385 customers
In 30 countries
100 million
Users protected
27+ billion
Online connections monitored
About Easy Solutions®
Easy Solutions to be Acquired by New Joint Venture Creating Global, Secure Infrastructure Company

Phishing
3
Phishing is the act of defrauding an online
user in order to obtain personal information
by posing as a trustworthy institution or
entity.

Why Phishing Detection is Hard
5
Original Website Only Using Images Subtle Changes

Is It Phishing?
Ideal Phishing Detection System
7
Machine
Learning
Algorithm

Ideal Phishing Detection System - Issues
8
Issues with full content
analysis:
• Time consuming
• Impractical to process
millions of websites per day
• Hard to implement for
small devices

There is always the need for an URL
9

Database of URLs
1,000,000 Phishing URLs from PhishTank
10
http://moviesjingle.com/auto/163.com/index.php
1,000,000 Legitimate URLs from Common Crawl
http://paypal.com.update.account.toughbook.cl/8a30e847925afc597516
1aeabe8930f1/?cmd=_home&dispatch=d09b78f5812945a73610edf38
http://msystemtech.ru/components/com_users/Italy/zz/Login.php?run=
_login-submit&session=68bbd43c854147324d77872062349924
https://www.sanfordhealth.org/ChildrensHealth/Article/73980
http://www.grahamleader.com/ci_25029538/these-are-5-worst-super-
bowl-halftime-shows&defid=1634182
http://www.carolinaguesthouse.co.uk/onlinebooking/?industrytype=1&
startdate=2013-09-05&nights=2&location&productid=25d47a24-6b74

CLASSIFYING PHISHING USING
URL LEXICAL AND
STATISTICAL FREQUENCIES
11

URL Lexical and Statistical Frequencies
12
http://www.papaya.com/secure_login.php
URL length Alexa
Ranking
Path length
URL Entropy
# of .com
Punctuation
count
TLD count
Is IP?
Euclidean
distance
KS & KL
distance

13
http://www.papaya.com/secure_login.php
URL length Alexa
Ranking
Path length
URL Entropy
# of .com
Punctuation
count
TLD count
Is IP?
Euclidean
distance
KS & KL
distance
Is It Phishing?

14
3-Fold CV Accuracy Recall Precision
Average 93.47% 93.28% 93.64%
Deviation 0.01% 0.02% 0.03%
Results:

15
Feature
Importance

MODELING PHISHING URLS
WITH RECURRENT
NEURAL NETWORKS
16

Normal Neural Network
17
Source: https://en.wikipedia.org/wiki/Artificial_neural_network

Recurrent Neural Networks RNN
Have loops!
19

The Problem of Long-Term Dependencies
20
Short term dependencies are easy
long term …

Long-Short Term Memory Networks LSTM
21
RNN contains
a single layer
LSTM contains
four interacting
layers
Source: http://colah.github.io/posts/2015-08-Understanding-LSTMs/

Long-Short Term Memory Networks LSTM
22
Key idea: Cell State

LSTM Step-by-Step
23
Step 1. Decide what information is going to be used

LSTM Step-by-Step
24
Step 2. Which new information is stored

LSTM Step-by-Step
25
Step 3. Update old cell state

LSTM Step-by-Step
26
Step 4. Make prediction

Modeling Architecture for URL Classification
27
URL
h
t
t
p
:
/
/
w
w
w
.
p
a
p
a
y
a
.
c
o
m
One hot
Encoding
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
Embedding
3.2 1.2 … 1.7
6.4 2.3 … 2.6
6.4 3.0 … 1.7
3.4 2.6 … 3.4
2.6 3.8 … 2.6
3.5 3.2 … 6.4
1.7 4.2 … 6.4
8.6 2.4 … 6.4
4.3 2.9 … 6.4
2.2 3.4 … 3.4
3.2 2.6 … 2.6
4.2 2.2 … 3.5
2.4 3.2 … 1.7
2.9 1.7 … 8.6
3.0 6.4 … 2.6
2.6 6.4 … 3.8
3.8 3.4 … 3.2
3.3 2.6 … 2.2
3.1 2.2 … 2.9
1.8 3.2 … 3.0
2.5 6.4 … 2.6
LSTM
LSTM
LSTM
LSTM
Sigmoid
…

Long-Short Term Memory Networks
28
3-Fold CV Accuracy Recall Precision
Average 98.76% 98.93% 98.60%
Deviation 0.04% 0.02% 0.02%
Results:

Models Comparison
29
90%
91%
92%
93%
94%
95%
96%
97%
98%
99%
100%
Accuracy Recall Precision
Long-Short Term Memory Network Random Forest

Models Comparison
30
Model
Random Forest
Long-Short Term
Memory Network
Memory
Consumption (MB)
289
0.56
Evaluation Time
(URLs per sec)
942
281
Training Time
(minutes)
2.95
238.7

What we learned
• Discerning URLs by their patterns is a good predictor of
phishing websites
• LSTM model shows an overall higher prediction
performance without the need of expert knowledge to
create the features
31

Thank you!
Any questions or comments, please let me know.
Alejandro Correa Bahnsen, PhD
Chief Data Scientist
acorrea@easysol.net

Classifying Phishing URLs Using Recurrent Neural Networks

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (13)

Similar to Classifying Phishing URLs Using Recurrent Neural Networks

Similar to Classifying Phishing URLs Using Recurrent Neural Networks (20)

Recently uploaded

Recently uploaded (20)

Classifying Phishing URLs Using Recurrent Neural Networks