Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Classifying Phishing URLs Using Recurrent Neural Networks

1,726 views

Published on

As the technical skills and costs associated with the deployment of phishing attacks decrease, we are witnessing an unprecedented level of scams that push the need for better methods to proactively detect phishing threats. In this work, we explored the use of URLs as input for machine learning models applied for phishing site prediction. In this way, we compared a feature-engineering approach followed by a random forest classifier against a novel method based on recurrent neural networks. We determined that the recurrent neural network approach provides an accuracy rate of 98.7% even without the need of manual feature creation, beating by 5% the random forest method. This means it is a scalable and fast-acting proactive detection system that does not require full content analysis.

Published in: Data & Analytics

Classifying Phishing URLs Using Recurrent Neural Networks

  1. 1. Classifying Phishing URLs Using Recurrent Neural Networks Sergio Villegas Javier Vargas *Alejandro Correa Bahnsen Easy Solutions Research Eduardo Contreras Bohorquez Fabio A. Gonzalez MindLab Research Group, Universidad Nacional de Colombia
  2. 2. Industry recognition A leading global provider of electronic fraud prevention for financial institutions and enterprise customers 385 customers In 30 countries 100 million Users protected 27+ billion Online connections monitored About Easy Solutions® Easy Solutions to be Acquired by New Joint Venture Creating Global, Secure Infrastructure Company
  3. 3. Phishing 3 Phishing is the act of defrauding an online user in order to obtain personal information by posing as a trustworthy institution or entity.
  4. 4. Typical Phishing Example 4
  5. 5. Why Phishing Detection is Hard 5 Original Website Only Using Images Subtle Changes
  6. 6. Is It Phishing? Ideal Phishing Detection System 7 Machine Learning Algorithm
  7. 7. Ideal Phishing Detection System - Issues 8 Issues with full content analysis: • Time consuming • Impractical to process millions of websites per day • Hard to implement for small devices
  8. 8. There is always the need for an URL 9
  9. 9. Database of URLs 1,000,000 Phishing URLs from PhishTank 10 http://moviesjingle.com/auto/163.com/index.php 1,000,000 Legitimate URLs from Common Crawl http://paypal.com.update.account.toughbook.cl/8a30e847925afc597516 1aeabe8930f1/?cmd=_home&dispatch=d09b78f5812945a73610edf38 http://msystemtech.ru/components/com_users/Italy/zz/Login.php?run= _login-submit&session=68bbd43c854147324d77872062349924 https://www.sanfordhealth.org/ChildrensHealth/Article/73980 http://www.grahamleader.com/ci_25029538/these-are-5-worst-super- bowl-halftime-shows&defid=1634182 http://www.carolinaguesthouse.co.uk/onlinebooking/?industrytype=1& startdate=2013-09-05&nights=2&location&productid=25d47a24-6b74
  10. 10. CLASSIFYING PHISHING USING URL LEXICAL AND STATISTICAL FREQUENCIES 11
  11. 11. URL Lexical and Statistical Frequencies 12 http://www.papaya.com/secure_login.php URL length Alexa Ranking Path length URL Entropy # of .com Punctuation count TLD count Is IP? Euclidean distance KS & KL distance
  12. 12. URL Lexical and Statistical Frequencies 13 http://www.papaya.com/secure_login.php URL length Alexa Ranking Path length URL Entropy # of .com Punctuation count TLD count Is IP? Euclidean distance KS & KL distance Is It Phishing?
  13. 13. URL Lexical and Statistical Frequencies 14 3-Fold CV Accuracy Recall Precision Average 93.47% 93.28% 93.64% Deviation 0.01% 0.02% 0.03% Results:
  14. 14. URL Lexical and Statistical Frequencies 15 Feature Importance
  15. 15. MODELING PHISHING URLS WITH RECURRENT NEURAL NETWORKS 16
  16. 16. Normal Neural Network 17 Source: https://en.wikipedia.org/wiki/Artificial_neural_network
  17. 17. Recurrent Neural Networks RNN Have loops! 19
  18. 18. The Problem of Long-Term Dependencies 20 Short term dependencies are easy long term …
  19. 19. Long-Short Term Memory Networks LSTM 21 RNN contains a single layer LSTM contains four interacting layers Source: http://colah.github.io/posts/2015-08-Understanding-LSTMs/
  20. 20. Long-Short Term Memory Networks LSTM 22 Key idea: Cell State
  21. 21. LSTM Step-by-Step 23 Step 1. Decide what information is going to be used
  22. 22. LSTM Step-by-Step 24 Step 2. Which new information is stored
  23. 23. LSTM Step-by-Step 25 Step 3. Update old cell state
  24. 24. LSTM Step-by-Step 26 Step 4. Make prediction
  25. 25. Modeling Architecture for URL Classification 27 URL h t t p : / / w w w . p a p a y a . c o m One hot Encoding … … … … … … … … … … … … … … … … … … … … … Embedding 3.2 1.2 … 1.7 6.4 2.3 … 2.6 6.4 3.0 … 1.7 3.4 2.6 … 3.4 2.6 3.8 … 2.6 3.5 3.2 … 6.4 1.7 4.2 … 6.4 8.6 2.4 … 6.4 4.3 2.9 … 6.4 2.2 3.4 … 3.4 3.2 2.6 … 2.6 4.2 2.2 … 3.5 2.4 3.2 … 1.7 2.9 1.7 … 8.6 3.0 6.4 … 2.6 2.6 6.4 … 3.8 3.8 3.4 … 3.2 3.3 2.6 … 2.2 3.1 2.2 … 2.9 1.8 3.2 … 3.0 2.5 6.4 … 2.6 LSTM LSTM LSTM LSTM Sigmoid …
  26. 26. Long-Short Term Memory Networks 28 3-Fold CV Accuracy Recall Precision Average 98.76% 98.93% 98.60% Deviation 0.04% 0.02% 0.02% Results:
  27. 27. Models Comparison 29 90% 91% 92% 93% 94% 95% 96% 97% 98% 99% 100% Accuracy Recall Precision Long-Short Term Memory Network Random Forest
  28. 28. Models Comparison 30 Model Random Forest Long-Short Term Memory Network Memory Consumption (MB) 289 0.56 Evaluation Time (URLs per sec) 942 281 Training Time (minutes) 2.95 238.7
  29. 29. What we learned • Discerning URLs by their patterns is a good predictor of phishing websites • LSTM model shows an overall higher prediction performance without the need of expert knowledge to create the features 31
  30. 30. Free to use 32
  31. 31. Thank you! Any questions or comments, please let me know. Alejandro Correa Bahnsen, PhD Chief Data Scientist acorrea@easysol.net

×