Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Local Image Descriptor Based Phishing Web Page Recognition as an Open-Set Problem

21 views

Published on

With the advent of e-commerce, digital services and social media, scammers have changed their way to gain illegal benefits in various forms such as capturing the credit card information or exploiting personal cloud accounts which is termed as phishing. For this reason, against this cyber crime, last two decades have witnessed a variety of combatting methodologies like HTML content based similarity analysis, URL based classification and recently visual similarity based matching since phishing web pages visually mimic to their legitimate counterparts in order to create an illusion to deceive innocent users. To this end, in this study, we propose a computer vision and machine learning based approach in order to classify whether a suspicious web page is phishing and further recognize its original brand name. In this regard, we have utilized and investigated two different local image descriptors namely Scale Invariant Feature Transform (SIFT) and DAISY. Apart from their common properties such as scale invariance, the aforementioned descriptors have apparent differences such that in addition to rotational invariance, SIFT employs key-point based sampling whereas DAISY applies dense sampling by default. Therefore, we first aimed to investigate the feasibility of these two local image descriptors in addition to revealing the effects of sampling strategy and rotational invariance in problem domain. Furthermore, in order to create a discriminative representation of a web page, we followed the bag of visual words (BOVW) approach having different vocabulary sizes such as 50, 100, 200 and 400. In order to evaluate the proposed approach, we have utilized a publicly available phishing dataset including snapshots of webpages sampled from both 14 different highly phished brands and ordinary legitimate web pages yielding a challenging open-set problem. The aforementioned dataset involves 1313 training and 1539 testing image samples in total. The visual features extracted via SIFT and DAISY were first transformed to a BOVW histogram and fed to three different machine learning methods such as SVM, Random Forest and XGBoost. According to the conducted experiments, based on a 400-D visual vocabulary, SIFT descriptor along with XGBoost has been found as the best descriptor-learner configuration having reached up to 89.34% validation accuracy with 0.76% false positive rate. Moreover, SIFT has outperformed DAISY descriptor in all settings. As a result, it has been shown that SIFT descriptors equipped with BOVW representation can be effectively used for brand identification of phishing web pages.

Published in: Engineering
  • Be the first to comment

  • Be the first to like this

Local Image Descriptor Based Phishing Web Page Recognition as an Open-Set Problem

  1. 1. Local Image Descriptor Based Phishing Web Page Recognition as an Open-Set Problem Ahmet Selman Bozkır, Esra Eroglu and Murat Aydos Hacettepe University Department of Computer Engineering, TURKEY Baskent University, Department of Management Sciences, TURKEY
  2. 2. Topics • What is phishing? • Types of phishing • Facts about phishing • Existing approaches • Why vision based scheme? • Proposed Method • Details of Phish-IRIS dataset • Experiments and results • Conclusion
  3. 3. What is phishing? • Phishing is a scamming activity which is based on creating a visual illusion on innocent users by providing fake web pages which mimic their legitimate targets in order to steal valuable digital data such as credit card information or e-mail passwords. Phone phreaking + fishing -> «phishing»
  4. 4. Types of Phishing • Spear phishing (Person focused) • Clone phishing (Bulk) • Whaling (Person/Institution Focused) • Rogue WIFI (Mitm)
  5. 5. Facts and figures * Source: PhishLabs 2016 Phishing Trends & Intelligence Report
  6. 6. Facts and figures • In 2017, 700.ooo unique phishing attacks have been reported* * Source: 2017 Quarter 3 Phishing Reports of APWG
  7. 7. Facts and figures Average life time of phishing pages is 32 hours • Risk of zero-day attacks getting higher due to not being discovered by blacklists 32h * Source: APWG, Phishing activity trends paper. [Online]. Available at http://www/antiphishing.org/resources/apwg-papers/
  8. 8. Facts and figures Consumer-oriented phishing attacks targeted • financial institutions • cloud storage/file hosting sites • webmail and online services • ecommerce sites • payment services. 90% * Source: PhishLabs 2016 Phishing Trends & Intelligence Report
  9. 9. Facts and figures • financial institutions • payment services. * Source: PhishLabs 2016 Phishing Trends & Intelligence Report • cloud storage/file hosting sites • cryptocurrency
  10. 10. Existing Anti-Phishing Approaches Content & Blacklist CANTINA [1] SpoofGuard[2] NetCraft [3] DOM based Medvet et al.[4] Zhang et al. [5] Fu et al. [6] Vision based Maurer et al.[7] Verilog [8] DeltaPhish [10] Other Chen et al.[9]
  11. 11. Why a vision based scheme? • Substition of textual HTML elements with <IMG> or applet like rich internet application (RIA) contents • Zero day attacks need pro-active solutions • Dynamic / AJAX type content loading • Different DOM organizations between legitimate and target phishing version. • Robustness against complex backgrounds or page layouts • And the most important is vision based solutions are in concordance with human perception
  12. 12. Our Proposal: Use of SIFT and DAISY descriptors in Bag of Visual Words Representation Scale Invariant Feature Transform • Lowe, 2004 • Local patch based key point driven sampling • Scale Invariance • Rotation Invariance • Robustness against illumination
  13. 13. Our Proposal: Use of SIFT and DAISY descriptors in Bag of Visual Words Representation DAISY Descriptor • Tola et al., 2010 • Local Patch based dense sampling • Scale Invariance • Robustness against illumination
  14. 14. Architecture - Flow
  15. 15. Phish-IRIS Data Set Publicly available at https://web.cs.hacettepe.edu.tr/~selman/phish-iris-dataset
  16. 16. Phish-IRIS Data Set • Lack of a common phishing dataset tailored for vision based antiphishing • Based on real world observation and literature. • 14 heavily phished target brands + legitimate samples • Collected between March and May 2018, Phishtank + Openphish • Open-set problem -> Collect “other” legitimate samples • Distinct screenshots • 1313 Training + 1539 Testing samples • Screenshots were collected via a specially implemented Java based wrapper equipped with Selenium Web Driver
  17. 17. Phish-IRIS Data Set Brand Name Training Samples Testing Samples Adobe 43 27 Alibaba 50 26 Amazon 18 11 Apple 49 15 Bank of America 81 35 Chase Bank 74 37 Dhl 67 42 Dropbox 75 40 Facebook 87 57 Linkedin 24 14 Microsoft 65 53 Paypal 121 93 Wellsfargo 89 45 Yahoo 70 44 Other 400 1000 Total 1313 1539 • Ratio of train/test for brands: • 2/3 (roughly) • Ratio of “unknown/other”: • 4/10
  18. 18. Experiments • We made variour experiments with 2 types of descriptors on SVM, XGBoost and Random Forest algorithms • We have tested different visual word counts such as 50, 100, 200 and 400 in order to understand whether sparsity or weak/strong features affect the prediction quality • Assessments have been carried out on test images via built machine learning models trained on training images • Evaluations were carried out regarding to True Positive Rate, False Positive Rate and F1 measures
  19. 19. Results – 1: SIFT based prediction Word count Algorithm Train acc Test acc TPR FPR F1 bov-50 svm 0.611 0.7732 0.7732 0.016 0.77 bov-50 xgboost 0.725 0.8187 0.8187 0.012 0.82 bov-50 random forest 0.729 0.842 0.842 0.112 0.84 bov-100 svm 0.674 0.803 0.803 0.014 0.80 bov-100 xgboost 0.762 0.846 0.8466 0.010 0.85 bov-100 random forest 0.749 0.860 0.860 0.0099 0.86 bov-200 svm 0.747 0.837 0.837 0.011 0.84 bov-200 xgboost 0.799 0.8589 0.8589 0.01 0.86 bov-200 random forest 0.774 0.8823 0.8823 0.0084 0.88 bov-400 svm 0.821 0.8758 0.875 0.008 0.88 bov-400 xgboost 0.827 0.8934 0.893 0.0076 0.89 bov-400 random forest 0.8 0.8875 0.8875 0.0080 0.89
  20. 20. Results – 2: DAISY based prediction Word count Algorithm Train acc Test acc TPR FPR F1 bov-50 svm 0,648 0,7465 0,746 0,018 0,74 bov-50 xgboost 0,678 0,7849 0,784 0,015 0,78 bov-50 random forest 0,699 0,816 0,816 0,013 0,8 bov-100 svm 0,709 0,7758 0,775 0,016 0,77 bov-100 xgboost 0,709 0,7953 0,795 0,014 0,79 bov-100 random forest 0,715 0,8226 0,822 0,012 0,81 bov-200 svm 0,725 0,7901 0,79 0,014 0,79 bov-200 xgboost 0,722 0,8174 0,817 0,013 0,81 bov-200 random forest 0,719 0,831 0,831 0,012 0,82 bov-400 svm 0,725 0,818 0,818 0,818 0,81 bov-400 xgboost 0,725 0,8122 0,812 0,013 0,8 bov-400 random forest 0,716 0,8356 0,835 0,011 0,82
  21. 21. Comparison with (Bozkir et al. 2018*) Method # of Features Learner TPR FPR F1 JCD 5040 random forest 0.891 0.131 0.886 FCTH 5760 random forest 0.895 0.114 0.891 CEDD 4320 random forest 0.895 0.137 0.89 CLD 360 random forest 0.878 0.146 0.873 SCD 3584 svm 0.906 0.085 0.905 Our best (Sift) 400 xgboost 0.893 0.0076 0.89 * Dalgic, F.C., Bozkir A.S., Aydos, M., “Phish-IRIS: A New Approach for Vision Based Brand Prediction of Phishing Web Pages via Compact Visual Descriptors”, ISMSIT, Kızılcahamam, Ankara, 2018
  22. 22. Conclusion • SIFT features based prediction outperforms the DAISY based recognition • We have found that required time for computation of SIFT is less than DAISY • One key finding we discovered is the importance of sampling strategy. Key point based sampling yields a better result • More the visual words we extract, more the accuracy we achieve. So, sparsity has not been found as a problem for SIFT and DAISY. Inference takes only 0.32 seconds on a PC equipped with Intel 8750 + 16 GB Memory • Scalable Color Descriptor still surpasses SIFT in terms of TPR and FPR meaning that color information is important as edge / contour / textures. However, SIFT achieves a better false positive rate with much less number of features • Consequently, SIFT is a suitable candidate for phishing web page detection/recognition • Future work: Use of Deep Convolutional Neural Networks
  23. 23. References 1. Y. Zhang, J. Hong, L. Cranor, CANTINA: A Content-Based Approach to Detecting Phishing Web Sites, WWW 2007 2. Chou, N., R. Ledesma, Y. Teraguchi, D. Boneh, and J.C. Mitchell. Client-Side Defense against Web-Based Identity Theft. In Proceedings of The 11th Annual Network and Distributed System Security Symposium (NDSS '04). 3. Netcraft, Netcraft Anti-Phishing Toolbar. Visited: April 20, 2016. http://toolbar.netcraft.com/ 4. E. Medvet, E. Kirda and C. Krueger, Visual-Similarity-Based Phishing Detection, Securecomm ’08 International Conference on Security and Privacy in Communication Networks, 2008 5. W. Zhang, H. Lu, B. Xu and H. Yang, Web Phishing Detection Based on Page Spatial Layout Similarity, Informatica, vol. 37, pp. 231-244, 2013. 6. A.Y. Fu, L. Wenyin and X. Deng, Detecting Phishing Web Pages with Visual Similarity Assesment based Earth Mover’s Distance (EMD), IEEE Transactions on Dependable and Secure Computing, pp. 301-311, 2006. 7. M.E. Maurer and D. Herzner, Using visual website similarity for phishing detection and reporting, In CHI’12 Extended Abstacts on Human Factors in Computing Systems, 2012. 8. G. Wang, H. Liu, S. Becerra, K. Wang, Verilog: Proactive Phishing Detection via Logo Recognition, Technical Report CS2011-0669, UC San Diego, 2011. 9. T. Chen, S. Dick, J. Miller, Detecting Visually Similar Web Pages: Application to Phishing Detection, ACM Transactions on Internet and Technology, 10(2), 2010

×