In this paper we are diving into the details of an anti phishing detection system which employs HOG features.
* The presentation is built with voice recording
Use of HOG Descriptors in
Phishing Detection
Ahmet Selman Bozkir, Ebru Akcapinar Sezer
Hacettepe University Computer Engineering Department, TURKEY
ISDFS 2016
Topics
• What is phishing?
• Facts and the rise of phishing attacks
• Existing approaches
• Why vision based scheme?
• HOG descriptors
• Demonstration of developed method
• Experiments and Results
• Conclusion
What is phishing?
• Phishing is a scamming activity which deals with
making a visual illusion on computer users by
providing fake web pages which mimic their
legitimate targets in order to steal valuable digital
data such as credit card information or e-mail
passwords.
Phone phreaking + fishing -> «phishing»
Facts and figures
• In 2012-2013, 37.3 millions
users were affected by
phishing attacks* 37.3M
* Source: 2013 Verizon Data Breach Investigation Report
Facts and figures
• 1 million confirmed
malicious phishing sites
on over 130,000 unique
domains. (as of 2013)
* Source: PhishLabs 2016 Phishing Trends & Intelligence Report
Facts and figures
Average life time of phishing
pages is 32 hours
• Risk of zero-day attacks
getting higher due to not
being discovered by
blacklists
32h
* Source: APWG, Phishing activity trends paper. [Online].
Available at http://www/antiphishing.org/resources/apwg-papers/
Existing Anti-Phishing Approaches
Content & Blacklist
CANTINA [1]
SpoofGuard[2]
NetCraft [3]
DOM based
Medvet et al.[4]
Zhang et al. [5]
Fu et al. [6]
Vision based
Maurer et al.[7]
Verilog [8]
Other
Chen et al.[9]
Why vision based scheme?
• Substition of textual HTML elements with <IMG> or applet like
contents
• Zero day attacks need pro-active solutions
• Dynamic / AJAX type content loading
• Different DOM organizations between legitimate and fake web
pages
• More robust to complex backgrounds or page layouts
• And the most important is vision based solutions are in
concordance with human perception
* Source: PhishLabs 2016 Phishing Trends & Intelligence Report
Methodology: HOG Features and
Descriptors
• Histogram of Oriented Gradients
• Dalal & Triggs-2005
• A good way to characterize and capture
local object appearance or shapes by
utilizing distribution of intensity
gradients or edge directions.
• Preffered because of:
(i) HOG descriptors are able to capture visual
cues of overall page layout;
(ii) they are able to provide a certain degree
of rotation and translation invariance.
Experiments
• For the first phishing web page dataset, 50 unique phishing
pages reported from Phishtank covering the days between 14
December 2015 and 5 January 2016 were collected.
• For the legitimate web page pairs, we have collected 18
legitimate home pages from Alexa top 500 web site directory.
Afterwards, we have shuffled the page URLs in order to
obtain 100 distinct legitimate home page pairs.
• 64 pixel wide and 128 pixel wide cells were employed
Results - 1
Statistics
Similarity of Pairs of Phishing Pages
(50 pages)
HOG-64 px cells HOG-128 px cells
min 51.873 % 49.910 %
max 98.861 % 98.390 %
mean 78.868 % 78.637 %
standard deviation 12.147 % 10.963 %
STATISTICS OF PHISHING AND THEIR TARGET PAGE
PAIRS IN HOG-64 AND HOG-128
Statistics
Similarity of Pairs of Legitimate Pages
(100 unique pairs)
HOG-64 px cells HOG-128 px cells
min 38.420 % 45.683 %
max 74.459 % 77.092 %
mean 60.739 % 66.012 %
standard deviation 11.026 % 9.492 %
STATISTICS OF UNIQUE LEGITIMATE PAGE PAIRS IN
HOG-64 AND HOG-128
Discussion and Conclusion
• This work is the first study that employs HOG in phishing detection
• It performs a robust method for phishing detection as it is pure vision based and
able to capture local visual cues on web page surface.
• However we addressed some shortcomings.
• Image contents in phishing web pages are generally different than the legitimate
ones. So the image invariance must be supplied in order to achieve a better and
robust phishing detection.
• The method must be also verified with a more comprehensive dataset.
References
1. Y. Zhang, J. Hong, L. Cranor, CANTINA: A Content-Based Approach to Detecting Phishing Web Sites, WWW 2007
2. Chou, N., R. Ledesma, Y. Teraguchi, D. Boneh, and J.C. Mitchell. Client-Side Defense against Web-Based Identity Theft.
In Proceedings of The 11th Annual Network and Distributed System Security Symposium (NDSS '04).
3. Netcraft, Netcraft Anti-Phishing Toolbar. Visited: April 20, 2016. http://toolbar.netcraft.com/
4. E. Medvet, E. Kirda and C. Krueger, Visual-Similarity-Based Phishing Detection, Securecomm ’08 International
Conference on Security and Privacy in Communication Networks, 2008
5. W. Zhang, H. Lu, B. Xu and H. Yang, Web Phishing Detection Based on Page Spatial Layout Similarity, Informatica, vol.
37, pp. 231-244, 2013.
6. A.Y. Fu, L. Wenyin and X. Deng, Detecting Phishing Web Pages with Visual Similarity Assesment based Earth
Mover’s Distance (EMD), IEEE Transactions on Dependable and Secure Computing, pp. 301-311, 2006.
7. M.E. Maurer and D. Herzner, Using visual website similarity for phishing detection and reporting, In CHI’12
Extended Abstacts on Human Factors in Computing Systems, 2012.
8. G. Wang, H. Liu, S. Becerra, K. Wang, Verilog: Proactive Phishing Detection via Logo Recognition, Technical
Report CS2011-0669, UC San Diego, 2011.
9. T. Chen, S. Dick, J. Miller, Detecting Visually Similar Web Pages: Application to Phishing Detection, ACM
Transactions on Internet and Technology, 10(2), 2010