Phishing Attacks: Trends, Detection Systems and Computer Vision as a Promising Approach

Phishing Attacks:
Trends, Detection Systems and
Computer Vision as a
Promising Approach
DR. AHMET SELMAN BOZKIR
DEPT. OF COMPUTER ENGINEERING - HACETTEPE UNIVERSITY
SEMINAR

Today
• What is phishing ?
• Facts and current trends
• Types of phishing
• Examples of attack types
• Why the problem of phishing could not be solved yet?
• Phishing detection methods in the literature
• Vision based analysis and various studies, challanges
• What we have done so far? Our vision
• Conclusion

What is Phishing?
• Phishing is a criminal mechanism employing both social
engineering and technical subterfuge to steal consumers’
personal identity data and financial account credentials.
• Social engineering schemes prey on unwary victims by
fooling them into believing they are dealing with a
trusted, legitimate party, such as by using deceptive email
addresses and email messages.
(APWG – Anti-Phishing Working Group )
• Phone phreaking + fishing -> «phishing»

Underlying Truth
• In 350BC, Aristotle noted that “our senses can be trusted
but they can be easily fooled”.
• According to the study written by Richard Gregory claims
that only %20 of our visual perception comes through our
eyes, the remaining part is rely on our inferences.
• Actual Reason : Careless operations

A typical life cycle of a phishing campaign

Facts and Current Trends
• SAAS/Webmail (36%)
• Payment (22%)
• Financial Inst. (18%)
• Other (9%)
• E-commerce, Retail (3%)
• Social Media (3%)
• Cloud Storage, Hosting (3%)
• Telecommunications (3%)

Financial Loss
• BEC * Business Email Compromise

Use of SSL in Phishing Websites

Types of phishing attacks
Typical
phishing
Spear
phishing
Whaling
Less quantity /More profitMore quantity / less profit

Example e-mails for “typical phishing”
1. [1]

Example e-mails for “spear phishing”
1. [1]

An example e-mail of “whaling phishing”
1. [1]

Why the problem of phishing could not be
solved yet?
• Even HighlyTrained Users are Clicking
When reading a hundred emails in the middle of a stressful workday, even the most well-
trained and observant employee will click on a malicious email.
• Phishing Attacks are Increasingly Sophisticated
-Employees are taught to look for typos and poor grammar to identify a text lure, but over
the last year, attackers have improved their spelling and learned to match legitimate
messages.
-More phishing sites are using HTTPS certificates in order to fool users with the green
“secure” icon in the browser that, ironically, users will interpret as ‘safe’.
-DomainSpoofing and Domain Impersonation Is More Sophisticated. the attacker can send
from an authentic Microsoft address

Why the problem of phishing could not be
solved yet?
• Phishing Has Become tooTargeted forTraditional Spam-Type filters
Broad Spam-like Phishing Attacks are EasilyCaught. Targeted, Customized Phishing
Attacks are Hard to Catch and on the Rise: Spear-phishing attacks, especially business email
compromise (BEC), have almost doubled since the beginning of the year, made easier by the
large scale data breaches last year.
• Targeted Attacks Have Become Psychologically More Sophisticated
-Attackers have learned to combine personalized information with a number of effective
motivators
-Fear, urgency, and curiosity were the top motivators in previous years, but they've been
replaced by entertainment, social and reward recognition.

Combatting Methods against Phishing
The URL
The Source
Code (DOM)
The Image/Screenshot
Domain Knowledge
(Web Information)

Classification of Anti-Phishing Methods
Blacklist
Google Safe Browsing
API
Rosiello et al. (2007)
Han et al. (2008)
PDA – Jain&Gupta (2016)
URL
Sahingoz et al. (2017)
CatchPhish – Rao et al.
(2018)
URLNet – Hung et al.
(2018)
PDRCNN –Wang et al.
(2019)
DOM
CANTINA+ (2011)
Marchal et al. (2016)
Buber et al. (2017)
Jain & Gupta (2018)
Visual Similarity
Maurer et al.(2013)
Verilogo (2015)
DeltaPhish (2017)
PhishIRIS - Dalgic et al.
(2018)
Less resource Time consuming / More resource/ Robust to “zero-hour” attacks

The URL
• The Uniform Resource Locator (URL) is the address of any resource,
in which case it is the webpage, inWorldWide Web
• Many researchers use this source of information in their studies to
extract key features to identify a phishing webpage.
• While some of them purpose a solution by using hand-crafted
(lexical) features, the others chose to apply machine learning based
features

Some Phishing URLs
http://www.cnhedge.cn/js/index.htm?http://us.battle.net/login/en/?ref=http://spdfozrus.battle.net/d3/en/index
http://www.arvindudyog.com/bright/bright/drake/bright/45886564bea8a9f07a8055347163a4a3/
http://amcnamibia.com/wp-admin/file/files/db/file.dropbox/
http://www.arvindudyog.com/papa/
http://www.iowasaferoutes.org/wp-content/plugins/wpsecone/dhl/
http://www.imanaforums.com/neomodules/accesst/
http://ausbuildblog.com.au/wp-content/heaven/index.php
http://fengshuireview.com/upload/free.mobile.fr/facturtion/finale/free/
http://searchenginetricks.ca/cam/config/webmail/
http://www.i-robot.kiev.ua/self/dropbox/dropbox/dropbox/
http://www.justaskaron.com/octapharma.com.ca/
http://i-robot.kiev.ua/self/dropbox/dropbox/dropbox/index.php
http://kiltonmotor.com/others/m.i.php?n=1774256418&rand.13inboxlight.aspxn.1774256418&rand=13inboxlightaspxn.1774256
418&username1=&username
http://www.sindhuratna.com/new2015/document.php
http://www.sindhuratna.com/new2015/document.php
http://justaskaron.com/octapharma.com.ca/index.php
http://www.alexsandroleiloes.com.br/admin/beats/verification-folder.php
http://www.vantaiduccuong.com/soutdoc/es/
http://www.pt-tkbi.com/providernet/provider/provider/webmail/securenow/webnet/
http://www.alhadbaa.org/googledrive/
http://www.parfumwangimurah.com/g9/
http://proseind.cl/new/index.php
http://annstringer.com/storagechecker/domain/ii.php

Lexical URL Features
• #dots
• #special characters
• #suffixes
• Length of URL
• Length of the query string
• Subdomain name
• SuspiciousCharacters / Punny code
• TLD Name and its length
• Domain Name
• The depth of the subdomain
• Having a SSL certificate (https)
• ….

Most discriminative 4-grams: chi-square
• “%20(“ :99.35901350685741
• “.log” :155.82961566651434
• “logi “ :1947.7954010788872
• “ogin” :2096.632706999275
• “secu” :895.0781029132113
• “/wp-” :1629.5131963112008

The Source Code
• Consists of HTML DOM, Js and CSS components.
• Used as the main markup directives for layout
information

The source code is no longer applicable!
• Thanks to capabilities of JavaScript and enormous libraries such as
React.js and Angular.js, the web page implementation is changing
from static rendering to dynamic rendering.
• Ajax and dynamic content loading
• Misuse of HTML tags
• Uncountable markupping combination for the same rendering!
• Thus, HTML, CSS or tag similarity are not guaranteed to be source
of evidence!

Phish-Sense
• Introduces fusion of information extracted
from lexical features and various n-gram
models to capture phishing URL patterns
• Chi Square method is selected as the feature
selection!
• 71.250 samples were provided.
• Out-performs all traditional methods by
havingTP rate 98.24% however outperformed
by the Deep Learning methods!

URLNet - 2018
• One of the first published
work based on Deep
Learning methods.
Le, Hung, et al. "URLNet: learning a URL representation with deep learning for malicious URL detection." arXiv preprint arXiv: 1802.03162 (2018).

Visual similarity or vision based analysis?
Logo
Screenshot of
whole page
Image with
Layout
• DOM tree similarity
• Visual features
• CSS Similarity
• Layout Similarity viaVIPS
(Block and overall layout)

Can computer vision help us?
• 47%-83% of the newly found phishing pages are added to lists in 12 hours. Zero
day attacks need pro-active solutions!
• Predefined or handy-crafted heuristics are evaded by attackers
• 23% of the users do not even look at the address bar! (Dhamija et al.)
• Substitution of textual HTML elements with <IMG> or applet like rich internet
application (RIA) contents such as Flash,ActiveX, Silverlight
• Loading of dynamic /AJAX based content, IFRAME
• Different DOM organizations between legitimate and target phishing version.
• Robustness against complex backgrounds or page layouts
• Brand recognition can be done in a holistic manner
• Language and source code independence
• And the most important is vision based solutions are in concordance with human
perception

Challenges related to vision based anti-phishing
• Lack of a well curated dataset
• Vast amount of brands
• High intra-class variations among the phishing samples of brands
• Inconsistent layouts
• Unrelated layouts and
color schemes
• Data leakage which
skews the bias

Phish-Iris Dataset
Publicly available at https://web.cs.hacettepe.edu.tr/~selman/phish-iris-dataset

HOG and MPEG7 like compact visual
descriptors (2016, 2018)
• Based on image global
image similarity via
descriptors
• Process whole webpage’s
screenshot.
• 92% accuracy.
- Bozkir, Ahmet Selman, and EbruAkcapinar Sezer. "Use of HOG descriptors in phishing detection." 2016 4th International Symposium on Digital Forensic and
Security (ISDFS). IEEE, 2016
- F. C. Dalgic,A. S. Bozkir, and M.Aydos, “Phish-iris: A new approach for vision based brand prediction of phishing web pages via compact visual descriptors,” in
Proceedings of the IEEE International Symposium on Multidisciplinary Studies and InnovativeTechnologies (ISMSIT),2018

White-Net (Phishing Website Detection by
Visual Whitelists)
• Consists of three CNNs where they are
structured as Siamese Networks.
• 2 steps in training stage (81% top-1 match)
• Based on FaceNet.
- Sahar Abdelnabi, Katharina Krombholz and Mario Fritz,WhiteNet: Phishing Website Detection byVisualWhitelists, https://arxiv.org/pdf/1909.00300.pdf, 2019

Verilogo : proactive phishing detection via
logo recognition
•SIFT based keypoint matching over 400/200 px stripes
•Pairwise comparison (not scalable)
•6 seconds/image
•352 image dataset
G.Wang et al.,Verilogo: Proactive Phishing Detection via Logo Recognition, 2010

LogoSENSE
•Object detection strategy with Max-Margin Loss
SVM and HOG
•0.04 seconds to analyze onCPU ~(1024*1024 px)
•A special dataset covering 15 brands on 1530
training + 1979 testing images (1000 samples for
legitimate)
Bozkir and Aydos, LogoSENSE: A Companion HOG based Logo Detection Scheme for Phishing Web Page and E-mail Brand Recognition (under revision)

Scr2Seg : Screenshot to Segments by deep learning
•A deep learning semantic segmentation approach to
understand the page layout by just looking at the
screenshot without needing any thing else
•Pixelwise annotated 197 screenshots were collected
•Up to 85% mIOU accuracy has been achieved
•Data collection process in continuing

Conclusion
• Due to capabilities of JavaScript and enormous libraries such as React.js and Angular.js,
the way of web page building is changing from static rendering to dynamic rendering.
Therefore, using HTML and CSS contents in a further solutions may not be feasible as they
used to be.
• Combined with legitimate domain compromise, the SSL is no more an effective evidence
of trust.
• Computer vision based approaches work similar to human perception and gain popularity
for the tasks of both phishing e-mail and web page identification and brand recognition.
• Low FPR is crucial!
• A standard and well curated benchmark dataset is required!
• Incorporation of online learning could be beneficial
• Image understanding and aggregation with URL based features are promising

12.2.2020
THANKS FOR LISTENING

Phishing Attacks: Trends, Detection Systems and Computer Vision as a Promising Approach

Recommended

Recommended

More Related Content

Similar to Phishing Attacks: Trends, Detection Systems and Computer Vision as a Promising Approach

Similar to Phishing Attacks: Trends, Detection Systems and Computer Vision as a Promising Approach (20)

More from Selman Bozkır

More from Selman Bozkır (13)

Recently uploaded

Recently uploaded (20)

Phishing Attacks: Trends, Detection Systems and Computer Vision as a Promising Approach