This document presents a literature review and proposed methodology for detecting malicious URLs. It discusses prior work on blacklisting, heuristic approaches, and machine learning techniques for malicious URL detection. The proposed methodology develops three categories of features - URL lexical features based on TF-IDF, source code features to identify obfuscated JavaScript, and network features like payload size. These features would be used to train a machine learning model like SVM to classify URLs as malicious or benign in real-time. The goal is to automatically detect new web attacks and force attackers to make tradeoffs to evade detection.
3. INTRODUCTION
A lot of rogue websites trick users into revealing
sensitive information which lead to theft of money or
identity or installing malware in the user’s system
URL (Uniform Reource Locator) is the global address
of documents (resources) on the world wide web.
A URL has two components
protocol identifier
resource name(specifies the IP address or the
domain name where resource is located)
4. INTRODUCTION
Types of attacks using malicious URLs include
Drive by download
Phishing and Social Engineering
Spam
Drive by Download is unintentional download of
malware upon just visiting the URL.These attacks are
carried out by exploiting vulnerabilities in plug ins or
inserting malicious code through JavaScript
5. INTRODUCTION
Phishing and social engineering attacks trick the user
into revealing private information by pretending to be
genuine web pages
Spam is the usage of unsolicited message for the
purpose of advertising or phishing
6. LITERATURE SURVEY
Title of
Paper
Details of
Publication
Description
Malicious
URL
detection
using
Machine
Learning :
A survey
Doyen
Sahoo,Chengh
ao Liu &
Steven CH Hoi
August 2019
The authors presented a
survey on malicious URL
detection using machine
learning techniques. They
discussed the existing
studies for malicious URL
detection paritcularly in the
forms of developing new
feature representation &
designing new learning
algorithm
7. LITERATURE SURVEY
Title of
Paper
Details of
Publication
Description
Automatic
Detection for
JavaScript
obfuscation.
Attacks in
web pages
through
string
pattern
analysis
Choi,Young
Han,Tae
Ghyoon Kim
, Seok Jin
Choi .
The author presents an
analysis system to
detect lexical and string
obfuscation in Java
malware. They identify a
set of 11 features that
characterize obfuscated
code and use it to train a
machine learning
classifier
8. LITERATURE SURVEY
Title of Paper Details of
Publication
Description
Kopis: Detecting
malware
domains at the
upper DNS
Hierarchy
Antonakakis
,Manos
The author s propose
a novel detection
system called Kopis
for detecting
malware related
domain names .
Kopis passively
monitors DNS traffic
at upper levels of
DNS hierarchy
9. PROBLEM DEFINITION
The current situation has required significant
information security since many people have suffered
from leakage of personal information
Detection of malicious URLs and identification of threat
types using machine learning are critical to thwart cyber
attacks like spamming,phishing and malware
10. OBJECTIVES
The main objective of our work is to
Survey a varying trend of malicious URL detection
To analyse a variety of detection techniques
changing over time
11. METHODOLOGY
The categories of strategies used for detecting
malicious URLs are
Blacklists( & Heuristics )
Machine Learning
12. METHODOLOGY
Blacklisting or Heuristic Approaches: These
approaches maintain a list of URLs that are known to
be malicious . Whenever a new URL is visited , a
database lookup is performed . If the URL is present in
the blacklist ,it is considered to be malicious and then
a warning will be generated ; else if is assumed to be
benign .Blacklisting suffers from the inability to
maintain an exhaustive list of all possible malicious
URLs as new URLs can be easily generated daily, thus
making it impossible for them to detect new threats.
13. METHODOLOGY
This is particularly of critical concern when attackers
generate new URLs algorithmically and can thus bypass
all blacklists. Despite several problems faced by
blacklisting , due to their simplicity and efficiency , they
continue to be one of the most commonly used
techniques by many anti-virus systems today.
14. METHODOLOGY
Heuristic approaches are a kind of extension of
Blacklist methods, wherein the idea is to create a
blacklist of signatures. Common attacks are identified
and a signature is assigned to this attack type. Intrusion
Detection Systems can scan the web pages for such
signatures and raise a flag if some suspicious behaviour
is found .These methods have better generalization
capabilities than blacklisting ,as they have the ability to
detect threats in new URLs as well. However, such
methods can be designed for only a limited number of
common threats , and cannot generalize to all types of
(novel) attacks. Moreover using obfuscation techniques ,
it is not difficult to bypass them
15. METHODOLOGY
A more specific version of heuristic approaches is
through analysis of execution dynamics of the webpage.
Here also, the idea is to look for a signature of malicious
activity such as unusual process creation, repeated
redirection etc.These methods require visiting the
webpage and thus the URLs actually can make an
attack. As a result , such techniques are often
implemented in controlled environment like a disposable
virtual machine .Such techniques are very resource
intensive and require all execution of the code .Another
drawback is that websites may not launch an attach
immediately after being visited and thus may go
undetected.
16. METHODOLOGY
Machine Learning Approaches: They analyze
information of a URL and its corresponding websites by
extracting good feature representations of URLs and
training a prediction model on training data of both
malicious and benign URLs . There are 2 types of
features – static features and dynamic features . In static
analysis we perform anlaysis of webpage based on
information available without extracting URL( i.e.
executing Java Script or other code). The features
extracted include lexical features from URL string , info
about host , and sometimes even HTML and Java Script
content. Since no execution is required , these methods
are safer than Dynamic methods.
17. METHODOLOGY
The underlying assumption is that distribution of these
features is different for malicious and benign URLs .
Using this distribution information,a prediction model can
be built , which can make predictions on new URLs. Due
to relatively safer environment for extracting important
information , and ability to generalize all types of
threats, static analysis techniques have been extensively
explored by applying machine learning techniques .
Dynamic analysis techniques include monitoring the
behaviour of systems which are potential victims , to
look for any anomaly. These include which monitor the
system call sequences for abnormal behaviour
18. METHODOLOGY
FEATURES: We develop 3 different categories of
features to detect malicious URLs
1. URL lexical features:
We approach the URL as an NLP problem .We use
term frequency – inverse document frequency i.e. tf- idf
to weigh the importance of a token in the URL as a way
to associate URL tokens with labels. Tokens include
anything in the URL, including both the domain and the
path .td-idf can be defined as
tf*idf = tf (t,d)*idf(t,D) where we define tf and idf
tf (t,d) = f (t,d) / max {f (w,d) : w subset d}
Idf (t,D) = log mag D/mag{d subst D: t subset D}
19. METHODOLOGY
We also exploit the hierarchical nature of the
subdomains by splitting along each separator and
saving a bigram consisting of any subdomain plus the
top level domain . We hope to run across phishing
patterns or other suspicious URLs in the process
2. Source code features: Java Script exploits are
typically obfuscated to prevent detection by automated
or manual analysis . Here is an example of one
exploitative script we found in our malicious sample
21. METHODOLOGY
Thus, we can use the ratio of special character
subsequences (non English for ”en” websites) to script
length.
In addition, attackers who choose to reconstruct
functions before
calling them require the use of special functions, such as
fromCharCode, eval, document.write, escape, etc.
They can also include the malicious code in an iframe.
We count these keywords and use them as one feature
22. METHODOLOGY
3. Network features :
Although we have explored a variety of network features
including latency, DNS query data, domain registry
data, and payload size, we have only captured
payload size for our tests. Executable can be arbitrarily
long, and obfuscated script may add to payload size as
well.
23. METHODOLOGY
Attacker strategy :
The growing threat to mobile web users could be
mitigated by automatic URL detection.By using a trained
SVM, one could check URLs fast enough to deploy in a
realtime service
This means users can use a preemptive service without
impacting their mobile experience As the old saying
Goes an ounce of prevention is worth a pound of cure
but only if the solution is palatable. Attackers may
certainly make tradeoffs to outwit the features we have
selected. However, such elusion isn’t free. For example,
using more legitimate sounding URLs in phishing
attempts may bypass suspicious
24. METHODOLOGY
bigram detection, but may result in fewer click-throughs
by scrupulous users. Or, reducing special char code
sequences in obfuscation may work, but only by
increasing script size or by using less obfuscation and
risking detection by malicious code pattern detectors.
Our hope is that by adding the appropriate features, a
machine learning based system would be able to force
attackers to make tradeoffs in web-based attacks.
25. CONCLUSION
By using a trained SVM, it is possible to provide a
realtime service to check malware URLs, regardless of
the browsing device used. In general, using a machine
learning approach to discover malicious URLs and web
attackers is a potentially significant approach, especially
when considering the scale at which machines
themselves have been used to automatically generate,
obfuscate, or permute attacks.We hope to see more
research put forward in this endeavor to further reduce
the space of feasible attacks.
26. REFERENCE
[1] Antonakakis, Manos. ”Kopis: Detecting Malware
Domains at
theUpperDNSHierarchy.”http://static.usenix.org/events/s
ec11/tech/slides/antonakakis.pdf.
[2] Choi, YoungHan, TaeGhyoon Kim, SeokJin Choi .
Automatic Detection for JavaScript Obfuscation Attacks
in Web Pages through String Pattern Analysis.
http://www.sersc.org/journals/IJSIA/vol4 no2 2010/2.pdf.
[3] Doyen Sahoo,Chenghao Liu , Steven CH.Hoi2019 “
Malicious URL Detection Using Machine Learning : A
Survey Aug 2019 ,37 pages