1. 1
AutoBLG: Automatic URL Blacklist
Generator Using Search Space
Expansion and Filters
Bo Sun1,Mitsuaki Akiyama2,Takeshi Yagi2,
Mitsuhiro Hatada1,Tatsuya Mori1
1,Waseda University
2,NTT Secure Platform Laboratories
IEEE
ISCC
2015
2. Background(1)
• The estimated number of drive-by-download
attacks is 4.3 M per day
2
7%
93%
The
number
of
web-‐based
a1acks
other
a0acks
drive-‐by-‐download
a0ack
3. Background(2)
• What is Drive-by-download attack
3
user
Landing page URL
Exploit URL
Malware download URL
download malware automatically
exploit
vulnerabilities
Click
on URL
4. Background(3)
• What is URL Blacklist
4
user
Landing page URL
Exploit URL
Malware download URL
Landing page URL
Exploit URL
URL Blacklist
Matching
Block
Malware download URL
5. Background(4)
• However, URL Blacklist cannot cope with previously
unseen malicious URLs
• It is crucial to keep the URLs updated to make a URL
blacklist effective
5
To collect fresh malicious URLs
7. 7
Goal
• Our main objective is to accelerate the process of
generating a URL blacklist automatically.
Idea
Existing
Malicious
URLs
New
Malicious
URLs
Search Space
Filter
(Machine Learning)
Expansion
Reduction
Input:
Output:
11. URL Expansion(3)
Passive DNS Database
11
sediscoXXXXXX.gruXXX.com
vorXXXXXXX.zdjecXXXki.com
Seed
Pre-
processing
Passive DNS
Database
Search
Engine
Web
Crawler
12. URL Expansion(4)
Search Engine and Web Crawler
12
http://100XXXXXwebcam.bXXX.pl/island-XXX-wXX.html
http://100XXXXXwebcam.bXXX.pl/isteam-XXXX.html
Seed
Pre-
processing
Passive DNS
Database
Search
Engine
Web
Crawler
13. URL Filrtation
13
Img from http://www.primalsecurity.net/0xc-python-tutorial-python-malware/
Existing Malicious URLs
Unknown URLs
Similarity Search
HTML Features
Bayesian sets
14. URL Verification
• Three tools for verification of drive-by-
download attacks
Ø Web Client honeypot Marionette
Ø Antivirus Software
Ø Virustotal online service
14
15. Performance Evaluation
15
• The number of URL Expansion data: 59,394
• No URL Filtration: more than 100 hours
• URL Filtration in use: approximately 6 hours
To accelerate the process of generating
blacklist URLs by adopting a high performance filter
16. Results(1)
16
Web client
honeypot
Antivirus software
Virustotal
1.16%
3.8%
16.5%
• Web Client Honeypot : definitely malicious
Ø it contained redirecting to the exploit web
pages
• Antivirus Software : highly suspicious
Ø they contained several HTTP objects that were
detected by the antivirus checkers; (malicious
JavaScript or executable malware)
• VirusTotal : suspicious
Ø need further manual inspection
17. Results(2)
• some URLs are identified by multiple tools
• After eliminating duplications, of the 600 of extracted URLs,
106 URLs were detected as malicious or suspicious
• Of the discovered 106 URLs, seven URLs are completely new
URLs that have not been listed in the VirusTotal
17
18. Limitation and future work
18
Item
Limitation
Future work
Search Engine
Only get Top-50 search
results
To accelerate web
search engine process
Web Crawler
evaded by
‘cloaking techniques’
To develop more
sophisticated tools
Query Pattern
Miss several malicious URLs To increase the number
of query patterns
URL Verification
Only two version of browser
or plug-in
To adopt a low-
interaction honeypot
Online
operation
Not fully online due to URL
Expansion part
To pipeline URL
expansion step
19. Summary
• We have proposed the AutoBLG framework
Ø light-weight
Ø new and previously unknown drive-by-download URLs
Ø other suspicious URLs that need for further analysis
• Key ideas
Ø the use of search space expansion and filters
• We proposed a high-performance filter
Ø it reduced number of URLs to be investigated with
the dynamic analysis systems by 99%
Ø while successfully finding new URLs that have not
been listed in the widely used popular URL
reputation system
19
21. URL Filtration(1)Feature Extraction
21
HTML Feature
Difference with pervious works
The number of elements with a small area Frameset tags
border,frameborder,framespacing
The number of suspicious word in the script’s
content
some strings such as
shellcode ,shcode.
The number of URLs with a different domain Only count URL with different
domain.
The number of iframe and frame tags
same
The number of hidden elements
The number of meta refresh tags
The number of out-of-place elements
The number of embed and object tags
The presence of unescape behavior
The number of setTimeout functions
22. URL Filtration(2)Similarity Search
22
Similarity Search:
Bayesian Sets
From web
space
Toyota
Nissan
Honda
BMW
Ford
Audi
Mitsubishi
Mazda
Volkswagen
Google Sets
From all
unknown
URLs
Adopting several
existing malicious
URL as query
(Malicious URLs
that are created
with same Exploit
Kit)
To output all URLs’
Score in
descending
order. The higher
score is, the more
probably URL is
Malicious
22
23. The range of experiment
23
Preliminary
Experiment
Performance Evaluation
URL Expansion
URL Filtration
URL
Verification
•Commercial blacklist
•Pre-processing
•Passive DNS database
•Search Engine
•Web crawler
•Feature Extraction
•Similarity Search
•Web Client Honeypot
•Antivirus Software
•VirusTotal
Steps in URL Expansion
Steps in URL Filtration Tools in URL Verification
23
24. Preliminary Experiment
24
100
101
102
103
Top-K URLs
0
1
2
3
ThenumberofMaliciousURLs
Query Pattern1
Query Pattern2
• Experiment Data
Ø The number of
benign URLs:10,000
Ø The number of
malicious URLs:6
• Experiment Result
Ø The two query patterns identify different three
malicious URLs in top 300 scores respectively and
extract all the six malicious URLs totally
Ø we considered the top 300 scores as the
threshold for URL filtration.
24