1
AutoBLG: Automatic URL Blacklist
Generator Using Search Space
Expansion and Filters 	
Bo Sun1,Mitsuaki Akiyama2,Takeshi Yagi2,
    Mitsuhiro Hatada1,Tatsuya Mori1
1,Waseda University
2,NTT Secure Platform Laboratories
	
IEEE	
  ISCC	
  2015
Background(1)	
•  The estimated number of drive-by-download
attacks is 4.3 M per day
	
2	
  
7%
93%
The	
  number	
  of	
  web-­‐based	
  a1acks	
  
other	
  a0acks	
   drive-­‐by-­‐download	
  a0ack	
  
Background(2)	
•  What is Drive-by-download attack
3	
  
user	
Landing page URL       	
 Exploit URL
  	
Malware download URL
   	
download malware automatically	
exploit
vulnerabilities 	
Click
on URL
Background(3)	
•  What is URL Blacklist	
4	
  
user	
Landing page URL         	
 Exploit URL
 	
Malware download URL
Landing page URL
Exploit URL
	
URL Blacklist	
Matching	
Block	
Malware download URL
Background(4)	
•  However, URL Blacklist cannot cope with previously
unseen malicious URLs
•  It is crucial to keep the URLs updated to make a URL
blacklist effective
5	
  
To collect fresh malicious URLs
Background(5)	
6	
  
30 trillion
unique URLs
Wild Internet	
Web client honeypot	
Scan
7	
  
Goal	
• Our main objective is to accelerate the process of
generating a URL blacklist automatically.
Idea	
Existing
Malicious
URLs	
New
Malicious
URLs	
Search Space
Filter
(Machine Learning)	
Expansion	
 Reduction	
Input:	
 Output:
AutoBLG Framework	
•  Three primary components:
	
8	
  
Img	
  from	
  h0p://www.itguyswa.com.au/free-­‐anJvirus-­‐protecJon/	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  h0ps://www.virustotal.com/ja/	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  h0p://www.soumu.go.jp/main_content/000174846.pdf	
8	
  
URL Expansion URL Flirtation	
8	
  
URL Verification
URL Expansion(1)Seed	
9	
  
http://2339XXX.net/main
http://auth.veXXXXX.com
Seed	
Pre-
processing	
Passive DNS
Database	
Search
Engine	
Web
Crawler
URL Expansion(2)Pre-processing	
10	
  
11X.5X.1XX.XX4
2X.XXX.X99.X2
Seed	
Pre-
processing	
Passive DNS
Database	
Search
Engine	
Web
Crawler
URL Expansion(3)
Passive DNS Database	
11	
  
sediscoXXXXXX.gruXXX.com
vorXXXXXXX.zdjecXXXki.com
Seed	
Pre-
processing	
Passive DNS
Database	
Search
Engine	
Web
Crawler
URL Expansion(4)
Search Engine and Web Crawler	
12	
  
http://100XXXXXwebcam.bXXX.pl/island-XXX-wXX.html
http://100XXXXXwebcam.bXXX.pl/isteam-XXXX.html
Seed	
Pre-
processing	
Passive DNS
Database	
Search
Engine	
Web
Crawler
URL Filrtation	
13	
  Img from http://www.primalsecurity.net/0xc-python-tutorial-python-malware/	
Existing Malicious URLs	
Unknown URLs	
 Similarity Search	
HTML Features	
Bayesian sets
URL Verification 	
•  Three tools for verification of drive-by-
download attacks  	
Ø Web Client honeypot Marionette
Ø Antivirus Software
Ø Virustotal online service
14	
  
Performance Evaluation	
15	
  
•  The number of URL Expansion data: 59,394
•  No URL Filtration: more than 100 hours
•  URL Filtration in use: approximately 6 hours
To accelerate the process of generating
blacklist URLs by adopting a high performance filter
Results(1)	
16	
  
Web client
honeypot	
Antivirus software	
 Virustotal	
1.16%	
 3.8%	
 16.5%	
•  Web Client Honeypot : definitely malicious
Ø  it contained redirecting to the exploit web
pages
•  Antivirus Software : highly suspicious
Ø  they contained several HTTP objects that were
detected by the antivirus checkers; (malicious
JavaScript or executable malware)
•  VirusTotal : suspicious
Ø  need further manual inspection
Results(2)	
•  some URLs are identified by multiple tools
•  After eliminating duplications, of the 600 of extracted URLs,
106 URLs were detected as malicious or suspicious
•  Of the discovered 106 URLs, seven URLs are completely new
URLs that have not been listed in the VirusTotal
17	
  
Limitation and future work

	
18	
  
Item	
 Limitation	
 Future work	
Search Engine	
 Only get Top-50 search
results
To accelerate web
search engine process
Web Crawler
	
evaded by
‘cloaking techniques’	
To develop more
sophisticated tools
Query Pattern	
 Miss several malicious URLs To increase the number
of query patterns	
URL Verification
	
Only two version of browser
or plug-in
To adopt a low-
interaction honeypot
Online
operation
	
Not fully online due to URL
Expansion part	
To pipeline URL
expansion step
Summary	
•  We have proposed the AutoBLG framework
Ø  light-weight
Ø  new and previously unknown drive-by-download URLs
Ø  other suspicious URLs that need for further analysis
	
  
•  Key ideas
Ø  the use of search space expansion and filters
•  We proposed a high-performance filter
Ø  it reduced number of URLs to be investigated with
the dynamic analysis systems by 99%
Ø  while successfully finding new URLs that have not
been listed in the widely used popular URL
reputation system
19	
  
Thank you for your listening
	
20
URL Filtration(1)Feature Extraction	
21	
  
HTML Feature	
 Difference with pervious works	
The number of elements with a small area Frameset tags
border,frameborder,framespacing
The number of suspicious word in the script’s
content
some strings such as
shellcode ,shcode.
The number of URLs with a different domain Only count URL with different
domain.    	
The number of iframe and frame tags
      same	
The number of hidden elements
The number of meta refresh tags
The number of out-of-place elements
The number of embed and object tags
The presence of unescape behavior
The number of setTimeout functions
URL Filtration(2)Similarity Search 	
22	
  
Similarity Search:
Bayesian Sets
From web
space	
Toyota
Nissan
Honda	
BMW
Ford
Audi
Mitsubishi
Mazda
Volkswagen	
Google Sets	
From all
unknown
URLs	
Adopting several
existing malicious
URL as query
(Malicious URLs
that are created
with same Exploit
Kit)	
To output all URLs’
Score in
descending
order. The higher
score is, the more
probably URL is
Malicious
22	
  
The range of experiment	
23	
  
Preliminary
Experiment
Performance Evaluation	
URL Expansion	
 URL Filtration	
 URL
Verification	
•Commercial blacklist
•Pre-processing
•Passive DNS database
•Search Engine
•Web crawler	
•Feature Extraction
•Similarity Search	
•Web Client Honeypot
•Antivirus Software
•VirusTotal	
Steps in URL Expansion
Steps in URL Filtration Tools in URL Verification
23	
  
Preliminary Experiment 	
24	
  
100
101
102
103
Top-K URLs
0
1
2
3
ThenumberofMaliciousURLs
Query Pattern1
Query Pattern2
•  Experiment Data
Ø  The number of
benign URLs:10,000
Ø  The number of
malicious URLs:6
	
•  Experiment Result
Ø  The two query patterns identify different three
malicious URLs in top 300 scores respectively and
extract all the six malicious URLs totally
Ø  we considered the top 300 scores as the
  threshold for URL filtration.
24	
  

AutoBLG by Sun Bo

  • 1.
    1 AutoBLG: Automatic URLBlacklist Generator Using Search Space Expansion and Filters Bo Sun1,Mitsuaki Akiyama2,Takeshi Yagi2,     Mitsuhiro Hatada1,Tatsuya Mori1 1,Waseda University 2,NTT Secure Platform Laboratories IEEE  ISCC  2015
  • 2.
    Background(1) •  The estimatednumber of drive-by-download attacks is 4.3 M per day 2   7% 93% The  number  of  web-­‐based  a1acks   other  a0acks   drive-­‐by-­‐download  a0ack  
  • 3.
    Background(2) •  What isDrive-by-download attack 3   user Landing page URL        Exploit URL    Malware download URL     download malware automatically exploit vulnerabilities Click on URL
  • 4.
    Background(3) •  What isURL Blacklist 4   user Landing page URL          Exploit URL   Malware download URL Landing page URL Exploit URL URL Blacklist Matching Block Malware download URL
  • 5.
    Background(4) •  However, URLBlacklist cannot cope with previously unseen malicious URLs •  It is crucial to keep the URLs updated to make a URL blacklist effective 5   To collect fresh malicious URLs
  • 6.
    Background(5) 6   30 trillion uniqueURLs Wild Internet Web client honeypot Scan
  • 7.
    7   Goal • Ourmain objective is to accelerate the process of generating a URL blacklist automatically. Idea Existing Malicious URLs New Malicious URLs Search Space Filter (Machine Learning) Expansion Reduction Input: Output:
  • 8.
    AutoBLG Framework •  Threeprimary components: 8   Img  from  h0p://www.itguyswa.com.au/free-­‐anJvirus-­‐protecJon/                                      h0ps://www.virustotal.com/ja/                                      h0p://www.soumu.go.jp/main_content/000174846.pdf 8   URL Expansion URL Flirtation 8   URL Verification
  • 9.
  • 10.
  • 11.
    URL Expansion(3) Passive DNSDatabase 11   sediscoXXXXXX.gruXXX.com vorXXXXXXX.zdjecXXXki.com Seed Pre- processing Passive DNS Database Search Engine Web Crawler
  • 12.
    URL Expansion(4) Search Engineand Web Crawler 12   http://100XXXXXwebcam.bXXX.pl/island-XXX-wXX.html http://100XXXXXwebcam.bXXX.pl/isteam-XXXX.html Seed Pre- processing Passive DNS Database Search Engine Web Crawler
  • 13.
    URL Filrtation 13  Imgfrom http://www.primalsecurity.net/0xc-python-tutorial-python-malware/ Existing Malicious URLs Unknown URLs Similarity Search HTML Features Bayesian sets
  • 14.
    URL Verification • Three tools for verification of drive-by- download attacks   Ø Web Client honeypot Marionette Ø Antivirus Software Ø Virustotal online service 14  
  • 15.
    Performance Evaluation 15   • The number of URL Expansion data: 59,394 •  No URL Filtration: more than 100 hours •  URL Filtration in use: approximately 6 hours To accelerate the process of generating blacklist URLs by adopting a high performance filter
  • 16.
    Results(1) 16   Web client honeypot Antivirussoftware Virustotal 1.16% 3.8% 16.5% •  Web Client Honeypot : definitely malicious Ø  it contained redirecting to the exploit web pages •  Antivirus Software : highly suspicious Ø  they contained several HTTP objects that were detected by the antivirus checkers; (malicious JavaScript or executable malware) •  VirusTotal : suspicious Ø  need further manual inspection
  • 17.
    Results(2) •  some URLsare identified by multiple tools •  After eliminating duplications, of the 600 of extracted URLs, 106 URLs were detected as malicious or suspicious •  Of the discovered 106 URLs, seven URLs are completely new URLs that have not been listed in the VirusTotal 17  
  • 18.
    Limitation and futurework
 18   Item Limitation Future work Search Engine Only get Top-50 search results To accelerate web search engine process Web Crawler evaded by ‘cloaking techniques’ To develop more sophisticated tools Query Pattern Miss several malicious URLs To increase the number of query patterns URL Verification Only two version of browser or plug-in To adopt a low- interaction honeypot Online operation Not fully online due to URL Expansion part To pipeline URL expansion step
  • 19.
    Summary •  We haveproposed the AutoBLG framework Ø  light-weight Ø  new and previously unknown drive-by-download URLs Ø  other suspicious URLs that need for further analysis   •  Key ideas Ø  the use of search space expansion and filters •  We proposed a high-performance filter Ø  it reduced number of URLs to be investigated with the dynamic analysis systems by 99% Ø  while successfully finding new URLs that have not been listed in the widely used popular URL reputation system 19  
  • 20.
    Thank you foryour listening 20
  • 21.
    URL Filtration(1)Feature Extraction 21   HTML Feature Difference with pervious works The number of elements with a small area Frameset tags border,frameborder,framespacing The number of suspicious word in the script’s content some strings such as shellcode ,shcode. The number of URLs with a different domain Only count URL with different domain.     The number of iframe and frame tags       same The number of hidden elements The number of meta refresh tags The number of out-of-place elements The number of embed and object tags The presence of unescape behavior The number of setTimeout functions
  • 22.
    URL Filtration(2)Similarity Search 22   Similarity Search: Bayesian Sets From web space Toyota Nissan Honda BMW Ford Audi Mitsubishi Mazda Volkswagen Google Sets From all unknown URLs Adopting several existing malicious URL as query (Malicious URLs that are created with same Exploit Kit) To output all URLs’ Score in descending order. The higher score is, the more probably URL is Malicious 22  
  • 23.
    The range ofexperiment 23   Preliminary Experiment Performance Evaluation URL Expansion URL Filtration URL Verification •Commercial blacklist •Pre-processing •Passive DNS database •Search Engine •Web crawler •Feature Extraction •Similarity Search •Web Client Honeypot •Antivirus Software •VirusTotal Steps in URL Expansion Steps in URL Filtration Tools in URL Verification 23  
  • 24.
    Preliminary Experiment 24   100 101 102 103 Top-K URLs 0 1 2 3 ThenumberofMaliciousURLs Query Pattern1 Query Pattern2 •  Experiment Data Ø  The number of benign URLs:10,000 Ø  The number of malicious URLs:6 •  Experiment Result Ø  The two query patterns identify different three malicious URLs in top 300 scores respectively and extract all the six malicious URLs totally Ø  we considered the top 300 scores as the   threshold for URL filtration. 24