Malicious url detection using machine learning

Beyond Blacklists: Malicious Url Detection Using
Machine Learning

Who am I ?
• Info security Investigator @ Cisco.
• Completed Mtech from IIT Jodhpur in 2014.
• Areas of interest include machine learning,
computer vision and A.I.
• Email : satyamiitj89@gmail.com

Malicious websites
Phishing : which one is real ??

Problem in a Nutshell
6
 URL features to identify malicious Web sites
 No context, no content
 Different classes of URLs
 Benign, spam, phishing, exploits, scams...
 For now, distinguish benign vs. malicious
facebook.com fblight.com

Information about new websites

State of the Practice
8
 Current approaches
 Blacklists [SORBS, URIBL, SURBL, Spamhaus]
 Learning on hand-tuned features [Garera et al, 2007]
 Limitations
 Cannot predict unlisted sites
 Cannot account for new features
 Arms race: Fast feedback cycle is critical
More automated approach?

URL Classification System
9
Label Example Hypothesis

Data Sets
10
 Malicious URLs
 5,000 from PhishTank (phishing)
 15,000 from Spamscatter (spam, phishing, etc)
 Benign URLs
 15,000 from Yahoo Web directory
 15,000 from DMOZ directory
 Malicious x Benign → 4 Data Sets
 30,000 – 55,000 features per data set

Algorithms
11
 Logistic regression w/ L1-norm regularization
 Other models
 Naive Bayes
 Support vector machines (linear, RBF kernels)
 Implicit feature selection
 Easier to interpret

Features to consider?
14
1) Blacklists
2) Simple heuristics
3) Domain name registration
4) Host properties
5) Lexical

(1) Blacklist Queries
15
 List of known malicious sites
 Providers: SORBS, URIBL, SURBL,
Spamhaus
http://www.bfuduuioo1fp.mobi
In blacklist?
Yes
http://fblight.com
No
In blacklist?
http://www.bfuduuioo1fp.mobi
Blacklist queries as features
........................................
........................................

(2) Manually-Selected Features
16
 Considered by previous studies
 IP address in hostname?
 Number of dots in URL
 WHOIS (domain name) registration date
stopgap.cn registered 28
June 2009
http://72.23.5.122/www.bankofamerica.com/
http://www.bankofamerica.com.qytrpbcw.stopgap.cn/

(3) WHOIS Features
17
 Domain name registration
 Date of registration, update, expiration
 Registrant: Who registered domain?
 Registrar: Who manages registration?
http://sleazysalmon.com
http://angryalbacore.com
http://mangymackerel.com
http://yammeringyellowtail.com
Registered on
29 June 2009
By SpamMedia

(4) Host-Based Features
18
 Blacklisted? (SORBS, URIBL, SURBL, Spamhaus)
 WHOIS: registrar, registrant, dates
 IP address: Which ASes/IP prefixes?
 DNS: TTL? PTR record exists/resolves?
 Geography-related: Locale? Connection speed?
75.102.60.0/2269.63.176.0/20
facebook.com fblight.com

(5) Lexical Features
19
 Tokens in URL hostname + path
 Length of URL
 Entropy of the domain name
http://www.bfuduuioo1fp.mobi/ws/ebayisapi.dll

Which feature sets?
20
Blacklist
Manual
WHOIS
Host-based
Lexical
Full
w/o WHOIS/Blacklist
4,000
# Features
13,000
4
3
17,000
30,000
26,000

Beyond Blacklists
21
Blacklist
Full features
Yahoo-PhishTank
Higher detection rate for
given false positive rate

Limitations
22
 False positives
 Sites hosted in disreputable ISP
 Guilt by association
 False negatives
 Compromised sites
 Free hosting sites
 Hosted in reputable ISP
 Future work: Web page content

Conclusion
23
 Detect malicious URLs with high accuracy
 Only using URL
 Diverse feature set helps: 86.5% w/ 18,000+
features
 Proof concept working in lab
 Future work
 Scaling up for deployment

References
 Ma, Justin, et al. "Beyond blacklists: learning
to detect malicious web sites from suspicious
URLs." Proceedings of the 15th ACM SIGKDD
international conference on Knowledge
discovery and data mining. ACM, 2009.

Malicious url detection using machine learning

More Related Content

What's hot

Similar to Malicious url detection using machine learning

More from Cysinfo Cyber Security Community

Recently uploaded

Malicious url detection using machine learning