Artificial intelligence in the post-deep learning era
Shady Paths: Leveraging Surfing Crowds to Detect Malicious Web Pages
1. Shady Paths: Leveraging
Surfing Crowds to Detect
Malicious Web Pages
Gianluca Stringhini, Christopher Kruegel, and Giovanni Vigna
University of California, Santa Barbara
2. The Web is a Dangerous
Place
• Drive-by downloads
• Social engineering
Shady Paths: Leveraging Surfing Crowds to Detect Malicious Web Pages
2
3. Current Detection Techniques
Static Analysis
Dynamic Analysis
Suspicious elements in
• URLs
• JavaScript
• Flash
Visit the web page (honeyclients)
• Signs of exploitation
Obfuscation
Cloaking
Can only detect attacks that
exploit vulnerabilities!
Shady Paths: Leveraging Surfing Crowds to Detect Malicious Web Pages
3
5. Redirection Graphs
No need to
analyze the
final page!
By analyzing the characteristics of the set of visitors and of the redirection
graph, we can determine if the destination page is malicious
Shady Paths: Leveraging Surfing Crowds to Detect Malicious Web Pages
5
6. Legitimate Uses of
Redirections
• Inform that a web page has moved
• Login functionalities
• Advertisements
We cannot flag all redirections as malicious
Luckily, malicious redirection graphs look different
Shady Paths: Leveraging Surfing Crowds to Detect Malicious Web Pages
6
12. Our System: SpiderWeb
We leverage the differences between
legitimate and malicious redirection
graphs for detection
Three components:
• Data collection
• Creation of redirection graphs
• Classification component
Shady Paths: Leveraging Surfing Crowds to Detect Malicious Web Pages
12
13. Data Collection
SpiderWeb needs a set of
navigation data from a
diverse population of users
Dataset obtained from a
large AV vendor
• Users of a browser
security tool
• Data collection was optin only
• Data was anonymized
Shady Paths: Leveraging Surfing Crowds to Detect Malicious Web Pages
13
15. Classification Component
Five categories of features
• Client features (3 features)
• Referrer features (4 features)
• Landing page features (4 features)
• Final page features (5 features)
}
how diverse are
these elements
Distinct URLs, Parameters, TLD, Domain is an IP
• Redirection graph features (12 features)
Length of chains, same country across referrer and final page,
intra-domain redirections, hubs
We use Support Vector Machines for classification
Shady Paths: Leveraging Surfing Crowds to Detect Malicious Web Pages
15
17. Evaluation Dataset
388,098 redirection chains, collected over two months
• 34,011 final URLs
• 13,780 distinct user IP addresses per week
• 145 countries
Labeled dataset for training
•
•
2,533 redirection chains leading to 1,854 malicious URLs
2,466 redirection chains leading to 510 legitimate URLs
Shady Paths: Leveraging Surfing Crowds to Detect Malicious Web Pages
17
18. Analysis of the Classifier
SpiderWeb’s performance depends on the redirection graph
complexity
• Complexity ≥ 6 causes no FPs and no FNs
• Our dataset is limited → we discard graphs with complexity < 4
We need to accept a certain amount of FPs and FNs
Full URL grouping: 1.2% FP rate, 17% FN rate
Redirection-graph specific features are the most important:
Without them, FNs raise to 67%
Shady Paths: Leveraging Surfing Crowds to Detect Malicious Web Pages
18
19. Detection in the Wild
3,549 redirection graphs with complexity ≥ 4
564 flagged as malicious → 3,368 URLs
778 URLs undetected by the AV vendor
• We could not confirm 1.5% of them
• Effectively complements state of the art
Shady Paths: Leveraging Surfing Crowds to Detect Malicious Web Pages
19
20. Comparison with Previous
Work
A few previous systems leverage redirection information to
detect malicious web pages
These systems also use other type of information
• WarningBird: uses Twitter profile information
• SURF: SEO specific
If this additional information is not present, SpiderWeb
outperforms previous systems
Shady Paths: Leveraging Surfing Crowds to Detect Malicious Web Pages
20
21. Possible Use Cases
Offline detection (blacklist)
Online detection
Users get infected until the required “complexity” is reached
We performed a chronological experiment
SpiderWeb would have protected 93% users
Shady Paths: Leveraging Surfing Crowds to Detect Malicious Web Pages
21
22. Discussion
Limitations
• Graphs with high complexity are required
• Groupings are not perfect
• Attackers might redirect users to legitimate pages
Attackers might make their redirections look legitimate
• Stop using cloaking (easier to detect by previous work)
• Stop using hubs (raises the bar)
Shady Paths: Leveraging Surfing Crowds to Detect Malicious Web Pages
22
23. Conclusions
• We showed that malicious and legitimate
redirection graphs differ
• We presented a system that analyzes redirection
graphs to detect malicious web pages
• We showed that our system is effective, and
complements existing systems
Shady Paths: Leveraging Surfing Crowds to Detect Malicious Web Pages
23