Real-Time Detection of Malware
Downloads via Large-Scale
URL→File→Machine Graph Mining
Babak Rahbarinia ; Marco Balduzzi ; Roberto Perdisci
AsiaCCS 2016, June 02, Xi’an, China
1
Introduction
Traditional AV is dead?
Signature-based VS. Statistical-based
Traditional AVs inefficiency (they don’t work!)
polymorphism, code obfuscation, packers, ...
URL blacklisting
static, lags behind
time consuming analysis of individual URLs
Local VS. Global
Local: looks at one potential malware at a time
Global: leverages global situational awareness
2
Introduction
Large-scale analysis of behavioral patterns
“Who - where - what” relationship
Global situation awareness
Graph-based machine learning
Combination of system- and network-level info
Mastino:
Real-time and concurrent detection of download
events
Real-world deployment on million of machines
(Internet-scale)
3
Approach
4
Approach
5
Static+dynamic detection [Many]
Graph mining detection: Polonium [KDD10]
Offline approach VS real-time
Only files classification VS + URLs (download event)
Bipartite VS tripartite graph
Proprietary reputation function VS open
AMICO [Esorics13]
HTTP-centric VS protocol-independent
Only works in LANs VS “move across networks”
Google’s CAMP [NDSS13]
Browser-centric VS system-centric
(Quick) Related Work
6
Download Graph
URLs
Files Machines
7
Annotations
URLs
Files Machines
● Age of URL,
domain, path, IP
● Size
● Lifetime, prevalence
● Packed, signed
● Download behavior
● Client processes8
URLs
Files Machines
Labeling
Machines’ reputations
based on their
download/activity
history 9
● B: Alexa (-hosting)
● M: GSB + WRS
● B: Grid + VT
● M: VT
Features and classifier
f
url1 url2 url3
f behavior-based
features = {URL stats, machine stats}
url4
machine1 machine3machine2
compute min, max,
med, avg, and std
compute min, max, med,
avg, and std
URL’s R + R of [FQD, e2LD, path, path
pattern, query string, query pattern]
Machine’s R
Files Features
10
Features and classifier
f
url1 url2 url3
f behavior-based
features = {URL stats, machine stats}
url4
machine1 machine3machine2
compute min, max,
med, avg, and std
compute min, max, med,
avg, and std
URL’s R + R of [FQD, e2LD, path, path
pattern, query string, query pattern]
Machine’s R
f intrinsic
features = {file size, prevalence,
packed, signed, ...}
+
Files Features
11
Features and classifier
f
url1 url2 url3
f behavior-based
features = {URL stats, machine stats}
url4
machine1 machine3machine2
compute min, max,
med, avg, and std
compute min, max, med,
avg, and std
URL’s R + R of [FQD, e2LD, path, path
pattern, query string, query pattern]
Machine’s R
f intrinsic
features = {file size, prevalence,
packed, signed, ...}
Files Features
URLs Features
u + {all URLs sharing
a component with u}
file1 file2 file3
u behavior-based
features = {files stats, machine stats}
file4
machine1 machine3machine2
compute min, max,
med, avg, and std
compute min, max, med,
avg, and std
File’s R
Machine’s R
+
12
Features and classifier
URLs Features
u + {all URLs sharing
a component with u}
file1 file2 file3
u behavior-based
features = {files stats, machine stats}
file4
machine1 machine3machine2
compute min, max,
med, avg, and std
compute min, max, med,
avg, and std
File’s R
Machine’s R
u intrinsic
features = {URL, FQD,
e2LD recency}
+
f
url1 url2 url3
f behavior-based
features = {URL stats, machine stats}
url4
machine1 machine3machine2
compute min, max,
med, avg, and std
compute min, max, med,
avg, and std
URL’s R + R of [FQD, e2LD, path, path
pattern, query string, query pattern]
Machine’s R
f intrinsic
features = {file size, prevalence,
packed, signed, ...}
Files Features
+
13
Example #1
U1
U2
URLs
Files Machines
F2
F1
F3
G1
G2
What could be said
about F1 and F2?
14
Example #1
URLs
Files Machines
F2
F1
What could be said
about F1 and F2?
15
Example #1
URLs
Files Machines
F2
F1
What could be said
about F1 and F2?
16
Example #2
u
URLs
Files
What could be said
about F1?
All neighbors are
unknown
F1
Machines
17
Example #2
u
URLs
Files
FQD Path
All URLs that share the
same components as u
Machines
All URL components:
* FQD
* e2LD
* Path
* Path pattern
* Query string
* Query string pattern
* IP
* IP/24
18
F1
Example #2
u
URLs
Files
FQD Path
All URLs that share the
same components as u
Machines
19
F1
Example #2
u
URLs
Files
FQD Path
All URLs that share the
same components as u
Machines
F1
20
Deployment
Time
Day 1 Day 2 Today
...
Yesterday
21
Time Window of 10 days
Deployment
Time
Day 1 Day 2 Today
...
Yesterday
Trained classifiers
URL
classifier
SHA1
classifier
Real-time
classification
of
URLs & SHA1s
Detection
of
Malicious
Download
Events
22
Data Collection
7 months of data (Jan to Aug 2014)
d = (u; f; m)
Hundreds of thousands of machines, files, urls
Million of nodes
Labeling:
Files: VirusTotal, GRID [Trend]
URLs: Alexa, Google Safe Browsing, WRS [Trend]
Annotations:
File census and GUID census [Trend]
Virus Total (signed..)
23
Train & test for new download
events
New download
events
Detection results new events over 7 periods of 5 days (35 days, total)
Files URLs
24
Combined detection of
download events
(u = m) v (f = m) -> d = m
1 day experiment (5 months)
Efficiency: requests are served in ~0.16 sec
84% of detection: 0-days (unknown) 25
Wuachos.A Dropper
Filename file_saw.exe
URLs with _no_ reputation
Low prevalence
Invalid signature
Path pattern with R of 0.72 (malicious) [*]
1,445 URLs serving 182 polymorphic malware
[*] /f/1392240240/1255385580/2 , /f/1392240120/4165299987/2 -> /H1/I10/I10/I1
Case Study #1
26
Somoto Adware
Filename FreeZipSetup-[d].exe
Packed, short lifetime, prevalence = 0
1 related machine downloaded 1 known
sample during our time window T=10days
Detected a campaign of 695 samples
616 were unknown to VirusTotal
61 unknown +6 months
Case Study #2
27
TTAWinCDM Spyware
Machine and URL with _no_ reputation
Low lifetime&prevelance&countries
Mismatch on downloading process
Acrobat process VS. Unauthoritative domain
Flash 0-day (+2 month)
Case Study #3
28
Analysis of Window T
Bonus #1
29
Features Analysis
Bonus #2
30
Files analysis URLs analysis
Mastino: real-time detection of malware
downloads by passive clients monitoring
Content agnostic, behavioral analysis
Real-world deployment on large-scale
Over 95% TP / 0.5% FP
0-days
Conclusions
31
Thank you!
@embyte
http://www.madlab.it
Babak Rahbarinia ; Marco Balduzzi ; Roberto Perdisci
Questions?
32

Detection of Malware Downloads via Graph Mining (AsiaCCS '16)

  • 1.
    Real-Time Detection ofMalware Downloads via Large-Scale URL→File→Machine Graph Mining Babak Rahbarinia ; Marco Balduzzi ; Roberto Perdisci AsiaCCS 2016, June 02, Xi’an, China 1
  • 2.
    Introduction Traditional AV isdead? Signature-based VS. Statistical-based Traditional AVs inefficiency (they don’t work!) polymorphism, code obfuscation, packers, ... URL blacklisting static, lags behind time consuming analysis of individual URLs Local VS. Global Local: looks at one potential malware at a time Global: leverages global situational awareness 2
  • 3.
    Introduction Large-scale analysis ofbehavioral patterns “Who - where - what” relationship Global situation awareness Graph-based machine learning Combination of system- and network-level info Mastino: Real-time and concurrent detection of download events Real-world deployment on million of machines (Internet-scale) 3
  • 4.
  • 5.
  • 6.
    Static+dynamic detection [Many] Graphmining detection: Polonium [KDD10] Offline approach VS real-time Only files classification VS + URLs (download event) Bipartite VS tripartite graph Proprietary reputation function VS open AMICO [Esorics13] HTTP-centric VS protocol-independent Only works in LANs VS “move across networks” Google’s CAMP [NDSS13] Browser-centric VS system-centric (Quick) Related Work 6
  • 7.
  • 8.
    Annotations URLs Files Machines ● Ageof URL, domain, path, IP ● Size ● Lifetime, prevalence ● Packed, signed ● Download behavior ● Client processes8
  • 9.
    URLs Files Machines Labeling Machines’ reputations basedon their download/activity history 9 ● B: Alexa (-hosting) ● M: GSB + WRS ● B: Grid + VT ● M: VT
  • 10.
    Features and classifier f url1url2 url3 f behavior-based features = {URL stats, machine stats} url4 machine1 machine3machine2 compute min, max, med, avg, and std compute min, max, med, avg, and std URL’s R + R of [FQD, e2LD, path, path pattern, query string, query pattern] Machine’s R Files Features 10
  • 11.
    Features and classifier f url1url2 url3 f behavior-based features = {URL stats, machine stats} url4 machine1 machine3machine2 compute min, max, med, avg, and std compute min, max, med, avg, and std URL’s R + R of [FQD, e2LD, path, path pattern, query string, query pattern] Machine’s R f intrinsic features = {file size, prevalence, packed, signed, ...} + Files Features 11
  • 12.
    Features and classifier f url1url2 url3 f behavior-based features = {URL stats, machine stats} url4 machine1 machine3machine2 compute min, max, med, avg, and std compute min, max, med, avg, and std URL’s R + R of [FQD, e2LD, path, path pattern, query string, query pattern] Machine’s R f intrinsic features = {file size, prevalence, packed, signed, ...} Files Features URLs Features u + {all URLs sharing a component with u} file1 file2 file3 u behavior-based features = {files stats, machine stats} file4 machine1 machine3machine2 compute min, max, med, avg, and std compute min, max, med, avg, and std File’s R Machine’s R + 12
  • 13.
    Features and classifier URLsFeatures u + {all URLs sharing a component with u} file1 file2 file3 u behavior-based features = {files stats, machine stats} file4 machine1 machine3machine2 compute min, max, med, avg, and std compute min, max, med, avg, and std File’s R Machine’s R u intrinsic features = {URL, FQD, e2LD recency} + f url1 url2 url3 f behavior-based features = {URL stats, machine stats} url4 machine1 machine3machine2 compute min, max, med, avg, and std compute min, max, med, avg, and std URL’s R + R of [FQD, e2LD, path, path pattern, query string, query pattern] Machine’s R f intrinsic features = {file size, prevalence, packed, signed, ...} Files Features + 13
  • 14.
  • 15.
    Example #1 URLs Files Machines F2 F1 Whatcould be said about F1 and F2? 15
  • 16.
    Example #1 URLs Files Machines F2 F1 Whatcould be said about F1 and F2? 16
  • 17.
    Example #2 u URLs Files What couldbe said about F1? All neighbors are unknown F1 Machines 17
  • 18.
    Example #2 u URLs Files FQD Path AllURLs that share the same components as u Machines All URL components: * FQD * e2LD * Path * Path pattern * Query string * Query string pattern * IP * IP/24 18 F1
  • 19.
    Example #2 u URLs Files FQD Path AllURLs that share the same components as u Machines 19 F1
  • 20.
    Example #2 u URLs Files FQD Path AllURLs that share the same components as u Machines F1 20
  • 21.
    Deployment Time Day 1 Day2 Today ... Yesterday 21 Time Window of 10 days
  • 22.
    Deployment Time Day 1 Day2 Today ... Yesterday Trained classifiers URL classifier SHA1 classifier Real-time classification of URLs & SHA1s Detection of Malicious Download Events 22
  • 23.
    Data Collection 7 monthsof data (Jan to Aug 2014) d = (u; f; m) Hundreds of thousands of machines, files, urls Million of nodes Labeling: Files: VirusTotal, GRID [Trend] URLs: Alexa, Google Safe Browsing, WRS [Trend] Annotations: File census and GUID census [Trend] Virus Total (signed..) 23
  • 24.
    Train & testfor new download events New download events Detection results new events over 7 periods of 5 days (35 days, total) Files URLs 24
  • 25.
    Combined detection of downloadevents (u = m) v (f = m) -> d = m 1 day experiment (5 months) Efficiency: requests are served in ~0.16 sec 84% of detection: 0-days (unknown) 25
  • 26.
    Wuachos.A Dropper Filename file_saw.exe URLswith _no_ reputation Low prevalence Invalid signature Path pattern with R of 0.72 (malicious) [*] 1,445 URLs serving 182 polymorphic malware [*] /f/1392240240/1255385580/2 , /f/1392240120/4165299987/2 -> /H1/I10/I10/I1 Case Study #1 26
  • 27.
    Somoto Adware Filename FreeZipSetup-[d].exe Packed,short lifetime, prevalence = 0 1 related machine downloaded 1 known sample during our time window T=10days Detected a campaign of 695 samples 616 were unknown to VirusTotal 61 unknown +6 months Case Study #2 27
  • 28.
    TTAWinCDM Spyware Machine andURL with _no_ reputation Low lifetime&prevelance&countries Mismatch on downloading process Acrobat process VS. Unauthoritative domain Flash 0-day (+2 month) Case Study #3 28
  • 29.
    Analysis of WindowT Bonus #1 29
  • 30.
  • 31.
    Mastino: real-time detectionof malware downloads by passive clients monitoring Content agnostic, behavioral analysis Real-world deployment on large-scale Over 95% TP / 0.5% FP 0-days Conclusions 31
  • 32.
    Thank you! @embyte http://www.madlab.it Babak Rahbarinia; Marco Balduzzi ; Roberto Perdisci Questions? 32