SlideShare a Scribd company logo
Statistical Structures:
Fingerprinting Malware for
Classification and Analysis
Daniel Bilar
Wellesley College (Wellesley, MA)
Colby College (Waterville, ME)
bilar <at> alum dot dartmouth dot org
Why Structural Fingerprinting?
Goal: Identifying and classifying malware
Problem: For any single fingerprint, balance
between over-fitting (type II error) and under-
fitting (type I error) hard to achieve
Approach: View binaries simultaneously from
different structural perspectives and perform
statistical analysis on these ‘structural
fingerprints’
Different Perspectives
Primarily
static
Graph structural
properties
Explore graph-
modeled control and
data dependencies
System
Dependence
Graph
Primarily
dynamic
API call vector
Observe API calls
made
Win 32 API
call
Primarily
static
Opcode
frequency
distribution
Count different
instructions
Assembly
instruction
static /
dynamic?
Statistical
Fingerprint
Description
Structural
Perspective
Idea: Multiple perspectives may increase likelihood of
correct identification and classification
Fingerprint:
Opcode frequency distribution
Synopsis: Statically disassemble the binary, tabulate
the opcode frequencies and construct a statistical
fingerprint with a subset of said opcodes.
Goal: Compare opcode fingerprint across non-
malicious software and malware classes for quick
identification and classification purposes.
Main result: ‘Rare’ opcodes explain more data
variation then common ones
Goodware: Opcode Distribution
Procedure:
1. Inventoried PEs (EXE, DLL,
etc) on XP box with
Advanced Disk Catalog
2. Chose random EXE samples
with MS Excel and Index
your Files
3. Ran IDA with modified
InstructionCounter plugin on
sample PEs
4. Augmented IDA output files
with PEID results (compiler)
and general ‘functionality
class’ (e.g. file utility, IDE,
network utility, etc)
5. Wrote Java parser for raw
data files and fed JAMA’ed
matrix into Excel for analysis
---------.exe
-------.exe
---------.exe
size: 122880
totalopcodes: 10680
compiler: MS Visual C++ 6.0
class: utility (process)
0001. 002145 20.08% mov
0002. 001859 17.41% push
0003. 000760 7.12% call
0004. 000759 7.11% pop
0005. 000641 6.00% cmp
1, 2
3, 4
5
Malware: Opcode Distribution
Procedure:
1. Booted VMPlayer with XP
image
2. Inventoried PEs from C. Ries
malware collection with
Advanced Disk Catalog
3. Fixed 7 classes (e.g. virus,,
rootkit, etc), chose random
PEs samples with MS Excel
and Index your Files
4. Ran IDA with modified
InstructionCounter plugin on
sample PEs
5. Augmented IDA output files
with PEID results (compiler,
packer) and ‘class’
6. Wrote Java parser for raw data
files and fed JAMA’ed matrix
into Excel for analysis
Giri.5209
Gobi.a
---------.b
size: 12288
totalopcodes: 615
compiler: unknown
class: virus
0001. 000112 18.21% mov
0002. 000094 15.28% push
0003. 000052 8.46% call
0004. 000051 8.29% cmp
0005. 000040 6.50% add
2,3
4, 5
6
AFXRK2K4.root.exe
vanquish.dll
1
Aggregate (Goodware):
Opcode Breakdown
and
1%
retn
2%
jnz
3%
add
3%
mov
25%
push
19%
call
9%
pop
6%
cmp
5%
jz
4%
lea
4%
test
3%
jmp
3%
xor
2% 20 EXEs
(size-blocked random samples
from 538 inventoried EXEs)
~1,520,000 opcodes read
192 out of 420 possible
opcodes found
72 opcodes in pie chart
account for >99.8%
14 opcodes labelled account
for ~90%
Top 5 opcodes account for
~64 %
20 EXEs
(size-blocked random samples from
538 inventoried EXEs)
~1,520,000 opcodes read
192 out of 398 possible opcodes
found
72 opcodes in pie chart account for
>99.8%
14 opcodes labelled account for ~90%
Top 5 opcodes account for ~64 %
Aggregate (Malware):
Opcode Breakdown
sub
1%
xor
3%
add
3%
lea
3%
jmp
3%
jz
4%
cmp
4%
pop
6% call
10%
push
16%
mov
30%
test
3%
retn
3%
jnz
3%
67 PEs
(class-blocked random samples
from 250 inventoried PEs)
~665,000 opcodes read
141 out of 420 possible
opcodes found (two undocu-
mented)
60 opcodes in pie chart
account for >99.8%
14 opcodes labelled account
for >92%
Top 5 opcodes account for
~65%
67 PEs
(class-blocked random samples
from 250 inventoried PEs)
~665,000 opcodes read
141 out of 398 possible
opcodes found (two undocu-
mented)
60 opcodes in pie chart
account for >99.8%
14 opcodes labelled account
for >92%
Top 5 opcodes account for
~65%
Class-blocked (Malware):
Opcode Breakdown Comparison
mov
push
call
pop
cmp
jz
jmp
lea
add
test
retn
jnz sub
xor
mov 30 %
push 16 %
call 10 %
pop 6 %
cmp 4 %
jz 4 %
jmp 4 %
lea 3%
add 3%
test 3%
retn 3%
jnz 2%
xor 2%
sub 1%
Aggregate breakdown
Aggregate
worms
virus
tools
trojan
s
rootkit
(kernel)
rootkit
(user)
bots
Top 14 Opcodes: Frequency
1.5%
1.1%
1.7%
3.7%
5.8%
4.1%
1.8%
1.8%
3.3%
6.4%
2.7%
5.5%
15.6%
37.0%
Kernel
RK
1.0%
2.3%
2.3%
3.1%
3.7%
3.8%
3.2%
3.3%
3.9%
4.9%
5.1%
8.9%
16.6%
29.0%
User
RK
1.3%
2.1%
2.9%
3.4%
3.4%
3.4%
3.7%
3.1%
4.3%
5.3%
5.9%
8.2%
19.0%
25.4%
Tools
0.5%
3.2%
3.0%
2.2%
2.5%
3.0%
2.6%
2.6%
3.3%
3.6%
6.8%
11.0%
14.1%
34.6%
Bot
0.6%
2.7%
3.2%
2.6%
3.0%
3.4%
3.4%
2.7%
3.5%
3.6%
7.3%
10.0%
15.4%
30.5%
Trojan
1.5%
2.1%
2.0%
3.2%
3.5%
2.7%
3.1%
5.5%
4.4%
5.9%
7.0%
9.1%
22.7%
16.1%
Virus
1.3%
1.9%
2.2%
2.6%
3.0%
3.0%
3.2%
3.9%
4.3%
5.1%
6.3%
8.7%
19.5%
25.3%
Goodware
1.6%
and
2.3%
xor
2.3%
retn
3.2%
jnz
3.0%
add
4.5%
jmp
3.0%
test
4.2%
lea
4.0%
jz
5.0%
cmp
6.2%
pop
8.7%
call
20.7%
push
22.2%
mov
Worms
Opcode
Comparison Opcode Frequencies
1.5%
1.1%
1.7%
3.7%
5.8%
4.1%
1.8%
1.8%
3.3%
6.4%
2.7%
5.5%
15.6%
37.0%
Kernel
RK
1.0%
2.3%
2.3%
3.1%
3.7%
3.8%
3.2%
3.3%
3.9%
4.9%
5.1%
8.9%
16.6%
29.0%
User
RK
1.3%
2.1%
2.9%
3.4%
3.4%
3.4%
3.7%
3.1%
4.3%
5.3%
5.9%
8.2%
19.0%
25.4%
Tools
0.5%
3.2%
3.0%
2.2%
2.5%
3.0%
2.6%
2.6%
3.3%
3.6%
6.8%
11.0%
14.1%
34.6%
Bot
0.6%
2.7%
3.2%
2.6%
3.0%
3.4%
3.4%
2.7%
3.5%
3.6%
7.3%
10.0%
15.4%
30.5%
Trojan
1.5%
2.1%
2.0%
3.2%
3.5%
2.7%
3.1%
5.5%
4.4%
5.9%
7.0%
9.1%
22.7%
16.1%
Virus
1.3%
1.9%
2.2%
2.6%
3.0%
3.0%
3.2%
3.9%
4.3%
5.1%
6.3%
8.7%
19.5%
25.3%
Goodware
1.6%
and
2.3%
xor
2.3%
retn
3.2%
jnz
3.0%
add
4.5%
jmp
3.0%
test
4.2%
lea
4.0%
jz
5.0%
cmp
6.2%
pop
8.7%
call
20.7%
push
22.2%
mov
Worms
Opcode
Perform distribution tests for top
14 opcodes on 7 classes of
malware:
Rootkit (kernel + user)
Virus and Worms
Trojan and Tools
Bots
Investigate: Which, if any,
opcode frequency is significantly
different for malware?
Top 14 Opcode Testing (z-scores)
1.9
-8.9
-5.5
8.7
22.9
8.5
-12.2
-16.2
-7.4
7.4
-22.0
-17.0
-15.5
36.8
Kernel
RK
-7.3
6.7
2.5
7.4
10.8
11.7
0.0
-8.4
-6.1
-3.5
-13.5
1.2
-21.0
20.6
User
RK
-0.7
-2.6
-12.3
-11.7
-6.4
-5.0
-6.6
10.9
0.9
-0.6
4.9
5.2
4.6
2.0
Tools
-33.6
29.5
18.4
-12.2
-13.5
-2.2
-14.6
-29.2
-20.9
-30.8
5.1
26.0
-59.9
70.1
Bot
-17.0
15.3
17.8
-0.9
-0.1
5.0
1.8
-18.3
-11.0
-21.2
9.8
10.6
-31.2
28.7
Trojan
2.4
2.7
-1.4
5.3
4.3
-2.3
-0.2
11.5
1.4
4.7
4.8
2.6
12.1
-27.9
Virus
5.9
and
7.7
xor
2.6
retn
8.0
jnz
0.5
add
20.4
jmp
-3.4
test
4.2
lea
-4.4
jz
-1.8
cmp
-1.1
pop
-0.3
call
6.9
push
-20.1
mov
Worms
Opcode
High
Low
Higher
Lower
Similar
Opcode
Frequency
Tests
suggests
opcode
frequency
roughly
1/3 same
1/3 lower
1/3 higher
vs
goodware
Top 14 Opcodes Results Interpretation
5.2
5.6
9.5
15.0
4.0
6.1
10.3
Cramer’s V
(in %)
Krn Usr Tools Bot Trojan Virus
and
xor
retn
jnz
add
jmp
test
lea
jz
cmp
pop
call
push
mov
Worm
Op
Kernel-mode Rootkit:
most # of deviations
 handcoded assembly;
‘evasive’ opcodes ?
Virus + Worms:
few # of deviations;
more jumps  smaller
size, simpler malicious
function, more control
flow ?
High
Low
Higher
Lower
Similar
Opc
Freq
Tools: (almost) no
deviation in top 5
opcodes  more
‘benign’ (i.e. similar
to goodware) ?
Most frequent 14
opcodes weak
predictor
Explains just 5-15% of
variation!
Rare 14 Opcodes (parts per million)
std
shld
setle
setb
sbb
rdtsc
pushf
nop
int
imul
fstcw
fild
fdivp
bt
Opcode Worms
Virus
Trojan
Bot
Tools
User
RK
Kernel
RK
Goodware
95
0
31
48
35
56
272
20
24
54
0
4
35
45
0
22
0
0
21
0
0
0
0
20
24
0
52
22
12
68
0
6
782
1133
458
431
1523
1330
588
1078
0
108
0
11
0
0
0
12
12
54
0
0
59
11
0
116
83
647
42
7
71
101
136
216
0
108
0
0
921
981
4028
25
1126
755
406
726
708
1849
1629
1182
12
0
21
22
0
0
0
11
438
0
115
133
0
45
0
357
59
0
52
52
35
0
0
37
118
0
83
70
47
34
0
30
Rare 14 Opcode Testing (z-scores)
4.8
-1.0
-1.0
-0.5
-6.5
-0.7
-2.4
-2.3
45.0
-3.3
-0.7
-4.3
-1.3
-1.2
Kernel
RK
1.4
0.6
-1.6
4.7
-2.0
-1.2
-3.7
-3.6
26.2
1.3
-1.2
-6.5
-2.2
-0.4
User
RK
0.8
0.6
-1.4
0.6
3.4
-1.1
-1.8
-3.2
28.7
-5.9
-1.0
-6.1
-0.3
0.7
Tools
0.3
-1.1
-1.6
4.6
-2.2
1.1
-3.9
-5.0
-1.8
4.4
3.3
-1.5
3.8
6.6
Bot
2.4
-0.9
1.3
7.9
0.3
-0.7
-2.2
-1.6
-1.0
-1.4
2.2
-0.8
2.8
5.9
Trojan
-0.6
1.0
-0.6
-0.3
0.8
3.8
-0.7
4.5
2.4
-1.7
-0.4
-2.6
-0.8
-0.7
Virus
4.8
std
0.2
shld
-1.2
setle
2.1
setb
-2.0
sbb
-0.9
rdtsc
-2.6
pushf
-2.3
nop
-1.4
int
0.9
imul
0.2
fstcw
2.1
fild
1.3
fdivp
4.8
bt
Worms
Opcode
High
Low
Higher
Lower
Similar
Opcode
Frequency
Tests
suggests
opcode
frequency
roughly
1/10 lower
1/5 higher
7/10 same
vs
goodware
Rare 14 Opcodes: Interpretation
12
10
16
17
42
36
63
Cramer’s V
(in %)
Krn Usr Tools Bot Trojan Virus
std
shld
setle
setb
sbb
rdtsc
pushf
nop
int
imul
fstcw
fild
fdivp
bt
Worm
Op
NOP:
Virus makes use 
NOP sled, padding ?
High
Low
Higher
Lower
Similar
Opc
Freq
INT: Rooktkits (and tools) make
heavy use of software interrupts 
tell-tale sign of RK ?
Infrequent 14 opcodes
much better
predictor!
Explains 12-63% of
variation
Summary: Opcode Distribution
Malware opcode
frequency distribution
seems to deviate
significantly from non-
malicious software
‘Rare’ opcodes explain
more frequency
variation then common
ones
Compare opcode
fingerprints against
various software classes
for quick identification
and classification
Giri.5209
Gobi.a
---------.b
size: 12288
totalopcodes: 615
compiler: unknown
class: virus
0001. 000112 18.21% mov
0002. 000094 15.28% push
0003. 000052 8.46% call
0004. 000051 8.29% cmp
0005. 000040 6.50% add
AFXRK2K4.root.exe
vanquish.dll
Opcodes: Further directions
Acquire more samples and software class differentiation
Investigate sophisticated tests for stronger control of
false discovery rate and type I error
Study n-way association with more factors (compiler,
type of opcodes, size)
Go beyond isolated opcodes to semantic ‘nuggets’ (size-
wise between isolated opcodes and basic blocks)
Investigate equivalent opcode substitution effects
Related Work
M. Weber (2002): PEAT – Toolkit for Detecting
and Analyzing Malicious Software
R. Chinchani (2005): A Fast Static Analysis
Approach to Detect Exploit Code Inside
Network Flows
S. Stolfo (2005): Fileprint Analysis for
Malware Detection
Fingerprint: Win 32 API calls
Goal: Classify malware quickly into a family (set of
variants make up a family)
Synopsis: Observe and record Win32 API calls made
by malicious code during execution, then compare
them to calls made by other malicious code to find
similarities
Joint work with Chris Ries
Main result: Simple model yields > 80% correct
classification, call vectors seem robust towards
different packer
Win 32 API call: System overview
Data
Collection
Vector
Builder
Vector Comparison
Database
Malicious Code Family
Log File
Data Collection: Run malicious code, recording
Win32 API calls it makes
Vector Builder: Build count vector from collected API
call data and store in database
Comparison: Compare vector to all other vectors in the
database to see if its related to any of them
Win 32 API Call: Data Collection
Win 2000
VMWare
WinXP
Linux
Relayer
Fake DNS Server
Honeyd
Malware
Logger
Malware runs for short period of time on VMWare
machine, can interact with fake network
API calls recorded by logger, passed on to Relayer
Relayer forwards logs to file, console
Win 32 API Call: Call Recording
Malicious process is started in suspended state
DLL is injected into process’s address space
When DLL’s DllMain() function is executed, it hooks
the Win32 API function
Calling
Function
Target
Function
Target
Function
Calling
Function
Hook Trampoline
Function call before hooking Function call after hooking
Hook records the call’s time and arguments, calls the
target, records the return value, and then returns the
target’s return value to the calling function.
Win 32 API call: Call Vector
Column of the vector represents a hooked function
and # of times called
1200+ different functions recorded during execution
For each malware specimen, vector values recorded
to database
…
0
156
12
62
Number
of Calls
…
EndPath
CloseHandle
FindFirstFileA
FindClose
Function
Name
Win 32 API call: Comparison
Computes cosine similarity measure csm
between vector and each vector in the database
=
•
=
2
1
2
1
2
1 )
,
(
v
v
v
v
v
v
csm v
v
v
v
v
v
If csm(vector, most similar vector in the
database) > threshold  vector is classified
as member of familymost-similar-vector
Otherwise vector classified as member of
familyno-variants-yet
v
1
v2
Win 32 API call: Results
Collected 77 malware samples
3
3
Moega
5
8
SDBot
6
6
Welchia
1
1
Pestlogger
0
1
Spybot
1
2
Randex
2
3
Sasser
8
8
Netsky
5
5
MyLife
8
10
MyDoom
2
2
Mitgleider
1
1
Klez
0
2
Inor
1
1
Gibe
2
3
Frethem
1
1
Blaster
14
15
Beagle
4
1
2
4
1
2
Banker
Nibu
Tarno
0
1
Apost
# correct
# of
members
Family
27
2
0
48
0.62
0.99
11
3
2
61
0.79
0.95
10
4
1
61
0.79
0.9
8
4
2
63
0.82
0.85
5
6
3
63
0.82
0.8
4
6
5
62
0.8
0.75
2
8
5
62
0.8
0.7
miss. fam.
both
false fam.
 #
 %
Threshold
Classification made by 17
major AV scanners produced
21 families (some aliases)
~80 % correct with
csm threshold 0.8
Discrepancies
Misclassifications
Win 32 API call: Packers
Wide variety of
different packers used
within same families
Dynamic Win 32 API
call fingerprint seems
robust towards packer

PE Pack
Netsky.Y

PE-Patch,
UPX
Netsky.S

FSG
Netsky.P

tElock
Netsky.K

PEtite
Netsky.D

PEtite
Netsky.C
UPX
Netsky.B

PECompact
Netsky.AB
Identified
Packer
Variant
8 Netsky variants in
sample, 7 identified
Summary: Win 32 API calls
Simple model yields > 80% correct classification
Resolved discrepancies between some AV scanners
Dynamical API call vectors seem robust towards
different packer
Data
Collection
Vector
Builder
Vector Comparison
Database
Malicious Code Family
Log File
Allows researchers and analysts to quickly identify
variants reasonably well, without manual analysis
API call : Further directions
Acquire more malware samples for better variant
classification
Explore resiliency to obfuscation techniques
(substitutions of Win 32 API calls, call spamming)
Investigate patterns of ‘call bundles’ instead of just
isolated calls for richer identification
Replace VSM with finite state automaton that captures
rich set of call relations
Related Work
R. Sekar et al (2001): A Fast Automaton-Based
Method for Detecting Anamalous Program
Behaviour
J. Rabek et al (2003): DOME – Detection of
Injected, Dynamically Generated, and
Obfuscated Malicious Code
K. Rozinov (2005): Efficient Static Analysis of
Executables for Detecting Malicous Behaviour
Fingerprint: PDG measures
Goal: Compare ‘graph structure’ fingerprint of
unknown binaries across non-malicious software and
malware classes for identification, classification and
prediction purposes
Synopsis: Represent binaries as a System
Dependence Graph, extract graph features to construct
‘graph-structural’ fingerprints for particular software
classes
Main result: Work in progress
Program Dependence Graph
A PDG models intra-
procedural
Data Dependence:
Program statements
compute data that are
used by other statements.
Control Dependence:
Arise from the ordered
flow of control in a
program.
Picture from J. Stafford (Colorado, Boulder)
Material from Codesurfer
Program Dependence Graph
System Dependence Graph
A SDG models control, data, and call dependences in
a program
A SDG of a program
are the aggregated
PDGs augmented
with the calls
between functions
Picture from J. Stafford (Colorado, Boulder)
System Dependence Graph
Material from Codesurfer
Graph measures as a fingerprint
Binaries represented as a System
Dependence Graph (SDG)
Tally distributions on graph measures:
Edge weights (weight is jump distance)
Node weight (weight is number of statements in basic block)
Centrality (“How important is the node”)
Clustering Coefficient (“probability of connected neighbours”)
Motifs (“recurring patterns”)
 Statistical structural fingerprint
Example: H. Flake (BH 05) Graph-structural measure
Primer: Graphs
Graphs (Networks)
are made up of
vertices (nodes) and
edges
An edge connects two
vertices
Nodes and edges can
have different weights
(b,c)
Edges can have
directions (d)
Picture from Mark Newman (2003)
Modeling with Graphs
Genetic regulatory network (proteins, dependence)
Cardiovascular (organs, veins) [also transmission]
Food (predator, prey)
Neural (neurons, axons)
Biological
biological ‘entities’ interacting
Social
Set of people/groups with
some interaction between
them
Chemistry (conformation of polymers, transitions)
Physical Science
physical ‘entities’ interacting
Power grid (power station, lines)
Telephone
Internet (routers, physical links)
Technology
(Transmission)
resource or commodity
distribution/transmission
Citation (paper, cited)
Thesaurus (words, synonym)
www (html pages, URL links)
Friendship (people, friendship bond)
Business (companies, business dealings)
Movies (actors, collaboration)
Phone calls (number, call)
Information
Information linked together
Example (nodes, edge)
Category
Measures: Centrality
Centrality tries to measure the ‘importance’ of a
vertex
Degree centrality: “How many nodes are
connected to me?”
Closeness centrality: “How close am I from
all other nodes?”
Betweenness centrality: “How important
am I for any two nodes?”
Freeman metric computes centralization for
entire graph
Measure:
Degree centrality
Degree
centrality: “How
many nodes are
connected to me?”
∑≠
=
=
N
j
i
j
i
d j
i
edge
n
M
,
1
)
,
(
)
(
1
)
,
(
)
(
,
1
−
=
∑≠
=
N
n
n
edge
n
M
N
j
i
j
j
i
i
d
Normalized:
Measure:
Closeness centrality
Closeness
centrality: “How
close am I from all
other nodes?”
1
1
)
,
(
)
(
−
=






= ∑
N
j
j
i
i
c n
n
d
n
M )
1
)(
(
)
(
'
−
= N
n
M
n
M i
C
i
C
Normalized:
Measure:
Betweenness centrality
Betweenness
centrality: “How
important am I for
any two nodes?”
∑≠
≠
=
i
k
j
jk
i
jk
i
B g
n
g
n
C /
)
(
)
(
2
/
)
2
)(
1
(
)
(
)
(
'
−
−
=
N
N
n
C
n
C i
B
i
B
Normalized:
Local structure:
Clusters and Motifs
Graphs can be decomposed into constitutive
subgraphs
Subgraph: A subset of nodes of the original
graph and of edges connecting them (does not
have to contain all the edges of a node)
Cluster: Connected subgraph
Motif: Recurring subgraph in networks at
higher frequencies than expected by random
chance
Measure: Clustering Coefficient
Slide from Albert (Penn State)
Measure: Network motifs
A motif in a network is a
subgraph ‘pattern’
Recurs in networks at
higher frequencies than
expected by random
chance
Motif may reflect underlying generative
processes, design principles and
constraints and driving dynamics
of the network
Motif example:
Found in
Biochemistry (Transcriptional regulation)
Neurobiology (Neuron connectivity)
Ecology (food web)
Engineering (electronic circuits) ..
and maybe Computer Science (PDGs) ??
Feed-forward loop:
X regulates Y and Z
Y regulates Z
Graph measures as a fingerprint
Binaries represented as a System
Dependence Graph (SDG)
Tally distributions on graph measures:
Edge weights (weight is jump distance)
Node weight (weight is number of statements in basic block)
Centrality (“How important is the node”)
Clustering Coefficient (“probability of connected neighbours”)
Motifs (“recurring patterns”)
 Statistical structural fingerprint
Summary: SDG measures
Goal: Compare ‘graph structure’ fingerprint of
unknown binaries across non-malicious software and
malware classes for identification, classification and
prediction purposes
Synopsis: Represent binaries as a System
Dependence Graph, extract graph features to construct
‘graph-structural’ fingerprints for particular software
classes
Main result: Work in progress
Related Work
H. Flake (2005): Compare, Port, Navigate
M. Christodorescu (2003): Static Analysis
of Executables to Detect Malicious
Patterns
A. Kiss (2005): Using Dynamic
Information in the Interprocedural Static
Slicing of Binary Executables
References
Statistical testing:
S. Haberman (1973): The Analysis of Residuals in Cross-Classified Tables,
pp. 205-220
B.S. Everitt (1992): The Analysis of Contingency Tables (2nd Ed.)
Network Graph Measures and Network Motifs:
L. Amaral et al (2000): Classes of Small-World Networks
R Milo, Alon U. et al (2002): Network Motifs: Simple Building Blocks of
Complex Networks
M. Newman (2003): The structure and function of complex networks
D. Bilar (2006): Science of Networks. http://cs.colby.edu/courses/cs298
System Dependence Graphs:
GrammaTech Inc.: Static Program Dependence Analysis via Dependence
Graphs. http://www.codesurfer.com/papers/
Á. Kiss et al (2003). Interprocedural Static Slicing of Binary Executables

More Related Content

Similar to BH-US-06-Bilar.pdf

150104 3 methods for-binary_analysis_and_valgrind
150104 3 methods for-binary_analysis_and_valgrind150104 3 methods for-binary_analysis_and_valgrind
150104 3 methods for-binary_analysis_and_valgrindRaghu Palakodety
 
Explainable Artificial Intelligence (XAI) 
to Predict and Explain Future Soft...
Explainable Artificial Intelligence (XAI) 
to Predict and Explain Future Soft...Explainable Artificial Intelligence (XAI) 
to Predict and Explain Future Soft...
Explainable Artificial Intelligence (XAI) 
to Predict and Explain Future Soft...
Chakkrit (Kla) Tantithamthavorn
 
MINI PROJECT s.pptx
MINI PROJECT s.pptxMINI PROJECT s.pptx
MINI PROJECT s.pptx
arjunchithanoor
 
PROVIDING CYBER SECURITY SOLUTION FOR MALWARE DETECTION USING SUPPORT VECTOR ...
PROVIDING CYBER SECURITY SOLUTION FOR MALWARE DETECTION USING SUPPORT VECTOR ...PROVIDING CYBER SECURITY SOLUTION FOR MALWARE DETECTION USING SUPPORT VECTOR ...
PROVIDING CYBER SECURITY SOLUTION FOR MALWARE DETECTION USING SUPPORT VECTOR ...
IRJET Journal
 
Zero day malware detection
Zero day malware detectionZero day malware detection
Zero day malware detection
sujeeshkumarj
 
Detection of vulnerabilities in programs with the help of code analyzers
Detection of vulnerabilities in programs with the help of code analyzersDetection of vulnerabilities in programs with the help of code analyzers
Detection of vulnerabilities in programs with the help of code analyzers
PVS-Studio
 
Keith J. Jones, Ph.D. - MALGAZER: AN AUTOMATED MALWARE CLASSIFIER WITH RUNNIN...
Keith J. Jones, Ph.D. - MALGAZER: AN AUTOMATED MALWARE CLASSIFIER WITH RUNNIN...Keith J. Jones, Ph.D. - MALGAZER: AN AUTOMATED MALWARE CLASSIFIER WITH RUNNIN...
Keith J. Jones, Ph.D. - MALGAZER: AN AUTOMATED MALWARE CLASSIFIER WITH RUNNIN...
Keith Jones, PhD
 
Snippets, Scans and Snap Decisions: How Component Identification Methods Impa...
Snippets, Scans and Snap Decisions: How Component Identification Methods Impa...Snippets, Scans and Snap Decisions: How Component Identification Methods Impa...
Snippets, Scans and Snap Decisions: How Component Identification Methods Impa...
Sonatype
 
Janus - Automation of duplicate code detection and refactoring
Janus - Automation of duplicate code detection and refactoringJanus - Automation of duplicate code detection and refactoring
Janus - Automation of duplicate code detection and refactoring
UmbertoAzadi
 
Java Performance & Profiling
Java Performance & ProfilingJava Performance & Profiling
Java Performance & Profiling
Isuru Perera
 
MACHINE LEARNING APPROACH TO LEARN AND DETECT MALWARE IN ANDROID
MACHINE LEARNING APPROACH TO LEARN AND DETECT MALWARE IN ANDROIDMACHINE LEARNING APPROACH TO LEARN AND DETECT MALWARE IN ANDROID
MACHINE LEARNING APPROACH TO LEARN AND DETECT MALWARE IN ANDROID
IRJET Journal
 
Finding Zero-Days Before The Attackers: A Fortune 500 Red Team Case Study
Finding Zero-Days Before The Attackers: A Fortune 500 Red Team Case StudyFinding Zero-Days Before The Attackers: A Fortune 500 Red Team Case Study
Finding Zero-Days Before The Attackers: A Fortune 500 Red Team Case Study
DevOps.com
 
Accelerated Windows Debugging 3 training public slides
Accelerated Windows Debugging 3 training public slidesAccelerated Windows Debugging 3 training public slides
Accelerated Windows Debugging 3 training public slides
Dmitry Vostokov
 
Esem2010 shihab
Esem2010 shihabEsem2010 shihab
Esem2010 shihabSAIL_QU
 
Application of data mining based malicious code detection techniques for dete...
Application of data mining based malicious code detection techniques for dete...Application of data mining based malicious code detection techniques for dete...
Application of data mining based malicious code detection techniques for dete...UltraUploader
 
A malware detection method for health sensor data based on machine learning
A malware detection method for health sensor data based on machine learningA malware detection method for health sensor data based on machine learning
A malware detection method for health sensor data based on machine learning
jaigera
 
Classification of Malware based on Data Mining Approach
Classification of Malware based on Data Mining ApproachClassification of Malware based on Data Mining Approach
Classification of Malware based on Data Mining Approach
ijsrd.com
 
Applications of genetic algorithms to malware detection and creation
Applications of genetic algorithms to malware detection and creationApplications of genetic algorithms to malware detection and creation
Applications of genetic algorithms to malware detection and creationUltraUploader
 
Malware analysis and detection using reverse Engineering, Available at: www....
Malware analysis and detection using reverse Engineering,  Available at: www....Malware analysis and detection using reverse Engineering,  Available at: www....
Malware analysis and detection using reverse Engineering, Available at: www....
Research Publish Journals (Publisher)
 
Malware Classification Using Structured Control Flow
Malware Classification Using Structured Control FlowMalware Classification Using Structured Control Flow
Malware Classification Using Structured Control Flow
Silvio Cesare
 

Similar to BH-US-06-Bilar.pdf (20)

150104 3 methods for-binary_analysis_and_valgrind
150104 3 methods for-binary_analysis_and_valgrind150104 3 methods for-binary_analysis_and_valgrind
150104 3 methods for-binary_analysis_and_valgrind
 
Explainable Artificial Intelligence (XAI) 
to Predict and Explain Future Soft...
Explainable Artificial Intelligence (XAI) 
to Predict and Explain Future Soft...Explainable Artificial Intelligence (XAI) 
to Predict and Explain Future Soft...
Explainable Artificial Intelligence (XAI) 
to Predict and Explain Future Soft...
 
MINI PROJECT s.pptx
MINI PROJECT s.pptxMINI PROJECT s.pptx
MINI PROJECT s.pptx
 
PROVIDING CYBER SECURITY SOLUTION FOR MALWARE DETECTION USING SUPPORT VECTOR ...
PROVIDING CYBER SECURITY SOLUTION FOR MALWARE DETECTION USING SUPPORT VECTOR ...PROVIDING CYBER SECURITY SOLUTION FOR MALWARE DETECTION USING SUPPORT VECTOR ...
PROVIDING CYBER SECURITY SOLUTION FOR MALWARE DETECTION USING SUPPORT VECTOR ...
 
Zero day malware detection
Zero day malware detectionZero day malware detection
Zero day malware detection
 
Detection of vulnerabilities in programs with the help of code analyzers
Detection of vulnerabilities in programs with the help of code analyzersDetection of vulnerabilities in programs with the help of code analyzers
Detection of vulnerabilities in programs with the help of code analyzers
 
Keith J. Jones, Ph.D. - MALGAZER: AN AUTOMATED MALWARE CLASSIFIER WITH RUNNIN...
Keith J. Jones, Ph.D. - MALGAZER: AN AUTOMATED MALWARE CLASSIFIER WITH RUNNIN...Keith J. Jones, Ph.D. - MALGAZER: AN AUTOMATED MALWARE CLASSIFIER WITH RUNNIN...
Keith J. Jones, Ph.D. - MALGAZER: AN AUTOMATED MALWARE CLASSIFIER WITH RUNNIN...
 
Snippets, Scans and Snap Decisions: How Component Identification Methods Impa...
Snippets, Scans and Snap Decisions: How Component Identification Methods Impa...Snippets, Scans and Snap Decisions: How Component Identification Methods Impa...
Snippets, Scans and Snap Decisions: How Component Identification Methods Impa...
 
Janus - Automation of duplicate code detection and refactoring
Janus - Automation of duplicate code detection and refactoringJanus - Automation of duplicate code detection and refactoring
Janus - Automation of duplicate code detection and refactoring
 
Java Performance & Profiling
Java Performance & ProfilingJava Performance & Profiling
Java Performance & Profiling
 
MACHINE LEARNING APPROACH TO LEARN AND DETECT MALWARE IN ANDROID
MACHINE LEARNING APPROACH TO LEARN AND DETECT MALWARE IN ANDROIDMACHINE LEARNING APPROACH TO LEARN AND DETECT MALWARE IN ANDROID
MACHINE LEARNING APPROACH TO LEARN AND DETECT MALWARE IN ANDROID
 
Finding Zero-Days Before The Attackers: A Fortune 500 Red Team Case Study
Finding Zero-Days Before The Attackers: A Fortune 500 Red Team Case StudyFinding Zero-Days Before The Attackers: A Fortune 500 Red Team Case Study
Finding Zero-Days Before The Attackers: A Fortune 500 Red Team Case Study
 
Accelerated Windows Debugging 3 training public slides
Accelerated Windows Debugging 3 training public slidesAccelerated Windows Debugging 3 training public slides
Accelerated Windows Debugging 3 training public slides
 
Esem2010 shihab
Esem2010 shihabEsem2010 shihab
Esem2010 shihab
 
Application of data mining based malicious code detection techniques for dete...
Application of data mining based malicious code detection techniques for dete...Application of data mining based malicious code detection techniques for dete...
Application of data mining based malicious code detection techniques for dete...
 
A malware detection method for health sensor data based on machine learning
A malware detection method for health sensor data based on machine learningA malware detection method for health sensor data based on machine learning
A malware detection method for health sensor data based on machine learning
 
Classification of Malware based on Data Mining Approach
Classification of Malware based on Data Mining ApproachClassification of Malware based on Data Mining Approach
Classification of Malware based on Data Mining Approach
 
Applications of genetic algorithms to malware detection and creation
Applications of genetic algorithms to malware detection and creationApplications of genetic algorithms to malware detection and creation
Applications of genetic algorithms to malware detection and creation
 
Malware analysis and detection using reverse Engineering, Available at: www....
Malware analysis and detection using reverse Engineering,  Available at: www....Malware analysis and detection using reverse Engineering,  Available at: www....
Malware analysis and detection using reverse Engineering, Available at: www....
 
Malware Classification Using Structured Control Flow
Malware Classification Using Structured Control FlowMalware Classification Using Structured Control Flow
Malware Classification Using Structured Control Flow
 

Recently uploaded

The Role of Electrical and Electronics Engineers in IOT Technology.pdf
The Role of Electrical and Electronics Engineers in IOT Technology.pdfThe Role of Electrical and Electronics Engineers in IOT Technology.pdf
The Role of Electrical and Electronics Engineers in IOT Technology.pdf
Nettur Technical Training Foundation
 
Water billing management system project report.pdf
Water billing management system project report.pdfWater billing management system project report.pdf
Water billing management system project report.pdf
Kamal Acharya
 
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdfGoverning Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
WENKENLI1
 
Basic Industrial Engineering terms for apparel
Basic Industrial Engineering terms for apparelBasic Industrial Engineering terms for apparel
Basic Industrial Engineering terms for apparel
top1002
 
Forklift Classes Overview by Intella Parts
Forklift Classes Overview by Intella PartsForklift Classes Overview by Intella Parts
Forklift Classes Overview by Intella Parts
Intella Parts
 
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTSHeap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Soumen Santra
 
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdfWater Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation & Control
 
14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application
SyedAbiiAzazi1
 
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
Amil Baba Dawood bangali
 
Technical Drawings introduction to drawing of prisms
Technical Drawings introduction to drawing of prismsTechnical Drawings introduction to drawing of prisms
Technical Drawings introduction to drawing of prisms
heavyhaig
 
Recycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part IIIRecycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part III
Aditya Rajan Patra
 
Unbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptxUnbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptx
ChristineTorrepenida1
 
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdfHybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
fxintegritypublishin
 
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdfAKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
SamSarthak3
 
Fundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptxFundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptx
manasideore6
 
Building Electrical System Design & Installation
Building Electrical System Design & InstallationBuilding Electrical System Design & Installation
Building Electrical System Design & Installation
symbo111
 
Cosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdfCosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdf
Kamal Acharya
 
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
AJAYKUMARPUND1
 
6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)
ClaraZara1
 
Final project report on grocery store management system..pdf
Final project report on grocery store management system..pdfFinal project report on grocery store management system..pdf
Final project report on grocery store management system..pdf
Kamal Acharya
 

Recently uploaded (20)

The Role of Electrical and Electronics Engineers in IOT Technology.pdf
The Role of Electrical and Electronics Engineers in IOT Technology.pdfThe Role of Electrical and Electronics Engineers in IOT Technology.pdf
The Role of Electrical and Electronics Engineers in IOT Technology.pdf
 
Water billing management system project report.pdf
Water billing management system project report.pdfWater billing management system project report.pdf
Water billing management system project report.pdf
 
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdfGoverning Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
 
Basic Industrial Engineering terms for apparel
Basic Industrial Engineering terms for apparelBasic Industrial Engineering terms for apparel
Basic Industrial Engineering terms for apparel
 
Forklift Classes Overview by Intella Parts
Forklift Classes Overview by Intella PartsForklift Classes Overview by Intella Parts
Forklift Classes Overview by Intella Parts
 
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTSHeap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
 
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdfWater Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
 
14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application
 
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
 
Technical Drawings introduction to drawing of prisms
Technical Drawings introduction to drawing of prismsTechnical Drawings introduction to drawing of prisms
Technical Drawings introduction to drawing of prisms
 
Recycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part IIIRecycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part III
 
Unbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptxUnbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptx
 
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdfHybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
 
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdfAKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
 
Fundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptxFundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptx
 
Building Electrical System Design & Installation
Building Electrical System Design & InstallationBuilding Electrical System Design & Installation
Building Electrical System Design & Installation
 
Cosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdfCosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdf
 
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
 
6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)
 
Final project report on grocery store management system..pdf
Final project report on grocery store management system..pdfFinal project report on grocery store management system..pdf
Final project report on grocery store management system..pdf
 

BH-US-06-Bilar.pdf

  • 1. Statistical Structures: Fingerprinting Malware for Classification and Analysis Daniel Bilar Wellesley College (Wellesley, MA) Colby College (Waterville, ME) bilar <at> alum dot dartmouth dot org
  • 2. Why Structural Fingerprinting? Goal: Identifying and classifying malware Problem: For any single fingerprint, balance between over-fitting (type II error) and under- fitting (type I error) hard to achieve Approach: View binaries simultaneously from different structural perspectives and perform statistical analysis on these ‘structural fingerprints’
  • 3. Different Perspectives Primarily static Graph structural properties Explore graph- modeled control and data dependencies System Dependence Graph Primarily dynamic API call vector Observe API calls made Win 32 API call Primarily static Opcode frequency distribution Count different instructions Assembly instruction static / dynamic? Statistical Fingerprint Description Structural Perspective Idea: Multiple perspectives may increase likelihood of correct identification and classification
  • 4. Fingerprint: Opcode frequency distribution Synopsis: Statically disassemble the binary, tabulate the opcode frequencies and construct a statistical fingerprint with a subset of said opcodes. Goal: Compare opcode fingerprint across non- malicious software and malware classes for quick identification and classification purposes. Main result: ‘Rare’ opcodes explain more data variation then common ones
  • 5. Goodware: Opcode Distribution Procedure: 1. Inventoried PEs (EXE, DLL, etc) on XP box with Advanced Disk Catalog 2. Chose random EXE samples with MS Excel and Index your Files 3. Ran IDA with modified InstructionCounter plugin on sample PEs 4. Augmented IDA output files with PEID results (compiler) and general ‘functionality class’ (e.g. file utility, IDE, network utility, etc) 5. Wrote Java parser for raw data files and fed JAMA’ed matrix into Excel for analysis ---------.exe -------.exe ---------.exe size: 122880 totalopcodes: 10680 compiler: MS Visual C++ 6.0 class: utility (process) 0001. 002145 20.08% mov 0002. 001859 17.41% push 0003. 000760 7.12% call 0004. 000759 7.11% pop 0005. 000641 6.00% cmp 1, 2 3, 4 5
  • 6. Malware: Opcode Distribution Procedure: 1. Booted VMPlayer with XP image 2. Inventoried PEs from C. Ries malware collection with Advanced Disk Catalog 3. Fixed 7 classes (e.g. virus,, rootkit, etc), chose random PEs samples with MS Excel and Index your Files 4. Ran IDA with modified InstructionCounter plugin on sample PEs 5. Augmented IDA output files with PEID results (compiler, packer) and ‘class’ 6. Wrote Java parser for raw data files and fed JAMA’ed matrix into Excel for analysis Giri.5209 Gobi.a ---------.b size: 12288 totalopcodes: 615 compiler: unknown class: virus 0001. 000112 18.21% mov 0002. 000094 15.28% push 0003. 000052 8.46% call 0004. 000051 8.29% cmp 0005. 000040 6.50% add 2,3 4, 5 6 AFXRK2K4.root.exe vanquish.dll 1
  • 7. Aggregate (Goodware): Opcode Breakdown and 1% retn 2% jnz 3% add 3% mov 25% push 19% call 9% pop 6% cmp 5% jz 4% lea 4% test 3% jmp 3% xor 2% 20 EXEs (size-blocked random samples from 538 inventoried EXEs) ~1,520,000 opcodes read 192 out of 420 possible opcodes found 72 opcodes in pie chart account for >99.8% 14 opcodes labelled account for ~90% Top 5 opcodes account for ~64 % 20 EXEs (size-blocked random samples from 538 inventoried EXEs) ~1,520,000 opcodes read 192 out of 398 possible opcodes found 72 opcodes in pie chart account for >99.8% 14 opcodes labelled account for ~90% Top 5 opcodes account for ~64 %
  • 8. Aggregate (Malware): Opcode Breakdown sub 1% xor 3% add 3% lea 3% jmp 3% jz 4% cmp 4% pop 6% call 10% push 16% mov 30% test 3% retn 3% jnz 3% 67 PEs (class-blocked random samples from 250 inventoried PEs) ~665,000 opcodes read 141 out of 420 possible opcodes found (two undocu- mented) 60 opcodes in pie chart account for >99.8% 14 opcodes labelled account for >92% Top 5 opcodes account for ~65% 67 PEs (class-blocked random samples from 250 inventoried PEs) ~665,000 opcodes read 141 out of 398 possible opcodes found (two undocu- mented) 60 opcodes in pie chart account for >99.8% 14 opcodes labelled account for >92% Top 5 opcodes account for ~65%
  • 9. Class-blocked (Malware): Opcode Breakdown Comparison mov push call pop cmp jz jmp lea add test retn jnz sub xor mov 30 % push 16 % call 10 % pop 6 % cmp 4 % jz 4 % jmp 4 % lea 3% add 3% test 3% retn 3% jnz 2% xor 2% sub 1% Aggregate breakdown Aggregate worms virus tools trojan s rootkit (kernel) rootkit (user) bots
  • 10. Top 14 Opcodes: Frequency 1.5% 1.1% 1.7% 3.7% 5.8% 4.1% 1.8% 1.8% 3.3% 6.4% 2.7% 5.5% 15.6% 37.0% Kernel RK 1.0% 2.3% 2.3% 3.1% 3.7% 3.8% 3.2% 3.3% 3.9% 4.9% 5.1% 8.9% 16.6% 29.0% User RK 1.3% 2.1% 2.9% 3.4% 3.4% 3.4% 3.7% 3.1% 4.3% 5.3% 5.9% 8.2% 19.0% 25.4% Tools 0.5% 3.2% 3.0% 2.2% 2.5% 3.0% 2.6% 2.6% 3.3% 3.6% 6.8% 11.0% 14.1% 34.6% Bot 0.6% 2.7% 3.2% 2.6% 3.0% 3.4% 3.4% 2.7% 3.5% 3.6% 7.3% 10.0% 15.4% 30.5% Trojan 1.5% 2.1% 2.0% 3.2% 3.5% 2.7% 3.1% 5.5% 4.4% 5.9% 7.0% 9.1% 22.7% 16.1% Virus 1.3% 1.9% 2.2% 2.6% 3.0% 3.0% 3.2% 3.9% 4.3% 5.1% 6.3% 8.7% 19.5% 25.3% Goodware 1.6% and 2.3% xor 2.3% retn 3.2% jnz 3.0% add 4.5% jmp 3.0% test 4.2% lea 4.0% jz 5.0% cmp 6.2% pop 8.7% call 20.7% push 22.2% mov Worms Opcode
  • 11. Comparison Opcode Frequencies 1.5% 1.1% 1.7% 3.7% 5.8% 4.1% 1.8% 1.8% 3.3% 6.4% 2.7% 5.5% 15.6% 37.0% Kernel RK 1.0% 2.3% 2.3% 3.1% 3.7% 3.8% 3.2% 3.3% 3.9% 4.9% 5.1% 8.9% 16.6% 29.0% User RK 1.3% 2.1% 2.9% 3.4% 3.4% 3.4% 3.7% 3.1% 4.3% 5.3% 5.9% 8.2% 19.0% 25.4% Tools 0.5% 3.2% 3.0% 2.2% 2.5% 3.0% 2.6% 2.6% 3.3% 3.6% 6.8% 11.0% 14.1% 34.6% Bot 0.6% 2.7% 3.2% 2.6% 3.0% 3.4% 3.4% 2.7% 3.5% 3.6% 7.3% 10.0% 15.4% 30.5% Trojan 1.5% 2.1% 2.0% 3.2% 3.5% 2.7% 3.1% 5.5% 4.4% 5.9% 7.0% 9.1% 22.7% 16.1% Virus 1.3% 1.9% 2.2% 2.6% 3.0% 3.0% 3.2% 3.9% 4.3% 5.1% 6.3% 8.7% 19.5% 25.3% Goodware 1.6% and 2.3% xor 2.3% retn 3.2% jnz 3.0% add 4.5% jmp 3.0% test 4.2% lea 4.0% jz 5.0% cmp 6.2% pop 8.7% call 20.7% push 22.2% mov Worms Opcode Perform distribution tests for top 14 opcodes on 7 classes of malware: Rootkit (kernel + user) Virus and Worms Trojan and Tools Bots Investigate: Which, if any, opcode frequency is significantly different for malware?
  • 12. Top 14 Opcode Testing (z-scores) 1.9 -8.9 -5.5 8.7 22.9 8.5 -12.2 -16.2 -7.4 7.4 -22.0 -17.0 -15.5 36.8 Kernel RK -7.3 6.7 2.5 7.4 10.8 11.7 0.0 -8.4 -6.1 -3.5 -13.5 1.2 -21.0 20.6 User RK -0.7 -2.6 -12.3 -11.7 -6.4 -5.0 -6.6 10.9 0.9 -0.6 4.9 5.2 4.6 2.0 Tools -33.6 29.5 18.4 -12.2 -13.5 -2.2 -14.6 -29.2 -20.9 -30.8 5.1 26.0 -59.9 70.1 Bot -17.0 15.3 17.8 -0.9 -0.1 5.0 1.8 -18.3 -11.0 -21.2 9.8 10.6 -31.2 28.7 Trojan 2.4 2.7 -1.4 5.3 4.3 -2.3 -0.2 11.5 1.4 4.7 4.8 2.6 12.1 -27.9 Virus 5.9 and 7.7 xor 2.6 retn 8.0 jnz 0.5 add 20.4 jmp -3.4 test 4.2 lea -4.4 jz -1.8 cmp -1.1 pop -0.3 call 6.9 push -20.1 mov Worms Opcode High Low Higher Lower Similar Opcode Frequency Tests suggests opcode frequency roughly 1/3 same 1/3 lower 1/3 higher vs goodware
  • 13. Top 14 Opcodes Results Interpretation 5.2 5.6 9.5 15.0 4.0 6.1 10.3 Cramer’s V (in %) Krn Usr Tools Bot Trojan Virus and xor retn jnz add jmp test lea jz cmp pop call push mov Worm Op Kernel-mode Rootkit: most # of deviations  handcoded assembly; ‘evasive’ opcodes ? Virus + Worms: few # of deviations; more jumps  smaller size, simpler malicious function, more control flow ? High Low Higher Lower Similar Opc Freq Tools: (almost) no deviation in top 5 opcodes  more ‘benign’ (i.e. similar to goodware) ? Most frequent 14 opcodes weak predictor Explains just 5-15% of variation!
  • 14. Rare 14 Opcodes (parts per million) std shld setle setb sbb rdtsc pushf nop int imul fstcw fild fdivp bt Opcode Worms Virus Trojan Bot Tools User RK Kernel RK Goodware 95 0 31 48 35 56 272 20 24 54 0 4 35 45 0 22 0 0 21 0 0 0 0 20 24 0 52 22 12 68 0 6 782 1133 458 431 1523 1330 588 1078 0 108 0 11 0 0 0 12 12 54 0 0 59 11 0 116 83 647 42 7 71 101 136 216 0 108 0 0 921 981 4028 25 1126 755 406 726 708 1849 1629 1182 12 0 21 22 0 0 0 11 438 0 115 133 0 45 0 357 59 0 52 52 35 0 0 37 118 0 83 70 47 34 0 30
  • 15. Rare 14 Opcode Testing (z-scores) 4.8 -1.0 -1.0 -0.5 -6.5 -0.7 -2.4 -2.3 45.0 -3.3 -0.7 -4.3 -1.3 -1.2 Kernel RK 1.4 0.6 -1.6 4.7 -2.0 -1.2 -3.7 -3.6 26.2 1.3 -1.2 -6.5 -2.2 -0.4 User RK 0.8 0.6 -1.4 0.6 3.4 -1.1 -1.8 -3.2 28.7 -5.9 -1.0 -6.1 -0.3 0.7 Tools 0.3 -1.1 -1.6 4.6 -2.2 1.1 -3.9 -5.0 -1.8 4.4 3.3 -1.5 3.8 6.6 Bot 2.4 -0.9 1.3 7.9 0.3 -0.7 -2.2 -1.6 -1.0 -1.4 2.2 -0.8 2.8 5.9 Trojan -0.6 1.0 -0.6 -0.3 0.8 3.8 -0.7 4.5 2.4 -1.7 -0.4 -2.6 -0.8 -0.7 Virus 4.8 std 0.2 shld -1.2 setle 2.1 setb -2.0 sbb -0.9 rdtsc -2.6 pushf -2.3 nop -1.4 int 0.9 imul 0.2 fstcw 2.1 fild 1.3 fdivp 4.8 bt Worms Opcode High Low Higher Lower Similar Opcode Frequency Tests suggests opcode frequency roughly 1/10 lower 1/5 higher 7/10 same vs goodware
  • 16. Rare 14 Opcodes: Interpretation 12 10 16 17 42 36 63 Cramer’s V (in %) Krn Usr Tools Bot Trojan Virus std shld setle setb sbb rdtsc pushf nop int imul fstcw fild fdivp bt Worm Op NOP: Virus makes use  NOP sled, padding ? High Low Higher Lower Similar Opc Freq INT: Rooktkits (and tools) make heavy use of software interrupts  tell-tale sign of RK ? Infrequent 14 opcodes much better predictor! Explains 12-63% of variation
  • 17. Summary: Opcode Distribution Malware opcode frequency distribution seems to deviate significantly from non- malicious software ‘Rare’ opcodes explain more frequency variation then common ones Compare opcode fingerprints against various software classes for quick identification and classification Giri.5209 Gobi.a ---------.b size: 12288 totalopcodes: 615 compiler: unknown class: virus 0001. 000112 18.21% mov 0002. 000094 15.28% push 0003. 000052 8.46% call 0004. 000051 8.29% cmp 0005. 000040 6.50% add AFXRK2K4.root.exe vanquish.dll
  • 18. Opcodes: Further directions Acquire more samples and software class differentiation Investigate sophisticated tests for stronger control of false discovery rate and type I error Study n-way association with more factors (compiler, type of opcodes, size) Go beyond isolated opcodes to semantic ‘nuggets’ (size- wise between isolated opcodes and basic blocks) Investigate equivalent opcode substitution effects
  • 19. Related Work M. Weber (2002): PEAT – Toolkit for Detecting and Analyzing Malicious Software R. Chinchani (2005): A Fast Static Analysis Approach to Detect Exploit Code Inside Network Flows S. Stolfo (2005): Fileprint Analysis for Malware Detection
  • 20. Fingerprint: Win 32 API calls Goal: Classify malware quickly into a family (set of variants make up a family) Synopsis: Observe and record Win32 API calls made by malicious code during execution, then compare them to calls made by other malicious code to find similarities Joint work with Chris Ries Main result: Simple model yields > 80% correct classification, call vectors seem robust towards different packer
  • 21. Win 32 API call: System overview Data Collection Vector Builder Vector Comparison Database Malicious Code Family Log File Data Collection: Run malicious code, recording Win32 API calls it makes Vector Builder: Build count vector from collected API call data and store in database Comparison: Compare vector to all other vectors in the database to see if its related to any of them
  • 22. Win 32 API Call: Data Collection Win 2000 VMWare WinXP Linux Relayer Fake DNS Server Honeyd Malware Logger Malware runs for short period of time on VMWare machine, can interact with fake network API calls recorded by logger, passed on to Relayer Relayer forwards logs to file, console
  • 23. Win 32 API Call: Call Recording Malicious process is started in suspended state DLL is injected into process’s address space When DLL’s DllMain() function is executed, it hooks the Win32 API function Calling Function Target Function Target Function Calling Function Hook Trampoline Function call before hooking Function call after hooking Hook records the call’s time and arguments, calls the target, records the return value, and then returns the target’s return value to the calling function.
  • 24. Win 32 API call: Call Vector Column of the vector represents a hooked function and # of times called 1200+ different functions recorded during execution For each malware specimen, vector values recorded to database … 0 156 12 62 Number of Calls … EndPath CloseHandle FindFirstFileA FindClose Function Name
  • 25. Win 32 API call: Comparison Computes cosine similarity measure csm between vector and each vector in the database = • = 2 1 2 1 2 1 ) , ( v v v v v v csm v v v v v v If csm(vector, most similar vector in the database) > threshold  vector is classified as member of familymost-similar-vector Otherwise vector classified as member of familyno-variants-yet v 1 v2
  • 26. Win 32 API call: Results Collected 77 malware samples 3 3 Moega 5 8 SDBot 6 6 Welchia 1 1 Pestlogger 0 1 Spybot 1 2 Randex 2 3 Sasser 8 8 Netsky 5 5 MyLife 8 10 MyDoom 2 2 Mitgleider 1 1 Klez 0 2 Inor 1 1 Gibe 2 3 Frethem 1 1 Blaster 14 15 Beagle 4 1 2 4 1 2 Banker Nibu Tarno 0 1 Apost # correct # of members Family 27 2 0 48 0.62 0.99 11 3 2 61 0.79 0.95 10 4 1 61 0.79 0.9 8 4 2 63 0.82 0.85 5 6 3 63 0.82 0.8 4 6 5 62 0.8 0.75 2 8 5 62 0.8 0.7 miss. fam. both false fam.  #  % Threshold Classification made by 17 major AV scanners produced 21 families (some aliases) ~80 % correct with csm threshold 0.8 Discrepancies Misclassifications
  • 27. Win 32 API call: Packers Wide variety of different packers used within same families Dynamic Win 32 API call fingerprint seems robust towards packer  PE Pack Netsky.Y  PE-Patch, UPX Netsky.S  FSG Netsky.P  tElock Netsky.K  PEtite Netsky.D  PEtite Netsky.C UPX Netsky.B  PECompact Netsky.AB Identified Packer Variant 8 Netsky variants in sample, 7 identified
  • 28. Summary: Win 32 API calls Simple model yields > 80% correct classification Resolved discrepancies between some AV scanners Dynamical API call vectors seem robust towards different packer Data Collection Vector Builder Vector Comparison Database Malicious Code Family Log File Allows researchers and analysts to quickly identify variants reasonably well, without manual analysis
  • 29. API call : Further directions Acquire more malware samples for better variant classification Explore resiliency to obfuscation techniques (substitutions of Win 32 API calls, call spamming) Investigate patterns of ‘call bundles’ instead of just isolated calls for richer identification Replace VSM with finite state automaton that captures rich set of call relations
  • 30. Related Work R. Sekar et al (2001): A Fast Automaton-Based Method for Detecting Anamalous Program Behaviour J. Rabek et al (2003): DOME – Detection of Injected, Dynamically Generated, and Obfuscated Malicious Code K. Rozinov (2005): Efficient Static Analysis of Executables for Detecting Malicous Behaviour
  • 31. Fingerprint: PDG measures Goal: Compare ‘graph structure’ fingerprint of unknown binaries across non-malicious software and malware classes for identification, classification and prediction purposes Synopsis: Represent binaries as a System Dependence Graph, extract graph features to construct ‘graph-structural’ fingerprints for particular software classes Main result: Work in progress
  • 32. Program Dependence Graph A PDG models intra- procedural Data Dependence: Program statements compute data that are used by other statements. Control Dependence: Arise from the ordered flow of control in a program. Picture from J. Stafford (Colorado, Boulder)
  • 34. System Dependence Graph A SDG models control, data, and call dependences in a program A SDG of a program are the aggregated PDGs augmented with the calls between functions Picture from J. Stafford (Colorado, Boulder)
  • 36. Graph measures as a fingerprint Binaries represented as a System Dependence Graph (SDG) Tally distributions on graph measures: Edge weights (weight is jump distance) Node weight (weight is number of statements in basic block) Centrality (“How important is the node”) Clustering Coefficient (“probability of connected neighbours”) Motifs (“recurring patterns”)  Statistical structural fingerprint Example: H. Flake (BH 05) Graph-structural measure
  • 37. Primer: Graphs Graphs (Networks) are made up of vertices (nodes) and edges An edge connects two vertices Nodes and edges can have different weights (b,c) Edges can have directions (d) Picture from Mark Newman (2003)
  • 38. Modeling with Graphs Genetic regulatory network (proteins, dependence) Cardiovascular (organs, veins) [also transmission] Food (predator, prey) Neural (neurons, axons) Biological biological ‘entities’ interacting Social Set of people/groups with some interaction between them Chemistry (conformation of polymers, transitions) Physical Science physical ‘entities’ interacting Power grid (power station, lines) Telephone Internet (routers, physical links) Technology (Transmission) resource or commodity distribution/transmission Citation (paper, cited) Thesaurus (words, synonym) www (html pages, URL links) Friendship (people, friendship bond) Business (companies, business dealings) Movies (actors, collaboration) Phone calls (number, call) Information Information linked together Example (nodes, edge) Category
  • 39. Measures: Centrality Centrality tries to measure the ‘importance’ of a vertex Degree centrality: “How many nodes are connected to me?” Closeness centrality: “How close am I from all other nodes?” Betweenness centrality: “How important am I for any two nodes?” Freeman metric computes centralization for entire graph
  • 40. Measure: Degree centrality Degree centrality: “How many nodes are connected to me?” ∑≠ = = N j i j i d j i edge n M , 1 ) , ( ) ( 1 ) , ( ) ( , 1 − = ∑≠ = N n n edge n M N j i j j i i d Normalized:
  • 41. Measure: Closeness centrality Closeness centrality: “How close am I from all other nodes?” 1 1 ) , ( ) ( − =       = ∑ N j j i i c n n d n M ) 1 )( ( ) ( ' − = N n M n M i C i C Normalized:
  • 42. Measure: Betweenness centrality Betweenness centrality: “How important am I for any two nodes?” ∑≠ ≠ = i k j jk i jk i B g n g n C / ) ( ) ( 2 / ) 2 )( 1 ( ) ( ) ( ' − − = N N n C n C i B i B Normalized:
  • 43. Local structure: Clusters and Motifs Graphs can be decomposed into constitutive subgraphs Subgraph: A subset of nodes of the original graph and of edges connecting them (does not have to contain all the edges of a node) Cluster: Connected subgraph Motif: Recurring subgraph in networks at higher frequencies than expected by random chance
  • 44. Measure: Clustering Coefficient Slide from Albert (Penn State)
  • 45. Measure: Network motifs A motif in a network is a subgraph ‘pattern’ Recurs in networks at higher frequencies than expected by random chance Motif may reflect underlying generative processes, design principles and constraints and driving dynamics of the network
  • 46. Motif example: Found in Biochemistry (Transcriptional regulation) Neurobiology (Neuron connectivity) Ecology (food web) Engineering (electronic circuits) .. and maybe Computer Science (PDGs) ?? Feed-forward loop: X regulates Y and Z Y regulates Z
  • 47. Graph measures as a fingerprint Binaries represented as a System Dependence Graph (SDG) Tally distributions on graph measures: Edge weights (weight is jump distance) Node weight (weight is number of statements in basic block) Centrality (“How important is the node”) Clustering Coefficient (“probability of connected neighbours”) Motifs (“recurring patterns”)  Statistical structural fingerprint
  • 48. Summary: SDG measures Goal: Compare ‘graph structure’ fingerprint of unknown binaries across non-malicious software and malware classes for identification, classification and prediction purposes Synopsis: Represent binaries as a System Dependence Graph, extract graph features to construct ‘graph-structural’ fingerprints for particular software classes Main result: Work in progress
  • 49. Related Work H. Flake (2005): Compare, Port, Navigate M. Christodorescu (2003): Static Analysis of Executables to Detect Malicious Patterns A. Kiss (2005): Using Dynamic Information in the Interprocedural Static Slicing of Binary Executables
  • 50. References Statistical testing: S. Haberman (1973): The Analysis of Residuals in Cross-Classified Tables, pp. 205-220 B.S. Everitt (1992): The Analysis of Contingency Tables (2nd Ed.) Network Graph Measures and Network Motifs: L. Amaral et al (2000): Classes of Small-World Networks R Milo, Alon U. et al (2002): Network Motifs: Simple Building Blocks of Complex Networks M. Newman (2003): The structure and function of complex networks D. Bilar (2006): Science of Networks. http://cs.colby.edu/courses/cs298 System Dependence Graphs: GrammaTech Inc.: Static Program Dependence Analysis via Dependence Graphs. http://www.codesurfer.com/papers/ Á. Kiss et al (2003). Interprocedural Static Slicing of Binary Executables