SlideShare a Scribd company logo
1 of 9
Download to read offline
Architecture of a Morphological Malware Detector
Guillaume Bonfante, Matthieu Kaczmarek and Jean-Yves Marion
Nancy-Universit´e - Loria - INPL - Ecole Nationale Sup´erieure des Mines de Nancy
B.P. 239, 54506 Vandœuvre-l`es-Nancy C´edex, France
July 15, 2008
Abstract
Most of malware detectors are based on syntactic
signatures that identify known malicious programs.
Up to now this architecture has been sufficiently ef-
ficient to overcome most of malware attacks. Nev-
ertheless, the complexity of malicious codes still in-
crease. As a result the time required to reverse
engineer malicious programs and to forge new sig-
natures is increasingly longer.
This study proposes an efficient construction of a
morphological malware detector, that is a detector
which associates syntactic and semantic analysis.
It aims at facilitating the task of malware analysts
providing some abstraction on the signature rep-
resentation which is based on control flow graphs
(CFG).
We build an efficient signature matching engine
over tree automata techniques. Moreover we de-
scribe a generic graph rewriting engine in order to
deal with classic mutations techniques. Finally, we
provide a preliminary evaluation of the strategy de-
tection carrying out experiments on a malware col-
lection.
Introduction
The identification of malicious behavior is a diffi-
cult task. Until now, no technologies have been able
to automatically prevent the spread of malware.
Several approaches have been considered but nei-
ther syntactic analysis nor behavioral consideration
were really effective. Presently, human analysis of
malware seems to be the best strategy, next mal-
ware detectors based on string signature remains
the most reliable solution. From this point of view,
we have tried to easier the task which consist in
finding a good signature within a malicious pro-
grams. Our technique has been inspired from the
article [6] where control flow graphs are used to de-
tect the different instances of the computer virus
MetaPHOR.
Generally speaking, detection strategies based on
string signatures uses a database of regular expres-
sions and a string matching engine to scan files and
to detect infected ones. Each regular expression
of the database is designed to identify a known
malicious program. There are at least three dif-
ficulties tied to this approach. First, the identi-
fication of a malware signature requires a human
expert and the time to forge a reliable signature
is long compared to the time related to a malware
attack. Second, string signature approach can be
easily bypassed by obfuscation methods. Among
recent work treating this subject, we propose to
see for example [4, 7, 14]. Third, as the quantity
of malware increase, the ratio of false positives be-
comes a crucial issue. And removing old malware
signatures would open doors for outbreaks of re-
engineered malware.
Thus, a current trend in the community proposes
to design next generation of malware detectors over
semantic aspects. [11, 9, 20]. However, most of se-
mantic properties are difficult to decide and even
heuristics can be very complex as it is illustrated
in the field of computer safety. For those reasons,
in [5] we try to propose and to construct a morpho-
logical analysis in order to detect malware. The
idea is to recognize the shape of a malicious pro-
gram. That is, unlike string signature detection, we
are not only considering a program as a flat text,
but rather as a semantics object, so adding in some
sense a dimension to the analysis. Our approach
tries to combine several features: (a) to associate
1
inria-00330022,version1-13Oct2008 Author manuscript, published in "Journal in Computer Virology (2008)"
DOI : 10.1007/s11416-008-0102-4
syntactic and semantic analysis, (b) to be efficient
and (c) to be as automatic as possible.
Our morphological detector uses a set of CFG
which plays the role of a malware signature
database. Next, the detection consists in scanning
files in order to recognize the shape of a malware.
This design is closed to a string signature based de-
tector and so we think that both approaches may be
combined in a near future. Moreover, it is impor-
tant to notice that this framework make the signa-
ture extraction easier. Indeed, either the extraction
is fully automatic when the malware CFG is rele-
vant or the task of signature makers is facilitated
since they can work on an abstract representation
of malicious programs.
This detection strategy is close to the ones pre-
sented in [9, 6] but we put our strengths to opti-
mize the efficiency of algorithms. For that sake,
we use tree automata, a generalization to trees of
finite state automata over strings [10]. Intuitively,
we transform CFG into trees with pointers in or-
der to represent back edges and cross edges. Then,
the collection of malware signatures is a finite set
of trees and so a regular tree language. Thanks to
the construction of Myhill-Nerode, the minimal au-
tomaton gives us a compact and efficient database.
Notice that the construction of the database is it-
erative and it is easy to add the CFG of a newly
discovered malicious program.
Another issue of malware detections is the sound-
ness with respect to classic mutation techniques.
Here, we detect isomorphic CFG and so several
comon obfuscation methods are canonically re-
moved. Moreover, we add a rewriting engine which
normalizes CFG in order to have a robust represen-
tation of the control flow with respect to mutations.
Related works are [6, 8, 20] where data flow of pro-
grams is also considered.
The design of this complete chain of process is
summarized by Figure 1.
We also provide large scale experiments, with a
collection of 10156 malicious programs and 2653
sane programs. Those results are promising, with
a completely automatic method for the signature
extraction we have obtained a false positive ratio
of 0.1%.
This study is organized as follows. First we ex-
pose the principles of CFG extraction and normal-
ization. Then, we present a matching engine for
CFG that is based on tree automata. Finally we
Figure 1: Design of the control flow detector
carry out some experiments to validate our method.
1 CFG in x86 languages
Road-map. Since we focus on practical aspects
we choose to work on a concrete assembly language.
This language is close to the x86 assembly lan-
guage. We detail how to extract CFG from pro-
grams, we underline the difficulties that can be en-
countered and we outline how they can be overcome
with classic methods. Finally, we study the prob-
lem of CFG mutations. We propose to normalize
the extracted CFG according to rewriting rules in
order to remove common mutations.
An x86 assembly language. We present the
grammar of the studied programming language.
The computation domain is the integers and we use
a subset of the commands of the x86 assembly lan-
guage. The important feature is that we consider
the same flow instructions as in x86 achitectures, as
a result the method that we develop can be directly
applied to concrete programs.
2
inria-00330022,version1-13Oct2008
Addresses N
Offsets Z
Registers R
Expressions E ::= Z | N | R | [N] | [R]
Flow
instructions
If
::= jmp E | call E | ret | jcc Z
Sequential
instructions
Id
::= mov E E | comp E E | . . .
Programs P ::= Id
| If
| P ; P
Next, a program is a sequence of instructions p =
i0 ; . . . ; in−1. The address of the instruction ik is
k. In order to ease the reading and without loss of
generality, we suppose that i0 is the first instruction
to be executed, the address 0 is the so called entry
point of the program.
We observe that the control flow of programs
is driven by only four kinds of flow instructions.
Given an instruction ik ∈ If
, the possible transfers
of control are the following.
• If ik is an unconditional jump jmp e. The con-
trol is transferred to the address given by the
value of the expression e.
• If ik is a conditional jump jcc x. If its asso-
ciated condition is true, the control is trans-
ferred to the address k + x. Otherwise, the
control is transferred to the address k + 1.
• If ik is a function call call e. The address
k + 1 is pushed on the stack and the control is
transferred to the the value of the expression
e.
• If ik is a function return ret. An address is
popped from the stack and the control is trans-
ferred to this address.
Prerequisites. The extraction of the CFG from
a program is tied to several difficulties. First, we
need access to the instructions of the program.
As a result packing and encryption techniques can
thwart the extraction. This problem is part of
the folklore, indeed classical string signature de-
tectors also have to face those techniques. Many
solutions such as sand-boxes and generic unpack-
ers have been developed to overcome this difficulty.
The presentation of those solutions exceeds the
scope of the current study then we refer to the text-
books [13, 12, 18].
Second, we are confronted to obscure sequenceq
of instructions such as push a; ret which have the
same behavior as the instruction jmp a. This is also
part of the folklore and we will suppose that such
sequence of instructions are normalized during the
disassembly phase of the extraction.
Third, the target addresses of jumps and function
calls have to be dynamically computed. For exam-
ple, when we encounter the instruction jmp eax we
need the value of the register eax in order to follow
the control flow. In such cases we rely on an heuris-
tic (|e|) which provides the value of the expression e
if it can be statically computed. If the value cannot
be computed then (|e|) = ⊥. Such an heuristic can
be based on partial evaluation, emulation or any
other static analysis technique.
The extraction procedure. The control flow
consists in the different paths that might be tra-
versed through the program during its execution.
It is frequently represented by a graph named a
control flow graph (CFG). The vertices stand for
addresses of instructions and the edges represent
the possible paths that the control flow can follow.
We suppose that we have access to the code of
programs and that we have an heuristic (| |) to eval-
uate expressions. Table 1 presents a procedure to
extract CFG from programs. We observe that this
procedure closely follows the semantics of flow in-
structions. Indeed, the vertices of the CFG are la-
beled accordingly to the instruction at the night
address and the nodes are linked according to the
possible control transfers.
• The symbol inst of arity 1 labels addresses
of sequential instructions. There is only one
successor: the address of the next instruction.
• The symbol jmp of arity 1 labels addresses of
unconditional jumps. There is only one suc-
cessor: the address to jump to.
• The symbol jcc of arity 2 labels addresses of
conditional jumps. There is two successors:
the address to jump to when the condition is
true and the address of the next instruction
where the control is transfered when the con-
dition is false.
• The symbol call of arity 2 labels addresses of
function calls. There is two successors: the
address of the function to call and the return
3
inria-00330022,version1-13Oct2008
address, that is the address of the next instruc-
tion.
• The symbol end of arity 0 labels addresses of
function returns and undefined instructions.
There is no successor.
The entry point of the program correspond to the
root of the CFG.
Instruction Graph
in ∈ Id
in = jmp e
(|e|) = k
in = call e
(|e|) = k
in = jcc x
Otherwise
Table 1: Control flow graph extraction
Normalizing mutations. Our CFG representa-
tion is a rough abstraction of programs. Indeed
we do not make any distinction between the dif-
ferent kinds of sequential instruction, all of them
are represented by nodes labeled with the symbol
inst. This first abstraction level makes the CFG
sound with respect to mutations which substitutes
instructions with the same behavior. For example
the replacement of the instruction mov eax 0 by the
instruction xor eax eax does not impact our CFG
representation.
However, the soundness with respect to classic
mutations techniques remains an important issue.
Indeed, some well know mutation techniques can
alter the CFG of malicious programs. In order to
recover a sound representation of the control flow
we apply reductions on CFG. A reduction is defined
by a graph rewriting rule. As a case study, we con-
sider three reductions associated to classic muta-
tion techniques. Of course several other reductions
can be defined in order to handle more mutations
techniques. We use the following reductions
• Concatenate consecutive instructions into
blocks of instructions.
• Realign code removing superfluous uncondi-
tional jumps.
• Merge consecutive conditional jumps.
Those abstractions can be defined through the
graph rewriting rules of Table 2. Figure 2 presents
an assembly program and its reduced CFG.
0: cmp eax 0
1: jne +7
2: mov ecx eax
3: dec ecx
4: mul eax ecx
5: cmp ecx 1
6: jne −3
7: jmp +2
8: inc ecx
9: ret
Figure 2: A program and its CFG
We remark that each rewriting rule impose a
diminution of the size of the rewritten graph then
the reduction clearly terminates. Moreover, since
there is no critical pair we have no problem of
confluence. Nevertheless, normalizing mutations
through rewriting rules is a generic principle that
could be applied on sophisticated cases. Then, the
issues of termination and confluence shall be care-
fully considered.
Table 3 presents mutations of the program of Fig-
ure 2. All of them have the same reduced CFG as
the original program.
2 Efficient database
Road-map. Morphological detection is based on
a set of malware CFG which plays the role of mal-
ware signatures. This collection of CFG is compiled
into a tree automaton thanks to a term represen-
tation. Since tree automata fulfill a minimization
property, we obtain an efficient representation of
the database. Next, we apply this framework for
the sub-CFG isomorphism problem in order to de-
tect malware infections.
From graphs to terms. A path is a word over
{1, 2}∗
, we write the empty path. We define the
4
inria-00330022,version1-13Oct2008
Concatenate
instructions
Realign code Merge jcc
→ → →
Table 2: Control flow graph reductions
Instruction
substitution
Block
substitution
Block
permutation
jcc
obfuscation
All in one
0: cmp eax 0
1: jne +7
2: mov ecx eax
3: sub ecx 1
4: mul eax ecx
5: cmp ecx 1
6: jne −3
7: jmp +2
8: mov eax 1
9: ret
0: cmp eax 0
1: jne +8
2: push eax
3: pop ecx
4: dec ecx
5: mul eax ecx
6: cmp ecx 1
7: jne −3
8: jmp +2
9: inc ecx
10: ret
0: cmp eax 0
1: jne +7
2: mov ecx eax
3: dec ecx
4: mul eax ecx
5: cmp ecx 1
6: jne −3
9: ret
8: inc ecx
9: jmp −2
0: cmp eax 0
1: jne +9
2: mov ecx eax
3: dec ecx
4: mul eax ecx
5: cmp ecx 2
6: ja −3
7: cmp ecx 1
8: jne −5
9: jmp +2
10: inc ecx
11: ret
0: cmp eax 0
1: je +2
2: jmp +10
2: push eax
3: pop ecx
4: sub ecx 1
5: mul eax ecx
6: cmp ecx 2
7: ja −3
8: cmp ecx 1
9: jne −5
10: ret
11: mov ecx 1
12: jmp −2
Table 3: Control flow graph mutations
path order for any path ρ, τ ∈ {1, 2}∗
and any in-
teger i ∈ {1, 2} as follows.
ρ1 < ρ2 ρ < ρi ρ < τ ⇒ ρρ < ττ
A tree domain is a set d ⊂ {1, 2}∗
such that for
any path ρ ∈ {1, 2}∗
and any integer i ∈ {1, 2} we
have
ρi ∈ d ⇒ ρ ∈ d
A tree over a set of symbols F is a pair t =
(d(t), ˆt) where d(t) is a tree domain and ˆt is a func-
tion from d(t) to F.
We consider the set of symbols F =
{inst, jmp, call, jcc, ret} ∪ {1, 2}∗
and the trees
overs this set. In such trees, a nodes labeled by
a path ρ = {1, 2}∗
is thought as pointers to the
night node of the tree. Then, a tree have two kinds
of nodes: the inner nodes labeled by symbols of
{inst, jmp, call, jcc, ret} and the pointer nodes
labeled by path in {1, 2}ρ
. In the following we
write ˚d(t) the set of inner nodes of the tree t, that
is
˚d(t) = ρ
ρ ∈ d(t)
ˆt(ρ) ∈ {inst, jmp, call, jcc, ret}
Next a tree t is well formed if for any paths ρ, τ ∈
d(t)
ˆt(ρ) = τ ⇒ τ ∈ ˚d(t) and ρ ≤ τ
We observe that any CFG can be represented by
a unique well formed tree.
Tree automata. A finite tree automaton is a tu-
ple A = (Q, F, Qf , ∆), where Q is a finite set of
states, F is a set of symbols, Qf ⊂ Q is a set of
final states and ∆ is a finite set of transition rules
of the type a(q1 . . . qi) → q with a ∈ F has arity i
and q, q1, . . . , qi ∈ Q.
A run of an automaton on a tree t starts at the
leaves and moves upward, associating a state with
5
inria-00330022,version1-13Oct2008
each sub-tree. Any symbol a of arity 0 is labeled
by q if a → q ∈ ∆. Next, if the direct sub-trees
t1, . . . , tn of a tree t = a(t1, . . . , tn) are respectively
labeled by states q1, . . . , qn then the tree t is labeled
by the state q if a(q1, . . . , qn) → q ∈ ∆. A tree t is
accepted by the automaton if the run labels t with
a final state. We observe that a run on a tree t can
be computed in linear time, that is O(n) where n
is the size of t, that is the number of its nodes.
For any automaton A, we write L(A) the set
of trees accepted by A. A language of trees L is
recognizable if there is a tree automaton A such that
L = L(A). We define the size |A| of an automaton
A as the number of its rules.
Tree automata have interesting properties. First,
it is easy to build an automaton which recognize a
given finite set of trees. This operation can be done
in linear time, that is O(n) where n is the sum
of the sizes of the trees in the language. Second,
we can add new trees to the language recognized
by an automaton computing a union of automata,
see [10]. Given an automaton A, the union of A
with an automaton A can be computed in linear
time, that is O(|A |).
Finally, for a given recognizable tree language,
there exists a unique minimal automaton in the
number of states which recognizes this language.
This property ensures that the minimal automaton
is the best representation of the tree language.
Theorem 1 (From [10]). For any tree automaton
A which recognize a tree language L we can com-
pute in quadratic time (O(|A|2
)) a tree automaton
A which is the minimum tree automaton recogniz-
ing L up to a renaming of the states.
Building the database. We explain how this
framework can be used to detect malware infec-
tions. Suppose that we have a set {t1, . . . , tn} of
malware CFG represented by trees. Since this set
is finite, there is a tree automaton A which recog-
nizes it.
Next, consider the tree representation t of a given
program. Computing a run of A on t, we can decide
in linear time if this tree is one of the the trees
obtained from malware CFG. This means that that
we can efficiently decide if a program has the same
CFG as a known malware.
Finally, we can speed-up the detection comput-
ing the minimal automaton which recognize the
language {t1, . . . , tn}. From a practical point of
view this is the most efficient representation of the
malware CFG database.
Detecting infections. Actually, when a mali-
cious program infects an other program, it includes
its own code within the program of its host. Then,
we can reasonably suppose that the CFG of the ma-
licious program appears as a subgraph of the global
CFG of the infected program. As a result, we can
detect such an infection by deciding the subgraph
isomorphism problem within the context of CFG.
First we have to observe that we are not con-
fronted with the general sub-graph isomorphism
since CFG are graphs with strong constraints. In
particular the edge labeling property implies that
a CFG composed of n nodes accepts at most n
subgraphs. As a result, the sub-CFG isomorphism
problem is not NP-complete. Then to detect sub-
CFG it is sufficient to run the automaton on the
tree representations of any sub-CFG.
3 Experiments
Road-map. We consider the win32 binaries of
VX Heavens malware collection [2]. This collection
is composed of 10156 malicious programs. Then,
we have collected 2653 win32 binaries from a fresh
installation of Windows VistaTM
. This second col-
lection is considered as sane programs.
Using those samples we experiments with our im-
plementation of the morphological detector. We
focus our attention on false positive ratios in order
to validate the our method. Indeed, we have to
know if it is possible to discriminate sane programs
from malicious ones only considering their CFG.
The following experimental results agree with this
hypothesis.
CFG extraction in practice. To overcome the
difficulties of CFG extraction we have chosen the
following solutions
• In order to deal with crypted and packed
samples, we use the unpacking capabilities of
ClamAVTM
[3].
• We have implemented a dynamic disassembler
based on the disassembler library Udis86 [1].
Our module is able to follow the control flow
6
inria-00330022,version1-13Oct2008
and it keeps track of the stack in order to re-
move push a; ret sequences.
• The evaluation heuristic (|e|) proceed as follows.
When we encounter a dynamic flow instruc-
tion, we emulate the preceding block of sequen-
tial instructions in order to recover the value of
the expression e. Our emulation technology is
also build over Udis86. It is limited to a subset
of x86 assembly instruction, interruptions and
system calls are not taken into account.
• We reduce the obtained CFG according to the
rules of Table 2.
Figure 3 gives the sizes of the reduced CFG ex-
tracted from the programs of those collections. On
the X axis we have the upper bound on the size of
CFG and on the Y axis we have the percentage of
CFG whose size is lower than the bound.
Figure 3: Sizes of control flow graphs
We obeserve that about 5% of the database are
programs with a non valid PE header, they pro-
duce an empty graph. Then we are able to extract
a CFG of more that 15 nodes from about 65% of
the samples. The remaining 30% produces a CFG
which have between 1 and 15. We think that those
graphs are too small to be relevant. We are cur-
runtly working on this part of the samples to im-
prove our extraction procedure.
Building the database. The size of malware
control flow graphs clearly impact the accuracy
of the control flow detector. We have observed
that the graphs extracted from some malware were
too small to be relevant and the resulting detec-
tor makes many false alerts because of a few such
graphs. As a result, we impose a lower bound
on the size of the graphs that we include in the
database. Next, we have done several tests using
different lower bounds.
Let N ∈ N be the lower bound on the size
of CFG. We build the minimized automaton AN
M
which recognizes the set of tree representations of
reduced malware CFG that are composed of more
than N nodes. We define the morphological de-
tector DN
M as a predicate such that for any pro-
gram p ∈ P we have DN
M (p) = 1 if a malware
CFG appears as a subgraph of the CFG of p and
DN
M (p) = 0 otherwise. We have seen in the previ-
ous sections that DN
M can be decided using AN
M.
This design has several advantages. First, when a
new malicious program is discovered, one can easily
add its CFG to the database using the union of
tree automata and a new compilation to obtain a
minimal tree automaton.
The computation of the ‘not minimal’ automata
takes about 25 minutes. The minimization takes
several hours but this delay is not so important.
Indeed, within the context of an update of the mal-
ware database, during the minimization we can re-
lease the ‘not minimal’ automaton. Indeed, even if
this is not the best automaton it still recognize the
malware database and it could be used until the
minimization is terminated.
Evaluation. We are interested in false positives,
that is sane programs detected as malicious. For
that, we have collected 2653 programs from a fresh
installation of Windows VistaTM
. Let us note S this
set of programs. Let N ∈ N be a lower bound on
the size of malware CFG, we consider the following
approximation of the false positives of the detector
DN
M
False positives {p | DN
M (p) = 1 and p ∈ S}
We do not evaluate false negatives, that is unde-
tected malicious programs. Indeed, by construc-
tion all malicious programs of our malware col-
lection are detected by the morphological detec-
tor. Nevertheless, this methods seems promising
for this aspect. Indeed, the study [6] has shown
that a CFG based detection allows to detect the
high-obfuscating computer virus MetaPHOR with no
false negative.
However, the presence of the lower bound on the
size of malware CFG that enter in the database im-
plies that some malware are undetected. Those can
7
inria-00330022,version1-13Oct2008
be considered as false negatives, even if they more
related to the technical problem of CFG extraction
than to the methods of morphological detection.
The results of the experiments mention this ratio
of undetected malware.
Experimental results. We have built tree au-
tomata from the malware samples considering dif-
ferent lower bounds N on the size of CFG. Accord-
ing to the previous section we obtain the morpho-
logical detectors DN
M . We have tested those detec-
tors on the collection of saneware in order to eval-
uate the false positives. It takes about 5 h 30 min
to analyze the collection of saneware, this repre-
sents the analysis of 2 319 294 sub-CFG. Table 4
presents the results. The first column indicates the
considered the lower bound N. The second col-
umn indicates the ratio of false positives that is the
number of sane programs detected as malicious out
of the toalt number of sane programs. The third
column indicates the ratio of undetected malicious
programs that is the number of malware samples
with a CFG size lower than the bound out of the
total number of malware samples.
3.1 Analysis.
As expected, we observe that the false positives de-
crease with the lower bound on the size of CFG.
Over 15 nodes, the CFG seems to be a relevant
criterium to discriminate malware.
Concerning the remaining false positives. The li-
braries ir41 qc.dll and ir41 qcx.dll, and the mali-
cious program Trojan.Win32.Sechole have the same
CFG composed of more than 1 000 nodes. We
have tested those programs with commercial an-
tivirus software and the libraries ir41 qc.dll and
ir41 qcx.dll are not detected whereas the program
Trojan.Win32.Sechole is detected as malicious. The
malicious programs seems to be based on the dy-
namic library and the extraction algorithm was not
able to extract the CFG related to the malicious
program.
Concerning the ratio of undetected malware, The
only way to improve the detector is to implement
a better heuristic for control flow graph extraction.
In its current version our prototype only use a few
heuristics.
For comparison, statistical methods used in [16]
induce false negatives ratios between 36 % and 48 %
Lower Bound False positives Undetected
1 100.00% 4.80%
2 83.78% 5.43%
3 76.82% 16.43%
4 76.77% 16.66%
5 57.98% 20.01%
6 34.84% 21.50%
7 20.57% 23.34%
9 12.06% 24.43%
10 2.17% 26.47%
11 2.04% 27.78%
12 1.60% 29.35%
13 0.71% 30.74%
15 0.09% 36.52%
Table 4: Results of the experiments
and false positive ratios between 0.5 % and 34 %.
A detector based on artificial neural networks de-
veloped at IBM [19] presents false negatives ratios
between 15 % and 20 % and false positive ratios
lower than 1 %. The data mining methods surveyed
in [17] present false negatives ratios between 2.3 %
and 64.4 % and false positive ratios between 2.2 %
and 47.5 %. Heuristics methods from antivirus in-
dustry tested in [15] present false negatives ratios
between 20.0 % and 48.6 % and false positive ratios
lower than 0.2 %.
References
[1] http://udis86.sourceforge.net.
[2] http://vx.netlux.org.
[3] http://www.clamav.net.
8
inria-00330022,version1-13Oct2008
[4] Ph Beaucamps and E Filiol. On the possibility
of practically obfuscating programs towards a
unified perspective of code protection. Journal
in Computer Virology, 3(1):3–21, April 2007.
[5] G. Bonfante, M. Kaczmarek, and J.Y. Marion.
Control Flow Graphs as Malware Signatures.
WTCV, May, 2007.
[6] D. Bruschi, Martignoni, L., and M. Monga.
Detecting self-mutating malware using
control-flow graph matching. Technical
report, Universit`a degli Studi di Milano,
September 2006.
[7] M. Christodorescu and S. Jha. Testing mal-
ware detectors. ACM SIGSOFT Software En-
gineering Notes, 29(4):34–44, 2004.
[8] M. Christodorescu, S. Jha, J. Kinder,
S. Katzenbeisser, and H. Veith. Software
transformations to improve malware detec-
tion. Journal in Computer Virology, 3(4):253–
265, 2007.
[9] M. Christodorescu, S. Jha, S.A. Seshia,
D. Song, and R.E. Bryant. Semantics-aware
malware detection. IEEE Symposium on Se-
curity and Privacy, 2005.
[10] H. Comon, M. Dauchet, R. Gilleron,
F. Jacquemard, D. Lugiez, S. Tison, and
M. Tommasi. Tree automata techniques
and applications. Available on: http://www.
grappa. univ-lille3. fr/tata, 10, 1997.
[11] M. Dalla Preda, M. Christodorescu, S. Jha,
and S. Debray. A Semantics-Based Approach
to Malware Detection. In POPL’07, 2007.
[12] E. Filiol. Computer Viruses: from Theory to
Applications. Springer-Verlag, 2005.
[13] E. Filiol. Advanced viral techniques: mathe-
matical and algorithmic aspects. Berlin Hei-
delberg New York: Springer, 2006.
[14] E. Filiol. Malware pattern scanning schemes
secure against black-box analysis. In 15th
EICAR, 2006.
[15] D. Gryaznov. Scanners of the Year 2000:
Heuristics. Proceedings of the 5th Interna-
tional Virus Bulletin, 1999.
[16] J.O. Kephart and W.C. Arnold. Automatic
Extraction of Computer Virus Signatures.
4th Virus Bulletin International Conference,
pages 178–184, 1994.
[17] M.G. Schultz, E. Eskin, E. Zadok, and S.J.
Stolfo. Data Mining Methods for Detection
of New Malicious Executables. Proceedings of
the IEEE Symposium on Security and Privacy,
page 38, 2001.
[18] P. Sz¨or. The Art of Computer Virus Research
and Defense. Addison-Wesley Professional,
2005.
[19] GJ Tesauro, JO Kephart, and GB Sorkin.
Neural networks for computer virus recogni-
tion. Expert, IEEE [see also IEEE Intelligent
Systems and Their Applications], 11(4):5–6,
1996.
[20] Andrew Walenstein, Rachit Mathur, Mo-
hamed R. Chouchane, and Arun Lakhotia.
Normalizing metamorphic malware using term
rewriting. scam, 0:75–84, 2006.
9
inria-00330022,version1-13Oct2008

More Related Content

What's hot

An Analytical Approach To Analyze The Impact Of Gray Hole Attacks In Manet
An Analytical Approach To Analyze The Impact Of Gray Hole Attacks In ManetAn Analytical Approach To Analyze The Impact Of Gray Hole Attacks In Manet
An Analytical Approach To Analyze The Impact Of Gray Hole Attacks In Manetidescitation
 
Iaetsd implementation of aho corasick algorithm
Iaetsd implementation of aho corasick algorithmIaetsd implementation of aho corasick algorithm
Iaetsd implementation of aho corasick algorithmIaetsd Iaetsd
 
Capturing Monotonic Components from Input Patterns
Capturing Monotonic Components from Input PatternsCapturing Monotonic Components from Input Patterns
Capturing Monotonic Components from Input Patternsijeukens
 
Applying the Scientific Method to Simulation Experiments
Applying the Scientific Method to Simulation ExperimentsApplying the Scientific Method to Simulation Experiments
Applying the Scientific Method to Simulation ExperimentsFrank Bergmann
 
ADRISYA: A FLOW BASED ANOMALY DETECTION SYSTEM FOR SLOW AND FAST SCAN
ADRISYA: A FLOW BASED ANOMALY DETECTION SYSTEM FOR SLOW AND FAST SCANADRISYA: A FLOW BASED ANOMALY DETECTION SYSTEM FOR SLOW AND FAST SCAN
ADRISYA: A FLOW BASED ANOMALY DETECTION SYSTEM FOR SLOW AND FAST SCANIJNSA Journal
 
Ontology model for c overflow vulnerabilities attack
Ontology model for c overflow vulnerabilities attackOntology model for c overflow vulnerabilities attack
Ontology model for c overflow vulnerabilities attackNurul Haszeli Ahmad
 
Adequacy of checksum algorithms for computer virus detection
Adequacy of checksum algorithms for computer virus detectionAdequacy of checksum algorithms for computer virus detection
Adequacy of checksum algorithms for computer virus detectionUltraUploader
 
Parameter Estimation of Software Reliability Growth Models Using Simulated An...
Parameter Estimation of Software Reliability Growth Models Using Simulated An...Parameter Estimation of Software Reliability Growth Models Using Simulated An...
Parameter Estimation of Software Reliability Growth Models Using Simulated An...Editor IJCATR
 
Software Reliability Growth Model with Logistic- Exponential Testing-Effort F...
Software Reliability Growth Model with Logistic- Exponential Testing-Effort F...Software Reliability Growth Model with Logistic- Exponential Testing-Effort F...
Software Reliability Growth Model with Logistic- Exponential Testing-Effort F...IDES Editor
 
Algebraic specification of computer viruses and their environments
Algebraic specification of computer viruses and their environmentsAlgebraic specification of computer viruses and their environments
Algebraic specification of computer viruses and their environmentsUltraUploader
 
You Cannot Fix What You Cannot Find! --- An Investigation of Fault Localizati...
You Cannot Fix What You Cannot Find! --- An Investigation of Fault Localizati...You Cannot Fix What You Cannot Find! --- An Investigation of Fault Localizati...
You Cannot Fix What You Cannot Find! --- An Investigation of Fault Localizati...Dongsun Kim
 
LSRepair: Live Search of Fix Ingredients for Automated Program Repair
LSRepair: Live Search of Fix Ingredients for Automated Program RepairLSRepair: Live Search of Fix Ingredients for Automated Program Repair
LSRepair: Live Search of Fix Ingredients for Automated Program RepairDongsun Kim
 
Dissertation Defense
Dissertation DefenseDissertation Defense
Dissertation DefenseSung Kim
 
Applications of Machine Learning and Metaheuristic Search to Security Testing
Applications of Machine Learning and Metaheuristic Search to Security TestingApplications of Machine Learning and Metaheuristic Search to Security Testing
Applications of Machine Learning and Metaheuristic Search to Security TestingLionel Briand
 
White space steganography on text
White space steganography on textWhite space steganography on text
White space steganography on textIJCNCJournal
 
Network Measurement and Monitori - Assigment 1, Group3, "Classification"
Network Measurement and Monitori - Assigment 1, Group3, "Classification"Network Measurement and Monitori - Assigment 1, Group3, "Classification"
Network Measurement and Monitori - Assigment 1, Group3, "Classification"Valentin Thirion
 
System Testing of Timing Requirements based on Use Cases and Timed Automata
System Testing of Timing Requirements based on Use Cases and Timed AutomataSystem Testing of Timing Requirements based on Use Cases and Timed Automata
System Testing of Timing Requirements based on Use Cases and Timed AutomataLionel Briand
 
An implementation of RSA policy
An implementation of RSA policyAn implementation of RSA policy
An implementation of RSA policySM NAZMUS SALEHIN
 
Feature Selection Techniques for Software Fault Prediction (Summary)
Feature Selection Techniques for Software Fault Prediction (Summary)Feature Selection Techniques for Software Fault Prediction (Summary)
Feature Selection Techniques for Software Fault Prediction (Summary)SungdoGu
 

What's hot (20)

An Analytical Approach To Analyze The Impact Of Gray Hole Attacks In Manet
An Analytical Approach To Analyze The Impact Of Gray Hole Attacks In ManetAn Analytical Approach To Analyze The Impact Of Gray Hole Attacks In Manet
An Analytical Approach To Analyze The Impact Of Gray Hole Attacks In Manet
 
Iaetsd implementation of aho corasick algorithm
Iaetsd implementation of aho corasick algorithmIaetsd implementation of aho corasick algorithm
Iaetsd implementation of aho corasick algorithm
 
Capturing Monotonic Components from Input Patterns
Capturing Monotonic Components from Input PatternsCapturing Monotonic Components from Input Patterns
Capturing Monotonic Components from Input Patterns
 
Applying the Scientific Method to Simulation Experiments
Applying the Scientific Method to Simulation ExperimentsApplying the Scientific Method to Simulation Experiments
Applying the Scientific Method to Simulation Experiments
 
ADRISYA: A FLOW BASED ANOMALY DETECTION SYSTEM FOR SLOW AND FAST SCAN
ADRISYA: A FLOW BASED ANOMALY DETECTION SYSTEM FOR SLOW AND FAST SCANADRISYA: A FLOW BASED ANOMALY DETECTION SYSTEM FOR SLOW AND FAST SCAN
ADRISYA: A FLOW BASED ANOMALY DETECTION SYSTEM FOR SLOW AND FAST SCAN
 
Ontology model for c overflow vulnerabilities attack
Ontology model for c overflow vulnerabilities attackOntology model for c overflow vulnerabilities attack
Ontology model for c overflow vulnerabilities attack
 
Adequacy of checksum algorithms for computer virus detection
Adequacy of checksum algorithms for computer virus detectionAdequacy of checksum algorithms for computer virus detection
Adequacy of checksum algorithms for computer virus detection
 
Parameter Estimation of Software Reliability Growth Models Using Simulated An...
Parameter Estimation of Software Reliability Growth Models Using Simulated An...Parameter Estimation of Software Reliability Growth Models Using Simulated An...
Parameter Estimation of Software Reliability Growth Models Using Simulated An...
 
Software Reliability Growth Model with Logistic- Exponential Testing-Effort F...
Software Reliability Growth Model with Logistic- Exponential Testing-Effort F...Software Reliability Growth Model with Logistic- Exponential Testing-Effort F...
Software Reliability Growth Model with Logistic- Exponential Testing-Effort F...
 
Algebraic specification of computer viruses and their environments
Algebraic specification of computer viruses and their environmentsAlgebraic specification of computer viruses and their environments
Algebraic specification of computer viruses and their environments
 
You Cannot Fix What You Cannot Find! --- An Investigation of Fault Localizati...
You Cannot Fix What You Cannot Find! --- An Investigation of Fault Localizati...You Cannot Fix What You Cannot Find! --- An Investigation of Fault Localizati...
You Cannot Fix What You Cannot Find! --- An Investigation of Fault Localizati...
 
LSRepair: Live Search of Fix Ingredients for Automated Program Repair
LSRepair: Live Search of Fix Ingredients for Automated Program RepairLSRepair: Live Search of Fix Ingredients for Automated Program Repair
LSRepair: Live Search of Fix Ingredients for Automated Program Repair
 
CORRECT-ICSE2016
CORRECT-ICSE2016CORRECT-ICSE2016
CORRECT-ICSE2016
 
Dissertation Defense
Dissertation DefenseDissertation Defense
Dissertation Defense
 
Applications of Machine Learning and Metaheuristic Search to Security Testing
Applications of Machine Learning and Metaheuristic Search to Security TestingApplications of Machine Learning and Metaheuristic Search to Security Testing
Applications of Machine Learning and Metaheuristic Search to Security Testing
 
White space steganography on text
White space steganography on textWhite space steganography on text
White space steganography on text
 
Network Measurement and Monitori - Assigment 1, Group3, "Classification"
Network Measurement and Monitori - Assigment 1, Group3, "Classification"Network Measurement and Monitori - Assigment 1, Group3, "Classification"
Network Measurement and Monitori - Assigment 1, Group3, "Classification"
 
System Testing of Timing Requirements based on Use Cases and Timed Automata
System Testing of Timing Requirements based on Use Cases and Timed AutomataSystem Testing of Timing Requirements based on Use Cases and Timed Automata
System Testing of Timing Requirements based on Use Cases and Timed Automata
 
An implementation of RSA policy
An implementation of RSA policyAn implementation of RSA policy
An implementation of RSA policy
 
Feature Selection Techniques for Software Fault Prediction (Summary)
Feature Selection Techniques for Software Fault Prediction (Summary)Feature Selection Techniques for Software Fault Prediction (Summary)
Feature Selection Techniques for Software Fault Prediction (Summary)
 

Viewers also liked

A bypass of cohen's impossibility result
A bypass of cohen's impossibility resultA bypass of cohen's impossibility result
A bypass of cohen's impossibility resultUltraUploader
 
Anomalous payload based worm detection and signature generation
Anomalous payload based worm detection and signature generationAnomalous payload based worm detection and signature generation
Anomalous payload based worm detection and signature generationUltraUploader
 
Automatically generating signatures for polymorphic worms
Automatically generating signatures for polymorphic wormsAutomatically generating signatures for polymorphic worms
Automatically generating signatures for polymorphic wormsUltraUploader
 
Automated defense from rootkit attacks
Automated defense from rootkit attacksAutomated defense from rootkit attacks
Automated defense from rootkit attacksUltraUploader
 
An undetectable computer virus
An undetectable computer virusAn undetectable computer virus
An undetectable computer virusUltraUploader
 
Antivirus update reaction times of major antivirus vendors
Antivirus update reaction times of major antivirus vendorsAntivirus update reaction times of major antivirus vendors
Antivirus update reaction times of major antivirus vendorsUltraUploader
 
Attack of the killer virus!
Attack of the killer virus!Attack of the killer virus!
Attack of the killer virus!UltraUploader
 
Artificial immune system against viral attack
Artificial immune system against viral attackArtificial immune system against viral attack
Artificial immune system against viral attackUltraUploader
 
(Ebook ita - inform - access) guida al database access (doc)
(Ebook   ita - inform - access) guida al database access (doc)(Ebook   ita - inform - access) guida al database access (doc)
(Ebook ita - inform - access) guida al database access (doc)UltraUploader
 
Automatic comparison of malware
Automatic comparison of malwareAutomatic comparison of malware
Automatic comparison of malwareUltraUploader
 
Anti spyware coalition definitions and supporting documents
Anti spyware coalition definitions and supporting documentsAnti spyware coalition definitions and supporting documents
Anti spyware coalition definitions and supporting documentsUltraUploader
 
Asm based modelling of self-replicating programs
Asm based modelling of self-replicating programsAsm based modelling of self-replicating programs
Asm based modelling of self-replicating programsUltraUploader
 
Botnetsand applications
Botnetsand applicationsBotnetsand applications
Botnetsand applicationsUltraUploader
 
Applications of immune system computing
Applications of immune system computingApplications of immune system computing
Applications of immune system computingUltraUploader
 
Analysis of web application worms and viruses
Analysis of web application worms and virusesAnalysis of web application worms and viruses
Analysis of web application worms and virusesUltraUploader
 

Viewers also liked (17)

Bad transfer
Bad transferBad transfer
Bad transfer
 
A bypass of cohen's impossibility result
A bypass of cohen's impossibility resultA bypass of cohen's impossibility result
A bypass of cohen's impossibility result
 
Anomalous payload based worm detection and signature generation
Anomalous payload based worm detection and signature generationAnomalous payload based worm detection and signature generation
Anomalous payload based worm detection and signature generation
 
Automatically generating signatures for polymorphic worms
Automatically generating signatures for polymorphic wormsAutomatically generating signatures for polymorphic worms
Automatically generating signatures for polymorphic worms
 
Automated defense from rootkit attacks
Automated defense from rootkit attacksAutomated defense from rootkit attacks
Automated defense from rootkit attacks
 
An undetectable computer virus
An undetectable computer virusAn undetectable computer virus
An undetectable computer virus
 
Antivirus update reaction times of major antivirus vendors
Antivirus update reaction times of major antivirus vendorsAntivirus update reaction times of major antivirus vendors
Antivirus update reaction times of major antivirus vendors
 
Attack of the killer virus!
Attack of the killer virus!Attack of the killer virus!
Attack of the killer virus!
 
Attacking antivirus
Attacking antivirusAttacking antivirus
Attacking antivirus
 
Artificial immune system against viral attack
Artificial immune system against viral attackArtificial immune system against viral attack
Artificial immune system against viral attack
 
(Ebook ita - inform - access) guida al database access (doc)
(Ebook   ita - inform - access) guida al database access (doc)(Ebook   ita - inform - access) guida al database access (doc)
(Ebook ita - inform - access) guida al database access (doc)
 
Automatic comparison of malware
Automatic comparison of malwareAutomatic comparison of malware
Automatic comparison of malware
 
Anti spyware coalition definitions and supporting documents
Anti spyware coalition definitions and supporting documentsAnti spyware coalition definitions and supporting documents
Anti spyware coalition definitions and supporting documents
 
Asm based modelling of self-replicating programs
Asm based modelling of self-replicating programsAsm based modelling of self-replicating programs
Asm based modelling of self-replicating programs
 
Botnetsand applications
Botnetsand applicationsBotnetsand applications
Botnetsand applications
 
Applications of immune system computing
Applications of immune system computingApplications of immune system computing
Applications of immune system computing
 
Analysis of web application worms and viruses
Analysis of web application worms and virusesAnalysis of web application worms and viruses
Analysis of web application worms and viruses
 

Similar to Architecture of a morphological malware detector

A fast static analysis approach to detect exploit code inside network flows
A fast static analysis approach to detect exploit code inside network flowsA fast static analysis approach to detect exploit code inside network flows
A fast static analysis approach to detect exploit code inside network flowsUltraUploader
 
DETECTION OF CONTROL FLOW ERRORS IN PARALLEL PROGRAMS AT COMPILE TIME
DETECTION OF CONTROL FLOW ERRORS IN PARALLEL PROGRAMS AT COMPILE TIME DETECTION OF CONTROL FLOW ERRORS IN PARALLEL PROGRAMS AT COMPILE TIME
DETECTION OF CONTROL FLOW ERRORS IN PARALLEL PROGRAMS AT COMPILE TIME ijdpsjournal
 
Machine learning techniques applied to detect cyber attacks on web applications
Machine learning techniques applied to detect cyber attacks on web applicationsMachine learning techniques applied to detect cyber attacks on web applications
Machine learning techniques applied to detect cyber attacks on web applicationsVenkat Projects
 
Machine learning techniques applied to detect cyber attacks on web applications
Machine learning techniques applied to detect cyber attacks on web applicationsMachine learning techniques applied to detect cyber attacks on web applications
Machine learning techniques applied to detect cyber attacks on web applicationsVenkat Projects
 
Static Slicing Technique with Algorithmic Approach
Static Slicing Technique with Algorithmic ApproachStatic Slicing Technique with Algorithmic Approach
Static Slicing Technique with Algorithmic ApproachIOSR Journals
 
Cl32990995
Cl32990995Cl32990995
Cl32990995IJMER
 
IRJET - Survey on Malware Detection using Deep Learning Methods
IRJET -  	  Survey on Malware Detection using Deep Learning MethodsIRJET -  	  Survey on Malware Detection using Deep Learning Methods
IRJET - Survey on Malware Detection using Deep Learning MethodsIRJET Journal
 
A semantics based approach to malware detection
A semantics based approach to malware detectionA semantics based approach to malware detection
A semantics based approach to malware detectionUltraUploader
 
Unveiling Metamorphism by Abstract Interpretation of Code Properties
Unveiling Metamorphism by Abstract Interpretation of Code PropertiesUnveiling Metamorphism by Abstract Interpretation of Code Properties
Unveiling Metamorphism by Abstract Interpretation of Code PropertiesFACE
 
A Literature Review on Plagiarism Detection in Computer Programming Assignments
A Literature Review on Plagiarism Detection in Computer Programming AssignmentsA Literature Review on Plagiarism Detection in Computer Programming Assignments
A Literature Review on Plagiarism Detection in Computer Programming AssignmentsIRJET Journal
 
Automatic reverse engineering of malware emulators
Automatic reverse engineering of malware emulatorsAutomatic reverse engineering of malware emulators
Automatic reverse engineering of malware emulatorsUltraUploader
 
A method for detecting obfuscated calls in malicious binaries
A method for detecting obfuscated calls in malicious binariesA method for detecting obfuscated calls in malicious binaries
A method for detecting obfuscated calls in malicious binariesUltraUploader
 
Enhancement of Error Correction in Quantum Cryptography BB84 ...
Enhancement of Error Correction in Quantum Cryptography BB84 ...Enhancement of Error Correction in Quantum Cryptography BB84 ...
Enhancement of Error Correction in Quantum Cryptography BB84 ...butest
 
Verification of the protection services in antivirus systems by using nusmv m...
Verification of the protection services in antivirus systems by using nusmv m...Verification of the protection services in antivirus systems by using nusmv m...
Verification of the protection services in antivirus systems by using nusmv m...ijfcstjournal
 
A model for detecting the existence of unknown computer viruses in real time
A model for detecting the existence of unknown computer viruses in real timeA model for detecting the existence of unknown computer viruses in real time
A model for detecting the existence of unknown computer viruses in real timeUltraUploader
 
Testing parallel programs
Testing parallel programsTesting parallel programs
Testing parallel programsPVS-Studio
 
A hybrid model to detect malicious executables
A hybrid model to detect malicious executablesA hybrid model to detect malicious executables
A hybrid model to detect malicious executablesUltraUploader
 
Software CrashLocator: Locating the Faulty Functions by Analyzing the Crash S...
Software CrashLocator: Locating the Faulty Functions by Analyzing the Crash S...Software CrashLocator: Locating the Faulty Functions by Analyzing the Crash S...
Software CrashLocator: Locating the Faulty Functions by Analyzing the Crash S...INFOGAIN PUBLICATION
 
Obfuscated computer virus detection using machine learning algorithm
Obfuscated computer virus detection using machine learning algorithmObfuscated computer virus detection using machine learning algorithm
Obfuscated computer virus detection using machine learning algorithmjournalBEEI
 

Similar to Architecture of a morphological malware detector (20)

A fast static analysis approach to detect exploit code inside network flows
A fast static analysis approach to detect exploit code inside network flowsA fast static analysis approach to detect exploit code inside network flows
A fast static analysis approach to detect exploit code inside network flows
 
DETECTION OF CONTROL FLOW ERRORS IN PARALLEL PROGRAMS AT COMPILE TIME
DETECTION OF CONTROL FLOW ERRORS IN PARALLEL PROGRAMS AT COMPILE TIME DETECTION OF CONTROL FLOW ERRORS IN PARALLEL PROGRAMS AT COMPILE TIME
DETECTION OF CONTROL FLOW ERRORS IN PARALLEL PROGRAMS AT COMPILE TIME
 
Machine learning techniques applied to detect cyber attacks on web applications
Machine learning techniques applied to detect cyber attacks on web applicationsMachine learning techniques applied to detect cyber attacks on web applications
Machine learning techniques applied to detect cyber attacks on web applications
 
Machine learning techniques applied to detect cyber attacks on web applications
Machine learning techniques applied to detect cyber attacks on web applicationsMachine learning techniques applied to detect cyber attacks on web applications
Machine learning techniques applied to detect cyber attacks on web applications
 
Static Slicing Technique with Algorithmic Approach
Static Slicing Technique with Algorithmic ApproachStatic Slicing Technique with Algorithmic Approach
Static Slicing Technique with Algorithmic Approach
 
Cl32990995
Cl32990995Cl32990995
Cl32990995
 
IRJET - Survey on Malware Detection using Deep Learning Methods
IRJET -  	  Survey on Malware Detection using Deep Learning MethodsIRJET -  	  Survey on Malware Detection using Deep Learning Methods
IRJET - Survey on Malware Detection using Deep Learning Methods
 
A semantics based approach to malware detection
A semantics based approach to malware detectionA semantics based approach to malware detection
A semantics based approach to malware detection
 
Unveiling Metamorphism by Abstract Interpretation of Code Properties
Unveiling Metamorphism by Abstract Interpretation of Code PropertiesUnveiling Metamorphism by Abstract Interpretation of Code Properties
Unveiling Metamorphism by Abstract Interpretation of Code Properties
 
A Literature Review on Plagiarism Detection in Computer Programming Assignments
A Literature Review on Plagiarism Detection in Computer Programming AssignmentsA Literature Review on Plagiarism Detection in Computer Programming Assignments
A Literature Review on Plagiarism Detection in Computer Programming Assignments
 
Automatic reverse engineering of malware emulators
Automatic reverse engineering of malware emulatorsAutomatic reverse engineering of malware emulators
Automatic reverse engineering of malware emulators
 
A method for detecting obfuscated calls in malicious binaries
A method for detecting obfuscated calls in malicious binariesA method for detecting obfuscated calls in malicious binaries
A method for detecting obfuscated calls in malicious binaries
 
Enhancement of Error Correction in Quantum Cryptography BB84 ...
Enhancement of Error Correction in Quantum Cryptography BB84 ...Enhancement of Error Correction in Quantum Cryptography BB84 ...
Enhancement of Error Correction in Quantum Cryptography BB84 ...
 
Verification of the protection services in antivirus systems by using nusmv m...
Verification of the protection services in antivirus systems by using nusmv m...Verification of the protection services in antivirus systems by using nusmv m...
Verification of the protection services in antivirus systems by using nusmv m...
 
A model for detecting the existence of unknown computer viruses in real time
A model for detecting the existence of unknown computer viruses in real timeA model for detecting the existence of unknown computer viruses in real time
A model for detecting the existence of unknown computer viruses in real time
 
J034057065
J034057065J034057065
J034057065
 
Testing parallel programs
Testing parallel programsTesting parallel programs
Testing parallel programs
 
A hybrid model to detect malicious executables
A hybrid model to detect malicious executablesA hybrid model to detect malicious executables
A hybrid model to detect malicious executables
 
Software CrashLocator: Locating the Faulty Functions by Analyzing the Crash S...
Software CrashLocator: Locating the Faulty Functions by Analyzing the Crash S...Software CrashLocator: Locating the Faulty Functions by Analyzing the Crash S...
Software CrashLocator: Locating the Faulty Functions by Analyzing the Crash S...
 
Obfuscated computer virus detection using machine learning algorithm
Obfuscated computer virus detection using machine learning algorithmObfuscated computer virus detection using machine learning algorithm
Obfuscated computer virus detection using machine learning algorithm
 

More from UltraUploader

01 le 10 regole dell'hacking
01   le 10 regole dell'hacking01   le 10 regole dell'hacking
01 le 10 regole dell'hackingUltraUploader
 
00 the big guide sz (by dr.to-d)
00   the big guide sz (by dr.to-d)00   the big guide sz (by dr.to-d)
00 the big guide sz (by dr.to-d)UltraUploader
 
[E book ita] php manual
[E book   ita] php manual[E book   ita] php manual
[E book ita] php manualUltraUploader
 
[Ebook ita - security] introduzione alle tecniche di exploit - mori - ifoa ...
[Ebook   ita - security] introduzione alle tecniche di exploit - mori - ifoa ...[Ebook   ita - security] introduzione alle tecniche di exploit - mori - ifoa ...
[Ebook ita - security] introduzione alle tecniche di exploit - mori - ifoa ...UltraUploader
 
[Ebook ita - database] access 2000 manuale
[Ebook   ita - database] access 2000 manuale[Ebook   ita - database] access 2000 manuale
[Ebook ita - database] access 2000 manualeUltraUploader
 
(E book) cracking & hacking tutorial 1000 pagine (ita)
(E book) cracking & hacking tutorial 1000 pagine (ita)(E book) cracking & hacking tutorial 1000 pagine (ita)
(E book) cracking & hacking tutorial 1000 pagine (ita)UltraUploader
 
(Ebook computer - ita - pdf) fondamenti di informatica - teoria
(Ebook   computer - ita - pdf) fondamenti di informatica - teoria(Ebook   computer - ita - pdf) fondamenti di informatica - teoria
(Ebook computer - ita - pdf) fondamenti di informatica - teoriaUltraUploader
 
Broadband network virus detection system based on bypass monitor
Broadband network virus detection system based on bypass monitorBroadband network virus detection system based on bypass monitor
Broadband network virus detection system based on bypass monitorUltraUploader
 
Bot software spreads, causes new worries
Bot software spreads, causes new worriesBot software spreads, causes new worries
Bot software spreads, causes new worriesUltraUploader
 
Blended attacks exploits, vulnerabilities and buffer overflow techniques in c...
Blended attacks exploits, vulnerabilities and buffer overflow techniques in c...Blended attacks exploits, vulnerabilities and buffer overflow techniques in c...
Blended attacks exploits, vulnerabilities and buffer overflow techniques in c...UltraUploader
 
Bird binary interpretation using runtime disassembly
Bird binary interpretation using runtime disassemblyBird binary interpretation using runtime disassembly
Bird binary interpretation using runtime disassemblyUltraUploader
 
Biologically inspired defenses against computer viruses
Biologically inspired defenses against computer virusesBiologically inspired defenses against computer viruses
Biologically inspired defenses against computer virusesUltraUploader
 
Biological versus computer viruses
Biological versus computer virusesBiological versus computer viruses
Biological versus computer virusesUltraUploader
 
Biological aspects of computer virology
Biological aspects of computer virologyBiological aspects of computer virology
Biological aspects of computer virologyUltraUploader
 
Biological models of security for virus propagation in computer networks
Biological models of security for virus propagation in computer networksBiological models of security for virus propagation in computer networks
Biological models of security for virus propagation in computer networksUltraUploader
 
Binary obfuscation using signals
Binary obfuscation using signalsBinary obfuscation using signals
Binary obfuscation using signalsUltraUploader
 
Beyond layers and peripheral antivirus security
Beyond layers and peripheral antivirus securityBeyond layers and peripheral antivirus security
Beyond layers and peripheral antivirus securityUltraUploader
 

More from UltraUploader (20)

1 (1)
1 (1)1 (1)
1 (1)
 
01 intro
01 intro01 intro
01 intro
 
01 le 10 regole dell'hacking
01   le 10 regole dell'hacking01   le 10 regole dell'hacking
01 le 10 regole dell'hacking
 
00 the big guide sz (by dr.to-d)
00   the big guide sz (by dr.to-d)00   the big guide sz (by dr.to-d)
00 the big guide sz (by dr.to-d)
 
[E book ita] php manual
[E book   ita] php manual[E book   ita] php manual
[E book ita] php manual
 
[Ebook ita - security] introduzione alle tecniche di exploit - mori - ifoa ...
[Ebook   ita - security] introduzione alle tecniche di exploit - mori - ifoa ...[Ebook   ita - security] introduzione alle tecniche di exploit - mori - ifoa ...
[Ebook ita - security] introduzione alle tecniche di exploit - mori - ifoa ...
 
[Ebook ita - database] access 2000 manuale
[Ebook   ita - database] access 2000 manuale[Ebook   ita - database] access 2000 manuale
[Ebook ita - database] access 2000 manuale
 
(E book) cracking & hacking tutorial 1000 pagine (ita)
(E book) cracking & hacking tutorial 1000 pagine (ita)(E book) cracking & hacking tutorial 1000 pagine (ita)
(E book) cracking & hacking tutorial 1000 pagine (ita)
 
(Ebook computer - ita - pdf) fondamenti di informatica - teoria
(Ebook   computer - ita - pdf) fondamenti di informatica - teoria(Ebook   computer - ita - pdf) fondamenti di informatica - teoria
(Ebook computer - ita - pdf) fondamenti di informatica - teoria
 
Broadband network virus detection system based on bypass monitor
Broadband network virus detection system based on bypass monitorBroadband network virus detection system based on bypass monitor
Broadband network virus detection system based on bypass monitor
 
Bot software spreads, causes new worries
Bot software spreads, causes new worriesBot software spreads, causes new worries
Bot software spreads, causes new worries
 
Blended attacks exploits, vulnerabilities and buffer overflow techniques in c...
Blended attacks exploits, vulnerabilities and buffer overflow techniques in c...Blended attacks exploits, vulnerabilities and buffer overflow techniques in c...
Blended attacks exploits, vulnerabilities and buffer overflow techniques in c...
 
Blast off!
Blast off!Blast off!
Blast off!
 
Bird binary interpretation using runtime disassembly
Bird binary interpretation using runtime disassemblyBird binary interpretation using runtime disassembly
Bird binary interpretation using runtime disassembly
 
Biologically inspired defenses against computer viruses
Biologically inspired defenses against computer virusesBiologically inspired defenses against computer viruses
Biologically inspired defenses against computer viruses
 
Biological versus computer viruses
Biological versus computer virusesBiological versus computer viruses
Biological versus computer viruses
 
Biological aspects of computer virology
Biological aspects of computer virologyBiological aspects of computer virology
Biological aspects of computer virology
 
Biological models of security for virus propagation in computer networks
Biological models of security for virus propagation in computer networksBiological models of security for virus propagation in computer networks
Biological models of security for virus propagation in computer networks
 
Binary obfuscation using signals
Binary obfuscation using signalsBinary obfuscation using signals
Binary obfuscation using signals
 
Beyond layers and peripheral antivirus security
Beyond layers and peripheral antivirus securityBeyond layers and peripheral antivirus security
Beyond layers and peripheral antivirus security
 

Architecture of a morphological malware detector

  • 1. Architecture of a Morphological Malware Detector Guillaume Bonfante, Matthieu Kaczmarek and Jean-Yves Marion Nancy-Universit´e - Loria - INPL - Ecole Nationale Sup´erieure des Mines de Nancy B.P. 239, 54506 Vandœuvre-l`es-Nancy C´edex, France July 15, 2008 Abstract Most of malware detectors are based on syntactic signatures that identify known malicious programs. Up to now this architecture has been sufficiently ef- ficient to overcome most of malware attacks. Nev- ertheless, the complexity of malicious codes still in- crease. As a result the time required to reverse engineer malicious programs and to forge new sig- natures is increasingly longer. This study proposes an efficient construction of a morphological malware detector, that is a detector which associates syntactic and semantic analysis. It aims at facilitating the task of malware analysts providing some abstraction on the signature rep- resentation which is based on control flow graphs (CFG). We build an efficient signature matching engine over tree automata techniques. Moreover we de- scribe a generic graph rewriting engine in order to deal with classic mutations techniques. Finally, we provide a preliminary evaluation of the strategy de- tection carrying out experiments on a malware col- lection. Introduction The identification of malicious behavior is a diffi- cult task. Until now, no technologies have been able to automatically prevent the spread of malware. Several approaches have been considered but nei- ther syntactic analysis nor behavioral consideration were really effective. Presently, human analysis of malware seems to be the best strategy, next mal- ware detectors based on string signature remains the most reliable solution. From this point of view, we have tried to easier the task which consist in finding a good signature within a malicious pro- grams. Our technique has been inspired from the article [6] where control flow graphs are used to de- tect the different instances of the computer virus MetaPHOR. Generally speaking, detection strategies based on string signatures uses a database of regular expres- sions and a string matching engine to scan files and to detect infected ones. Each regular expression of the database is designed to identify a known malicious program. There are at least three dif- ficulties tied to this approach. First, the identi- fication of a malware signature requires a human expert and the time to forge a reliable signature is long compared to the time related to a malware attack. Second, string signature approach can be easily bypassed by obfuscation methods. Among recent work treating this subject, we propose to see for example [4, 7, 14]. Third, as the quantity of malware increase, the ratio of false positives be- comes a crucial issue. And removing old malware signatures would open doors for outbreaks of re- engineered malware. Thus, a current trend in the community proposes to design next generation of malware detectors over semantic aspects. [11, 9, 20]. However, most of se- mantic properties are difficult to decide and even heuristics can be very complex as it is illustrated in the field of computer safety. For those reasons, in [5] we try to propose and to construct a morpho- logical analysis in order to detect malware. The idea is to recognize the shape of a malicious pro- gram. That is, unlike string signature detection, we are not only considering a program as a flat text, but rather as a semantics object, so adding in some sense a dimension to the analysis. Our approach tries to combine several features: (a) to associate 1 inria-00330022,version1-13Oct2008 Author manuscript, published in "Journal in Computer Virology (2008)" DOI : 10.1007/s11416-008-0102-4
  • 2. syntactic and semantic analysis, (b) to be efficient and (c) to be as automatic as possible. Our morphological detector uses a set of CFG which plays the role of a malware signature database. Next, the detection consists in scanning files in order to recognize the shape of a malware. This design is closed to a string signature based de- tector and so we think that both approaches may be combined in a near future. Moreover, it is impor- tant to notice that this framework make the signa- ture extraction easier. Indeed, either the extraction is fully automatic when the malware CFG is rele- vant or the task of signature makers is facilitated since they can work on an abstract representation of malicious programs. This detection strategy is close to the ones pre- sented in [9, 6] but we put our strengths to opti- mize the efficiency of algorithms. For that sake, we use tree automata, a generalization to trees of finite state automata over strings [10]. Intuitively, we transform CFG into trees with pointers in or- der to represent back edges and cross edges. Then, the collection of malware signatures is a finite set of trees and so a regular tree language. Thanks to the construction of Myhill-Nerode, the minimal au- tomaton gives us a compact and efficient database. Notice that the construction of the database is it- erative and it is easy to add the CFG of a newly discovered malicious program. Another issue of malware detections is the sound- ness with respect to classic mutation techniques. Here, we detect isomorphic CFG and so several comon obfuscation methods are canonically re- moved. Moreover, we add a rewriting engine which normalizes CFG in order to have a robust represen- tation of the control flow with respect to mutations. Related works are [6, 8, 20] where data flow of pro- grams is also considered. The design of this complete chain of process is summarized by Figure 1. We also provide large scale experiments, with a collection of 10156 malicious programs and 2653 sane programs. Those results are promising, with a completely automatic method for the signature extraction we have obtained a false positive ratio of 0.1%. This study is organized as follows. First we ex- pose the principles of CFG extraction and normal- ization. Then, we present a matching engine for CFG that is based on tree automata. Finally we Figure 1: Design of the control flow detector carry out some experiments to validate our method. 1 CFG in x86 languages Road-map. Since we focus on practical aspects we choose to work on a concrete assembly language. This language is close to the x86 assembly lan- guage. We detail how to extract CFG from pro- grams, we underline the difficulties that can be en- countered and we outline how they can be overcome with classic methods. Finally, we study the prob- lem of CFG mutations. We propose to normalize the extracted CFG according to rewriting rules in order to remove common mutations. An x86 assembly language. We present the grammar of the studied programming language. The computation domain is the integers and we use a subset of the commands of the x86 assembly lan- guage. The important feature is that we consider the same flow instructions as in x86 achitectures, as a result the method that we develop can be directly applied to concrete programs. 2 inria-00330022,version1-13Oct2008
  • 3. Addresses N Offsets Z Registers R Expressions E ::= Z | N | R | [N] | [R] Flow instructions If ::= jmp E | call E | ret | jcc Z Sequential instructions Id ::= mov E E | comp E E | . . . Programs P ::= Id | If | P ; P Next, a program is a sequence of instructions p = i0 ; . . . ; in−1. The address of the instruction ik is k. In order to ease the reading and without loss of generality, we suppose that i0 is the first instruction to be executed, the address 0 is the so called entry point of the program. We observe that the control flow of programs is driven by only four kinds of flow instructions. Given an instruction ik ∈ If , the possible transfers of control are the following. • If ik is an unconditional jump jmp e. The con- trol is transferred to the address given by the value of the expression e. • If ik is a conditional jump jcc x. If its asso- ciated condition is true, the control is trans- ferred to the address k + x. Otherwise, the control is transferred to the address k + 1. • If ik is a function call call e. The address k + 1 is pushed on the stack and the control is transferred to the the value of the expression e. • If ik is a function return ret. An address is popped from the stack and the control is trans- ferred to this address. Prerequisites. The extraction of the CFG from a program is tied to several difficulties. First, we need access to the instructions of the program. As a result packing and encryption techniques can thwart the extraction. This problem is part of the folklore, indeed classical string signature de- tectors also have to face those techniques. Many solutions such as sand-boxes and generic unpack- ers have been developed to overcome this difficulty. The presentation of those solutions exceeds the scope of the current study then we refer to the text- books [13, 12, 18]. Second, we are confronted to obscure sequenceq of instructions such as push a; ret which have the same behavior as the instruction jmp a. This is also part of the folklore and we will suppose that such sequence of instructions are normalized during the disassembly phase of the extraction. Third, the target addresses of jumps and function calls have to be dynamically computed. For exam- ple, when we encounter the instruction jmp eax we need the value of the register eax in order to follow the control flow. In such cases we rely on an heuris- tic (|e|) which provides the value of the expression e if it can be statically computed. If the value cannot be computed then (|e|) = ⊥. Such an heuristic can be based on partial evaluation, emulation or any other static analysis technique. The extraction procedure. The control flow consists in the different paths that might be tra- versed through the program during its execution. It is frequently represented by a graph named a control flow graph (CFG). The vertices stand for addresses of instructions and the edges represent the possible paths that the control flow can follow. We suppose that we have access to the code of programs and that we have an heuristic (| |) to eval- uate expressions. Table 1 presents a procedure to extract CFG from programs. We observe that this procedure closely follows the semantics of flow in- structions. Indeed, the vertices of the CFG are la- beled accordingly to the instruction at the night address and the nodes are linked according to the possible control transfers. • The symbol inst of arity 1 labels addresses of sequential instructions. There is only one successor: the address of the next instruction. • The symbol jmp of arity 1 labels addresses of unconditional jumps. There is only one suc- cessor: the address to jump to. • The symbol jcc of arity 2 labels addresses of conditional jumps. There is two successors: the address to jump to when the condition is true and the address of the next instruction where the control is transfered when the con- dition is false. • The symbol call of arity 2 labels addresses of function calls. There is two successors: the address of the function to call and the return 3 inria-00330022,version1-13Oct2008
  • 4. address, that is the address of the next instruc- tion. • The symbol end of arity 0 labels addresses of function returns and undefined instructions. There is no successor. The entry point of the program correspond to the root of the CFG. Instruction Graph in ∈ Id in = jmp e (|e|) = k in = call e (|e|) = k in = jcc x Otherwise Table 1: Control flow graph extraction Normalizing mutations. Our CFG representa- tion is a rough abstraction of programs. Indeed we do not make any distinction between the dif- ferent kinds of sequential instruction, all of them are represented by nodes labeled with the symbol inst. This first abstraction level makes the CFG sound with respect to mutations which substitutes instructions with the same behavior. For example the replacement of the instruction mov eax 0 by the instruction xor eax eax does not impact our CFG representation. However, the soundness with respect to classic mutations techniques remains an important issue. Indeed, some well know mutation techniques can alter the CFG of malicious programs. In order to recover a sound representation of the control flow we apply reductions on CFG. A reduction is defined by a graph rewriting rule. As a case study, we con- sider three reductions associated to classic muta- tion techniques. Of course several other reductions can be defined in order to handle more mutations techniques. We use the following reductions • Concatenate consecutive instructions into blocks of instructions. • Realign code removing superfluous uncondi- tional jumps. • Merge consecutive conditional jumps. Those abstractions can be defined through the graph rewriting rules of Table 2. Figure 2 presents an assembly program and its reduced CFG. 0: cmp eax 0 1: jne +7 2: mov ecx eax 3: dec ecx 4: mul eax ecx 5: cmp ecx 1 6: jne −3 7: jmp +2 8: inc ecx 9: ret Figure 2: A program and its CFG We remark that each rewriting rule impose a diminution of the size of the rewritten graph then the reduction clearly terminates. Moreover, since there is no critical pair we have no problem of confluence. Nevertheless, normalizing mutations through rewriting rules is a generic principle that could be applied on sophisticated cases. Then, the issues of termination and confluence shall be care- fully considered. Table 3 presents mutations of the program of Fig- ure 2. All of them have the same reduced CFG as the original program. 2 Efficient database Road-map. Morphological detection is based on a set of malware CFG which plays the role of mal- ware signatures. This collection of CFG is compiled into a tree automaton thanks to a term represen- tation. Since tree automata fulfill a minimization property, we obtain an efficient representation of the database. Next, we apply this framework for the sub-CFG isomorphism problem in order to de- tect malware infections. From graphs to terms. A path is a word over {1, 2}∗ , we write the empty path. We define the 4 inria-00330022,version1-13Oct2008
  • 5. Concatenate instructions Realign code Merge jcc → → → Table 2: Control flow graph reductions Instruction substitution Block substitution Block permutation jcc obfuscation All in one 0: cmp eax 0 1: jne +7 2: mov ecx eax 3: sub ecx 1 4: mul eax ecx 5: cmp ecx 1 6: jne −3 7: jmp +2 8: mov eax 1 9: ret 0: cmp eax 0 1: jne +8 2: push eax 3: pop ecx 4: dec ecx 5: mul eax ecx 6: cmp ecx 1 7: jne −3 8: jmp +2 9: inc ecx 10: ret 0: cmp eax 0 1: jne +7 2: mov ecx eax 3: dec ecx 4: mul eax ecx 5: cmp ecx 1 6: jne −3 9: ret 8: inc ecx 9: jmp −2 0: cmp eax 0 1: jne +9 2: mov ecx eax 3: dec ecx 4: mul eax ecx 5: cmp ecx 2 6: ja −3 7: cmp ecx 1 8: jne −5 9: jmp +2 10: inc ecx 11: ret 0: cmp eax 0 1: je +2 2: jmp +10 2: push eax 3: pop ecx 4: sub ecx 1 5: mul eax ecx 6: cmp ecx 2 7: ja −3 8: cmp ecx 1 9: jne −5 10: ret 11: mov ecx 1 12: jmp −2 Table 3: Control flow graph mutations path order for any path ρ, τ ∈ {1, 2}∗ and any in- teger i ∈ {1, 2} as follows. ρ1 < ρ2 ρ < ρi ρ < τ ⇒ ρρ < ττ A tree domain is a set d ⊂ {1, 2}∗ such that for any path ρ ∈ {1, 2}∗ and any integer i ∈ {1, 2} we have ρi ∈ d ⇒ ρ ∈ d A tree over a set of symbols F is a pair t = (d(t), ˆt) where d(t) is a tree domain and ˆt is a func- tion from d(t) to F. We consider the set of symbols F = {inst, jmp, call, jcc, ret} ∪ {1, 2}∗ and the trees overs this set. In such trees, a nodes labeled by a path ρ = {1, 2}∗ is thought as pointers to the night node of the tree. Then, a tree have two kinds of nodes: the inner nodes labeled by symbols of {inst, jmp, call, jcc, ret} and the pointer nodes labeled by path in {1, 2}ρ . In the following we write ˚d(t) the set of inner nodes of the tree t, that is ˚d(t) = ρ ρ ∈ d(t) ˆt(ρ) ∈ {inst, jmp, call, jcc, ret} Next a tree t is well formed if for any paths ρ, τ ∈ d(t) ˆt(ρ) = τ ⇒ τ ∈ ˚d(t) and ρ ≤ τ We observe that any CFG can be represented by a unique well formed tree. Tree automata. A finite tree automaton is a tu- ple A = (Q, F, Qf , ∆), where Q is a finite set of states, F is a set of symbols, Qf ⊂ Q is a set of final states and ∆ is a finite set of transition rules of the type a(q1 . . . qi) → q with a ∈ F has arity i and q, q1, . . . , qi ∈ Q. A run of an automaton on a tree t starts at the leaves and moves upward, associating a state with 5 inria-00330022,version1-13Oct2008
  • 6. each sub-tree. Any symbol a of arity 0 is labeled by q if a → q ∈ ∆. Next, if the direct sub-trees t1, . . . , tn of a tree t = a(t1, . . . , tn) are respectively labeled by states q1, . . . , qn then the tree t is labeled by the state q if a(q1, . . . , qn) → q ∈ ∆. A tree t is accepted by the automaton if the run labels t with a final state. We observe that a run on a tree t can be computed in linear time, that is O(n) where n is the size of t, that is the number of its nodes. For any automaton A, we write L(A) the set of trees accepted by A. A language of trees L is recognizable if there is a tree automaton A such that L = L(A). We define the size |A| of an automaton A as the number of its rules. Tree automata have interesting properties. First, it is easy to build an automaton which recognize a given finite set of trees. This operation can be done in linear time, that is O(n) where n is the sum of the sizes of the trees in the language. Second, we can add new trees to the language recognized by an automaton computing a union of automata, see [10]. Given an automaton A, the union of A with an automaton A can be computed in linear time, that is O(|A |). Finally, for a given recognizable tree language, there exists a unique minimal automaton in the number of states which recognizes this language. This property ensures that the minimal automaton is the best representation of the tree language. Theorem 1 (From [10]). For any tree automaton A which recognize a tree language L we can com- pute in quadratic time (O(|A|2 )) a tree automaton A which is the minimum tree automaton recogniz- ing L up to a renaming of the states. Building the database. We explain how this framework can be used to detect malware infec- tions. Suppose that we have a set {t1, . . . , tn} of malware CFG represented by trees. Since this set is finite, there is a tree automaton A which recog- nizes it. Next, consider the tree representation t of a given program. Computing a run of A on t, we can decide in linear time if this tree is one of the the trees obtained from malware CFG. This means that that we can efficiently decide if a program has the same CFG as a known malware. Finally, we can speed-up the detection comput- ing the minimal automaton which recognize the language {t1, . . . , tn}. From a practical point of view this is the most efficient representation of the malware CFG database. Detecting infections. Actually, when a mali- cious program infects an other program, it includes its own code within the program of its host. Then, we can reasonably suppose that the CFG of the ma- licious program appears as a subgraph of the global CFG of the infected program. As a result, we can detect such an infection by deciding the subgraph isomorphism problem within the context of CFG. First we have to observe that we are not con- fronted with the general sub-graph isomorphism since CFG are graphs with strong constraints. In particular the edge labeling property implies that a CFG composed of n nodes accepts at most n subgraphs. As a result, the sub-CFG isomorphism problem is not NP-complete. Then to detect sub- CFG it is sufficient to run the automaton on the tree representations of any sub-CFG. 3 Experiments Road-map. We consider the win32 binaries of VX Heavens malware collection [2]. This collection is composed of 10156 malicious programs. Then, we have collected 2653 win32 binaries from a fresh installation of Windows VistaTM . This second col- lection is considered as sane programs. Using those samples we experiments with our im- plementation of the morphological detector. We focus our attention on false positive ratios in order to validate the our method. Indeed, we have to know if it is possible to discriminate sane programs from malicious ones only considering their CFG. The following experimental results agree with this hypothesis. CFG extraction in practice. To overcome the difficulties of CFG extraction we have chosen the following solutions • In order to deal with crypted and packed samples, we use the unpacking capabilities of ClamAVTM [3]. • We have implemented a dynamic disassembler based on the disassembler library Udis86 [1]. Our module is able to follow the control flow 6 inria-00330022,version1-13Oct2008
  • 7. and it keeps track of the stack in order to re- move push a; ret sequences. • The evaluation heuristic (|e|) proceed as follows. When we encounter a dynamic flow instruc- tion, we emulate the preceding block of sequen- tial instructions in order to recover the value of the expression e. Our emulation technology is also build over Udis86. It is limited to a subset of x86 assembly instruction, interruptions and system calls are not taken into account. • We reduce the obtained CFG according to the rules of Table 2. Figure 3 gives the sizes of the reduced CFG ex- tracted from the programs of those collections. On the X axis we have the upper bound on the size of CFG and on the Y axis we have the percentage of CFG whose size is lower than the bound. Figure 3: Sizes of control flow graphs We obeserve that about 5% of the database are programs with a non valid PE header, they pro- duce an empty graph. Then we are able to extract a CFG of more that 15 nodes from about 65% of the samples. The remaining 30% produces a CFG which have between 1 and 15. We think that those graphs are too small to be relevant. We are cur- runtly working on this part of the samples to im- prove our extraction procedure. Building the database. The size of malware control flow graphs clearly impact the accuracy of the control flow detector. We have observed that the graphs extracted from some malware were too small to be relevant and the resulting detec- tor makes many false alerts because of a few such graphs. As a result, we impose a lower bound on the size of the graphs that we include in the database. Next, we have done several tests using different lower bounds. Let N ∈ N be the lower bound on the size of CFG. We build the minimized automaton AN M which recognizes the set of tree representations of reduced malware CFG that are composed of more than N nodes. We define the morphological de- tector DN M as a predicate such that for any pro- gram p ∈ P we have DN M (p) = 1 if a malware CFG appears as a subgraph of the CFG of p and DN M (p) = 0 otherwise. We have seen in the previ- ous sections that DN M can be decided using AN M. This design has several advantages. First, when a new malicious program is discovered, one can easily add its CFG to the database using the union of tree automata and a new compilation to obtain a minimal tree automaton. The computation of the ‘not minimal’ automata takes about 25 minutes. The minimization takes several hours but this delay is not so important. Indeed, within the context of an update of the mal- ware database, during the minimization we can re- lease the ‘not minimal’ automaton. Indeed, even if this is not the best automaton it still recognize the malware database and it could be used until the minimization is terminated. Evaluation. We are interested in false positives, that is sane programs detected as malicious. For that, we have collected 2653 programs from a fresh installation of Windows VistaTM . Let us note S this set of programs. Let N ∈ N be a lower bound on the size of malware CFG, we consider the following approximation of the false positives of the detector DN M False positives {p | DN M (p) = 1 and p ∈ S} We do not evaluate false negatives, that is unde- tected malicious programs. Indeed, by construc- tion all malicious programs of our malware col- lection are detected by the morphological detec- tor. Nevertheless, this methods seems promising for this aspect. Indeed, the study [6] has shown that a CFG based detection allows to detect the high-obfuscating computer virus MetaPHOR with no false negative. However, the presence of the lower bound on the size of malware CFG that enter in the database im- plies that some malware are undetected. Those can 7 inria-00330022,version1-13Oct2008
  • 8. be considered as false negatives, even if they more related to the technical problem of CFG extraction than to the methods of morphological detection. The results of the experiments mention this ratio of undetected malware. Experimental results. We have built tree au- tomata from the malware samples considering dif- ferent lower bounds N on the size of CFG. Accord- ing to the previous section we obtain the morpho- logical detectors DN M . We have tested those detec- tors on the collection of saneware in order to eval- uate the false positives. It takes about 5 h 30 min to analyze the collection of saneware, this repre- sents the analysis of 2 319 294 sub-CFG. Table 4 presents the results. The first column indicates the considered the lower bound N. The second col- umn indicates the ratio of false positives that is the number of sane programs detected as malicious out of the toalt number of sane programs. The third column indicates the ratio of undetected malicious programs that is the number of malware samples with a CFG size lower than the bound out of the total number of malware samples. 3.1 Analysis. As expected, we observe that the false positives de- crease with the lower bound on the size of CFG. Over 15 nodes, the CFG seems to be a relevant criterium to discriminate malware. Concerning the remaining false positives. The li- braries ir41 qc.dll and ir41 qcx.dll, and the mali- cious program Trojan.Win32.Sechole have the same CFG composed of more than 1 000 nodes. We have tested those programs with commercial an- tivirus software and the libraries ir41 qc.dll and ir41 qcx.dll are not detected whereas the program Trojan.Win32.Sechole is detected as malicious. The malicious programs seems to be based on the dy- namic library and the extraction algorithm was not able to extract the CFG related to the malicious program. Concerning the ratio of undetected malware, The only way to improve the detector is to implement a better heuristic for control flow graph extraction. In its current version our prototype only use a few heuristics. For comparison, statistical methods used in [16] induce false negatives ratios between 36 % and 48 % Lower Bound False positives Undetected 1 100.00% 4.80% 2 83.78% 5.43% 3 76.82% 16.43% 4 76.77% 16.66% 5 57.98% 20.01% 6 34.84% 21.50% 7 20.57% 23.34% 9 12.06% 24.43% 10 2.17% 26.47% 11 2.04% 27.78% 12 1.60% 29.35% 13 0.71% 30.74% 15 0.09% 36.52% Table 4: Results of the experiments and false positive ratios between 0.5 % and 34 %. A detector based on artificial neural networks de- veloped at IBM [19] presents false negatives ratios between 15 % and 20 % and false positive ratios lower than 1 %. The data mining methods surveyed in [17] present false negatives ratios between 2.3 % and 64.4 % and false positive ratios between 2.2 % and 47.5 %. Heuristics methods from antivirus in- dustry tested in [15] present false negatives ratios between 20.0 % and 48.6 % and false positive ratios lower than 0.2 %. References [1] http://udis86.sourceforge.net. [2] http://vx.netlux.org. [3] http://www.clamav.net. 8 inria-00330022,version1-13Oct2008
  • 9. [4] Ph Beaucamps and E Filiol. On the possibility of practically obfuscating programs towards a unified perspective of code protection. Journal in Computer Virology, 3(1):3–21, April 2007. [5] G. Bonfante, M. Kaczmarek, and J.Y. Marion. Control Flow Graphs as Malware Signatures. WTCV, May, 2007. [6] D. Bruschi, Martignoni, L., and M. Monga. Detecting self-mutating malware using control-flow graph matching. Technical report, Universit`a degli Studi di Milano, September 2006. [7] M. Christodorescu and S. Jha. Testing mal- ware detectors. ACM SIGSOFT Software En- gineering Notes, 29(4):34–44, 2004. [8] M. Christodorescu, S. Jha, J. Kinder, S. Katzenbeisser, and H. Veith. Software transformations to improve malware detec- tion. Journal in Computer Virology, 3(4):253– 265, 2007. [9] M. Christodorescu, S. Jha, S.A. Seshia, D. Song, and R.E. Bryant. Semantics-aware malware detection. IEEE Symposium on Se- curity and Privacy, 2005. [10] H. Comon, M. Dauchet, R. Gilleron, F. Jacquemard, D. Lugiez, S. Tison, and M. Tommasi. Tree automata techniques and applications. Available on: http://www. grappa. univ-lille3. fr/tata, 10, 1997. [11] M. Dalla Preda, M. Christodorescu, S. Jha, and S. Debray. A Semantics-Based Approach to Malware Detection. In POPL’07, 2007. [12] E. Filiol. Computer Viruses: from Theory to Applications. Springer-Verlag, 2005. [13] E. Filiol. Advanced viral techniques: mathe- matical and algorithmic aspects. Berlin Hei- delberg New York: Springer, 2006. [14] E. Filiol. Malware pattern scanning schemes secure against black-box analysis. In 15th EICAR, 2006. [15] D. Gryaznov. Scanners of the Year 2000: Heuristics. Proceedings of the 5th Interna- tional Virus Bulletin, 1999. [16] J.O. Kephart and W.C. Arnold. Automatic Extraction of Computer Virus Signatures. 4th Virus Bulletin International Conference, pages 178–184, 1994. [17] M.G. Schultz, E. Eskin, E. Zadok, and S.J. Stolfo. Data Mining Methods for Detection of New Malicious Executables. Proceedings of the IEEE Symposium on Security and Privacy, page 38, 2001. [18] P. Sz¨or. The Art of Computer Virus Research and Defense. Addison-Wesley Professional, 2005. [19] GJ Tesauro, JO Kephart, and GB Sorkin. Neural networks for computer virus recogni- tion. Expert, IEEE [see also IEEE Intelligent Systems and Their Applications], 11(4):5–6, 1996. [20] Andrew Walenstein, Rachit Mathur, Mo- hamed R. Chouchane, and Arun Lakhotia. Normalizing metamorphic malware using term rewriting. scam, 0:75–84, 2006. 9 inria-00330022,version1-13Oct2008