My name is Silvio Cesare. My coauthor for this paper is Dr Yang Xiang. We are both from Central Queensland University and our research investigates the topic of Malware classification using structured control flow.
The first topic I’d like to address is what the motivation for investigating this research is, and why it’s a significant topic to investigate. Our research focuses on better methods to detect and classify malware, but what is malware? Malware is characteriized as hostile, intrusive or annoying software, and it’s a pervasive problem in distributed and networked computing. The global problem of malware gives motivation to the detection of malware. And detection of malware is necessary for a secure environment. Identifying malware variants provides great benefit in early detection and presents a useful defense against malware threats.
A variety of schemes exist to statically classify malware. In a purely static approach, the malware is never is executed. And static approaches have been applied employing statistical measures such as n-grams, or dissimilarity measures such as the edit distance of the malware’s raw content. Classification using control flow is considered superior to n-grams and edit distances utilising the raw malware content, because control flow can be identified as an invariant characteristic across strains in a family of malware. Alternate techniques perform poorly because small changes in the malware source code can significantly affect the byte level content. This is not true, however, of control flow. Control flow is an effective feature to fingerprint malware, but the extraction of these features can be hindered when the malware hides its real content using the code packing transformation.
The code packing transformation is an obfuscation method applied to malware as a post-processing stage to hide its real content. Some legitimate software is packed, but the majority of malware is also. In one study, 79% of malware seen in that month was found to be packed. In 2006, it was reported that 50% of malware from that year were repacked versions of existing malware. The typical behaviour of a packed program is to dynamically generate the hidden code at runtime, and then execute it. The goal of automated unpacking is to reverse the code packing transformation so that the hidden content is revealed.
Silvio Cesare and Yang Xiang School of Management and Information Systems Centre for Intelligent and Networked Systems Central Queensland University
A novel system for approximate identification of control flow (flowgraph) signatures using the decompilation technique of structuring, and then using those signatures to classify a query program against a malware database.
A fast application level emulator to provide automated unpacking, that is capable of real-time desktop use.
A novel algorithm to determine when to stop emulation, using entropy analysis.
We implement and evaluate our ideas in a prototype system that performs automated unpacking and malware classification.
Tested classifying Klez (shown bottom left), Netsky, (shown bottom right) and Roron families of malware.
Results show high similarities between malware variants.
a b c d g h a 0.84 1.00 0.76 0.47 0.47 b 0.84 0.84 0.87 0.46 0.46 c 1.00 0.84 0.76 0.47 0.47 d 0.76 0.87 0.76 0.46 0.45 g 0.47 0.46 0.47 0.46 0.83 h 0.47 0.46 0.47 0.45 0.83 aa ac f j p t x y aa 0.78 0.61 0.70 0.47 0.67 0.44 0.81 ac 0.78 0.66 0.75 0.41 0.53 0.35 0.64 f 0.61 0.66 0.86 0.46 0.59 0.39 0.72 j 0.70 0.75 0.86 0.52 0.67 0.44 0.83 p 0.47 0.41 0.46 0.52 0.61 0.79 0.56 t 0.67 0.53 0.59 0.67 0.61 0.61 0.79 x 0.44 0.35 0.39 0.44 0.79 0.61 0.49 y 0.81 0.64 0.72 0.83 0.56 0.79 0.49
Evaluation of Flowgraph Based Classification (cont)
Examined similarities between unrelated malware and programs (left).
Evaluated likely occurrence of false positives by calculating the similarities between the set of Windows Vista system programs, which are mostly not similar to each other (right).