Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Malware_SmartCom_2017
1. Effective Malware Detection based on
Behavior and Data Features
Zhiwu Xu, Cheng Wen, Shengchao Qin, and Zhong Ming
College of Computer Science and Software Engineering,
Shenzhen University, China
3. Malware
Malicious software:
Computer viruses, worms, Trojan
horses, ransomware, spyware, adware,
scareware, and other intrusive codes
Recent report from McAfee:
More than 650 million malware samples
detected in Q1, 2017, in which more than
30 million ones are new.
4. Signature-based method
To compare with the known signatures,
Comodo, McAfee, Kaspersky, Kingsoft, and
Symantec
Can be easily evaded by the evasion techniques
packing, variable-renaming, and
polymorphism.
5. Heuristic-based method
To identity malicious patterns though either
static analysis or dynamic analysis
However, heavy-weight or Inefficient
6. Machine learning approaches
Most of existing work focus on behaviour
features, without data information
binary codes, opcodes and API calls
Can be easily evaded
previously-unseen behaviors
obfuscate
8. Our approaches
Based on machine learning
Consider both the behaviour information and
the data information.
Consider the time-split samples and obfuscated
samples
13. Feature Extractor
Decompilation
Information Extraction
Feature Selection and representation
Selection:
Term Frequency and Inverse Document Frequency (TF-IDF)
Representation:
15. Classifier
Classifier Training
An executable 𝑒 can be represented as a vector 𝑥. 𝐷0
represent the available dataset with known categories. Our
training problem is to find a classifier 𝐶: 𝑋 → [0,1] such that
𝑚𝑖𝑛
𝑥,𝑐 ∈𝐷0
𝑑 𝐶 𝑥 − 𝑐
Malware Detection
Given an executable 𝑒 and its vector representation, the
goal of the detection is to find 𝑐 such that
min 𝑑 𝐶 𝑥 − 𝑐
21. Time-Split Experiment
We use some fresh malware samples, which
were collected dated from January 2017 to July
2017, from the DAS MALWERK website.
22. Obfuscation Experiments
Obfuscation tools:Obfuscator
Change code execution flow
Obfuscation tools:Unest
rewriting digital changes equivalently
confusing the output string
pushing the target code segment into the stack and jumping
to it to confuse the target code
obfuscating the static libraries
24. Conclusion
Machine learning methods based on the opcodes,
data types and system libraries.
Carried out some interesting experiments.
Capable of detecting some fresh malware
Has a resistance to some obfuscation techniques
Thank you for the Session Chair. I am very honor to have this opportunity to attend this conference. The topic of my paper is “Effective Malware Detection based on Behavior and Data Features”. I am the speaker Cheng Wen. This work is done with Zhiwu Xu, Shengchao Qin and Zhong Ming.
The outline of my talk as follows. In the first part I want to introduce the background of this research. The second part present our approach. Followed by experiments. Finally, a simple conclusion is given.
Well, let’s move on the first part of this topic.
Malware is a generic term that encompasses viruses, Worm, spywares and other intrusive codes. They are spreading all over the world and are increasing day by day, thus becoming a serious threat. According to the recent report from McAfee, there are more than 650 million malware samples detected in the first quarter, 2017, in which more than 30 million ones are new. So the detection of malware is of major concern to both the anti-malware industry and researchers.
To protect users from these threats, anti-malware software products from different companies provide the major defense against malware, such as Comodo, McAfee and so on, wherein the signature-based method is employed. However, this method can be easily evaded by malware writers through the evasion techniques.
To overcome the limitation of the signature-based method, heuristic-based method are proposed, which focuses on identifying the malicious behavior patterns, though either static analysis or dynamic analysis. But the
increasing number of malware samples makes this method no longer considered effective.
Recently, various machine learning approaches have been proposed for detecting malware. Although some approaches can get a high accuracy (for the stationary data sets), it is still not enough for malware detection. Most of existing work focus on the behaviour features such as binary codes, opcodes and API calls, leaving the data information out of consideration.
Also, It can be easily evaded. Malware evolves rapidly and it thus becomes hard to generalize learning models to reflect future, previously-unseen behaviors. And most of the work didn’t consider the resistance to obfuscation techniques.
Next, Let’s move to the second part.
In this paper, we propose an effective approach to detect malware based on machine learning. Different from most existing work, we take into account not only the behaviour information but also the data information. We also consider the time-split samples and obfuscated samples,
Generally, the behaviour information reflects which behaviours a software intends to do, while the data information indicates which data's a software intends to perform on or how data's are organized.
This Figure shows the framework of our approach, which consists of two components, namely the feature extractor and the malware classifier. The feature extractor extracts the feature information from the executables and represents them as vectors. While the malware classifier is first trained from an available dataset of executables, and then can be used to detect new, unseen executables. In the following, we describe both components in more detail.
Feature extractor is consists of the 3 steps, Decompilation, Information Extraction and Feature Selection and representation
An instruction or a data in an executable file can be represented as a series of binary codes, which are clearly not easy to read. So the first step is to transform the binary codes into a readable intermediate representation such as assembly codes by a decompilation tool.
Next, the extractor parses the asm files to extract the information, namely, opcodes, data types and system libraries. Generally, the opcodes used in an executable represent its intended
behaviours, while the data types indicate the structures of the datas it may perform on. In addition, the imported system libraries, which reflect the interaction between the executable and system. All these information describes the possible mission of an executable in some sense, and similar executables share the similar information.
We use the well-known scheme TF-IDF method to measure the statistical dependence. Next the extractor select the top k weight terms. Each executable can be represented as a vector. An example of vector is shown in the following.
Another component is malware classifier.
As mentioned before, we will first train our malware classifier from an available dataset of executables with known categories by a supervised machine-learning method, and then use it detect new, unseen executables.
Followed by the experiments
Our dataset consists of malwares and benigns. The malware dataset consists of the samples from BIG 2015 Challenge and from theZoo aka Malware DB, while the benign software are collected from 360 software company. We use various machine learning method to train a classifier and performed some experiments to test our approach’s ability.
To evaluate the performance of our approach, we conducted 10-fold cross validation experiments. The learning methods we used in our experiments are listed in the table. Concerning ROC curve, most classifiers can produce much better classification results.
Meanwhile, we counted the training times and the testing times in seconds for each cross validation experiment. The results are shown in this table. We also evaluated how the feature extractor perform. Both the decompilation time and the extracting time are acceptable.
Next, we also performed experiments based on each kind of feature to see their effectiveness. For that, we conducted the same experiments as above for each kind of feature. From the results we can see that all the features are effective to detect malware, and using all of the features together produced the best results. The opcode and library features have been used by lots of work in practice, so we believe that the type information can benefit to malware detection as well in practice.
In this section, to test our approach’s ability to detect genuinely new malware or new malware versions, we ran a time split experiment. First, we downloaded the malware samples, which were collected from January 2017 to July 2017. That is to say, all the malware samples are newer than the ones in our data set.
About 81% of the samples can be detected by our classifier, which estimates that our approach can detect some new malware samples or new versions of existing malware samples. However, the results also indicate that the classifier becomes ineffective as time passes. This suggests that malware classifiers should be updated often with new data or new features in order to maintain the classification accuracy.
One reason to make the malware detection difficult is that malware writers can use obfuscation techniques. In this section, we performed some experiments to test our approach’s ability to detect new malware samples that are obtained by obfuscating the existing ones.
We use two commercial tool, Obfuscator and Unest, to obfuscate some malware samples, which are randomly selected from our data set.
The results show that all the obfuscated malware samples can be detected by our classifier. That is to say, our classifier has a resistance to some obfuscation techniques.
At last, I conclude the talk.
In this work, we have proposed a malware detection approach using various machine learning methods based on the opcodes, data types and system libraries.
To evaluate the proposed approach, we have carried out some interesting experiments.
The experimental results have demonstrated that our classier is capable of detecting some fresh malware, and has a resistance to some obfuscation techniques.
We use static analysis. In malware detection, both static analysis and dynamic analysis have their own advantages and limitations.
In real application, we suggest using static analysis at first. If the file cannot be will-represented, then we can try dynamic analysis.