Malware_SmartCom_2017

•Download as PPTX, PDF•

0 likes•73 views

FoodieBuddy

malware analysis cyber security digital forensics

Engineering

Effective Malware Detection based on
Behavior and Data Features
Zhiwu Xu, Cheng Wen, Shengchao Qin, and Zhong Ming
College of Computer Science and Software Engineering,
Shenzhen University, China

Introduction
Approach
Experiments
Conclusion

Malware
 Malicious software:
 Computer viruses, worms, Trojan
horses, ransomware, spyware, adware,
scareware, and other intrusive codes
 Recent report from McAfee:
 More than 650 million malware samples
detected in Q1, 2017, in which more than
30 million ones are new.

Signature-based method
 To compare with the known signatures,
 Comodo, McAfee, Kaspersky, Kingsoft, and
Symantec
 Can be easily evaded by the evasion techniques
 packing, variable-renaming, and
polymorphism.

Heuristic-based method
 To identity malicious patterns though either
static analysis or dynamic analysis
 However, heavy-weight or Inefficient

Machine learning approaches
Most of existing work focus on behaviour
features, without data information
 binary codes, opcodes and API calls
 Can be easily evaded
 previously-unseen behaviors
 obfuscate

Our approaches
 Based on machine learning
 Consider both the behaviour information and
the data information.
 Consider the time-split samples and obfuscated
samples

Feature Extractor
 Decompilation
 Information Extraction
 Feature Selection and representation

Feature Extractor
 Decompilation
 Information Extraction
 Feature Selection and representation
Decompilation
Tool
ASM codes

Feature Extractor
 Decompilation
 Information Extraction
 Feature Selection and representation
Opcode
System call
Data Type：int *

Feature Extractor
 Decompilation
 Information Extraction
 Feature Selection and representation
Selection:
Term Frequency and Inverse Document Frequency (TF-IDF)
Representation:

Classifier
Classifier Training
 An executable 𝑒 can be represented as a vector 𝑥. 𝐷0
represent the available dataset with known categories. Our
training problem is to find a classifier 𝐶: 𝑋 → [0,1] such that
𝑚𝑖𝑛
𝑥,𝑐 ∈𝐷0
𝑑 𝐶 𝑥 − 𝑐
Malware Detection
 Given an executable 𝑒 and its vector representation, the
goal of the detection is to find 𝑐 such that
min 𝑑 𝐶 𝑥 − 𝑐

Experiments
Malware dataset (11376 samples)
BIG 2015 Challenge
theZoo aka Malware DB
Benign dataset (8003 samples)
QIHU 360 software
(with the total size of 250 GB)

Cross Validation Experiments
10-fold cross validation

250GB, 15.6 hours, Decompile 0.22s/MB
182GB, 10.5 hours, Extract features 0.20s/MB
Runtime performance
Classifier Training Time (s) Testing Time (s)
KNN (k = 1) 0 + (16.477) 178.789
KNN (k = 3) 0 + (16.369) 199.474
KNN (k = 5) 0 + (16.517) 207.052
KNN (k = 7) 0 + (16.238) 210.557
DT (criterion = ‘gini’) 23.442 0.067
DT (criterion = ‘entropy’) 13.485 0.066
RF (n = 10, gini) 4.115 0.086
RF (n = 10, entropy) 3.791 0.077
Gaussian Naïve Bayes 3.093 0.480
Multinomial Naïve Bayes 1.535 0.035
Bernouli Naïve Bayes 1.826 0.828
SVM (kernel = ‘linear’) 150.022 14.494
SVM (kernel = ‘rbf’) 799.310 50.196
SVM (kernel = ‘sigmoid’) 1303.607 130.178
SGD Classifier 22.569 0.048

Time-Split Experiment
We use some fresh malware samples, which
were collected dated from January 2017 to July
2017, from the DAS MALWERK website.

Obfuscation Experiments
Obfuscation tools：Obfuscator
 Change code execution flow
Obfuscation tools：Unest
 rewriting digital changes equivalently
 confusing the output string
 pushing the target code segment into the stack and jumping
to it to confuse the target code
 obfuscating the static libraries

Conclusion
Machine learning methods based on the opcodes,
data types and system libraries.
Carried out some interesting experiments.
Capable of detecting some fresh malware
Has a resistance to some obfuscation techniques

Similar to Malware_SmartCom_2017

Cybersecurity - Jim ButterworthTechBiz Forense Digital

AntimalwareMayank Chaudhari

Introduzione ai network penetration test secondo osstmmSimone Onofri

Presentation1.pptxSubhashreddyPalleti

A Fast Flowgraph Based Classification System for Packed and Polymorphic Malwa...Silvio Cesare

Malware Classification Using Structured Control FlowSilvio Cesare

Making Runtime Data Useful for Incident Diagnosis: An Experience ReportQAware GmbH

Security Testing ModernApps_v1.0Neelu Tripathy

Malware Most Wanted: Security EcosystemCyphort

SF Bay Area Splunk User Group Meeting October 5, 2022Becky Burwell

Real time intrusion detection in network traffic using adaptive and auto-scal...Gobinath Loganathan

IRJET - Survey on Malware Detection using Deep Learning MethodsIRJET Journal

ICoSTEC-PPT.pptxRickiFirmansyah1

Integris Security - Hacking With Glue ℠Integris Security LLC

Machine learning techniques applied to detect cyber attacks on web applicationsVenkat Projects

Python for Data Science with AnacondaTravis Oliphant

Parasoft .TEST, Write better C# Code Using Data Flow Analysis Engineering Software Lab

[IITB BTP 2015 Dec] Dynamic detection of malware in android OS.pptxDeepanjanKundu2

Simseer and Bugwise - Web Services for Binary-level Software Similarity and D...Silvio Cesare

Similar to Malware_SmartCom_2017 (20)

Cybersecurity - Jim Butterworth

Antimalware

Introduzione ai network penetration test secondo osstmm

Presentation1.pptx

A Fast Flowgraph Based Classification System for Packed and Polymorphic Malwa...

Malware Classification Using Structured Control Flow

Making Runtime Data Useful for Incident Diagnosis: An Experience Report

Security Testing ModernApps_v1.0

Malware Most Wanted: Security Ecosystem

SF Bay Area Splunk User Group Meeting October 5, 2022

Real time intrusion detection in network traffic using adaptive and auto-scal...

IRJET - Survey on Malware Detection using Deep Learning Methods

ICoSTEC-PPT.pptx

Integris Security - Hacking With Glue ℠

Machine learning techniques applied to detect cyber attacks on web applications

Python for Data Science with Anaconda

Parasoft .TEST, Write better C# Code Using Data Flow Analysis

[IITB BTP 2015 Dec] Dynamic detection of malware in android OS.pptx

Simseer and Bugwise - Web Services for Binary-level Software Similarity and D...

Recently uploaded

College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile

Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...Call Girls in Nagpur High Profile

247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).pptssuser5c9d4b1

HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSRajkumarAkumalla

(TARA) Talegaon Dabhade Call Girls Just Call 7001035870 [ Cash on Delivery ] ...ranjana rawat

★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR9953056974 Low Rate Call Girls In Saket, Delhi NCR

(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat

High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escortsranjana rawat

The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat

(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escortsranjana rawat

VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130Suhani Kapoor

SPICE PARK APR2024 ( 6,793 SPICE Models )Tsuyoshi Horigome

Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Christo Ananth

Introduction and different types of Ethernet.pptxupamatechverse

HARMONY IN THE NATURE AND EXISTENCE - Unit-IVRajaP95

Microscopic Analysis of Ceramic Materials.pptxpurnimasatapathy1234

OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...Soham Mondal

Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Dr.Costas Sachpazis

APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSKurinjimalarL3

Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile

Recently uploaded (20)

College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik

Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...

247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt

HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS

(TARA) Talegaon Dabhade Call Girls Just Call 7001035870 [ Cash on Delivery ] ...

★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR

(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...

High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts

The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...

(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts

VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130

SPICE PARK APR2024 ( 6,793 SPICE Models )

Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...

Introduction and different types of Ethernet.pptx

HARMONY IN THE NATURE AND EXISTENCE - Unit-IV

Microscopic Analysis of Ceramic Materials.pptx

OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...

Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...

APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS

Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts

Malware_SmartCom_2017

1. Effective Malware Detection based on Behavior and Data Features Zhiwu Xu, Cheng Wen, Shengchao Qin, and Zhong Ming College of Computer Science and Software Engineering, Shenzhen University, China

2. Introduction Approach Experiments Conclusion

3. Malware  Malicious software:  Computer viruses, worms, Trojan horses, ransomware, spyware, adware, scareware, and other intrusive codes  Recent report from McAfee:  More than 650 million malware samples detected in Q1, 2017, in which more than 30 million ones are new.

4. Signature-based method  To compare with the known signatures,  Comodo, McAfee, Kaspersky, Kingsoft, and Symantec  Can be easily evaded by the evasion techniques  packing, variable-renaming, and polymorphism.

5. Heuristic-based method  To identity malicious patterns though either static analysis or dynamic analysis  However, heavy-weight or Inefficient

6. Machine learning approaches Most of existing work focus on behaviour features, without data information  binary codes, opcodes and API calls  Can be easily evaded  previously-unseen behaviors  obfuscate

7. Introduction Approach Experiments Conclusion

8. Our approaches  Based on machine learning  Consider both the behaviour information and the data information.  Consider the time-split samples and obfuscated samples

9. Framework Feature Extractor

10. Feature Extractor  Decompilation  Information Extraction  Feature Selection and representation

11. Feature Extractor  Decompilation  Information Extraction  Feature Selection and representation Decompilation Tool ASM codes

12. Feature Extractor  Decompilation  Information Extraction  Feature Selection and representation Opcode System call Data Type：int *

13. Feature Extractor  Decompilation  Information Extraction  Feature Selection and representation Selection: Term Frequency and Inverse Document Frequency (TF-IDF) Representation:

14. Framework Classifier

15. Classifier Classifier Training  An executable 𝑒 can be represented as a vector 𝑥. 𝐷0 represent the available dataset with known categories. Our training problem is to find a classifier 𝐶: 𝑋 → [0,1] such that 𝑚𝑖𝑛 𝑥,𝑐 ∈𝐷0 𝑑 𝐶 𝑥 − 𝑐 Malware Detection  Given an executable 𝑒 and its vector representation, the goal of the detection is to find 𝑐 such that min 𝑑 𝐶 𝑥 − 𝑐

16. Introduction Approach Experiments Conclusion

17. Experiments Malware dataset (11376 samples) BIG 2015 Challenge theZoo aka Malware DB Benign dataset (8003 samples) QIHU 360 software (with the total size of 250 GB)

18. Cross Validation Experiments 10-fold cross validation

19. 250GB, 15.6 hours, Decompile 0.22s/MB 182GB, 10.5 hours, Extract features 0.20s/MB Runtime performance Classifier Training Time (s) Testing Time (s) KNN (k = 1) 0 + (16.477) 178.789 KNN (k = 3) 0 + (16.369) 199.474 KNN (k = 5) 0 + (16.517) 207.052 KNN (k = 7) 0 + (16.238) 210.557 DT (criterion = ‘gini’) 23.442 0.067 DT (criterion = ‘entropy’) 13.485 0.066 RF (n = 10, gini) 4.115 0.086 RF (n = 10, entropy) 3.791 0.077 Gaussian Naïve Bayes 3.093 0.480 Multinomial Naïve Bayes 1.535 0.035 Bernouli Naïve Bayes 1.826 0.828 SVM (kernel = ‘linear’) 150.022 14.494 SVM (kernel = ‘rbf’) 799.310 50.196 SVM (kernel = ‘sigmoid’) 1303.607 130.178 SGD Classifier 22.569 0.048

20. Feature Experiment

21. Time-Split Experiment We use some fresh malware samples, which were collected dated from January 2017 to July 2017, from the DAS MALWERK website.

22. Obfuscation Experiments Obfuscation tools：Obfuscator  Change code execution flow Obfuscation tools：Unest  rewriting digital changes equivalently  confusing the output string  pushing the target code segment into the stack and jumping to it to confuse the target code  obfuscating the static libraries

23. Introduction Approach Experiments Conclusion

24. Conclusion Machine learning methods based on the opcodes, data types and system libraries. Carried out some interesting experiments. Capable of detecting some fresh malware Has a resistance to some obfuscation techniques

25. That ’s all. Thank you very much!

Editor's Notes

Thank you for the Session Chair. I am very honor to have this opportunity to attend this conference. The topic of my paper is “Effective Malware Detection based on Behavior and Data Features”. I am the speaker Cheng Wen. This work is done with Zhiwu Xu, Shengchao Qin and Zhong Ming.
The outline of my talk as follows. In the first part I want to introduce the background of this research. The second part present our approach. Followed by experiments. Finally, a simple conclusion is given. Well, let’s move on the first part of this topic.
Malware is a generic term that encompasses viruses, Worm, spywares and other intrusive codes. They are spreading all over the world and are increasing day by day, thus becoming a serious threat. According to the recent report from McAfee, there are more than 650 million malware samples detected in the first quarter, 2017, in which more than 30 million ones are new. So the detection of malware is of major concern to both the anti-malware industry and researchers.
To protect users from these threats, anti-malware software products from different companies provide the major defense against malware, such as Comodo, McAfee and so on, wherein the signature-based method is employed. However, this method can be easily evaded by malware writers through the evasion techniques.
To overcome the limitation of the signature-based method, heuristic-based method are proposed, which focuses on identifying the malicious behavior patterns, though either static analysis or dynamic analysis. But the increasing number of malware samples makes this method no longer considered effective.
Recently, various machine learning approaches have been proposed for detecting malware. Although some approaches can get a high accuracy (for the stationary data sets), it is still not enough for malware detection. Most of existing work focus on the behaviour features such as binary codes, opcodes and API calls, leaving the data information out of consideration. Also, It can be easily evaded. Malware evolves rapidly and it thus becomes hard to generalize learning models to reflect future, previously-unseen behaviors. And most of the work didn’t consider the resistance to obfuscation techniques.
Next, Let’s move to the second part.
In this paper, we propose an effective approach to detect malware based on machine learning. Different from most existing work, we take into account not only the behaviour information but also the data information. We also consider the time-split samples and obfuscated samples, Generally, the behaviour information reflects which behaviours a software intends to do, while the data information indicates which data's a software intends to perform on or how data's are organized.
This Figure shows the framework of our approach, which consists of two components, namely the feature extractor and the malware classifier. The feature extractor extracts the feature information from the executables and represents them as vectors. While the malware classifier is first trained from an available dataset of executables, and then can be used to detect new, unseen executables. In the following, we describe both components in more detail.
Feature extractor is consists of the 3 steps, Decompilation, Information Extraction and Feature Selection and representation
An instruction or a data in an executable file can be represented as a series of binary codes, which are clearly not easy to read. So the first step is to transform the binary codes into a readable intermediate representation such as assembly codes by a decompilation tool.
Next, the extractor parses the asm files to extract the information, namely, opcodes, data types and system libraries. Generally, the opcodes used in an executable represent its intended behaviours, while the data types indicate the structures of the datas it may perform on. In addition, the imported system libraries, which reflect the interaction between the executable and system. All these information describes the possible mission of an executable in some sense, and similar executables share the similar information.
We use the well-known scheme TF-IDF method to measure the statistical dependence. Next the extractor select the top k weight terms. Each executable can be represented as a vector. An example of vector is shown in the following.
Another component is malware classifier.
As mentioned before, we will first train our malware classifier from an available dataset of executables with known categories by a supervised machine-learning method, and then use it detect new, unseen executables.
Followed by the experiments
Our dataset consists of malwares and benigns. The malware dataset consists of the samples from BIG 2015 Challenge and from theZoo aka Malware DB, while the benign software are collected from 360 software company. We use various machine learning method to train a classifier and performed some experiments to test our approach’s ability.
To evaluate the performance of our approach, we conducted 10-fold cross validation experiments. The learning methods we used in our experiments are listed in the table. Concerning ROC curve, most classifiers can produce much better classification results.
Meanwhile, we counted the training times and the testing times in seconds for each cross validation experiment. The results are shown in this table. We also evaluated how the feature extractor perform. Both the decompilation time and the extracting time are acceptable.
Next, we also performed experiments based on each kind of feature to see their effectiveness. For that, we conducted the same experiments as above for each kind of feature. From the results we can see that all the features are effective to detect malware, and using all of the features together produced the best results. The opcode and library features have been used by lots of work in practice, so we believe that the type information can benefit to malware detection as well in practice.
In this section, to test our approach’s ability to detect genuinely new malware or new malware versions, we ran a time split experiment. First, we downloaded the malware samples, which were collected from January 2017 to July 2017. That is to say, all the malware samples are newer than the ones in our data set. About 81% of the samples can be detected by our classifier, which estimates that our approach can detect some new malware samples or new versions of existing malware samples. However, the results also indicate that the classifier becomes ineffective as time passes. This suggests that malware classifiers should be updated often with new data or new features in order to maintain the classification accuracy.
One reason to make the malware detection difficult is that malware writers can use obfuscation techniques. In this section, we performed some experiments to test our approach’s ability to detect new malware samples that are obtained by obfuscating the existing ones. We use two commercial tool, Obfuscator and Unest, to obfuscate some malware samples, which are randomly selected from our data set. The results show that all the obfuscated malware samples can be detected by our classifier. That is to say, our classifier has a resistance to some obfuscation techniques.
At last, I conclude the talk.
In this work, we have proposed a malware detection approach using various machine learning methods based on the opcodes, data types and system libraries. To evaluate the proposed approach, we have carried out some interesting experiments. The experimental results have demonstrated that our classier is capable of detecting some fresh malware, and has a resistance to some obfuscation techniques.
We use static analysis. In malware detection, both static analysis and dynamic analysis have their own advantages and limitations. In real application, we suggest using static analysis at first. If the file cannot be will-represented, then we can try dynamic analysis.

Malware_SmartCom_2017

Recommended

Recommended

More Related Content

Similar to Malware_SmartCom_2017

Similar to Malware_SmartCom_2017 (20)

Recently uploaded

Recently uploaded (20)

Malware_SmartCom_2017

Editor's Notes