SlideShare a Scribd company logo
Effective Malware Detection based on
Behavior and Data Features
Zhiwu Xu, Cheng Wen, Shengchao Qin, and Zhong Ming
College of Computer Science and Software Engineering,
Shenzhen University, China
Introduction
Approach
Experiments
Conclusion
Malware
 Malicious software:
 Computer viruses, worms, Trojan
horses, ransomware, spyware, adware,
scareware, and other intrusive codes
 Recent report from McAfee:
 More than 650 million malware samples
detected in Q1, 2017, in which more than
30 million ones are new.
Signature-based method
 To compare with the known signatures,
 Comodo, McAfee, Kaspersky, Kingsoft, and
Symantec
 Can be easily evaded by the evasion techniques
 packing, variable-renaming, and
polymorphism.
Heuristic-based method
 To identity malicious patterns though either
static analysis or dynamic analysis
 However, heavy-weight or Inefficient
Machine learning approaches
Most of existing work focus on behaviour
features, without data information
 binary codes, opcodes and API calls
 Can be easily evaded
 previously-unseen behaviors
 obfuscate
Introduction
Approach
Experiments
Conclusion
Our approaches
 Based on machine learning
 Consider both the behaviour information and
the data information.
 Consider the time-split samples and obfuscated
samples
Framework
Feature
Extractor
Feature Extractor
 Decompilation
 Information Extraction
 Feature Selection and representation
Feature Extractor
 Decompilation
 Information Extraction
 Feature Selection and representation
Decompilation
Tool
ASM codes
Feature Extractor
 Decompilation
 Information Extraction
 Feature Selection and representation
Opcode
System call
Data Type:int *
Feature Extractor
 Decompilation
 Information Extraction
 Feature Selection and representation
Selection:
Term Frequency and Inverse Document Frequency (TF-IDF)
Representation:
Framework
Classifier
Classifier
Classifier Training
 An executable 𝑒 can be represented as a vector 𝑥. 𝐷0
represent the available dataset with known categories. Our
training problem is to find a classifier 𝐶: 𝑋 → [0,1] such that
𝑚𝑖𝑛
𝑥,𝑐 ∈𝐷0
𝑑 𝐶 𝑥 − 𝑐
Malware Detection
 Given an executable 𝑒 and its vector representation, the
goal of the detection is to find 𝑐 such that
min 𝑑 𝐶 𝑥 − 𝑐
Introduction
Approach
Experiments
Conclusion
Experiments
Malware dataset (11376 samples)
BIG 2015 Challenge
theZoo aka Malware DB
Benign dataset (8003 samples)
QIHU 360 software
(with the total size of 250 GB)
Cross Validation Experiments
10-fold cross validation
250GB, 15.6 hours, Decompile 0.22s/MB
182GB, 10.5 hours, Extract features 0.20s/MB
Runtime performance
Classifier Training Time (s) Testing Time (s)
KNN (k = 1) 0 + (16.477) 178.789
KNN (k = 3) 0 + (16.369) 199.474
KNN (k = 5) 0 + (16.517) 207.052
KNN (k = 7) 0 + (16.238) 210.557
DT (criterion = ‘gini’) 23.442 0.067
DT (criterion = ‘entropy’) 13.485 0.066
RF (n = 10, gini) 4.115 0.086
RF (n = 10, entropy) 3.791 0.077
Gaussian Naïve Bayes 3.093 0.480
Multinomial Naïve Bayes 1.535 0.035
Bernouli Naïve Bayes 1.826 0.828
SVM (kernel = ‘linear’) 150.022 14.494
SVM (kernel = ‘rbf’) 799.310 50.196
SVM (kernel = ‘sigmoid’) 1303.607 130.178
SGD Classifier 22.569 0.048
Feature Experiment
Time-Split Experiment
We use some fresh malware samples, which
were collected dated from January 2017 to July
2017, from the DAS MALWERK website.
Obfuscation Experiments
Obfuscation tools:Obfuscator
 Change code execution flow
Obfuscation tools:Unest
 rewriting digital changes equivalently
 confusing the output string
 pushing the target code segment into the stack and jumping
to it to confuse the target code
 obfuscating the static libraries
Introduction
Approach
Experiments
Conclusion
Conclusion
Machine learning methods based on the opcodes,
data types and system libraries.
Carried out some interesting experiments.
Capable of detecting some fresh malware
Has a resistance to some obfuscation techniques
That ’s all.
Thank you very much!

More Related Content

Similar to Malware_SmartCom_2017

Cybersecurity - Jim Butterworth
Cybersecurity - Jim ButterworthCybersecurity - Jim Butterworth
Cybersecurity - Jim Butterworth
TechBiz Forense Digital
 
Antimalware
AntimalwareAntimalware
Antimalware
Mayank Chaudhari
 
Introduzione ai network penetration test secondo osstmm
Introduzione ai network penetration test secondo osstmmIntroduzione ai network penetration test secondo osstmm
Introduzione ai network penetration test secondo osstmm
Simone Onofri
 
Presentation1.pptx
Presentation1.pptxPresentation1.pptx
Presentation1.pptx
SubhashreddyPalleti
 
A Fast Flowgraph Based Classification System for Packed and Polymorphic Malwa...
A Fast Flowgraph Based Classification System for Packed and Polymorphic Malwa...A Fast Flowgraph Based Classification System for Packed and Polymorphic Malwa...
A Fast Flowgraph Based Classification System for Packed and Polymorphic Malwa...
Silvio Cesare
 
Malware Classification Using Structured Control Flow
Malware Classification Using Structured Control FlowMalware Classification Using Structured Control Flow
Malware Classification Using Structured Control Flow
Silvio Cesare
 
Making Runtime Data Useful for Incident Diagnosis: An Experience Report
Making Runtime Data Useful for Incident Diagnosis: An Experience ReportMaking Runtime Data Useful for Incident Diagnosis: An Experience Report
Making Runtime Data Useful for Incident Diagnosis: An Experience Report
QAware GmbH
 
Security Testing ModernApps_v1.0
Security Testing ModernApps_v1.0Security Testing ModernApps_v1.0
Security Testing ModernApps_v1.0
Neelu Tripathy
 
Malware Most Wanted: Security Ecosystem
Malware Most Wanted: Security EcosystemMalware Most Wanted: Security Ecosystem
Malware Most Wanted: Security Ecosystem
Cyphort
 
SF Bay Area Splunk User Group Meeting October 5, 2022
SF Bay Area Splunk User Group Meeting October 5, 2022SF Bay Area Splunk User Group Meeting October 5, 2022
SF Bay Area Splunk User Group Meeting October 5, 2022
Becky Burwell
 
Real time intrusion detection in network traffic using adaptive and auto-scal...
Real time intrusion detection in network traffic using adaptive and auto-scal...Real time intrusion detection in network traffic using adaptive and auto-scal...
Real time intrusion detection in network traffic using adaptive and auto-scal...
Gobinath Loganathan
 
IRJET - Survey on Malware Detection using Deep Learning Methods
IRJET -  	  Survey on Malware Detection using Deep Learning MethodsIRJET -  	  Survey on Malware Detection using Deep Learning Methods
IRJET - Survey on Malware Detection using Deep Learning Methods
IRJET Journal
 
ICoSTEC-PPT.pptx
ICoSTEC-PPT.pptxICoSTEC-PPT.pptx
ICoSTEC-PPT.pptx
RickiFirmansyah1
 
Integris Security - Hacking With Glue ℠
Integris Security - Hacking With Glue ℠Integris Security - Hacking With Glue ℠
Integris Security - Hacking With Glue ℠
Integris Security LLC
 
Machine learning techniques applied to detect cyber attacks on web applications
Machine learning techniques applied to detect cyber attacks on web applicationsMachine learning techniques applied to detect cyber attacks on web applications
Machine learning techniques applied to detect cyber attacks on web applications
Venkat Projects
 
Machine learning techniques applied to detect cyber attacks on web applications
Machine learning techniques applied to detect cyber attacks on web applicationsMachine learning techniques applied to detect cyber attacks on web applications
Machine learning techniques applied to detect cyber attacks on web applications
Venkat Projects
 
Python for Data Science with Anaconda
Python for Data Science with AnacondaPython for Data Science with Anaconda
Python for Data Science with Anaconda
Travis Oliphant
 
Parasoft .TEST, Write better C# Code Using Data Flow Analysis
Parasoft .TEST, Write better C# Code Using  Data Flow Analysis Parasoft .TEST, Write better C# Code Using  Data Flow Analysis
Parasoft .TEST, Write better C# Code Using Data Flow Analysis
Engineering Software Lab
 
[IITB BTP 2015 Dec] Dynamic detection of malware in android OS.pptx
[IITB BTP 2015 Dec] Dynamic detection of malware in android OS.pptx[IITB BTP 2015 Dec] Dynamic detection of malware in android OS.pptx
[IITB BTP 2015 Dec] Dynamic detection of malware in android OS.pptx
DeepanjanKundu2
 
Simseer and Bugwise - Web Services for Binary-level Software Similarity and D...
Simseer and Bugwise - Web Services for Binary-level Software Similarity and D...Simseer and Bugwise - Web Services for Binary-level Software Similarity and D...
Simseer and Bugwise - Web Services for Binary-level Software Similarity and D...
Silvio Cesare
 

Similar to Malware_SmartCom_2017 (20)

Cybersecurity - Jim Butterworth
Cybersecurity - Jim ButterworthCybersecurity - Jim Butterworth
Cybersecurity - Jim Butterworth
 
Antimalware
AntimalwareAntimalware
Antimalware
 
Introduzione ai network penetration test secondo osstmm
Introduzione ai network penetration test secondo osstmmIntroduzione ai network penetration test secondo osstmm
Introduzione ai network penetration test secondo osstmm
 
Presentation1.pptx
Presentation1.pptxPresentation1.pptx
Presentation1.pptx
 
A Fast Flowgraph Based Classification System for Packed and Polymorphic Malwa...
A Fast Flowgraph Based Classification System for Packed and Polymorphic Malwa...A Fast Flowgraph Based Classification System for Packed and Polymorphic Malwa...
A Fast Flowgraph Based Classification System for Packed and Polymorphic Malwa...
 
Malware Classification Using Structured Control Flow
Malware Classification Using Structured Control FlowMalware Classification Using Structured Control Flow
Malware Classification Using Structured Control Flow
 
Making Runtime Data Useful for Incident Diagnosis: An Experience Report
Making Runtime Data Useful for Incident Diagnosis: An Experience ReportMaking Runtime Data Useful for Incident Diagnosis: An Experience Report
Making Runtime Data Useful for Incident Diagnosis: An Experience Report
 
Security Testing ModernApps_v1.0
Security Testing ModernApps_v1.0Security Testing ModernApps_v1.0
Security Testing ModernApps_v1.0
 
Malware Most Wanted: Security Ecosystem
Malware Most Wanted: Security EcosystemMalware Most Wanted: Security Ecosystem
Malware Most Wanted: Security Ecosystem
 
SF Bay Area Splunk User Group Meeting October 5, 2022
SF Bay Area Splunk User Group Meeting October 5, 2022SF Bay Area Splunk User Group Meeting October 5, 2022
SF Bay Area Splunk User Group Meeting October 5, 2022
 
Real time intrusion detection in network traffic using adaptive and auto-scal...
Real time intrusion detection in network traffic using adaptive and auto-scal...Real time intrusion detection in network traffic using adaptive and auto-scal...
Real time intrusion detection in network traffic using adaptive and auto-scal...
 
IRJET - Survey on Malware Detection using Deep Learning Methods
IRJET -  	  Survey on Malware Detection using Deep Learning MethodsIRJET -  	  Survey on Malware Detection using Deep Learning Methods
IRJET - Survey on Malware Detection using Deep Learning Methods
 
ICoSTEC-PPT.pptx
ICoSTEC-PPT.pptxICoSTEC-PPT.pptx
ICoSTEC-PPT.pptx
 
Integris Security - Hacking With Glue ℠
Integris Security - Hacking With Glue ℠Integris Security - Hacking With Glue ℠
Integris Security - Hacking With Glue ℠
 
Machine learning techniques applied to detect cyber attacks on web applications
Machine learning techniques applied to detect cyber attacks on web applicationsMachine learning techniques applied to detect cyber attacks on web applications
Machine learning techniques applied to detect cyber attacks on web applications
 
Machine learning techniques applied to detect cyber attacks on web applications
Machine learning techniques applied to detect cyber attacks on web applicationsMachine learning techniques applied to detect cyber attacks on web applications
Machine learning techniques applied to detect cyber attacks on web applications
 
Python for Data Science with Anaconda
Python for Data Science with AnacondaPython for Data Science with Anaconda
Python for Data Science with Anaconda
 
Parasoft .TEST, Write better C# Code Using Data Flow Analysis
Parasoft .TEST, Write better C# Code Using  Data Flow Analysis Parasoft .TEST, Write better C# Code Using  Data Flow Analysis
Parasoft .TEST, Write better C# Code Using Data Flow Analysis
 
[IITB BTP 2015 Dec] Dynamic detection of malware in android OS.pptx
[IITB BTP 2015 Dec] Dynamic detection of malware in android OS.pptx[IITB BTP 2015 Dec] Dynamic detection of malware in android OS.pptx
[IITB BTP 2015 Dec] Dynamic detection of malware in android OS.pptx
 
Simseer and Bugwise - Web Services for Binary-level Software Similarity and D...
Simseer and Bugwise - Web Services for Binary-level Software Similarity and D...Simseer and Bugwise - Web Services for Binary-level Software Similarity and D...
Simseer and Bugwise - Web Services for Binary-level Software Similarity and D...
 

Recently uploaded

BRAIN TUMOR DETECTION for seminar ppt.pdf
BRAIN TUMOR DETECTION for seminar ppt.pdfBRAIN TUMOR DETECTION for seminar ppt.pdf
BRAIN TUMOR DETECTION for seminar ppt.pdf
LAXMAREDDY22
 
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
IJECEIAES
 
Engineering Drawings Lecture Detail Drawings 2014.pdf
Engineering Drawings Lecture Detail Drawings 2014.pdfEngineering Drawings Lecture Detail Drawings 2014.pdf
Engineering Drawings Lecture Detail Drawings 2014.pdf
abbyasa1014
 
Computational Engineering IITH Presentation
Computational Engineering IITH PresentationComputational Engineering IITH Presentation
Computational Engineering IITH Presentation
co23btech11018
 
一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理
一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理
一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理
ecqow
 
Comparative analysis between traditional aquaponics and reconstructed aquapon...
Comparative analysis between traditional aquaponics and reconstructed aquapon...Comparative analysis between traditional aquaponics and reconstructed aquapon...
Comparative analysis between traditional aquaponics and reconstructed aquapon...
bijceesjournal
 
Software Quality Assurance-se412-v11.ppt
Software Quality Assurance-se412-v11.pptSoftware Quality Assurance-se412-v11.ppt
Software Quality Assurance-se412-v11.ppt
TaghreedAltamimi
 
学校原版美国波士顿大学毕业证学历学位证书原版一模一样
学校原版美国波士顿大学毕业证学历学位证书原版一模一样学校原版美国波士顿大学毕业证学历学位证书原版一模一样
学校原版美国波士顿大学毕业证学历学位证书原版一模一样
171ticu
 
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
ydzowc
 
Curve Fitting in Numerical Methods Regression
Curve Fitting in Numerical Methods RegressionCurve Fitting in Numerical Methods Regression
Curve Fitting in Numerical Methods Regression
Nada Hikmah
 
Software Engineering and Project Management - Introduction, Modeling Concepts...
Software Engineering and Project Management - Introduction, Modeling Concepts...Software Engineering and Project Management - Introduction, Modeling Concepts...
Software Engineering and Project Management - Introduction, Modeling Concepts...
Prakhyath Rai
 
LLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by Anant
LLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by AnantLLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by Anant
LLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by Anant
Anant Corporation
 
An Introduction to the Compiler Designss
An Introduction to the Compiler DesignssAn Introduction to the Compiler Designss
An Introduction to the Compiler Designss
ElakkiaU
 
Mechanical Engineering on AAI Summer Training Report-003.pdf
Mechanical Engineering on AAI Summer Training Report-003.pdfMechanical Engineering on AAI Summer Training Report-003.pdf
Mechanical Engineering on AAI Summer Training Report-003.pdf
21UME003TUSHARDEB
 
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
Yasser Mahgoub
 
132/33KV substation case study Presentation
132/33KV substation case study Presentation132/33KV substation case study Presentation
132/33KV substation case study Presentation
kandramariana6
 
spirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptxspirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptx
Madan Karki
 
Manufacturing Process of molasses based distillery ppt.pptx
Manufacturing Process of molasses based distillery ppt.pptxManufacturing Process of molasses based distillery ppt.pptx
Manufacturing Process of molasses based distillery ppt.pptx
Madan Karki
 
Data Driven Maintenance | UReason Webinar
Data Driven Maintenance | UReason WebinarData Driven Maintenance | UReason Webinar
Data Driven Maintenance | UReason Webinar
UReason
 
Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...
Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...
Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...
shadow0702a
 

Recently uploaded (20)

BRAIN TUMOR DETECTION for seminar ppt.pdf
BRAIN TUMOR DETECTION for seminar ppt.pdfBRAIN TUMOR DETECTION for seminar ppt.pdf
BRAIN TUMOR DETECTION for seminar ppt.pdf
 
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
 
Engineering Drawings Lecture Detail Drawings 2014.pdf
Engineering Drawings Lecture Detail Drawings 2014.pdfEngineering Drawings Lecture Detail Drawings 2014.pdf
Engineering Drawings Lecture Detail Drawings 2014.pdf
 
Computational Engineering IITH Presentation
Computational Engineering IITH PresentationComputational Engineering IITH Presentation
Computational Engineering IITH Presentation
 
一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理
一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理
一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理
 
Comparative analysis between traditional aquaponics and reconstructed aquapon...
Comparative analysis between traditional aquaponics and reconstructed aquapon...Comparative analysis between traditional aquaponics and reconstructed aquapon...
Comparative analysis between traditional aquaponics and reconstructed aquapon...
 
Software Quality Assurance-se412-v11.ppt
Software Quality Assurance-se412-v11.pptSoftware Quality Assurance-se412-v11.ppt
Software Quality Assurance-se412-v11.ppt
 
学校原版美国波士顿大学毕业证学历学位证书原版一模一样
学校原版美国波士顿大学毕业证学历学位证书原版一模一样学校原版美国波士顿大学毕业证学历学位证书原版一模一样
学校原版美国波士顿大学毕业证学历学位证书原版一模一样
 
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
 
Curve Fitting in Numerical Methods Regression
Curve Fitting in Numerical Methods RegressionCurve Fitting in Numerical Methods Regression
Curve Fitting in Numerical Methods Regression
 
Software Engineering and Project Management - Introduction, Modeling Concepts...
Software Engineering and Project Management - Introduction, Modeling Concepts...Software Engineering and Project Management - Introduction, Modeling Concepts...
Software Engineering and Project Management - Introduction, Modeling Concepts...
 
LLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by Anant
LLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by AnantLLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by Anant
LLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by Anant
 
An Introduction to the Compiler Designss
An Introduction to the Compiler DesignssAn Introduction to the Compiler Designss
An Introduction to the Compiler Designss
 
Mechanical Engineering on AAI Summer Training Report-003.pdf
Mechanical Engineering on AAI Summer Training Report-003.pdfMechanical Engineering on AAI Summer Training Report-003.pdf
Mechanical Engineering on AAI Summer Training Report-003.pdf
 
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
 
132/33KV substation case study Presentation
132/33KV substation case study Presentation132/33KV substation case study Presentation
132/33KV substation case study Presentation
 
spirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptxspirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptx
 
Manufacturing Process of molasses based distillery ppt.pptx
Manufacturing Process of molasses based distillery ppt.pptxManufacturing Process of molasses based distillery ppt.pptx
Manufacturing Process of molasses based distillery ppt.pptx
 
Data Driven Maintenance | UReason Webinar
Data Driven Maintenance | UReason WebinarData Driven Maintenance | UReason Webinar
Data Driven Maintenance | UReason Webinar
 
Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...
Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...
Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...
 

Malware_SmartCom_2017

Editor's Notes

  1. Thank you for the Session Chair. I am very honor to have this opportunity to attend this conference. The topic of my paper is “Effective Malware Detection based on Behavior and Data Features”. I am the speaker Cheng Wen. This work is done with Zhiwu Xu, Shengchao Qin and Zhong Ming.
  2. The outline of my talk as follows. In the first part I want to introduce the background of this research. The second part present our approach. Followed by experiments. Finally, a simple conclusion is given. Well, let’s move on the first part of this topic.
  3. Malware is a generic term that encompasses viruses, Worm, spywares and other intrusive codes. They are spreading all over the world and are increasing day by day, thus becoming a serious threat. According to the recent report from McAfee, there are more than 650 million malware samples detected in the first quarter, 2017, in which more than 30 million ones are new. So the detection of malware is of major concern to both the anti-malware industry and researchers.
  4. To protect users from these threats, anti-malware software products from different companies provide the major defense against malware, such as Comodo, McAfee and so on, wherein the signature-based method is employed. However, this method can be easily evaded by malware writers through the evasion techniques.
  5. To overcome the limitation of the signature-based method, heuristic-based method are proposed, which focuses on identifying the malicious behavior patterns, though either static analysis or dynamic analysis. But the increasing number of malware samples makes this method no longer considered effective.
  6. Recently, various machine learning approaches have been proposed for detecting malware. Although some approaches can get a high accuracy (for the stationary data sets), it is still not enough for malware detection. Most of existing work focus on the behaviour features such as binary codes, opcodes and API calls, leaving the data information out of consideration. Also, It can be easily evaded. Malware evolves rapidly and it thus becomes hard to generalize learning models to reflect future, previously-unseen behaviors. And most of the work didn’t consider the resistance to obfuscation techniques.
  7. Next, Let’s move to the second part.
  8. In this paper, we propose an effective approach to detect malware based on machine learning. Different from most existing work, we take into account not only the behaviour information but also the data information. We also consider the time-split samples and obfuscated samples, Generally, the behaviour information reflects which behaviours a software intends to do, while the data information indicates which data's a software intends to perform on or how data's are organized.
  9. This Figure shows the framework of our approach, which consists of two components, namely the feature extractor and the malware classifier. The feature extractor extracts the feature information from the executables and represents them as vectors. While the malware classifier is first trained from an available dataset of executables, and then can be used to detect new, unseen executables. In the following, we describe both components in more detail.
  10. Feature extractor is consists of the 3 steps, Decompilation, Information Extraction and Feature Selection and representation
  11. An instruction or a data in an executable file can be represented as a series of binary codes, which are clearly not easy to read. So the first step is to transform the binary codes into a readable intermediate representation such as assembly codes by a decompilation tool.
  12. Next, the extractor parses the asm files to extract the information, namely, opcodes, data types and system libraries. Generally, the opcodes used in an executable represent its intended behaviours, while the data types indicate the structures of the datas it may perform on. In addition, the imported system libraries, which reflect the interaction between the executable and system. All these information describes the possible mission of an executable in some sense, and similar executables share the similar information.
  13. We use the well-known scheme TF-IDF method to measure the statistical dependence. Next the extractor select the top k weight terms. Each executable can be represented as a vector. An example of vector is shown in the following.
  14. Another component is malware classifier.
  15. As mentioned before, we will first train our malware classifier from an available dataset of executables with known categories by a supervised machine-learning method, and then use it detect new, unseen executables.
  16. Followed by the experiments
  17. Our dataset consists of malwares and benigns. The malware dataset consists of the samples from BIG 2015 Challenge and from theZoo aka Malware DB, while the benign software are collected from 360 software company. We use various machine learning method to train a classifier and performed some experiments to test our approach’s ability.
  18. To evaluate the performance of our approach, we conducted 10-fold cross validation experiments. The learning methods we used in our experiments are listed in the table. Concerning ROC curve, most classifiers can produce much better classification results.
  19. Meanwhile, we counted the training times and the testing times in seconds for each cross validation experiment. The results are shown in this table. We also evaluated how the feature extractor perform. Both the decompilation time and the extracting time are acceptable.
  20. Next, we also performed experiments based on each kind of feature to see their effectiveness. For that, we conducted the same experiments as above for each kind of feature. From the results we can see that all the features are effective to detect malware, and using all of the features together produced the best results. The opcode and library features have been used by lots of work in practice, so we believe that the type information can benefit to malware detection as well in practice.
  21. In this section, to test our approach’s ability to detect genuinely new malware or new malware versions, we ran a time split experiment. First, we downloaded the malware samples, which were collected from January 2017 to July 2017. That is to say, all the malware samples are newer than the ones in our data set. About 81% of the samples can be detected by our classifier, which estimates that our approach can detect some new malware samples or new versions of existing malware samples. However, the results also indicate that the classifier becomes ineffective as time passes. This suggests that malware classifiers should be updated often with new data or new features in order to maintain the classification accuracy.
  22. One reason to make the malware detection difficult is that malware writers can use obfuscation techniques. In this section, we performed some experiments to test our approach’s ability to detect new malware samples that are obtained by obfuscating the existing ones. We use two commercial tool, Obfuscator and Unest, to obfuscate some malware samples, which are randomly selected from our data set. The results show that all the obfuscated malware samples can be detected by our classifier. That is to say, our classifier has a resistance to some obfuscation techniques.
  23. At last, I conclude the talk.
  24. In this work, we have proposed a malware detection approach using various machine learning methods based on the opcodes, data types and system libraries. To evaluate the proposed approach, we have carried out some interesting experiments. The experimental results have demonstrated that our classier is capable of detecting some fresh malware, and has a resistance to some obfuscation techniques.
  25. We use static analysis. In malware detection, both static analysis and dynamic analysis have their own advantages and limitations. In real application, we suggest using static analysis at first. If the file cannot be will-represented, then we can try dynamic analysis.