SlideShare a Scribd company logo
1 of 17
Download to read offline
Open University, Data Mining Seminar 13802
Semester 2015b
Malware Detection via Data Mining
Prof Roy Gelbard
David Zivi 204785638
Contents
Terminology and Definitions..............................................................................................................3
Introduction .......................................................................................................................................4
Research Question.............................................................................................................................5
Study goal:......................................................................................................................................5
Study importance:..........................................................................................................................5
Mapping of Knowledge Elements......................................................................................................6
Bibliography Review...........................................................................................................................7
Signature-based detection:............................................................................................................7
Heuristic-based detection:.............................................................................................................7
Behavioral-based detection:..........................................................................................................7
Sandbox detection:........................................................................................................................7
Data mining techniques:................................................................................................................7
Research methodology ......................................................................................................................8
Raw Data Acquisition.....................................................................................................................8
Extraction of Significant Data.........................................................................................................8
Opcode Relevance in malwares.....................................................................................................9
Average Calculation .......................................................................................................................9
Results Export ................................................................................................................................9
Weka ............................................................................................................................................10
Prediction parameters .................................................................................................................10
Noise method...............................................................................................................................10
Results..............................................................................................................................................11
Recurrent WEKA process .............................................................................................................11
First run....................................................................................................................................11
Second run ...............................................................................................................................11
Third run...................................................................................................................................11
Fourth run................................................................................................................................11
Fifth run....................................................................................................................................11
Sixth run...................................................................................................................................11
Seventh run..............................................................................................................................11
Results summary per round.............................................................................................................12
Rules generated by WEKA................................................................................................................13
Noise on model............................................................................................................................14
Result Discussion and Future Research ...........................................................................................15
Future Research...........................................................................................................................15
Extension of the opcodes set...................................................................................................15
Sensitive system call ................................................................................................................15
PE header analysis....................................................................................................................15
Resources List...................................................................................................................................16
Terminology and Definitions
 Virus: A computer virus is a type of malware that propagates by inserting a copy
of itself into, and becoming part of, another program. It spreads from one
computer to another, leaving infections as it travels. Viruses can range in severity
from causing mildly annoying effects to damaging data or software and causing
denial-of-service (DoS) conditions. Almost all viruses are attached to an
executable file, which means the virus may exist on a system, but will not be
active or able to spread until a user runs or opens the malicious host file or
program. When the host code is executed, the viral code is executed as well. [1]
 Disassembler: a computer program that translates machine language into
assembly language—the inverse operation to that of an assembler. Disassembly,
the output of a disassembler, is often formatted for human-readability rather
than suitability for input to an assembler, making it principally a reverse-
engineering tool. [2]
 Opcode: In computing, an opcode (abbreviated from operation code) is the
portion of a machine language instruction that specifies the operation to be
performed. Beside the opcode itself, instructions usually specify the data they will
process, in form of operands. In addition to opcodes used in instruction set
architectures of various CPUs, which are hardware devices, opcodes can also be
used in abstract computing machines as part of their byte code specifications. [3]
 x86 instruction set: x86 is a family of backward compatible instruction set
architectures based on the Intel 8086 CPU and its Intel 8088 variant. The 8086
was introduced in 1978 as a fully 16-bit extension of Intel's 8-bit based 8080
microprocessor, with memory segmentation as a solution for addressing more
memory than can be covered by a plain 16-bit address. The term "x86" came into
being because the names of several successors to the Intel's 8086 processor
ended in "86", including 80186, 80286, 80386 and 80486 processors. [4]
 WEKA: is a workbench that contains a collection of visualization tools and
algorithms for data analysis and predictive modeling, together with graphical user
interfaces for easy access to this functionality. All of Weka's techniques are
predicated on the assumption that the data is available as a single flat file or relation,
where each data point is described by a fixed number of attributes. [17]
Introduction
In this study I present a technique I developed to determine whether an application is
malware or not using data-mining. The underlying method to be used is to teach the
system how to differentiate between software that is or is not malware, using a dataset
that is represented by a list of instructions that potentially characterize malware. The list
of instructions were collected from previous research done on this topic.
The technique is a two-step technique, where the first step consists of disassembly, and
opcode frequency calculation; the second step consists of usage of the data learning
algorithm J48 provided by the WEKA library.
For the dataset, I extracted relevant data from 300 known malware [6] and 150 types of
benign software typically found in a home computer under “c:program files”. This data
was fed to WEKA, which generated rules to be used to determine whether a piece of
software is malware or not.
These rules were run on the dataset (cross-validation) I compiled, and were able to
predict, with an accuracy of 96%, whether a piece of software is malware or not.
In order to check the robustness of the rules, a noise with intensity scale from 2% to 50%
was randomly to relevant data without significant regression on score mentioned above.
Within noise of up to 50% the prediction score decreased to 91%.
Research Question
Study goal:
Today with the exponential growth of the “freeware” software [7], users and
corporations can find a large variety of application and utilities that can be installed for
free. Since those applications come from unknown sources, the question raised is: Can a
user or corporation benefit from some free applications without compromising their
entire system?
The goal of this study is to build rules that will determine whether an executable or
library received by a third party can be trusted or not. This study does not purport to
replace the known anti-viruses, but to propose a complementary mechanism that will
make up for weakness of known anti-virus programs.
Limitations of current methods:
The common technique used by anti-viruses to determine if an executable is a malware
or not is done by scanning. A scanner will search all files in memory and on disk for code
snippets that will uniquely identify a file as malware. Such mechanisms have two main
weaknesses:
 Attackers interested in propagating a known malware can just change the code
snippets that anti-virus is looking for.
 New malware not yet classified will be considered as benign software till the
malware is analyzed and classified. In such a case malware will continue to infect
the system and expand itself to the new system until the anti-virus is updated.
Study importance:
According to a newly-released report sponsored by McAfee, global cyber activity is
costing up to $500 billion each year, which is almost as much as the estimated cost of
drug trafficking [5]. In the third quarter of 2015, McAfee Labs detected more than 307
new threats every minute, or more than five every second, with mobile malware samples
growing by 16 percent during the quarter, and overall malware surging by 76 percent
year after year [8].
Malware becomes more and more sophisticated and provide high revenues to their
owners. Malware editors becomes well organized and structured with impressive skills
and high qualified resources. Due to the exponential growth of malware and its agility to
camouflage itself, having a sterile system for users and corporates becomes a
tremendous task. Since current anti-viruses are run against known malware databases
which are updated once a day in the best case, malwares have one entire day to infect
the system until they are caught.
Mapping of Knowledge Elements
Characteristic/Process In Human World In Machine World
Data
Software editor
Software behavior
All the data is saved in an automatic way
Every malware has its own signature
Malware characteristics are saved in
database
Information
Collect information about known
malware
Collect information about software to
check
Basic statistical calculation on raw data
Knowledge
Can get global feeling about the
software we want to check
Tendency can be deducted
Run algorithm on software in order
to determine if we are dealing with
malware or not
Data transformation to
Information &
Knowledge
With the data we have, we can be
deduce which software we have to
check
Installation validation based on decision
tree
Information &
Knowledge
transformation to Data
The knowledge can be transformed to
data. For example, the software
signature can be saved in a database
Exists in learning system, where the
conclusions and knowledge are
automatically translated to data
Transformation of tacit
knowledge to explicit
knowledge
Malware analyst writes in a formal way
the conclusions reached about
malware pattern
Does not exist
Transformation of
explicit knowledge to
tacit knowledge
Malware analyst learns from explicit
knowledge
Does not exist
Knowledge contribution
in decision
Based on explicit & tacit knowledge ,
analyst decides to accept or reject
software
According to decision tree
Knowledge contribution
to innovation
Update of anti-virus engines based on
knowledge
Does not exist
Learning and knowledge
sharing
Learning of new techniques used by
malware
The system automatically update its
research criterion
Bibliography Review
The exponential growth of malware encourages security researchers to invent new techniques to
protect computers and network. The various techniques used for malware detections [9] are
described below:
Signature-based detection:
This technique is the most common method used to identify viruses and other malware. The anti-
virus engine compares the contents of a file to its database of known malware signatures. Such
technique requires daily update of malware database.
Heuristic-based detection:
This technique is generally used together with signature-based detection. It detects malware
based on characteristics typically used in known malware code.
Behavioral-based detection:
This technique is similar to heuristic-based detection and used also in Intrusion Detection System.
The main difference is that, instead of characteristics hardcoded in the malware code itself, it is
based on the behavioral fingerprint of the malware at run-time. Clearly, this technique is able to
detect (known or unknown) malware only after it has starting doing its malicious actions.
Sandbox detection:
This technique is a particular behavioral-based detection that, instead of detecting the behavioral
fingerprint at run time, executes the programs in a virtual environment, logging whatever actions
the program performs. Depending on the actions logged, the anti-virus engine can determine if
the program is malicious or not. If not, the program is executed in the real environment. Even
though this technique has shown to be quite effective it is heavy and slow, so it is rarely used in
end-user anti-virus solutions.
Data mining techniques:
Data mining techniques are one of the latest approaches applied in malware detection. Data
mining and machine learning algorithms are used to try to classify the behavior of a file (as either
malicious or benign) given a series of file features that are extracted from the file itself. In this
study the focus was pointed on the following few techniques:
- Malware detection via analysis of number of strings, call and binary patterns [10]
- Malware detection via analysis of program executable header [11]
- Malware prediction via function call frequency, usage of non-standard
instructions and use of suspicious system calls [12]
- Malware detection via statistical analysis of opcode distributions [13]
Research methodology
The following picture describes the flow used in this study to generate rules that will be able to
catch malware:
Raw Data Acquisition
In order to learn from malware and benign software behavior, a large database of samples for
both malware and benign software are needed. Since malware has been analyzed and classified
by security researchers, it is quite easy to find malware databases on the internet [14]. In this
study, all the known and classified malware from year 2014 are used. In order to prevent noise
on the dataset, derivatives of the same malware are not included in the sample. All 400 families
of malware found in 2014 were classified and used in this study. Our benign software dataset is
represented by standard applications located under “c:program files (x86)” such as “Outlook”,
“Word”, “Excel”, and “Calculator”. These were taken from a non-infected computer.
Note: For security reasons, the malware database is protected by a password that can be
retrieved from [14]. All the research and access to malware for this study were done on
dedicated virtual machines in order to prevent unintentional infection.
Extraction of Significant Data
According to the aforementioned bibliography on malware detection via data mining, this study
focuses on the following opcodes: call, nop, int, rdtsc, sbb, shld, fdivp, imul, pushf, setb, fild and
xor. In this study those opcodes represent a criterion for malware detection. All the above
opcodes are extracted from executables and libraries using the IDA [15] disassembler.
Opcode Relevance in malwares
One of the main challenges of malwares is their capability of “camouflage”. In order to survive
anti-viruses and security researcher analysis, malwares must hide themselves, since once they
are discovered they are automatically removed from the infected computer. Moreover, when a
malware is discovered, its characteristics are shared with the entire community. In order hide
themselves, the malwares use a technique called “packing” [16] which consist of the
compression/encryption of the original executable. When this compressed executable is
executed, the decompression/decryption code recreates the original code from the
compressed/encrypted code before executing it. Executable compression is also frequently used
to deter reverse engineering or to obfuscate the contents of the executable, for example, to hide
the presence of malware from anti-virus scanners. Executable compression can be used to
prevent direct disassembly; it consists of masking string literals and modifying signatures.
Although this does not eliminate the chance of reverse engineering, it can make the process
more costly. The following picture illustrates how the “packing” mechanism works:
Average Calculation
A script in Python is used to count the number of instances found for each of the relevant
opcodes listed above. Then a value is calculated for every relevant opcode, according to the
following formula (example for “call” opcode):
Call percentage = (Number of Call * Size of Call opcode) / Size of all text section
Results Export
The same Python script now exports an Excel table of all the opcodes, averaged for every
disassembled file (benign software & malware).
Weka
The generated excel file is used for WEKA data mining tool analysis. Since the target field is
nominal, J48 & Kstar algorithms are used. The test options used to validate the model is cross
validation with percentage split of 66%
Prediction parameters
TP (true positive): rate of valid prediction of a malware
TN (true negative): rate of valid prediction of benign software
FP (false positive): missed malware prediction i.e. malware was predicted as benign software
FN (false negative): missed benign software prediction i.e. benign software was predicted as
malware
Noise method
A randomized noise with intensity ranging from 5% to 50% is applied on the generated excel
table. The noise is applied only to the head of the tree i.e. in our case on the “imul” instruction.
Applying such noise will determine the robustness of the model, in other words does standard
noise contest the study conclusion or not. Since the noise source can be the result of non-
standard code, for example when code is written directly in assembler by programmers, we
assume that a typical noise can have an intensity of 30%. If the malware prediction score is still
greater than 90% we can conclude that the generated model is robust enough.
Results
In this research a deterministic way for malware prediction was formulated and tested. This
method prove to be highly successful in differentiating between malware and benign software. A
key factor for efficiently identifying malware was to have appropriate set of instructions. To find
the ideal instruction set I performed an investigation in a recurrent manner as described below:
Recurrent WEKA process
As opposed to the mentioned research done on malware detection via data mining, like Sanjam
Singla et al [12], in this study the research analysis did not stop when a high level of predictability
was achieved. In every WEKA iteration the “head of the tree” was removed and new analysis was
performed to determine if the “head of the tree” is a key opcode in our prediction or not. Getting
good result after removal of “head of the tree” means that the “head of the tree” cannot be used
for prediction. The following describes the recurrent WEKA process used:
First run
In the first WEKA run, all the potential opcodes are take into account i.e. call, nop, int, rdtsc,
sbb, shld, fdivp, imul, pushf, setb, fild and xor. The prediction score was 98% when the head of
the tree was “call” opcode
Second run
The call opcode was removed so WEKA was run with: nop, int, rdtsc, sbb, shld, fdivp, imul,
pushf, setb, fild and xor. The prediction score was 97% when the head of the tree was “xor”
opcode
Third run
The xor opcode was removed so WEKA was run with: nop, int, rdtsc, sbb, shld, fdivp, imul,
pushf, setb and fild. The prediction score was 97% when the head of the tree was “int” opcode
Fourth run
The int opcode was removed so WEKA was run with: nop, rdtsc, sbb, shld, fdivp, imul, pushf,
setb and fild. The prediction score was 96% when the head of the tree was “rdtsc” opcode
Fifth run
The rdtsc opcode was removed so WEKA was run with: nop, sbb, shld, fdivp, imul, pushf, setb
and fild. The prediction score was 95% when the head of the tree was “sbb” opcode
Sixth run
The sbb opcode was removed so WEKA was run with: nop, shld, fdivp, imul, pushf, setb and fild.
The prediction score was 91% when the head of the tree was “imul” opcode
Seventh run
The imul opcode was removed so WEKA was run with: nop, shld, fdivp, pushf, setb and fild. The
prediction score decreased to 78%. At this point the recurrent processing was stopped due to
significant decrease in prediction score. The last acceptable score was of 91% with opcodes: nop,
shld, fdivp, imul, pushf, setb and fild.
Results summary per round
Dataset Algorithm TP FP TN FN RMSE Fscore
All J48 307 4 102 4 0.13 0.98
All Kstar 310 1 99 7 0.13 0.98
call removed J48 306 5 100 6 0.15 0.97
call removed Kstar 311 0 95 11 0.14 0.97
call &xorremoved J48 308 3 99 7 0.15 0.97
call &xorremoved Kstar 311 0 89 17 0.19 0.95
call,xor&intremoved J48 307 4 101 5 0.14 0.97
call,xor&intremoved Kstar 309 2 92 14 0.18 0.96
call,xor,int&rdtscremoved J48 304 7 97 9 0.19 0.96
call,xor,int&rdtscremoved Kstar 309 2 92 14 0.18 0.96
call,xor,int,rdtsc&sbbremoved J48 306 5 98 8 0.17 0.96
call,xor,int,rdtsc&sbbremoved Kstar 311 0 71 35 0.24 0.91
call,xor,int,rdtsc,sbb&imul removed J48 308 3 33 73 0.37 0.78
call,xor,int,rdtsc,sbb&imul removed Kstar 310 1 45 61 0.31 0.82
Rules generated by WEKA
The following graph represents rules generated by WEKA for malware prediction with score of
96%
The following table represent the Confusion Matrix for J48 algorithm
clean virus
clean 102 4
virus 4 307
Noise on model
The following table summarizes the noise intensity at the head of the tree, with the appropriate
Fscore
The following gives a graphical representation of the above results
Noise intensity in % on "imul" variable Fscore
0 0.969
2 0.966
5 0.964
8 0.966
10 0.966
13 0.964
16 0.966
20 0.954
25 0.961
30 0.961
35 0.946
40 0.954
45 0.939
50 0.916
Data set: call,xor,int,rdtsc & sbb removed
Result Discussion and Future Research
The goal of this study was to demonstrate that malware can be caught via analysis of executable
opcodes. I have shown that malware detection via data mining is a very promising method, since
the prediction score achieved is 96% for 2014 malware. This study found a different set of
instructions that point to code being malware, compared to previous research that was done on
this topic. The final instructions found in our generated tree are: “imul”, “pushf” and “fild”. As
explained before, those instructions are commonly used by “packer” and “protector” software in
order to unpack/decrypt the malware code.
The novel approaches of the study, as opposed to previous research done in this field, are:
 Analysis of the malware surface: Since most of the malwares use packers and protectors
to hide themselves from security researchers, after disassembly only a small portion of
the malware code can be analyzed. In this study it was proved that analysis of a small
portion of code (loader) is enough to detect if the executable is a malware or not.
 Recurrent run of WEKA: As opposed to previous research, the WEKA data mining tool
was run many times, in order to found the instructions that really influence the
prediction. Originally the “call” instruction was singled out as identifying software as
malware or not, as was done in research by Sanjam Singla et al [12], but after recurrent
run of WEKA the “imul” instruction was found as the identifier of malware.
I have shown that malware detection via data mining is a very promising method, since the
prediction score achieved is 96% for 2014 malware.
Future Research
Extension of the opcodes set
Since malware changes very frequently, in future research the instruction set used for prediction
must be enlarged and updated according to new malware. Furthermore, malware commonly
uses rare instructions that will never being generated by a compiler, so having such rare
instructions in our instruction dataset will help to recognize them.
Sensitive system call
In this study, only malware instructions were checked. As claimed in the above bibliography,
malware commonly uses sensitive calls like “VirtualAllocEx”, “IsDebugerPresent”. Use of such
calls can point to malware as well
PE header analysis
Another approach to detect malware will be to check the program header format of the
executable. Since during the packing/protect mechanism the “entry point” of the program is
modified, we can find PE header malformations that point to malware as well.
Resources List
[1] http://www.cisco.com/web/about/security/intelligence/virus-worm-diffs.html
[2] https://en.wikipedia.org/wiki/Disassembler
[3] https://en.wikipedia.org/wiki/Opcode
[4] http://www.felixcloutier.com/x86/
[5] http://www.foxbusiness.com/technology/2013/07/22/report-cyber-crime-costs-global-
economy-up-to-1-trillion-year/
[6] http://www.nothink.org/honeypots/malware-archives/
[7] https://en.wikipedia.org/wiki/Freeware
[8] http://www.mcafee.com/us/about/news/2014/q4/20141209-01.aspx
[9] https://en.wikipedia.org/wiki/Antivirus_software
[10] M. Schultz, M. Eskin, E. Zadok 2001 Data Mining Methods for Detection of New Malicious
Executables
[11] Usukhbayar Baldangombo et al - A STATIC MALWARE DETECTION SYSTEM USING DATA
MINING
[12] A Novel Approach to Malware Detection using Static Classification Sanjam Singla et al 2015
[13] International Journal of Electronic Security Daniel Bilar, Opcodes as predictor for malware
2007
[14] http://www.nothink.org/honeypots/malware-archives/
[15] https://www.hex-rays.com/products/ida/index.shtml
[16] https://en.wikipedia.org/wiki/Executable_compression
[17] https://en.wikipedia.org/wiki/Weka_%28machine_learning%29

More Related Content

What's hot

2018 03 brownie_sensor_integration_and_biofeedback
2018 03 brownie_sensor_integration_and_biofeedback2018 03 brownie_sensor_integration_and_biofeedback
2018 03 brownie_sensor_integration_and_biofeedbackWebmentix GmbH
 
Efficient Similarity Search Over Encrypted Data
Efficient Similarity Search Over Encrypted DataEfficient Similarity Search Over Encrypted Data
Efficient Similarity Search Over Encrypted DataIRJET Journal
 
New Hybrid Intrusion Detection System Based On Data Mining Technique to Enhan...
New Hybrid Intrusion Detection System Based On Data Mining Technique to Enhan...New Hybrid Intrusion Detection System Based On Data Mining Technique to Enhan...
New Hybrid Intrusion Detection System Based On Data Mining Technique to Enhan...ijceronline
 
A data estimation for failing nodes using fuzzy logic with integrated microco...
A data estimation for failing nodes using fuzzy logic with integrated microco...A data estimation for failing nodes using fuzzy logic with integrated microco...
A data estimation for failing nodes using fuzzy logic with integrated microco...IJECEIAES
 
Intrusion Detection System Based on K-Star Classifier and Feature Set Reduction
Intrusion Detection System Based on K-Star Classifier and Feature Set ReductionIntrusion Detection System Based on K-Star Classifier and Feature Set Reduction
Intrusion Detection System Based on K-Star Classifier and Feature Set ReductionIOSR Journals
 
Retrieving Secure Data from Cloud Using OTP
Retrieving Secure Data from Cloud Using OTPRetrieving Secure Data from Cloud Using OTP
Retrieving Secure Data from Cloud Using OTPAM Publications
 
IRJET- Netreconner: An Innovative Method to Intrusion Detection using Regular...
IRJET- Netreconner: An Innovative Method to Intrusion Detection using Regular...IRJET- Netreconner: An Innovative Method to Intrusion Detection using Regular...
IRJET- Netreconner: An Innovative Method to Intrusion Detection using Regular...IRJET Journal
 
Privacy Preserving Data Leak Detection for Sensitive Data
Privacy Preserving Data Leak Detection for Sensitive DataPrivacy Preserving Data Leak Detection for Sensitive Data
Privacy Preserving Data Leak Detection for Sensitive Datapaperpublications3
 
Data integrity proof techniques in cloud storage
Data integrity proof techniques in cloud storageData integrity proof techniques in cloud storage
Data integrity proof techniques in cloud storageIAEME Publication
 
IRJET - Securing Computers from Remote Access Trojans using Deep Learning...
IRJET -  	  Securing Computers from Remote Access Trojans using Deep Learning...IRJET -  	  Securing Computers from Remote Access Trojans using Deep Learning...
IRJET - Securing Computers from Remote Access Trojans using Deep Learning...IRJET Journal
 
Using Learning Vector Quantization in IDS Alert Management System
Using Learning Vector Quantization in IDS Alert Management SystemUsing Learning Vector Quantization in IDS Alert Management System
Using Learning Vector Quantization in IDS Alert Management SystemCSCJournals
 
Data Security In Relational Database Management System
Data Security In Relational Database Management SystemData Security In Relational Database Management System
Data Security In Relational Database Management SystemCSCJournals
 
Methodology for Optimizing Storage on Cloud Using Authorized De-Duplication –...
Methodology for Optimizing Storage on Cloud Using Authorized De-Duplication –...Methodology for Optimizing Storage on Cloud Using Authorized De-Duplication –...
Methodology for Optimizing Storage on Cloud Using Authorized De-Duplication –...IRJET Journal
 
Data Hiding In Medical Images by Preserving Integrity of ROI Using Semi-Rever...
Data Hiding In Medical Images by Preserving Integrity of ROI Using Semi-Rever...Data Hiding In Medical Images by Preserving Integrity of ROI Using Semi-Rever...
Data Hiding In Medical Images by Preserving Integrity of ROI Using Semi-Rever...IJERA Editor
 
Efficient Similarity Search over Encrypted Data
Efficient Similarity Search over Encrypted DataEfficient Similarity Search over Encrypted Data
Efficient Similarity Search over Encrypted DataIRJET Journal
 
Yolinda chiramba Survey Paper
Yolinda chiramba Survey PaperYolinda chiramba Survey Paper
Yolinda chiramba Survey PaperYolinda Chiramba
 
Design and implementation of secured scan based attacks on ic’s by using on c...
Design and implementation of secured scan based attacks on ic’s by using on c...Design and implementation of secured scan based attacks on ic’s by using on c...
Design and implementation of secured scan based attacks on ic’s by using on c...eSAT Publishing House
 
Safe machinelearning
Safe machinelearningSafe machinelearning
Safe machinelearningMansiChowkkar
 
Efficiently Detecting and Analyzing Spam Reviews Using Live Data Feed
Efficiently Detecting and Analyzing Spam Reviews Using Live Data FeedEfficiently Detecting and Analyzing Spam Reviews Using Live Data Feed
Efficiently Detecting and Analyzing Spam Reviews Using Live Data FeedIRJET Journal
 

What's hot (19)

2018 03 brownie_sensor_integration_and_biofeedback
2018 03 brownie_sensor_integration_and_biofeedback2018 03 brownie_sensor_integration_and_biofeedback
2018 03 brownie_sensor_integration_and_biofeedback
 
Efficient Similarity Search Over Encrypted Data
Efficient Similarity Search Over Encrypted DataEfficient Similarity Search Over Encrypted Data
Efficient Similarity Search Over Encrypted Data
 
New Hybrid Intrusion Detection System Based On Data Mining Technique to Enhan...
New Hybrid Intrusion Detection System Based On Data Mining Technique to Enhan...New Hybrid Intrusion Detection System Based On Data Mining Technique to Enhan...
New Hybrid Intrusion Detection System Based On Data Mining Technique to Enhan...
 
A data estimation for failing nodes using fuzzy logic with integrated microco...
A data estimation for failing nodes using fuzzy logic with integrated microco...A data estimation for failing nodes using fuzzy logic with integrated microco...
A data estimation for failing nodes using fuzzy logic with integrated microco...
 
Intrusion Detection System Based on K-Star Classifier and Feature Set Reduction
Intrusion Detection System Based on K-Star Classifier and Feature Set ReductionIntrusion Detection System Based on K-Star Classifier and Feature Set Reduction
Intrusion Detection System Based on K-Star Classifier and Feature Set Reduction
 
Retrieving Secure Data from Cloud Using OTP
Retrieving Secure Data from Cloud Using OTPRetrieving Secure Data from Cloud Using OTP
Retrieving Secure Data from Cloud Using OTP
 
IRJET- Netreconner: An Innovative Method to Intrusion Detection using Regular...
IRJET- Netreconner: An Innovative Method to Intrusion Detection using Regular...IRJET- Netreconner: An Innovative Method to Intrusion Detection using Regular...
IRJET- Netreconner: An Innovative Method to Intrusion Detection using Regular...
 
Privacy Preserving Data Leak Detection for Sensitive Data
Privacy Preserving Data Leak Detection for Sensitive DataPrivacy Preserving Data Leak Detection for Sensitive Data
Privacy Preserving Data Leak Detection for Sensitive Data
 
Data integrity proof techniques in cloud storage
Data integrity proof techniques in cloud storageData integrity proof techniques in cloud storage
Data integrity proof techniques in cloud storage
 
IRJET - Securing Computers from Remote Access Trojans using Deep Learning...
IRJET -  	  Securing Computers from Remote Access Trojans using Deep Learning...IRJET -  	  Securing Computers from Remote Access Trojans using Deep Learning...
IRJET - Securing Computers from Remote Access Trojans using Deep Learning...
 
Using Learning Vector Quantization in IDS Alert Management System
Using Learning Vector Quantization in IDS Alert Management SystemUsing Learning Vector Quantization in IDS Alert Management System
Using Learning Vector Quantization in IDS Alert Management System
 
Data Security In Relational Database Management System
Data Security In Relational Database Management SystemData Security In Relational Database Management System
Data Security In Relational Database Management System
 
Methodology for Optimizing Storage on Cloud Using Authorized De-Duplication –...
Methodology for Optimizing Storage on Cloud Using Authorized De-Duplication –...Methodology for Optimizing Storage on Cloud Using Authorized De-Duplication –...
Methodology for Optimizing Storage on Cloud Using Authorized De-Duplication –...
 
Data Hiding In Medical Images by Preserving Integrity of ROI Using Semi-Rever...
Data Hiding In Medical Images by Preserving Integrity of ROI Using Semi-Rever...Data Hiding In Medical Images by Preserving Integrity of ROI Using Semi-Rever...
Data Hiding In Medical Images by Preserving Integrity of ROI Using Semi-Rever...
 
Efficient Similarity Search over Encrypted Data
Efficient Similarity Search over Encrypted DataEfficient Similarity Search over Encrypted Data
Efficient Similarity Search over Encrypted Data
 
Yolinda chiramba Survey Paper
Yolinda chiramba Survey PaperYolinda chiramba Survey Paper
Yolinda chiramba Survey Paper
 
Design and implementation of secured scan based attacks on ic’s by using on c...
Design and implementation of secured scan based attacks on ic’s by using on c...Design and implementation of secured scan based attacks on ic’s by using on c...
Design and implementation of secured scan based attacks on ic’s by using on c...
 
Safe machinelearning
Safe machinelearningSafe machinelearning
Safe machinelearning
 
Efficiently Detecting and Analyzing Spam Reviews Using Live Data Feed
Efficiently Detecting and Analyzing Spam Reviews Using Live Data FeedEfficiently Detecting and Analyzing Spam Reviews Using Live Data Feed
Efficiently Detecting and Analyzing Spam Reviews Using Live Data Feed
 

Viewers also liked

Data processing in Industrial Systems course notes after week 5
Data processing in Industrial Systems course notes after week 5Data processing in Industrial Systems course notes after week 5
Data processing in Industrial Systems course notes after week 5Ufuk Cebeci
 
Predictive Security in the 3rd Platform Era
Predictive Security in the 3rd Platform EraPredictive Security in the 3rd Platform Era
Predictive Security in the 3rd Platform EraIDC Italy
 
Artificial intelligence in information security
Artificial intelligence in information securityArtificial intelligence in information security
Artificial intelligence in information securitypradnya patil
 
Artificial intelligence bsc - iso 27001 information security
Artificial intelligence   bsc - iso 27001 information securityArtificial intelligence   bsc - iso 27001 information security
Artificial intelligence bsc - iso 27001 information securityUfuk Cebeci
 
November 2013 HUG: Cyber Security with Hadoop
November 2013 HUG: Cyber Security with HadoopNovember 2013 HUG: Cyber Security with Hadoop
November 2013 HUG: Cyber Security with HadoopYahoo Developer Network
 
Using Machine Learning in Networks Intrusion Detection Systems
Using Machine Learning in Networks Intrusion Detection SystemsUsing Machine Learning in Networks Intrusion Detection Systems
Using Machine Learning in Networks Intrusion Detection SystemsOmar Shaya
 
Jisheng Wang at AI Frontiers: Deep Learning in Security
Jisheng Wang at AI Frontiers: Deep Learning in SecurityJisheng Wang at AI Frontiers: Deep Learning in Security
Jisheng Wang at AI Frontiers: Deep Learning in SecurityAI Frontiers
 
Machine Learning for Threat Detection
Machine Learning for Threat DetectionMachine Learning for Threat Detection
Machine Learning for Threat DetectionNapier University
 
AWS re:Invent 2016: Predictive Security: Using Big Data to Fortify Your Defen...
AWS re:Invent 2016: Predictive Security: Using Big Data to Fortify Your Defen...AWS re:Invent 2016: Predictive Security: Using Big Data to Fortify Your Defen...
AWS re:Invent 2016: Predictive Security: Using Big Data to Fortify Your Defen...Amazon Web Services
 
To use the concept of Data Mining and machine learning concept for Cyber secu...
To use the concept of Data Mining and machine learning concept for Cyber secu...To use the concept of Data Mining and machine learning concept for Cyber secu...
To use the concept of Data Mining and machine learning concept for Cyber secu...Nishant Mehta
 

Viewers also liked (12)

Malicious url detection using machine learning
Malicious url detection using machine learningMalicious url detection using machine learning
Malicious url detection using machine learning
 
Data processing in Industrial Systems course notes after week 5
Data processing in Industrial Systems course notes after week 5Data processing in Industrial Systems course notes after week 5
Data processing in Industrial Systems course notes after week 5
 
Malicious Client Detection using Machine learning
Malicious Client Detection using Machine learningMalicious Client Detection using Machine learning
Malicious Client Detection using Machine learning
 
Predictive Security in the 3rd Platform Era
Predictive Security in the 3rd Platform EraPredictive Security in the 3rd Platform Era
Predictive Security in the 3rd Platform Era
 
Artificial intelligence in information security
Artificial intelligence in information securityArtificial intelligence in information security
Artificial intelligence in information security
 
Artificial intelligence bsc - iso 27001 information security
Artificial intelligence   bsc - iso 27001 information securityArtificial intelligence   bsc - iso 27001 information security
Artificial intelligence bsc - iso 27001 information security
 
November 2013 HUG: Cyber Security with Hadoop
November 2013 HUG: Cyber Security with HadoopNovember 2013 HUG: Cyber Security with Hadoop
November 2013 HUG: Cyber Security with Hadoop
 
Using Machine Learning in Networks Intrusion Detection Systems
Using Machine Learning in Networks Intrusion Detection SystemsUsing Machine Learning in Networks Intrusion Detection Systems
Using Machine Learning in Networks Intrusion Detection Systems
 
Jisheng Wang at AI Frontiers: Deep Learning in Security
Jisheng Wang at AI Frontiers: Deep Learning in SecurityJisheng Wang at AI Frontiers: Deep Learning in Security
Jisheng Wang at AI Frontiers: Deep Learning in Security
 
Machine Learning for Threat Detection
Machine Learning for Threat DetectionMachine Learning for Threat Detection
Machine Learning for Threat Detection
 
AWS re:Invent 2016: Predictive Security: Using Big Data to Fortify Your Defen...
AWS re:Invent 2016: Predictive Security: Using Big Data to Fortify Your Defen...AWS re:Invent 2016: Predictive Security: Using Big Data to Fortify Your Defen...
AWS re:Invent 2016: Predictive Security: Using Big Data to Fortify Your Defen...
 
To use the concept of Data Mining and machine learning concept for Cyber secu...
To use the concept of Data Mining and machine learning concept for Cyber secu...To use the concept of Data Mining and machine learning concept for Cyber secu...
To use the concept of Data Mining and machine learning concept for Cyber secu...
 

Similar to malware_detection_data_mining

Parallel and Distributed Algorithms for Large Text Datasets Analysis
Parallel and Distributed Algorithms for Large Text Datasets AnalysisParallel and Distributed Algorithms for Large Text Datasets Analysis
Parallel and Distributed Algorithms for Large Text Datasets AnalysisIllia Ovchynnikov
 
Show and tell program 04 2014-09-04
Show and tell program 04 2014-09-04Show and tell program 04 2014-09-04
Show and tell program 04 2014-09-04nihshowandtell
 
BrownResearch_CV
BrownResearch_CVBrownResearch_CV
BrownResearch_CVAbby Brown
 
Akash final-year-project report
Akash final-year-project reportAkash final-year-project report
Akash final-year-project reportAkash Rajguru
 
Data Gaurd Final Thesis for University in Progress (2).docx
Data Gaurd Final Thesis for University in Progress (2).docxData Gaurd Final Thesis for University in Progress (2).docx
Data Gaurd Final Thesis for University in Progress (2).docxMohdKashif82
 
Tracing-for-fun-and-profit.pptx
Tracing-for-fun-and-profit.pptxTracing-for-fun-and-profit.pptx
Tracing-for-fun-and-profit.pptxHai Nguyen Duy
 
Venice boats classification
Venice boats classificationVenice boats classification
Venice boats classificationRoberto Falconi
 
WSO2 Machine Learner - Product Overview
WSO2 Machine Learner - Product OverviewWSO2 Machine Learner - Product Overview
WSO2 Machine Learner - Product OverviewWSO2
 
employee turnover prediction document.docx
employee turnover prediction document.docxemployee turnover prediction document.docx
employee turnover prediction document.docxrohithprabhas1
 
Automated Validation of Internet Security Protocols and Applications (AVISPA)
Automated Validation of Internet Security Protocols and Applications (AVISPA) Automated Validation of Internet Security Protocols and Applications (AVISPA)
Automated Validation of Internet Security Protocols and Applications (AVISPA) Krassen Deltchev
 
Record matching over query results
Record matching over query resultsRecord matching over query results
Record matching over query resultsambitlick
 
MICRE: Microservices In MediCal Research Environments
MICRE: Microservices In MediCal Research EnvironmentsMICRE: Microservices In MediCal Research Environments
MICRE: Microservices In MediCal Research EnvironmentsMartin Chapman
 
Expert System Full Details
Expert System Full DetailsExpert System Full Details
Expert System Full Detailsssbd6985
 

Similar to malware_detection_data_mining (20)

Parallel and Distributed Algorithms for Large Text Datasets Analysis
Parallel and Distributed Algorithms for Large Text Datasets AnalysisParallel and Distributed Algorithms for Large Text Datasets Analysis
Parallel and Distributed Algorithms for Large Text Datasets Analysis
 
Data mining weka
Data mining wekaData mining weka
Data mining weka
 
Show and tell program 04 2014-09-04
Show and tell program 04 2014-09-04Show and tell program 04 2014-09-04
Show and tell program 04 2014-09-04
 
BrownResearch_CV
BrownResearch_CVBrownResearch_CV
BrownResearch_CV
 
Malware analysis
Malware analysisMalware analysis
Malware analysis
 
Akash final-year-project report
Akash final-year-project reportAkash final-year-project report
Akash final-year-project report
 
Proposal with sdlc
Proposal with sdlcProposal with sdlc
Proposal with sdlc
 
Data Gaurd Final Thesis for University in Progress (2).docx
Data Gaurd Final Thesis for University in Progress (2).docxData Gaurd Final Thesis for University in Progress (2).docx
Data Gaurd Final Thesis for University in Progress (2).docx
 
Tracing-for-fun-and-profit.pptx
Tracing-for-fun-and-profit.pptxTracing-for-fun-and-profit.pptx
Tracing-for-fun-and-profit.pptx
 
Venice boats classification
Venice boats classificationVenice boats classification
Venice boats classification
 
WSO2 Machine Learner - Product Overview
WSO2 Machine Learner - Product OverviewWSO2 Machine Learner - Product Overview
WSO2 Machine Learner - Product Overview
 
Introduction
IntroductionIntroduction
Introduction
 
employee turnover prediction document.docx
employee turnover prediction document.docxemployee turnover prediction document.docx
employee turnover prediction document.docx
 
Automated Validation of Internet Security Protocols and Applications (AVISPA)
Automated Validation of Internet Security Protocols and Applications (AVISPA) Automated Validation of Internet Security Protocols and Applications (AVISPA)
Automated Validation of Internet Security Protocols and Applications (AVISPA)
 
Dimensions iom training
Dimensions iom trainingDimensions iom training
Dimensions iom training
 
Record matching over query results
Record matching over query resultsRecord matching over query results
Record matching over query results
 
A035401010
A035401010A035401010
A035401010
 
MICRE: Microservices In MediCal Research Environments
MICRE: Microservices In MediCal Research EnvironmentsMICRE: Microservices In MediCal Research Environments
MICRE: Microservices In MediCal Research Environments
 
Practical power systems protection
Practical power systems protectionPractical power systems protection
Practical power systems protection
 
Expert System Full Details
Expert System Full DetailsExpert System Full Details
Expert System Full Details
 

malware_detection_data_mining

  • 1. Open University, Data Mining Seminar 13802 Semester 2015b Malware Detection via Data Mining Prof Roy Gelbard David Zivi 204785638
  • 2. Contents Terminology and Definitions..............................................................................................................3 Introduction .......................................................................................................................................4 Research Question.............................................................................................................................5 Study goal:......................................................................................................................................5 Study importance:..........................................................................................................................5 Mapping of Knowledge Elements......................................................................................................6 Bibliography Review...........................................................................................................................7 Signature-based detection:............................................................................................................7 Heuristic-based detection:.............................................................................................................7 Behavioral-based detection:..........................................................................................................7 Sandbox detection:........................................................................................................................7 Data mining techniques:................................................................................................................7 Research methodology ......................................................................................................................8 Raw Data Acquisition.....................................................................................................................8 Extraction of Significant Data.........................................................................................................8 Opcode Relevance in malwares.....................................................................................................9 Average Calculation .......................................................................................................................9 Results Export ................................................................................................................................9 Weka ............................................................................................................................................10 Prediction parameters .................................................................................................................10 Noise method...............................................................................................................................10 Results..............................................................................................................................................11 Recurrent WEKA process .............................................................................................................11 First run....................................................................................................................................11 Second run ...............................................................................................................................11 Third run...................................................................................................................................11 Fourth run................................................................................................................................11 Fifth run....................................................................................................................................11 Sixth run...................................................................................................................................11 Seventh run..............................................................................................................................11 Results summary per round.............................................................................................................12 Rules generated by WEKA................................................................................................................13
  • 3. Noise on model............................................................................................................................14 Result Discussion and Future Research ...........................................................................................15 Future Research...........................................................................................................................15 Extension of the opcodes set...................................................................................................15 Sensitive system call ................................................................................................................15 PE header analysis....................................................................................................................15 Resources List...................................................................................................................................16
  • 4. Terminology and Definitions  Virus: A computer virus is a type of malware that propagates by inserting a copy of itself into, and becoming part of, another program. It spreads from one computer to another, leaving infections as it travels. Viruses can range in severity from causing mildly annoying effects to damaging data or software and causing denial-of-service (DoS) conditions. Almost all viruses are attached to an executable file, which means the virus may exist on a system, but will not be active or able to spread until a user runs or opens the malicious host file or program. When the host code is executed, the viral code is executed as well. [1]  Disassembler: a computer program that translates machine language into assembly language—the inverse operation to that of an assembler. Disassembly, the output of a disassembler, is often formatted for human-readability rather than suitability for input to an assembler, making it principally a reverse- engineering tool. [2]  Opcode: In computing, an opcode (abbreviated from operation code) is the portion of a machine language instruction that specifies the operation to be performed. Beside the opcode itself, instructions usually specify the data they will process, in form of operands. In addition to opcodes used in instruction set architectures of various CPUs, which are hardware devices, opcodes can also be used in abstract computing machines as part of their byte code specifications. [3]  x86 instruction set: x86 is a family of backward compatible instruction set architectures based on the Intel 8086 CPU and its Intel 8088 variant. The 8086 was introduced in 1978 as a fully 16-bit extension of Intel's 8-bit based 8080 microprocessor, with memory segmentation as a solution for addressing more memory than can be covered by a plain 16-bit address. The term "x86" came into being because the names of several successors to the Intel's 8086 processor ended in "86", including 80186, 80286, 80386 and 80486 processors. [4]  WEKA: is a workbench that contains a collection of visualization tools and algorithms for data analysis and predictive modeling, together with graphical user interfaces for easy access to this functionality. All of Weka's techniques are predicated on the assumption that the data is available as a single flat file or relation, where each data point is described by a fixed number of attributes. [17]
  • 5. Introduction In this study I present a technique I developed to determine whether an application is malware or not using data-mining. The underlying method to be used is to teach the system how to differentiate between software that is or is not malware, using a dataset that is represented by a list of instructions that potentially characterize malware. The list of instructions were collected from previous research done on this topic. The technique is a two-step technique, where the first step consists of disassembly, and opcode frequency calculation; the second step consists of usage of the data learning algorithm J48 provided by the WEKA library. For the dataset, I extracted relevant data from 300 known malware [6] and 150 types of benign software typically found in a home computer under “c:program files”. This data was fed to WEKA, which generated rules to be used to determine whether a piece of software is malware or not. These rules were run on the dataset (cross-validation) I compiled, and were able to predict, with an accuracy of 96%, whether a piece of software is malware or not. In order to check the robustness of the rules, a noise with intensity scale from 2% to 50% was randomly to relevant data without significant regression on score mentioned above. Within noise of up to 50% the prediction score decreased to 91%.
  • 6. Research Question Study goal: Today with the exponential growth of the “freeware” software [7], users and corporations can find a large variety of application and utilities that can be installed for free. Since those applications come from unknown sources, the question raised is: Can a user or corporation benefit from some free applications without compromising their entire system? The goal of this study is to build rules that will determine whether an executable or library received by a third party can be trusted or not. This study does not purport to replace the known anti-viruses, but to propose a complementary mechanism that will make up for weakness of known anti-virus programs. Limitations of current methods: The common technique used by anti-viruses to determine if an executable is a malware or not is done by scanning. A scanner will search all files in memory and on disk for code snippets that will uniquely identify a file as malware. Such mechanisms have two main weaknesses:  Attackers interested in propagating a known malware can just change the code snippets that anti-virus is looking for.  New malware not yet classified will be considered as benign software till the malware is analyzed and classified. In such a case malware will continue to infect the system and expand itself to the new system until the anti-virus is updated. Study importance: According to a newly-released report sponsored by McAfee, global cyber activity is costing up to $500 billion each year, which is almost as much as the estimated cost of drug trafficking [5]. In the third quarter of 2015, McAfee Labs detected more than 307 new threats every minute, or more than five every second, with mobile malware samples growing by 16 percent during the quarter, and overall malware surging by 76 percent year after year [8]. Malware becomes more and more sophisticated and provide high revenues to their owners. Malware editors becomes well organized and structured with impressive skills and high qualified resources. Due to the exponential growth of malware and its agility to camouflage itself, having a sterile system for users and corporates becomes a tremendous task. Since current anti-viruses are run against known malware databases which are updated once a day in the best case, malwares have one entire day to infect the system until they are caught.
  • 7. Mapping of Knowledge Elements Characteristic/Process In Human World In Machine World Data Software editor Software behavior All the data is saved in an automatic way Every malware has its own signature Malware characteristics are saved in database Information Collect information about known malware Collect information about software to check Basic statistical calculation on raw data Knowledge Can get global feeling about the software we want to check Tendency can be deducted Run algorithm on software in order to determine if we are dealing with malware or not Data transformation to Information & Knowledge With the data we have, we can be deduce which software we have to check Installation validation based on decision tree Information & Knowledge transformation to Data The knowledge can be transformed to data. For example, the software signature can be saved in a database Exists in learning system, where the conclusions and knowledge are automatically translated to data Transformation of tacit knowledge to explicit knowledge Malware analyst writes in a formal way the conclusions reached about malware pattern Does not exist Transformation of explicit knowledge to tacit knowledge Malware analyst learns from explicit knowledge Does not exist Knowledge contribution in decision Based on explicit & tacit knowledge , analyst decides to accept or reject software According to decision tree Knowledge contribution to innovation Update of anti-virus engines based on knowledge Does not exist Learning and knowledge sharing Learning of new techniques used by malware The system automatically update its research criterion
  • 8. Bibliography Review The exponential growth of malware encourages security researchers to invent new techniques to protect computers and network. The various techniques used for malware detections [9] are described below: Signature-based detection: This technique is the most common method used to identify viruses and other malware. The anti- virus engine compares the contents of a file to its database of known malware signatures. Such technique requires daily update of malware database. Heuristic-based detection: This technique is generally used together with signature-based detection. It detects malware based on characteristics typically used in known malware code. Behavioral-based detection: This technique is similar to heuristic-based detection and used also in Intrusion Detection System. The main difference is that, instead of characteristics hardcoded in the malware code itself, it is based on the behavioral fingerprint of the malware at run-time. Clearly, this technique is able to detect (known or unknown) malware only after it has starting doing its malicious actions. Sandbox detection: This technique is a particular behavioral-based detection that, instead of detecting the behavioral fingerprint at run time, executes the programs in a virtual environment, logging whatever actions the program performs. Depending on the actions logged, the anti-virus engine can determine if the program is malicious or not. If not, the program is executed in the real environment. Even though this technique has shown to be quite effective it is heavy and slow, so it is rarely used in end-user anti-virus solutions. Data mining techniques: Data mining techniques are one of the latest approaches applied in malware detection. Data mining and machine learning algorithms are used to try to classify the behavior of a file (as either malicious or benign) given a series of file features that are extracted from the file itself. In this study the focus was pointed on the following few techniques: - Malware detection via analysis of number of strings, call and binary patterns [10] - Malware detection via analysis of program executable header [11] - Malware prediction via function call frequency, usage of non-standard instructions and use of suspicious system calls [12] - Malware detection via statistical analysis of opcode distributions [13]
  • 9. Research methodology The following picture describes the flow used in this study to generate rules that will be able to catch malware: Raw Data Acquisition In order to learn from malware and benign software behavior, a large database of samples for both malware and benign software are needed. Since malware has been analyzed and classified by security researchers, it is quite easy to find malware databases on the internet [14]. In this study, all the known and classified malware from year 2014 are used. In order to prevent noise on the dataset, derivatives of the same malware are not included in the sample. All 400 families of malware found in 2014 were classified and used in this study. Our benign software dataset is represented by standard applications located under “c:program files (x86)” such as “Outlook”, “Word”, “Excel”, and “Calculator”. These were taken from a non-infected computer. Note: For security reasons, the malware database is protected by a password that can be retrieved from [14]. All the research and access to malware for this study were done on dedicated virtual machines in order to prevent unintentional infection. Extraction of Significant Data According to the aforementioned bibliography on malware detection via data mining, this study focuses on the following opcodes: call, nop, int, rdtsc, sbb, shld, fdivp, imul, pushf, setb, fild and xor. In this study those opcodes represent a criterion for malware detection. All the above opcodes are extracted from executables and libraries using the IDA [15] disassembler.
  • 10. Opcode Relevance in malwares One of the main challenges of malwares is their capability of “camouflage”. In order to survive anti-viruses and security researcher analysis, malwares must hide themselves, since once they are discovered they are automatically removed from the infected computer. Moreover, when a malware is discovered, its characteristics are shared with the entire community. In order hide themselves, the malwares use a technique called “packing” [16] which consist of the compression/encryption of the original executable. When this compressed executable is executed, the decompression/decryption code recreates the original code from the compressed/encrypted code before executing it. Executable compression is also frequently used to deter reverse engineering or to obfuscate the contents of the executable, for example, to hide the presence of malware from anti-virus scanners. Executable compression can be used to prevent direct disassembly; it consists of masking string literals and modifying signatures. Although this does not eliminate the chance of reverse engineering, it can make the process more costly. The following picture illustrates how the “packing” mechanism works: Average Calculation A script in Python is used to count the number of instances found for each of the relevant opcodes listed above. Then a value is calculated for every relevant opcode, according to the following formula (example for “call” opcode): Call percentage = (Number of Call * Size of Call opcode) / Size of all text section Results Export The same Python script now exports an Excel table of all the opcodes, averaged for every disassembled file (benign software & malware).
  • 11. Weka The generated excel file is used for WEKA data mining tool analysis. Since the target field is nominal, J48 & Kstar algorithms are used. The test options used to validate the model is cross validation with percentage split of 66% Prediction parameters TP (true positive): rate of valid prediction of a malware TN (true negative): rate of valid prediction of benign software FP (false positive): missed malware prediction i.e. malware was predicted as benign software FN (false negative): missed benign software prediction i.e. benign software was predicted as malware Noise method A randomized noise with intensity ranging from 5% to 50% is applied on the generated excel table. The noise is applied only to the head of the tree i.e. in our case on the “imul” instruction. Applying such noise will determine the robustness of the model, in other words does standard noise contest the study conclusion or not. Since the noise source can be the result of non- standard code, for example when code is written directly in assembler by programmers, we assume that a typical noise can have an intensity of 30%. If the malware prediction score is still greater than 90% we can conclude that the generated model is robust enough.
  • 12. Results In this research a deterministic way for malware prediction was formulated and tested. This method prove to be highly successful in differentiating between malware and benign software. A key factor for efficiently identifying malware was to have appropriate set of instructions. To find the ideal instruction set I performed an investigation in a recurrent manner as described below: Recurrent WEKA process As opposed to the mentioned research done on malware detection via data mining, like Sanjam Singla et al [12], in this study the research analysis did not stop when a high level of predictability was achieved. In every WEKA iteration the “head of the tree” was removed and new analysis was performed to determine if the “head of the tree” is a key opcode in our prediction or not. Getting good result after removal of “head of the tree” means that the “head of the tree” cannot be used for prediction. The following describes the recurrent WEKA process used: First run In the first WEKA run, all the potential opcodes are take into account i.e. call, nop, int, rdtsc, sbb, shld, fdivp, imul, pushf, setb, fild and xor. The prediction score was 98% when the head of the tree was “call” opcode Second run The call opcode was removed so WEKA was run with: nop, int, rdtsc, sbb, shld, fdivp, imul, pushf, setb, fild and xor. The prediction score was 97% when the head of the tree was “xor” opcode Third run The xor opcode was removed so WEKA was run with: nop, int, rdtsc, sbb, shld, fdivp, imul, pushf, setb and fild. The prediction score was 97% when the head of the tree was “int” opcode Fourth run The int opcode was removed so WEKA was run with: nop, rdtsc, sbb, shld, fdivp, imul, pushf, setb and fild. The prediction score was 96% when the head of the tree was “rdtsc” opcode Fifth run The rdtsc opcode was removed so WEKA was run with: nop, sbb, shld, fdivp, imul, pushf, setb and fild. The prediction score was 95% when the head of the tree was “sbb” opcode Sixth run The sbb opcode was removed so WEKA was run with: nop, shld, fdivp, imul, pushf, setb and fild. The prediction score was 91% when the head of the tree was “imul” opcode Seventh run The imul opcode was removed so WEKA was run with: nop, shld, fdivp, pushf, setb and fild. The prediction score decreased to 78%. At this point the recurrent processing was stopped due to significant decrease in prediction score. The last acceptable score was of 91% with opcodes: nop, shld, fdivp, imul, pushf, setb and fild.
  • 13. Results summary per round Dataset Algorithm TP FP TN FN RMSE Fscore All J48 307 4 102 4 0.13 0.98 All Kstar 310 1 99 7 0.13 0.98 call removed J48 306 5 100 6 0.15 0.97 call removed Kstar 311 0 95 11 0.14 0.97 call &xorremoved J48 308 3 99 7 0.15 0.97 call &xorremoved Kstar 311 0 89 17 0.19 0.95 call,xor&intremoved J48 307 4 101 5 0.14 0.97 call,xor&intremoved Kstar 309 2 92 14 0.18 0.96 call,xor,int&rdtscremoved J48 304 7 97 9 0.19 0.96 call,xor,int&rdtscremoved Kstar 309 2 92 14 0.18 0.96 call,xor,int,rdtsc&sbbremoved J48 306 5 98 8 0.17 0.96 call,xor,int,rdtsc&sbbremoved Kstar 311 0 71 35 0.24 0.91 call,xor,int,rdtsc,sbb&imul removed J48 308 3 33 73 0.37 0.78 call,xor,int,rdtsc,sbb&imul removed Kstar 310 1 45 61 0.31 0.82
  • 14. Rules generated by WEKA The following graph represents rules generated by WEKA for malware prediction with score of 96% The following table represent the Confusion Matrix for J48 algorithm clean virus clean 102 4 virus 4 307
  • 15. Noise on model The following table summarizes the noise intensity at the head of the tree, with the appropriate Fscore The following gives a graphical representation of the above results Noise intensity in % on "imul" variable Fscore 0 0.969 2 0.966 5 0.964 8 0.966 10 0.966 13 0.964 16 0.966 20 0.954 25 0.961 30 0.961 35 0.946 40 0.954 45 0.939 50 0.916 Data set: call,xor,int,rdtsc & sbb removed
  • 16. Result Discussion and Future Research The goal of this study was to demonstrate that malware can be caught via analysis of executable opcodes. I have shown that malware detection via data mining is a very promising method, since the prediction score achieved is 96% for 2014 malware. This study found a different set of instructions that point to code being malware, compared to previous research that was done on this topic. The final instructions found in our generated tree are: “imul”, “pushf” and “fild”. As explained before, those instructions are commonly used by “packer” and “protector” software in order to unpack/decrypt the malware code. The novel approaches of the study, as opposed to previous research done in this field, are:  Analysis of the malware surface: Since most of the malwares use packers and protectors to hide themselves from security researchers, after disassembly only a small portion of the malware code can be analyzed. In this study it was proved that analysis of a small portion of code (loader) is enough to detect if the executable is a malware or not.  Recurrent run of WEKA: As opposed to previous research, the WEKA data mining tool was run many times, in order to found the instructions that really influence the prediction. Originally the “call” instruction was singled out as identifying software as malware or not, as was done in research by Sanjam Singla et al [12], but after recurrent run of WEKA the “imul” instruction was found as the identifier of malware. I have shown that malware detection via data mining is a very promising method, since the prediction score achieved is 96% for 2014 malware. Future Research Extension of the opcodes set Since malware changes very frequently, in future research the instruction set used for prediction must be enlarged and updated according to new malware. Furthermore, malware commonly uses rare instructions that will never being generated by a compiler, so having such rare instructions in our instruction dataset will help to recognize them. Sensitive system call In this study, only malware instructions were checked. As claimed in the above bibliography, malware commonly uses sensitive calls like “VirtualAllocEx”, “IsDebugerPresent”. Use of such calls can point to malware as well PE header analysis Another approach to detect malware will be to check the program header format of the executable. Since during the packing/protect mechanism the “entry point” of the program is modified, we can find PE header malformations that point to malware as well.
  • 17. Resources List [1] http://www.cisco.com/web/about/security/intelligence/virus-worm-diffs.html [2] https://en.wikipedia.org/wiki/Disassembler [3] https://en.wikipedia.org/wiki/Opcode [4] http://www.felixcloutier.com/x86/ [5] http://www.foxbusiness.com/technology/2013/07/22/report-cyber-crime-costs-global- economy-up-to-1-trillion-year/ [6] http://www.nothink.org/honeypots/malware-archives/ [7] https://en.wikipedia.org/wiki/Freeware [8] http://www.mcafee.com/us/about/news/2014/q4/20141209-01.aspx [9] https://en.wikipedia.org/wiki/Antivirus_software [10] M. Schultz, M. Eskin, E. Zadok 2001 Data Mining Methods for Detection of New Malicious Executables [11] Usukhbayar Baldangombo et al - A STATIC MALWARE DETECTION SYSTEM USING DATA MINING [12] A Novel Approach to Malware Detection using Static Classification Sanjam Singla et al 2015 [13] International Journal of Electronic Security Daniel Bilar, Opcodes as predictor for malware 2007 [14] http://www.nothink.org/honeypots/malware-archives/ [15] https://www.hex-rays.com/products/ida/index.shtml [16] https://en.wikipedia.org/wiki/Executable_compression [17] https://en.wikipedia.org/wiki/Weka_%28machine_learning%29