Malware Dectection Using Machine learning

Malware Detection using Machine Learning
By:
Shubham Dubey(14ucs114)

Malware overview
●Malicious software that tries to damage or perform unauthorized
access to your system.
●Can be of different type:
Virus | Trojan | Adware | Worm etc
●More then 1 Lacs new samples found by AV companies every day.
●Most of them are Variant of each other or some old samples.

Current status of Detection
●Currently Antivirus company use signature based detection.
●Signature can be anything from strings to assembly code
snippets.

Problem with current method
●Polymorphic malware can change their code on every execution.
●Most malware can encrypt or Pack themselves using packers.
●Detecting those malware using signature doesn’t work all the time.

Solution using ML
●API sequence features can be used to detect if a file is malicious
or not.
●API calls are robust way of analysis as they cannot be alter easily.
●They outline everything happening to the operating system,
including the operations on the files,registry, mutexes, processes
and other features mentioned earlier.
●For example, OpenFile, CreateFile define the file operations,
OpenMutex, CreateMutex and describe mutexes opened/created.

Our System Description
●Cuckoo sandbox is used to analyze and record all Api calls.
●File report get saved into json format file.
●Calls get parsed to save inside csv file in matrix vector format.
●Samples with less then 10 calls(or any other user set value) get
ignored.

Feature extraction
● The frequency representation approach has been taken.
●S1,S2..are sample number. API1,API2 are Calls made to an API.

Redundant subsequence removal
methods
●There are large number of useless api call sequence present.
●They can be removed using N-gram sample subsequence
extraction.
●Match if some api calling pattern is present in many sample then
remove it.(Works like sliding window)

Redundant subsequence removal
methods
●Other method can be using information gain
●C entropy of the malware detection system
● H(C) is the information entropy
●The information gain of the subsequence T to class C is:
●p(ti) is the probability that the feature appear and p(tj) is opposite.

Using machine learning methods
●After the features were extracted and selected, we can apply the
machine learning methods to the data that we obtained.
●The packages used for the implementation of algorithms are:
Random Forest – randomForest
K-Nearest Neighbours – class
Support Vector Machines – kernlab
J48 Decision Tree – RWeka

Comparison method
●The Cuckoo analysis score is an indication of how malicious an
analyzed file is.
●In total, there are three levels of severity and all levels have their
score of severity: 1 for low, 2 for medium and 3 for high.
●It is hard to measure the accuracy of the detection since there is
no threshold value indicating whether the sample is malicious or
not.
●This can be compare with the result received by ML algorithms.

Results
●The accuracy of detection is measured as the
percentage of correctly identified instances:

Support Vector Machines Results
●The overall accuracy achieved was 87.6% for multi-
class classification and 94.6% for binary classification.

Random Forest Results
●The algorithm resulted in a good accuracy of
predictions, 95.69% for multi-class classification and
96.8% for binary classification.

KNN Results
●As it can be seen, the best accuracy was achieved with
k=1. The algorithm resulted in a good accuracy of 87%
for multi-class classification and 94.6% for two-class
classification.

Conclusion
Experiments show that the integrated Machine learning
classifier has a better performance than the separate
signature based Detection.

Conclusion
In classification problems, different models gave different
results. The lowest accuracy was achieved by Naive Bayes
(72.34% and 55%), followed by k-Nearest-Neighbors and
Support Vector Machines (87%, 94.6% and 87.6%,
94.6% respectively). The highest accuracy was achieved
with the J48 and Random Forest models, and it was equal
to 93.3% and 95.69% for multi-class classification and
94.6% and 96.8% for binary classification respectively.

Malware Dectection Using Machine learning

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Malware Dectection Using Machine learning

Similar to Malware Dectection Using Machine learning (20)

Recently uploaded

Recently uploaded (20)

Malware Dectection Using Machine learning