Malware detection is an important factor in the security of the computer systems. However, currently utilized signature-based methods cannot provide accurate detection of zero-day attacks and polymorphic viruses. That is why the need for machine learning-based detection arises.
2. Malware overview
●Malicious software that tries to damage or perform unauthorized
access to your system.
●Can be of different type:
Virus | Trojan | Adware | Worm etc
●More then 1 Lacs new samples found by AV companies every day.
●Most of them are Variant of each other or some old samples.
3. Current status of Detection
●Currently Antivirus company use signature based detection.
●Signature can be anything from strings to assembly code
snippets.
4. Problem with current method
●Polymorphic malware can change their code on every execution.
●Most malware can encrypt or Pack themselves using packers.
●Detecting those malware using signature doesn’t work all the time.
5. Solution using ML
●API sequence features can be used to detect if a file is malicious
or not.
●API calls are robust way of analysis as they cannot be alter easily.
●They outline everything happening to the operating system,
including the operations on the files,registry, mutexes, processes
and other features mentioned earlier.
●For example, OpenFile, CreateFile define the file operations,
OpenMutex, CreateMutex and describe mutexes opened/created.
6. Our System Description
●Cuckoo sandbox is used to analyze and record all Api calls.
●File report get saved into json format file.
●Calls get parsed to save inside csv file in matrix vector format.
●Samples with less then 10 calls(or any other user set value) get
ignored.
7. Feature extraction
● The frequency representation approach has been taken.
●S1,S2..are sample number. API1,API2 are Calls made to an API.
8. Redundant subsequence removal
methods
●There are large number of useless api call sequence present.
●They can be removed using N-gram sample subsequence
extraction.
●Match if some api calling pattern is present in many sample then
remove it.(Works like sliding window)
9. Redundant subsequence removal
methods
●Other method can be using information gain
●C entropy of the malware detection system
● H(C) is the information entropy
●The information gain of the subsequence T to class C is:
●p(ti) is the probability that the feature appear and p(tj) is opposite.
10. Using machine learning methods
●After the features were extracted and selected, we can apply the
machine learning methods to the data that we obtained.
●The packages used for the implementation of algorithms are:
Random Forest – randomForest
K-Nearest Neighbours – class
Support Vector Machines – kernlab
J48 Decision Tree – RWeka
11. Using machine learning methods
●After the features were extracted and selected, we can apply the
machine learning methods to the data that we obtained.
●The packages used for the implementation of algorithms are:
Random Forest – randomForest
K-Nearest Neighbours – class
Support Vector Machines – kernlab
J48 Decision Tree – RWeka
12. Comparison method
●The Cuckoo analysis score is an indication of how malicious an
analyzed file is.
●In total, there are three levels of severity and all levels have their
score of severity: 1 for low, 2 for medium and 3 for high.
●It is hard to measure the accuracy of the detection since there is
no threshold value indicating whether the sample is malicious or
not.
●This can be compare with the result received by ML algorithms.
13. Results
●The accuracy of detection is measured as the
percentage of correctly identified instances:
14. Support Vector Machines Results
●The overall accuracy achieved was 87.6% for multi-
class classification and 94.6% for binary classification.
15. Random Forest Results
●The algorithm resulted in a good accuracy of
predictions, 95.69% for multi-class classification and
96.8% for binary classification.
16. KNN Results
●As it can be seen, the best accuracy was achieved with
k=1. The algorithm resulted in a good accuracy of 87%
for multi-class classification and 94.6% for two-class
classification.
17. Conclusion
Experiments show that the integrated Machine learning
classifier has a better performance than the separate
signature based Detection.
18. Conclusion
In classification problems, different models gave different
results. The lowest accuracy was achieved by Naive Bayes
(72.34% and 55%), followed by k-Nearest-Neighbors and
Support Vector Machines (87%, 94.6% and 87.6%,
94.6% respectively). The highest accuracy was achieved
with the J48 and Random Forest models, and it was equal
to 93.3% and 95.69% for multi-class classification and
94.6% and 96.8% for binary classification respectively.