Presentation_Malware Analysis.pptx

Malware Analysis and Detection Using Machine Learning

Introduction
 Malware is actually a generic definition for all kind of computer
threats. A simple classification of malware consists of file infectors and
stand-alone malware. Another way of classifying malware is based on
their particular action: worms, backdoors, trojans, rootkits, spyware,
adware etc.
 This project has combined the two fields such as cyber security field
and machine learning field An overview on different machine learning
methods that were proposed for malware detection is given such as
Random Forest Classifier, Gradient Boosting, and Adaboost techniques.

CONTENTS:-
1. Aim and Objectives
2. Dataset
3. Algorithms
Model building
1. Accuracy
2. Testing
3. Conclusion

Problem Statement:
Build a classifier model after Analysing the data and to distinguish between
the legitimate(clean) files and the malicious(malware) files. we have used an
ensemble technique such as bagging and boosting techniques.
OBJECTIVES:
 To use machine learning algorithms and to detect the legitimate
files and the malware files.
 To get best testing data accuracy.

DATASET:-
In our Project there are 138048 samples(files) in which 41323
files are the Legitimate file or the clean files and most of these
files has the (.exe) and (.dll) . And the remaining 96724 files are
taken from the www.virusshare.com site whch is considered as
the malware files
 Train Dataset- 30%
 Test Dataset- 70%

MODEL BUILDING:-
 Model consists of different layers:-
 Feature Selection
 Separate the data in X and Y
 Train test split
 fit the model
 Predict the data
 Find the maximum accuracy

Model Building
3 Exploratory Data Analysis:
Exploratory Data Analysis refers to the critical process of performing initial
on data so as to discover patterns, to spot anomalies, to test hypothesis and to check
assumptions with the help of summary statistics and graphical representations

Correlation
Pearson Correlation :

+1 to -1
+1 highest positive correlation
-1 highest negative correlation
0 there is no correlation or two variables are independent of each other

Split the Data in X train and Y train

Algorithms :
Decision Tree:
The general motive of using Decision Tree is to create a training model which can
to predict class or value of target variables by learning decision rules inferred from
prior data (training data). Each internal node of the tree corresponds to an attribute,
and each leaf node corresponds to a class label

Random Forest Classifier
Random forest is a supervised learning algorithm. The "forest" it builds, is an
ensemble of decision trees, usually trained with the “bagging” method. The
general idea of the bagging method is that a combination of learning
models increases the overall result.
Put simply: random forest builds multiple decision trees and merges them
together to get a more accurate and stable prediction.

Gradient Boost:
Gradient boosting machines are a family of powerful machine-learning
techniques that have shown considerable success in a wide range of
practical applications. They are highly customizable to the particular
needs of the application, like being learned with respect to different loss
functions. This article gives a tutorial introduction into the methodology
of gradient boosting methods with a strong focus on machine learning
aspects of modeling

ACCURACY:-
 After Applying these Algorithms we have
concluded that
 Decision Tree Model Accuracy- 99.04%
 Random Forest Classifier – 99.35%
 Gradient Boosting Classifier – 98.62%

CONCLUSION:-
 Machine learning models have a very great capability to surpass the human potential if
the data provided is sufficient.
 we give input test data to generated system it gives desired output.
 The system has provided the 99.35% accuracy in prediction of detecting the files .

Future Scope :
• Machine learning models have a very great capability to surpass the human
potential if the data provided is sufficient.
• The system has provided the 99.35% accuracy in prediction of detecting the
legitimate file and the malware file .
• We can predict the amount of time the question will take to get answered.

References :
 https://media.kaspersky.com/en/enterprise-security/Kaspersky-Lab-Whitepaper-Machine-
Learning.pdf
 https://www.youtube.com/watch?v=HMWulRpPpuA&list=PLonlF40eS6nzc7TqDshRo7k-
mTM1Tu_j4
 https://www.oreilly.com/library/view/practical-malware-analysis/9781593272906/ch02s05.html

Presentation_Malware Analysis.pptx

More Related Content

What's hot

Similar to Presentation_Malware Analysis.pptx

Recently uploaded

Presentation_Malware Analysis.pptx